

With the new partition schema, the next question is what the appropriate value for file size (controlled by maxFileSize) will be.

The numbers in the green boxes are the file count in the bucket of that table. Each table holds the same set of data, to which the OPTIMIZE command has been applied. The head is the configuration value for the delta table. The query below indicates how well the file size could be aligned to the configuration. We can tune file size by setting the configuration “maxFileSize.”.With a larger granularity, small files per org/date are merged into a bigger one which helps reduce the number of small files.This partition scheme has the following benefits: Delta Lake divides files by the configuration of maxFileSize. With this partition scheme, files are written to a partition directory of org, which is a larger granularity. Once the partition columns are decided, it it impossible to change file sizes for optimizations.

And many small files could be created if small orgs are the majority. Small orgs with less engagement per day would have small files.

With this partition scheme, files are written to a partition directory of org/date, which could lead to the following issues: The Comparison between Two Partition Schemes Partition by Org and Date Here is what will happen to a query: Delta Lake first filters by orgId to locate the target directory (partition pruning), and then skips files by EngagementDate range (I/O pruning). We can see a few differences in the metadata of Engagement table with the new partition strategy: first, the partitionValues column contains OrgId only second, the tags column indicates that the z-order column is EngagementDate third, each file maintains an non-overlapped range of engagement dates. The table is partitioned by OrgId and EngagementDate, which means data files are in an S3 key of ) “Engagement” column is a struct type containing the detailed fields. Each engagement has a time and is rounded into a date indicated in the column “EngaementDate”. Engagement Table SchemaĪll engagements belong to an organization, which is indicated in the column “OrgId” in the table below. One mutation request could lead to changes to multiple engagements. Engagement Mutation is the other batch job to handle mutation requests. Engagement Ingestion is a batch job to ingest Engagement records from Kafka and store them to Engagement Table. The discussion in this blog post will focus on two pipelines: one is engagement ingestion, and the other is engagement mutation. Therefore, our data lake of engagements is a mutable one. For example, a lead L could become a contact C with a new ID, and all engagements that belong L should be updated with the new ID of C. The important thing for us to note is that an engagement is mutable. The type of activities we store are customer activities such as viewing a presentation or webinar or clicking on an email. We’ve written before about how we built an engagement activity platform to automatically capture and store user engagement activities and about the steps we took to greatly improve our pipeline stability by eliminating Conflicting Commits errors and maintaining data integrity.
