I’m trying to understand the constraint from data lake’s definition for unprocessed data.

2 min readFeb 14, 2021

I’m trying to understand the constraint from data lake’s definition for unprocessed data. From my point of view the data generated by a data model can be part of the data lake. Moreover, the data reflect the reality as maintained in the source data. A value of truth can be judged only when data’s quality is considered. Therefore, the true aspect is just a slogan. In respect to master data one of the advantages of data warehouses was that the data could be cleaned into the data warehouse, eith all the positive/negative implications deriving from this.

Not sure what you mean by mutate. Is it a change of reference? Records are indeed overwritten in source systems and the changes must be handled accordingly in the data lake, ideally using versioning and specific logic. In addition, many systems allow deleting transactional data as Purchase Orders or Move Orders. Good luck handling such situations without a proper design. What I mean, a data warehouse or data lake must accommodate such scenarios or try to avoid them be design. In data warehouses this was handled upfront, before the data were consumed. This meant that the users were working in theory with the same data. In data lakes scenarios each solution might end up with different handling if the sane data.

I know the definition of idempotence from Mathematics and I find it curious the way you used it. I think Beauchemin means something else by idempotence – namely no matter how many times you execute a package with the same base data, the results should be the same (no garbage data should remain). The requirement for determinism means that given the logic used the results should be predictable. This can mean also for example, that if the source data are sorted in other order this should not influence the logic. This can also mean that at a given time, only one instance of the process runs.

Partitions have more to do with convenience and performance rather idempotence or determinism. Moreover, partitioning means that a record will be assigned to only one partition (beforehand or during processing). Partitioning on a “modified date” attribute can lead to unexpected effects. A better partition key is the creation date, though it might be unusable in certain scenarios.

Overall an interesting article.

Written by Adrian

No responses yet