Adrian
2 min readOct 31, 2020

Data Lakes: 10 Definitions

“If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” (James Dixon, “Pentaho, Hadoop, and Data Lakes”, 2010)

“At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems’, partners’, and collaborators’ data flows into it and insights spring out. […] Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze.” (Beulah S Purra & Pradeep Pasupuleti, “Data Lake Development with Big Data”, 2015)

“A storage system designed to hold vast amounts of raw data in its native (ingested) format, usually in a flat or semi-structured format. Extract, transform, and load (ETL) operations are usually applied to data lakes to extract local data marts for downstream computation.” (Benjamin Bengfort & Jenny Kim, “Data Analytics with Hadoop”, 2016)

“A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning.” (Piethein Strengholt, “Data Management at Scale”, 2020)

“A data lake is a central location, that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data.” (databricks)

“A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores.” (Gartner)

“A data lake is a collection of long-term data containers that capture, refine, and explore any form of raw data at scale. It is enabled by low-cost technologies that multiple downstream facilities can draw upon, including data marts, data warehouses, and recommendation engines.” (Teradata)

“A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources.” (Oracle)

“A Data Lake is a service which provides a protective ring around the data stored in a cloud object store, including authentication, authorization, and governance support.” (Cloudera)

“A data lake is an unstructured data repository that contains information available for analysis. A data lake ingests data in its raw, original state, straight from data sources, without any cleansing, standardization, remodeling, or transformation. It enables ad hoc queries, data exploration, and discovery-oriented analytics because data management and structure can be applied on the fly at runtime, unlike traditional structured data storage which requires a schema on write.” (TDWI)

More quotes on “Data Lakes” at sql-troubles.blogspot.com.

Adrian

IT professional/blogger with more than 24 years experience in IT - Software Engineering, BI & Analytics, Data, Project, Quality, Database & Knowledge Management