Concept

Concept

Modern enterprises become increasingly data-driven and data-intensive, relying on data and analytics throughout the whole fabric of the business (strategic planning, sales, marketing, finance, operations) to make fact-based business decisions and to better analyze and understand business conditions.

Extract-Transform-Load (ETL) processes are typically used to filter, aggregate and transform data from the original sources into a Data Warehouse. As this implies a predefined format and schema for the target data, and predefined rules for data ingestion, it is not flexible for investigating new data sources or accommodating changes in existing ones. It introduces design, performance and productivity bottlenecks since several important low-level decisions are made in advance, at a point where both the type of data and the type of queries are not fully known or may change over time. Moreover, ETL processes often take several hours to complete, introducing long waiting times before new data can be queried and analyzed.

Data Lakes provide an alternative, potentially complementary, approach. They are raw data ecosystems, where large amounts of diverse structured, unstructured and semi-structured data coexists in its original model and format. A Data Lake retains all data, including data that is kept because it might be of use at some point in the future, as opposed to predefined parts of data that are known in advance to serve specific purposes. Data is retained in its natural, raw form, following a “schema on read” rather than “schema on write” approach, and it is transformed only when the use for it arises.

As opposed to Data Warehouses that can efficiently serve well-planned and anticipated business needs and operations, Data Lakes are the go-to place for self-service analytics. Data scientists can directly tap into the Data Lake to analyze data from new sources, combine data of different types, come up with new business questions, test hypotheses and derive new insights and knowledge, offering flexible, fast, and ad hoc decision making.