Data Lakes can quickly degrade to so-called data swamps, becoming “data rich, information poor” due to the lack of efficient, fit for purpose, and scalable data analytics methodologies and engineering solutions for Big Data Lakes. Key challenges and recurring questions that companies face for enabling extreme-scale analytics over Big Data Lakes that grow in volume and variety over time include:
- Challenge 1 : Handling data heterogeneity – How can I achieve flexibility for handling heterogeneous data with different models and formats, and at the same time offer high-performance queries and analytics?
- Challenge 2 : Reducing storage costs – How can I take advantage of emerging storage tiering opportunities to reduce storage costs by optimizing data placement under dynamically changing data characteristics, access patterns and business needs?
- Challenge 3 : Making sense of the data – How can I resolve different types of entities across multiple sources, mine different types of relations and associations, and find patterns in the data?
- Challenge 4 : Monitoring changes – How can I detect changes resulting from newly collected data, and their impact on my analysis?
- Challenge 5 : Support the human in the loop – How can I visually and interactively explore the data to extract insights, formulate hypotheses, try different analyses, and compare the effects of different parameters?