A Data Lake is an enterprise-wide system for storing and analyzing disparate sources of data in their native formats. A Data Lake might combine sensor data, social media data, click-streams, location data, log files, and much more with traditional data from existing RDBMSes. The goal is to break the information silos in an enterprise by bringing all the data into a single place for analysis without the restrictions of schema, security, or authorization. Data Lakes are designed to store vast amounts of data, even petabytes, in local or cloud-based clusters consisting of commodity hardware.
In a more technical sense, a Data Lake is a set of tools for ingesting, transforming, storing, securing, recovering, accessing, and analyzing all the relevant data of a company. Which set of tools depends in some part on the specific needs and requirements of your business.
The choice of technologies to be installed will vary depending on the use cases and the streams that need to be processed. Additional considerations could be whether to host a solution internally or to rely on companies providing platforms as a service, like Amazon Web Services (AWS).
Use cases can be broadly categorized into different streams of processing like batch, streaming, and queries with fast read/write times. An appropriate tool set can be configured depending on the kind of processing that needs to take place.
Given the advantages of a central Data Lake, there are some measures and utility challenges within the building process that every enterprise should address. The most important include data management and security for sensitive data. However, there are Hadoop ecosystem tools available on top of the core frameworks that make Data Lakes an enterprise-ready solution. Taking into account these common challenges, we can reclassify a Data Lake as an Enterprise Data Warehouse that does not require you to predefine metadata or define a specific schema before ingesting data into the system.
There are a variety of ways you can use a Data Lake:
Managed Unlimited Data
A Data Lake can store unlimited data, in its original formats and fidelity, for as long as you need. As a data staging area, it can not only prepare data quickly and costeffectively for use in downstream systems, but also serves as an automatic compliance archive to satisfy internal and external regulatory demands. Unlike traditional archival storage solutions, a Data Lake is an online system: all data is available for query.
Accelerate Data Preparation and Reduce Costs
Increasingly, data processing workloads that previously had to run on expensive systems can migrate to a Data Lake, where they run at very low cost, in parallel, much faster than before. Optimizing the placement of these workloads and the data on which they operate frees capacity on high-end data warehouses, making them more valuable by allowing them to concentrate on the business-critical OLAP and other applications for which they were designed.
Explore and Analyze
Fast Above all else, a Data Lake enables analytic agility. IT can provide analysts and data scientists with a self-service environment to ask new questions and rapidly integrate, combine, and explore any data they need. Structure can be applied incrementally, at the right time, rather than necessarily up front. Not limited to standard SQL, a Data Lake offers options for full-text search, machine learning, scripting, and connectivity to existing business intelligence, data discovery, and analytic platforms. A Data Lake finally makes it cost-effective to run data-driven experiments and analysis over unlimited data.
Build Data Applications
Once developed and tested, analytical models can be instantly deployed at web scale to create real-time data-driven applications. Naturally, these applications themselves generate rich usage data that can be used to better understand user behavior and optimize the models. Creating automated feedback loops is easy in a Data Lake because the architecture provides both analytical and operational capabilities; instrumentation data simply lands right back in the Lake where, as with other data, it’s immediately available for human or machine analysis.
For example, a large U.S. home improvement retailer’s data about 100 million customer interactions per year was stored across isolated silos, preventing the company from correlating transactional data with various marketing campaigns and online customer browsing behavior. What this large retailer needed was a “golden record” that unified customer data across all time periods and across all channels, including point-of-sale transactions, home delivery and website traffic.
By making the golden record a reality, the Data Lake delivers key insights for highly targeted marketing campaigns including customized coupons, promotions and emails. And by supporting multiple access methods (batch, real-time, streaming, in-memory, etc.) to a common data set, it enables users to transform and view data in multiple ways (across various schemas) and deploy closed-loop analytics applications that bring time-to-insight closer to real time than ever before.
In a practical sense, a Data Lake is characterized by three key attributes
Collect everything. A Data Lake contains all data, both raw sources over extended periods of time as well as any processed data.
Dive in anywhere. A Data Lake enables users across multiple business units to refine, explore and enrich data on their terms.
Flexible access. A Data Lake enables multiple data access patterns across a shared infrastructure: batch, interactive, online, search, in-memory and other processing engines.