Microsoft Business Intelligence (Data Tools)|data lake vs data warehouse

Monday, May 18, 2020

Technological Benefits of Data Lakes

Data is the business asset for every organisation which is audited and protected. Data can be any form such as structured, semi-structured and unstructured. To handle any kind of the data, Data Lake comes in the picture as a centralized repository to store the data as-is (relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media).

The types of raw data that are stored in a data lake can include:

Audio, images and video
Communications (blogs, emails, social media, click-streams)
Operational data (inventory, sales, tickets, tourism)
Machine-generated data (log files, IoT sensor readings)

The most importantly, data lakes are specifically designed to run large scale analytics workloads in a cost-effective way. Within Data Lake, the necessary data is made available to all levels of employees, irrespective of their level or the designation.

All-around Availability of Data — This is the biggest advantage of the Data Lake implementation for any organisation because it gives a surety that all the employees, irrespective of their designation and roles, can have access to data and this term is known as data democratization.

Fetches Quality Data — Data lakes implementation supports many tools and technologies which gives a tremendous data processing power for fetching quality data such as —

Real-time decision analysis — Data lakes take advantage of large quantities of consistent data and deep learning algorithms to arrive at real-time decision analytics by the help of many supportive languages.

Supports SQL and other languages — Conventional data-warehouse technologies support SQL which is good enough for simple analytics. For advanced analytics, other languages are PIG, Hive, Tachyon, Impala and for machine learning, Spark MLlib is over there also.

Operational Analytics Monitoring— Data lakes have all kinds of great benefits for companies, data managers, and data processors. However, with a Data Lake, the necessary data is made available to all levels of employees, irrespective of their level or the designation. Search, explore, filter, aggregate, and visualize business data in near real-time for application monitoring, log analytics, and click stream analytics are easy tasks in Data lake. Just as in the case of Twitter, business user decides whom he wants to connect with or not to connect with, likewise in the case of Data Lakes, a user could choose the required data to meet different business objectives.

Scalable, Versatile and Schema Flexibility- This is the another biggest advantages of Data Lake that data volumes are growing exponentially day by day and unlike traditional data warehouse, Data Leaks offers scalability and is inexpensive as well. There are many technologies (AWS, Azure, Google Cloud etc.) now a days to help you to reduce the cost of your compute usage, like auto-scaling and integration. A data lake can store your versatile data such as XML, logs, multimedia, sensor data, chat, social data, binary, and people data from diverse sources. Hadoop Data Lake enables us to be schema free, or we could come up with multiple schemas for the same data. Meanwhile we can easily separate schema from data, which is good for analytics.

Mukesh Singh

With over 17 years of experience in the Data Engineering stack across a variety of cloud and on-premises systems, I have successfully delivered more than ten complete business product solutions. My expertise lies in building robust infrastructure and architecture to support data engineering, data analytics, and machine learning processes. These solutions have significantly improved collaboration among cross-functional teams, including data scientists, business analysts, software engineers, and stakeholders. Key Contributions Data Modelling and Integration • Data Modeling: Developed various data models to produce suitable data for business users, data analytics, data science, and data visualization teams. • Legacy Systems and Cloud Technologies: Integrated legacy systems with modern cloud-based technologies (AWS, Azure, GCP), data lakes, and data warehouses. • Streamlined Data Pipelines: Built efficient data pipelines, data warehouses, BI reports, and dashboards to streamline data access and insights.

Tuesday, July 17, 2018

Data Lake Vs Data Warehouse

We know that data is the business asset for any organisation which always keeps secure and accessible to business users whenever it required.

In current era, two techniques are very popular to store the data for the business insights. Hence, we are going to differentiate them based on some technical terms.

One is Data Warehouse which is highly structured store of the data that is requiring a significant amount of discovery, planning, data modeling, and development work before the data becomes available for analysis by the business users.

Second one is a Data Lake which is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.

Data lakes are a big opportunity to store large amounts of data in an affordable way without having to decide upfront how it must be structured and used. They are typically used to complement traditional data warehouses, which are still better adapted for highly-trusted, tightly-governed data such as your financial figures, but there are some overlaps between the two compositories.

Data Warehouses compared to Data Lakes - Depending on the business requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.

Characteristics	Data Warehouse	Data Lake
Type of data stored	Structured data (most often in columns & rows in a relational database) from transactional systems, operational databases, and line of business applications	Any type of data structure, any format, including structured, semi-structured, and unstructured data from IoT devices, web sites, mobile apps, social media, and corporate applications
Best way to ingest data	Batch processes	Streaming, micro-batch, or batch processes
Schema	Designed prior to the DW implementation (schema-on-write)	define the structure of the data at the time of analysis , referred to as schema on reading (schema-on-read)
Typical load pattern	ETL - (Extract, Transform, then Load)	ELT - (Extract, Load, and Transform at the time the data is loaded)
Price/Performance	Fastest query results using higher cost storage	Query results getting faster using low-cost storage
Data Quality	Highly curated data that serves as the central version of the truth	Any data that may or may not be curated (ie. raw data)
Users	Business analysts	Data scientists, Data developers, and Business analysts (using curated data)
Analytics pattern	Determine structure, acquire data, then analyze it; iterate back to change structure as needed. Batch reporting, BI and visualizations	Acquire data, analyze it, then iterate to determine its final structured form. Machine Learning, Predictive analytics, data discovery and profiling

During the development of a traditional data warehouse, we should decide a considerable amount of time which is going to spend analyzing data sources, understanding business processes, profiling data, and modeling data.

In contrast, the default expectation for a data lake is to acquire all of the data and retain all of the data.

Please visit us to learn more on -