0% found this document useful (0 votes)
10 views78 pages

UNIT 1-2

Big data refers to large and complex data sets that are challenging to manage and analyze, encompassing structured, unstructured, and semi-structured data. It has evolved over decades, driven by advancements in technology and the need for data-driven decision-making across various industries. Key platforms for managing big data include Apache Hadoop, Apache Spark, and cloud-based solutions like Google BigQuery and Amazon EMR.

Uploaded by

Anurag Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views78 pages

UNIT 1-2

Big data refers to large and complex data sets that are challenging to manage and analyze, encompassing structured, unstructured, and semi-structured data. It has evolved over decades, driven by advancements in technology and the need for data-driven decision-making across various industries. Key platforms for managing big data include Apache Hadoop, Apache Spark, and cloud-based solutions like Google BigQuery and Amazon EMR.

Uploaded by

Anurag Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

BIG DATA

UNIT-1
Introduction
• Big data is a term used to describe large and complex sets of data that
are difficult to manage. It can be structured, semi-structured, or
unstructured. Big data is used by organizations to improve their
processes, make decisions, and create better products and services.

• Big data includes structured data, like an inventory database or list of


financial transactions; unstructured data, such as social posts or
videos; and mixed data sets, like those used to train large language
models for AI.
Introduction
• We can have data without information but we cannot have information
without data. With such voluminous data comes the complexity of
managing it well with techniques that are not only effective and
human-friendly but also deliver the desired results in a timely manner.
Types of Digital Data
• Structured Data
• Unstructured Data
• Semi-Structured Data
Structured Data
• Structured Data refers to information that is
organized in a predefined manner, typically
stored in relational databases with clear rows
and columns. It follows a specific schema,
making it easily searchable and processable
using Structured Query Language (SQL). Examples
include customer records, sales transactions,
and sensor readings that adhere to a fixed
format.
Types of Structured Data
• Databases
• Spreadsheet
• SQL
• OLTP systems
Sources of structured Data
• The structured data come from databases such as Access, OLTP Systems, SQL
as well as spreadsheets such as Excel are all in the structured format
• To summarize, structured data:
• Consists of fully described data sets. Has clearly defined categories and sub- categories.
• Is placed neatly in rows and columns .
• Goes into records and hence the database is regulated. by a well-defined structure.
• Can be indexed easily by the DBS itself or manually
Advantages of structured data(Easy
to work with structured data)
• It is easy to work with structured data.
• The advantages are :
• Storage: Both defined and user- defined data types help with the storage of structured data.
• Scalability: Scalability is not generally an issue with increase in data
• Security: Ensuring security is easy
• Update and Delete: Updating, deleting etc is easy due to structured form.
Easy to work with structured
data
• Retrieval of structured data is totally hassle free.
• The features are as follows:
•Retrieving information: a well defined structure helps in easy retrieval of data
•Indexing and searching: Data can be indexed based not only on a text string but also
on other attributes . This enables streamlined search.
•Mining Data: Structured data can be easily mined and knowledge can be extracted
from it.
• BI operations: BI works extremely well with structured data. Hence data mining,
warehousing etc. can be easily undertaken
UNSTRUCTURED DATA
• Unstructured Data refers to information that lacks a fixed format or predefined
structure, making it more challenging to store and analyze. This type of data
includes a wide range of content such as text documents, social media posts,
images, videos, and audio recordings. Since unstructured data is not easily
processed using traditional databases, specialized tools and algorithms, such as
artificial intelligence and machine learning, are often required for analysis.

• It is the one which cannot be stored in the form of rows and columns as in a
database and does not conform to any data model.
• It is difficult to determine the meaning of the data.
• It does not follow any rules and it can be of any type and thus its
unpredictable
CHARACTERISTICS OF UNSTRUCTURED
DATA
SOURCES OF UNSTRUCTURED DATA
• Web pages,
• Memos,
• Videos (MPEG, etc.),
• Images (JPEG, GIF, etc.),
• body of an email,
• Word document,
• PowerPoint presentation,
• Chats,
• Reports,
• White papers,
• Surveys etc.
Where does Unstructured data
come from ?
• Anything in a non-database form is unstructured data.
• It can be divided into two broad categories :
• Bitmap objects : For e.g. Image, video files.
• Textual objects : For e.g. Microsoft word documents, emails or MS Excel.
• A lot of unstructured data is also noisy text such as chats, emails and SMS texts.
MANAGING UNSTRUCTURED DATA
• INDEXING : Data is indexed to enable faster search and retrieval. On
the basis of some value in data, index is defined as an identifier which
represents a large record in the data set.
• Indexing in unstructured data is difficult as text can be indexed based
on a text string but in case of non-text based files, e.g. audio/video,
indexing depends on file names.
• TAGS/METADATA : Using metadata
• CLASSIFICATION/TAXONOMY : Taxonomy is classifying data on the
basis of relationship that exist between data. Data can be grouped
and placed in hierarchies based on the taxonomy prevalent in a firm.
• But in absence of any structure/metadata, identifying relationships
between data is difficult as data is unstructured, naming standards
are not consistent across the firm thus making it difficult to classify
data.
• CAS (Content Addressable Storage) : It stores data based on their
metadata. It assigns a unique name to every object stored in it.AMh1 e
object is retrieved based on its content and not its location. It is used
to store emails etc.
CHALLENGES FACED WHILE STORING
UNSTRUCTURED DATA
Semi Structured Data
• Semi-Structured Data is a type of data that
does not follow a rigid schema like structured
data but still contains tags or markers to
define elements within it. This type of data is
often stored in formats such as JSON, XML, or
NoSQL databases, allowing for some level of
organization while maintaining flexibility.
Examples include emails, web pages with
embedded metadata, and data logs.
Semi Structured Data
• Schema less Data or self describing data from XML, JSON OBJECT,
does not follow a data model
• Only about 10% of data in any organization is semi structured.
• still it is important to understand, manage, and analyze this semi-
structured data coming from heterogeneous sources
• Semi-structured data does not conform to any data model. Also, this
data cannot be stored in rows and columns as in a database
• Semi-structured data has tags and markers which helps group the
data and describe how the data is stored. But they are not sufficient
for management and autonomous of data
Where does semi-structured
data come from?
• Email
• XML
• TCP/IP Packets
• Zipped File
• Binary Executables
• Mark-Up Languages
• Integration of data from heterogeneous sources
Characteristics of semi structured
data are summarized as below :
• It is organized into semantic entities.
• Entities in the same group may not have the same attributes.
• The order of attributes is not necessarily important.
• Not always all attributes are required.
• Size of the same attributes in a group may differ.
• Type of the same attributes in a group may differ.
History of BIG DATA
• The history of Big Data can be traced back to the evolution
of data storage, computing power, and analytics techniques
over several decades.
• The concept of handling large volumes of data began in the
1960s and 1970s with the development of relational
databases. Companies and institutions started using
mainframes to store structured data in an organized manner.
However, computing power was limited, restricting the
ability to process vast amounts of information efficiently.
• During the 1980s and 1990s, advancements in database
management systems, particularly relational database
management systems (RDBMS), allowed businesses to handle
growing data sets more efficiently. The introduction of the
internet further accelerated data generation, leading to an
increased need for better data storage and retrieval
mechanisms.
History of BIG DATA
• The early 2000s marked the emergence of Big Data as a
formal concept. In 2001, analyst Doug Laney introduced
the three Vs model (Volume, Velocity, and Variety) to
describe the growing challenges associated with large-
scale data processing. Around this time, companies like
Google and Yahoo developed distributed computing
frameworks, such as the Google File System (GFS) and
MapReduce, enabling the processing of massive data sets
across multiple machines.
• By the 2010s, open-source frameworks like Apache Hadoop
and Apache Spark became popular, providing scalable and
efficient ways to manage and analyze large-scale data.
Cloud computing platforms such as Amazon Web Services
(AWS), Microsoft Azure, and Google Cloud revolutionized
Big Data storage and processing, making it more
accessible to businesses of all sizes.
History of BIG DATA
• In recent years, the rise of artificial
intelligence (AI), machine learning (ML), and
real-time analytics has further advanced Big Data
applications. The use of streaming data platforms
like Apache Kafka and the adoption of deep
learning models have enabled faster and more
complex data analysis, impacting industries such
as healthcare, finance, and e-commerce.
• Today, Big Data continues to evolve, with
technologies like edge computing, blockchain, and
quantum computing shaping the future of large-
scale data processing and analytics.
Big Data Platform
• Big data platforms are comprehensive frameworks that enable
organizations to store, process, and analyze vast amounts of structured
and unstructured data.
• A big data platform is an integrated computing solution that
combines numerous software systems, tools, and hardware for big
data management. It is a one-stop architecture that solves all the data
needs of a business regardless of the volume and size of the data at
hand. Due to their efficiency in data management, enterprises are
increasingly adopting big data platforms to gather tons of data and
convert them into structured, actionable business insights
a. Apache Hadoop
• Apache Hadoop is one of the industry's most widely used big data platforms. It is
an open-source framework that enables distributed processing for massive
datasets throughout clusters. Hadoop provides a scalable and cost-effective
solution for storing, processing, and analyzing massive amounts of structured and
unstructured data.

• One of the key features of Hadoop is its distributed file system, known as Hadoop
Distributed File System (HDFS). HDFS enables data to be stored across multiple
machines, providing fault tolerance and high availability. This feature allows
businesses to store and process data at a previously unattainable scale. Hadoop
also includes a powerful processing engine called MapReduce, which allows for
parallel data processing across the cluster. The prominent companies that use
Apache Hadoop are:
Yahoo
Facebook
Twitter
What are the best Big Data
Platforms?
• This aims around four letters which are S, A, P, S; which means
Scalability, Availability, Performance, and Security.
• There are various tools responsible to manage hybrid data of IT
systems.
Apache Spark

• Apache Spark is a unified analytics engine for batch processing, streaming data, machine
learning, and graph processing. It is one of the most popular big data platforms used by
companies. One of the key benefits that Apache Spark offers is speed. It is designed to
perform data processing tasks in-memory and achieve significantly faster processing
times than traditional disk-based systems.

• Spark also supports various programming languages, including Java, Scala, Python, and
R, making it accessible to a wide range of developers. Hadoop offers a rich set of libraries
and tools, such as Spark SQL for querying structured data, MLlib for machine learning,
and GraphX for graph processing. Spark integrates well with other big data technologies,
such as Hadoop, allowing companies to leverage their existing infrastructure. The
prominent companies that use Apache Spark include:N

• Netflix,Uber,Airbnb
Google BigQuery
• Google BigQuery is a cloud-based Big Data
analytics platform that allows organizations to
process and analyze massive datasets using SQL-
like queries. It is fully managed, serverless,
and integrates well with other Google Cloud
services, making it ideal for businesses
looking for scalable and fast cloud-based
analytics.
Amazon EMR (Elastic MapReduce)
• Amazon EMR is a cloud-based Big Data processing
platform that runs Apache Hadoop, Apache Spark,
and other Big Data frameworks on Amazon Web
Services (AWS). It provides scalable, cost-
effective data processing for applications such
as log analysis, data mining, and machine
learning.
Microsoft Azure Synapse
Analytics
• Microsoft Azure Synapse Analytics is an
enterprise-grade analytics service that
integrates Big Data and data warehousing. It
enables organizations to query and analyze
large datasets using SQL, Spark, and machine
learning capabilities. Azure Synapse is
commonly used for business intelligence,
reporting, and predictive analytics.
• These platforms are widely adopted for their
scalability, efficiency, and ability to handle
diverse data processing workloads across
different industries.
Drivers of Big Data
• Big Data is driven by several key factors that
contribute to its rapid growth, adoption, and
influence across industries.
• These drivers can be categorized into
1. Technological drivers
2. Business drivers
3. Social and Environmental Drivers
Technological drivers

•Increase in Data Generation: The exponential growth of digital data from social media, IoT devices,
sensors, and business transactions fuels Big Data.
•Advancements in Storage Technologies: Cloud computing, distributed file systems (e.g., Hadoop
HDFS), and SSDs enable efficient data storage.
•Computing Power & Scalability: High-performance computing (HPC), GPUs, and cloud platforms
(AWS, Google Cloud, Azure) facilitate real-time data processing.
•Big Data Frameworks & Tools: Technologies like Hadoop, Spark, and NoSQL databases (MongoDB,
Cassandra) allow for large-scale data processing.
•Artificial Intelligence & Machine Learning: AI and ML require vast amounts of data for training
models, driving the need for Big Data solutions
Business Drivers

•Data-Driven Decision Making: Companies leverage Big Data for predictive analytics, customer
insights, and competitive advantage.
•Cost Reduction: Big Data analytics helps optimize supply chains, reduce operational costs, and
improve efficiency.
•Personalization & Customer Experience: Businesses use Big Data for targeted marketing,
recommendation systems (e.g., Netflix, Amazon), and user engagement.
•Fraud Detection & Risk Management: Financial institutions and cybersecurity firms use Big Data
analytics for fraud detection and anomaly detection.
•Real-Time Processing & Automation: Industries like finance, healthcare, and manufacturing use real-
time data analytics for automation and decision-making.
Social & Environmental Drivers

•Growth of Social Media & Digital Platforms: Platforms like Facebook, Twitter, and YouTube generate
massive user data daily.
•Smart Cities & IoT Integration: Governments and organizations use Big Data to optimize urban planning,
traffic management, and energy consumption.
•Healthcare & Genomics: Medical research and personalized medicine rely on Big Data for disease prediction,
drug discovery, and diagnostics.
•Regulatory Compliance & Governance: Industries are required to manage and analyze large volumes of
compliance-related data (e.g., GDPR, HIPAA).
•Environmental Monitoring & Sustainability: Big Data is used for climate modeling, disaster prediction, and
efficient resource management.
BIG DATA Architecture
• Big data architecture is specifically designed to manage
data ingestion, data processing, and analysis of data
that is too large or complex. A big size data cannot be
store, process and manage by conventional relational
databases. The solution is to organize technology into a
structure of big data architecture. Big data architecture
is able to manage and process data.
Key Aspects of Big Data Architecture

The following are some key aspects of big data


architecture −
• To store and process large size data like 100 GB in size.
• To aggregates and transform of a wide variety of
unstructured data for analysis and reporting.
• Access, processing and analysis of streamed data in
real time.
1. Data Storage
• Big Data storage consists of distributed file
stores that can hold large, multi-format files
efficiently. A Data Lake is used to store diverse
file formats, including structured, semi-
structured, and unstructured data. This storage is
primarily used for batch operations and supports
blob storage solutions such as:
• HDFS (Hadoop Distributed File System)
• Microsoft Azure Blob Storage
• AWS S3 (Simple Storage Service)
• Google Cloud Storage (GCP Storage)
2. Batch Processing
• Batch processing is a long-running operation that
processes data in chunks by filtering, aggregating, and
preparing it for analysis. These jobs require input
data, process it, and generate output files. Common
batch processing tools include:
• Hive Jobs (SQL-like querying for batch data)
• U-SQL Jobs (Microsoft’s big data processing language)
• Apache Sqoop (Data transfer between RDBMS and Hadoop)
• Apache Pig (High-level scripting for Hadoop)
• Custom MapReduce Jobs (Written in Java, Scala, Python)
3. Real-Time Message Ingestion
• A real-time streaming system handles incoming data
as it arrives, differing from batch processing,
which processes data in scheduled intervals. Data
is continuously collected and stored for
processing. Some common message-based ingestion
tools include:
• Apache Kafka (Highly scalable, distributed event
streaming)
• Apache Flume (Data collection, aggregation, and
movement)
• Azure Event Hubs (Streaming platform for event-
driven applications)
4. Stream Processing
• Unlike batch processing, stream processing handles
real-time data flows by consuming, processing, and
delivering insights within milliseconds to
seconds. This is achieved using publish-subscribe
messaging systems and window-based data processing
techniques.
• Apache Spark Streaming (Micro-batch stream
processing)
• Apache Flink (Low-latency, distributed stream
processing)
• Apache Storm (Real-time distributed computation)
• Processed data is then stored in a sink for
further use
5. Analytics-Based Datastore
• Once processed, data is stored in a data
warehouse or NoSQL database for querying and
analysis. These analytical stores allow faster
lookups and advanced analytics.
• HBase (NoSQL database for real-time read/write)
• Apache Hive (SQL-based querying on Hadoop)
• Spark SQL (Query engine for structured big data
processing)
• Hive enables metadata abstraction, making it
easier to manage and analyze large datasets.
6. Reporting & Analysis
• The insights generated from Big Data processing
need to be visualized using reporting and analysis
tools. These tools create dashboards, graphs, and
reports to support business intelligence (BI) and
decision-making.
• IBM Cognos
• Oracle Hyperion
• Tableau, Power BI, Looker
• These tools help organizations understand trends,
make predictions, and gain actionable insights.
7. Orchestration
• Orchestration tools automate and manage Big
Data workflows, ensuring data pipelines run
efficiently. They enable data transformation,
movement, and scheduling across different
sources and destinations. Some common
orchestration tools include:
• Apache Oozie (Workflow scheduler for Hadoop)
• Apache Airflow (Task orchestration and workflow
automation)
• Azure Data Factory (Cloud-based ETL and data
movement service)
5 V’s of Big Data
• There are five v's of Big Data that explains the
characteristics.
• 5 V's of Big Data
• Volume
• Veracity
• Variety
• Value
• Velocity
1. Volume (Size of Data)
• Refers to the massive amount of data generated
daily from sources like social media, IoT
devices, sensors, transactions, and logs.
• Examples:Facebook generates over 4 petabytes of
data per day.
• The Large Hadron Collider produces 1 petabyte
per second of data during experiments.
• Challenges: Requires scalable storage solutions
like Hadoop HDFS, AWS S3, and Google BigQuery.
2. Velocity (Speed of Data
Generation & Processing)

•Describes the speed at which data is generated, collected, and processed in real time.
•Examples:
•Stock market transactions require millisecond-level processing.
•IoT sensors stream continuous real-time data for predictive maintenance.
•Challenges: Needs low-latency data pipelines using Kafka, Apache Flink, and Spark Streaming.
3. Variety (Different Data
Formats & Sources)
• Data comes in structured, semi-structured, and
unstructured formats from various sources.
• Examples:
• Structured: SQL databases, Excel files.
• Semi-Structured: JSON, XML, NoSQL databases.
• Unstructured: Images, videos, audio, social media
posts.
• Challenges: Requires multi-format storage
(HDFS, MongoDB) and flexible processing
frameworks (Spark, Hadoop).
4. Veracity (Data Quality &
Accuracy)
• Measures trustworthiness, consistency, and
reliability of data.
• Examples:
• Fake news and misinformation on social media.
• Sensor data errors due to hardware malfunctions.
• Challenges: Requires data cleansing, filtering,
and validation using AI/ML techniques.
5. Value (Business &
Analytical Insights)
• The ultimate goal is to extract meaningful
insights that drive business decisions.
• Examples:
• E-commerce: Personalized recommendations (Amazon,
Netflix).
• Healthcare: Predicting disease outbreaks with Big
Data analytics.
• Challenges: Requires AI-driven analytics, data
monetization, and predictive modeling.
Applications of Big Data
1. Healthcare & Medical Research
2. Finance & Banking
3. Retail & E-Commerce
4. Manufacturing & Industrial IoT (IIoT)
5. Transportation & Logistics
6. Smart Cities & IoT
7. Education & Research
8. Cybersecurity & Threat Detection
9. Social Media & Entertainment
10.Agriculture & Environmental Science
Big Data Security
• Big data security is a set of data security measures and
practices to safeguard large volumes of data, known as "big
data," from malware attacks, unauthorized access, and other
security threats.
• The process involves protecting the confidentiality, integrity,
and accessibility of data.
• Big data security management includes data encryption, access
control, authentication, authorization, monitoring, threat
detection, employee training, etc.
Big Data Security Practices
1. Encryption
2. Effective user access control
3. Monitoring Cloud Security
4. Network Traffic Analysis
5. Vulnerability Management
6. Employee training and awareness
7. Insider threat detection
8. Prompt incident response plan
9. Regular Data Backup
Big Data Compliance
• Big Data compliance refers to the process of
ensuring that large-scale data collection,
storage, processing, and analysis adhere to
legal, regulatory, and ethical standards. Due
to the vast volume, variety, and velocity of
Big Data, compliance becomes complex and
requires robust governance mechanisms.
Big Data Auditing
• Big data auditing is the process of analyzing big data to identify
patterns and correlations that may be of interest for an audit. It's also
known as data risk management.
Working of Big Data Auditing
• Analyzes structured and unstructured data, both internally and
externally
• Uses data analytics to identify patterns, trends, descriptions,
exceptions, inconsistencies, and relationships in data sets
• Helps identify and fix errors
• Helps assess data quality
• Helps make informed decisions based on analytics
Big Data Privacy
Big data privacy is protecting individuals' personal and sensitive
data when it comes to collecting, storing, processing, and
analyzing large amounts of data. Following are some important
aspects of big data privacy:
Big Data Ethics
Big data ethics refers to the ethical and responsible decisions
that are made when collecting, processing, analyzing, and
deploying large and complex data sets. The following are some
important aspects of the big data ethics:
Big Data technology component
• Cloud Computing is the delivery of computing services over
the internet, allowing users to access resources on-
demand. It includes services like storage, networking, and
software.
• Machine Learning(ML) is a type of artificial intelligence
(AI) that allows computers to learn and improve their
performance over time.
• Business Intelligence (BI) is the process of analyzing
data to help businesses make better decisions. It involves
using technology, strategies, and methodologies to collect
and manage data.
• Natural Language Processing (NLP) is a subfield of
artificial intelligence that helps computers understand
and communicate with humans. It uses machine learning,
computational linguistics, and statistical modeling.
Big Data Analytics
Big Data Analytics uses advanced analytical methods that can
extract important business insights from bulk datasets. Within
these datasets lies both structured (organized) and unstructured
(unorganized) data. Its applications cover different industries such
as healthcare, education, insurance, AI, retail, and
manufacturing.
Big Data Analytics is all about crunching massive amounts of information to uncover
hidden trends, patterns, and relationships. It's like sifting through a giant mountain of data
to find the gold nuggets of insight.
Here's a breakdown of what it involves:
• Collecting Data: Such data is coming from various sources such as social media, web
traffic, sensors and customer reviews.

• Cleaning the Data: Imagine having to assess a pile of rocks that included some gold
pieces in it. You would have to clean the dirt and the debris first. When data is being
cleaned, mistakes must be fixed, duplicates must be removed and the data must be
formatted properly.

• Analyzing the Data: It is here that the wizardry takes place. Data analysts employ
powerful tools and techniques to discover patterns and trends. It is the same thing as
looking for a specific pattern in all those rocks that you sorted through.
Types of Big Data Analytics

1.Descriptive Analytics: This type helps us understand past events. In social media,
it shows performance metrics, like the number of likes on a post.

2.Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the


reasons behind past events. In healthcare, it identifies the causes of high patient
re-admissions.

3.Predictive Analytics: Predictive analytics forecasts future events based on past


data. Weather forecasting, for example, predicts tomorrow's weather by analyzing
historical patterns.

4.Prescriptive Analytics: However, this category not only predicts results but also
offers recommendations for action to achieve the best results. In e-commerce, it
may suggest the best price for a product to achieve the highest possible profit.
Types of Big Data Analytics

5.Real-time Analytics: The key function of real-time analytics is


data processing in real time. It swiftly allows traders to make
decisions based on real-time market events.
6.Spatial Analytics: Spatial analytics is about the location data. In
urban management, it optimizes traffic flow from the data under
the sensors and cameras to minimize the traffic jam.
7.Text Analytics: Text analytics delves into the unstructured data
of text. In the hotel business, it can use the guest reviews to
enhance services and guest satisfaction.
Challenges of Conventional
System in Big Data
● Scalability
● Speed
● Storage
● Data Integration
● Security
Intelligent Data Analysis
Intelligent data analysis (IDA) is a process that uses artificial
intelligence (AI) and other techniques to extract useful information
from large amounts of data. It can help decision makers make better
choices.
Data preparation: Select and integrate relevant data
into a dataset
Data mining: Use algorithms to find rules and
patterns in the data
Result validation: Verify the patterns found by the
algorithms
Result explanation: Communicate the results in a way
that's easy to understand
Nature of data, analytic processes
and tools
• The nature of data refers to its structure, type, and
characteristics, such as being structured (databases,
spreadsheets), unstructured (text, images, videos), or semi-
structured (JSON, XML). It can also be qualitative
(descriptive) or quantitative (numerical) based on its
format and usage.
• The analytical process involves data collection, cleaning,
exploration, feature engineering, modeling, evaluation,
interpretation, and deployment to extract meaningful
insights.
• Analytical tools include Python (Pandas, NumPy, Scikit-
learn), R, SQL, TensorFlow, Tableau, Power BI, Apache Spark,
and OpenCV for data processing, analysis, visualization, and
machine learning.
Reporting
• Reporting primarily involves the presentation of data in a
structured format. Its purpose is to provide a snapshot of
specific metrics or KPIs over a defined period. Reports are
instrumental in summarizing information for stakeholders and are
often automated and scheduled on a regular basis. Ad hoc reports,
created on-demand, can address specific inquiries or issues
promptly. Data visualizations help identify trends, patterns, and
anomalies more intuitively. Dashboards play a crucial role in
presenting real-time data to stakeholders for quick decision-
making.
Analytics
• Analytics goes beyond summarizing data. Its primary purpose is to
uncover meaningful insights and patterns within the data.
Analytics aims to answer questions like ‘why’, ‘how’, and ‘what
next’, by offering a deeper understanding of the underlying
factors driving performance.

The 4 Types of Data Analytics Question Answered

Descriptive What happened?


Diagnostic Why did it happen?
Predictive What will happen?
Prescriptive What should we do?
Data Analytics Tools
• Apache Hadoop – Distributed storage and processing
of big data using the Hadoop Distributed File
System (HDFS) and MapReduce.
• Apache Spark – Fast, in-memory data processing
engine with support for SQL, streaming, machine
learning, and graph processing.
• Google BigQuery – A fully managed, serverless data
warehouse for high-speed analytics.
• Amazon Redshift – Cloud-based data warehousing
solution for analytics at scale.
• Snowflake – A cloud-based platform optimized for
big data storage, querying, and sharing.
END

You might also like