Unit - 1
Unit - 1
Data
Unit - 1
What is Data?
• The quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and transmitted in
the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
• Big Data is a collection of data that is huge in volume, yet growing
exponentially with time. It is a data with such large size and
complexity that none of traditional data management tools can store
it or process itefficiently.
• Today we live in the digital world. With increased digitization the
amount of structured and unstructured data being created and stored
is exploding.
• The data is being generated from various sources - transactions, social
media, sensors, digital images, videos, audios and clickstreams for
domains including healthcare, retail, energy and utilities.
• For instance, 30 billion content are being shared on Facebook every
month; the photos viewed every 16 seconds in Picasa could cover a
football field.
What is Big Data?
• Big data is a collection of large, complex, and diverse data sets that
are difficult to manage and analyze using traditional data processing
tools. It can include structured, semi-structured, and unstructured
data.
• Big data refers to extremely large and diverse collections of
structured, unstructured, and semi-structured data that continues to
grow exponentially over time.
• These datasets are so huge and complex in volume, velocity, and
variety, that traditional data management systems cannot store,
process, and analyze them.
Why Big Data is Important?
Big data is important because it can help organizations make better
decisions, improve operations, and gain a competitive advantage.
The importance of big data doesn’t simply revolve around how much data
you have. The value lies in how you use it. By taking data from any source
and analyzing it, you can find answers that
1) Streamline resource management,
2) Improve operational efficiencies,
3) Optimize product development,
4) Drive new revenue and growth opportunities and
5) Enable smart decision making.
When you combine big data with high-performance analytics, you can
accomplish business-related tasks such as:
• Determining root causes of failures, issues and defects in near-real
time.
• Spotting anomalies faster and more accurately than the human eye.
• Improving patient outcomes by rapidly converting medical image data
into insights.
• Recalculating entire risk portfolios in minutes.
• Sharpening deep learning models' ability to accurately classify and
react to changing variables.
• Detecting fraudulent behavior before it affects your organization.
• Companies use Bigdata in their systems to improve operations,
provide better customer service, create personalized marketing
campaigns and take other actions that ultimately, can increase
revenue and profits.
• Bigdata is used to describe a massive volume of both structured and
unstructured data. It is difficult to process using traditional database
and software techniques.
• The term big data is believed to have originated with web search
companies who needed to query very large distributed aggregations
of loosely-structured data.
• Big data has the potential to help companies improve operations and
make faster, more intelligent decisions.
Semi-structured Data:
Semi-structured data, or partially structured data, doesn’t follow the
tabular structure associated with relational databases or other forms of
data tables. However, it does contain tags and metadata to separate
semantic elements and establish hierarchies of records and fields.
What are Examples of Semi-Structured Data?
HTML code, graphs and tables, e-mails, XML documents are examples
of semi-structured data, which are often found in object-oriented
databases.
NoSQL databases, CSV, JSON documents, Electronic data interchange
(EDI),RDF
Semi-structured Data
What is an Example of Big Data?
Following are some of the Big Data examples-
• Stock Exchange
The New York Stock Exchange is an example of Big Data that generates about
one terabyte of new trade data per day.its data is also data but with huge size.
• Social Media
The statistic shows that 500+terabytes of new data get ingested into the
databases of social media siteFacebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousandflights per day, generation of data reaches up to
many Petabytes.
• One Bit - 1 or 0 - 8 bits - 1 Byte
• (1024)1 Bytes = 1 KiloBye - 1 KB
• (1024)2 Bytes = 1 MegaByte - 1 MB
• (1024)3 Bytes = 1 GigaByte - 1GB
• (1024)4 Bytes = 1 TeraByte - 1TB
• (1024)5 Bytes = 1 PetaByte - 1PB
• (1024)6 Bytes = 1 ExaByte - 1EB
• (1024)7 Bytes = 1 ZettaByte - 1ZB
• (1024)8 Bytes = 1 YottaByte - 1YB
• (1024)9 Bytes = 1 BrontoByte - 1BB
• (1024)10 Bytes = 1 geopByte - 1gB
Evolution of Big Data
1. Early days of Computing:
• Data was stored on mainframe computers and was used for business and
scientific applications.
• The amount of data stored & analyzed was limited. For data processing, Batch
Processing techniques was used.
2. Data warehousing:
• It allowed organizations to store and analyze large amount of data from
multiple sources.
• The data is primarily structured.The amount of data stored & analyzed was
limited.
3. The rise of the Internet:
• With the rise of the internet in the 1990s, the amount of
data being generated & collected began to grow rapidly.
• The data was more diverse and unstructured. It is difficult
to process and analyze using traditional techniques.
What’s Captured:
Equipment performance, operational data, and environmental metrics.
Applications:
• Predictive Maintenance: Anticipate when machines might fail to reduce
downtime.
• Automation: Optimize workflows in smart factories or agricultural irrigation
systems.
Example: Smart home devices like thermostats adjust room temperatures based
on usage data.
3. Transaction Data:
Transaction data includes digital records from financial institutions, e-commerce
websites, and point-of-sale systems.
What’s Captured:
Purchase history, payment methods, inventory levels, and customer details.
Applications:
Fraud Detection: Monitor transactions for unusual activity.
Demand Forecasting: Predict product requirements based on buying patterns.
Example: E-commerce platforms like Amazon analyze purchase history to
recommend products.
4. Healthcare Data:
The healthcare industry collects and processes critical information from
hospitals, clinics, diagnostics labs, and wearable devices.
What’s Captured:
Patient records, genetic data, diagnostic images, and treatment
outcomes.
Applications:
• Personalized Medicine: Tailor treatments based on patient history.
• Epidemic Prediction: Use patient data to identify and contain
outbreaks.
Example: Fitness trackers provide real-time health metrics, which
doctors can use to monitor patients remotely.
5. Government and Public Data:
Government agencies and public organizations generate data from
weather monitoring, census collection, and transportation systems.
What’s Captured:
Population statistics, weather forecasts, traffic patterns, and public records.
Applications:
• Policy Making: Use demographic data to create impactful public policies.
• Urban Planning: Optimize infrastructure projects based on traffic and
population data.
Example: Smart traffic systems use data to reduce congestion in urban
areas.
6. Media and Entertainment Data:
Streaming services, gaming platforms, and digital publishers track user
activity and preferences.
What’s Captured:
Viewing habits, subscription details, social media engagement, and user
feedback.
Applications:
• Content Personalization: Recommend movies, songs, or games based on
user preferences.
• Engagement Analytics: Identify what content performs well to optimize
strategies.
Example: Netflix uses data analytics to recommend shows based on viewing
history.
7. Industrial Data:
Collected from robotics, manufacturing systems, and supply chains,
industrial data is critical for process optimization.
What’s Captured:
Production efficiency, inventory levels, shipment statuses, and machine
performance.
Applications:
• Supply Chain Optimization: Ensure timely delivery of goods by
monitoring logistics.
• Quality Assurance: Analyze production data to maintain high standards.
Example: Automotive companies monitor assembly line data to detect
defects early.
8. Scientific Research Data:
Fields like genomics, climate studies, and astronomy generate extensive
datasets from experiments and observations.
What’s Captured:
Satellite imagery, genome sequences, and experimental data.
Applications:
• Climate Models: Predict changes in weather patterns to combat global
warming.
• Medical Research: Develop new treatments or drugs using genomic
data.
Example: Space agencies use satellite data to monitor planetary
conditions.
What are the Main Components of
Big Data?
Organizations integrate these following components effectively can
unlock the potential of big data.
1. Data Sources
What It Includes:
Social media interactions, IoT devices, business transactions, and
customer feedback.
Purpose:
Provide the raw data required for analysis.
2. Data Storage
Key Systems:
• Hadoop Distributed File System (HDFS): For distributed and scalable storage.
• Data Lakes: Store large volumes of unstructured and semi-structured data.
• Cloud Storage: Solutions like Azure, AWS, and Google Cloud for flexible storage.
Purpose:
Organize and securely store data for easy access.
3. Data Processing
Techniques:
• Batch Processing: Tools like MapReduce process large data sets in chunks.
• Real-Time Streaming: Platforms like Apache Spark handle live data streams.
Purpose:
Convert raw data into structured and actionable formats.
4. Data Analytics
Methods Used:
Statistical models, machine learning algorithms, and predictive analytics.
Tools:
Python libraries like Pandas and Scikit-learn, and platforms like SAS and Tableau.
Purpose:
Derive insights, identify trends, and make data-driven predictions.
5. Data Visualization
How It’s Done:
Dashboards, heatmaps, and interactive graphs using tools like Power BI and
Tableau.
Purpose:
Present findings in an understandable way to help decision-makers.
How Does Big Data Analytics Work?
Big data analytics involves transforming vast amounts of raw data into
actionable insights. Here's a clear and concise step-by-step explanation:
1. Data Collection
What Happens: Data is gathered from diverse sources like:
• Social media platforms.
• Internet of Things (IoT) devices.
• Business databases.
• Online transactions.
Goal: Compile data in all formats—structured, unstructured, and semi-
structured—for analysis.
2. Data Cleaning
What Happens: Errors, duplicates, and irrelevant entries are removed. Common
tasks include:
• Fixing typos and standardizing formats.
• Filling missing values to avoid incomplete analysis.
Goal: Ensure the data is accurate and reliable for processing.
3. Data Processing
What Happens: Organize and structure data using powerful tools like:
• Apache Hadoop: For distributed storage and processing.
• Apache Spark: For faster, real-time data operations.
Goal: Convert raw data into manageable formats like tables or graphs for further
analysis.
4. Data Analysis
What Happens: Use statistical techniques and machine learning models to
extract insights. Popular methods include:
• Regression analysis for identifying trends.
• Clustering to group similar data points.
• Predictive modeling to forecast future trends.
Goal: Solve key business problems and predict outcomes.
5. Data Visualization
What Happens: Present the results in clear, intuitive visuals using tools like:
• Tableau and Power BI for creating interactive dashboards.
• Charts, heatmaps, and graphs to make data easy to understand.
Goal: Help stakeholders make informed decisions quickly.