0% found this document useful (0 votes)
3 views

Unit - 1

Example – There are two files A.txt and B.txt which are stored in a cluster having 5 nodes. When these files are put in HDFS, as per the applicable block size, let's say both of these files are divided into two blocks.

Uploaded by

rajsreerama.s
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit - 1

Example – There are two files A.txt and B.txt which are stored in a cluster having 5 nodes. When these files are put in HDFS, as per the applicable block size, let's say both of these files are divided into two blocks.

Uploaded by

rajsreerama.s
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to Big

Data
Unit - 1
What is Data?
• The quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and transmitted in
the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
• Big Data is a collection of data that is huge in volume, yet growing
exponentially with time. It is a data with such large size and
complexity that none of traditional data management tools can store
it or process itefficiently.
• Today we live in the digital world. With increased digitization the
amount of structured and unstructured data being created and stored
is exploding.
• The data is being generated from various sources - transactions, social
media, sensors, digital images, videos, audios and clickstreams for
domains including healthcare, retail, energy and utilities.
• For instance, 30 billion content are being shared on Facebook every
month; the photos viewed every 16 seconds in Picasa could cover a
football field.
What is Big Data?
• Big data is a collection of large, complex, and diverse data sets that
are difficult to manage and analyze using traditional data processing
tools. It can include structured, semi-structured, and unstructured
data.
• Big data refers to extremely large and diverse collections of
structured, unstructured, and semi-structured data that continues to
grow exponentially over time.
• These datasets are so huge and complex in volume, velocity, and
variety, that traditional data management systems cannot store,
process, and analyze them.
Why Big Data is Important?
Big data is important because it can help organizations make better
decisions, improve operations, and gain a competitive advantage.
The importance of big data doesn’t simply revolve around how much data
you have. The value lies in how you use it. By taking data from any source
and analyzing it, you can find answers that
1) Streamline resource management,
2) Improve operational efficiencies,
3) Optimize product development,
4) Drive new revenue and growth opportunities and
5) Enable smart decision making.
When you combine big data with high-performance analytics, you can
accomplish business-related tasks such as:
• Determining root causes of failures, issues and defects in near-real
time.
• Spotting anomalies faster and more accurately than the human eye.
• Improving patient outcomes by rapidly converting medical image data
into insights.
• Recalculating entire risk portfolios in minutes.
• Sharpening deep learning models' ability to accurately classify and
react to changing variables.
• Detecting fraudulent behavior before it affects your organization.
• Companies use Bigdata in their systems to improve operations,
provide better customer service, create personalized marketing
campaigns and take other actions that ultimately, can increase
revenue and profits.
• Bigdata is used to describe a massive volume of both structured and
unstructured data. It is difficult to process using traditional database
and software techniques.
• The term big data is believed to have originated with web search
companies who needed to query very large distributed aggregations
of loosely-structured data.
• Big data has the potential to help companies improve operations and
make faster, more intelligent decisions.
Semi-structured Data:
Semi-structured data, or partially structured data, doesn’t follow the
tabular structure associated with relational databases or other forms of
data tables. However, it does contain tags and metadata to separate
semantic elements and establish hierarchies of records and fields.
What are Examples of Semi-Structured Data?
HTML code, graphs and tables, e-mails, XML documents are examples
of semi-structured data, which are often found in object-oriented
databases.
NoSQL databases, CSV, JSON documents, Electronic data interchange
(EDI),RDF
Semi-structured Data
What is an Example of Big Data?
Following are some of the Big Data examples-
• Stock Exchange
The New York Stock Exchange is an example of Big Data that generates about
one terabyte of new trade data per day.its data is also data but with huge size.
• Social Media
The statistic shows that 500+terabytes of new data get ingested into the
databases of social media siteFacebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousandflights per day, generation of data reaches up to
many Petabytes.
• One Bit - 1 or 0 - 8 bits - 1 Byte
• (1024)1 Bytes = 1 KiloBye - 1 KB
• (1024)2 Bytes = 1 MegaByte - 1 MB
• (1024)3 Bytes = 1 GigaByte - 1GB
• (1024)4 Bytes = 1 TeraByte - 1TB
• (1024)5 Bytes = 1 PetaByte - 1PB
• (1024)6 Bytes = 1 ExaByte - 1EB
• (1024)7 Bytes = 1 ZettaByte - 1ZB
• (1024)8 Bytes = 1 YottaByte - 1YB
• (1024)9 Bytes = 1 BrontoByte - 1BB
• (1024)10 Bytes = 1 geopByte - 1gB
Evolution of Big Data
1. Early days of Computing:
• Data was stored on mainframe computers and was used for business and
scientific applications.
• The amount of data stored & analyzed was limited. For data processing, Batch
Processing techniques was used.

2. Data warehousing:
• It allowed organizations to store and analyze large amount of data from
multiple sources.
• The data is primarily structured.The amount of data stored & analyzed was
limited.
3. The rise of the Internet:
• With the rise of the internet in the 1990s, the amount of
data being generated & collected began to grow rapidly.
• The data was more diverse and unstructured. It is difficult
to process and analyze using traditional techniques.

4. The emergence of Big data:


• In the early 2000s, the term “Big Data” was coined to
describe the large volume of data that was being
generated & collected.
• New technologies such as Hadoop & NoSQL databases
were developed to handle the volume & variety of data
5. The growth of Big data:
• The amount of more diversed & unstructured form of data being
generated & collected has continued to grow rapidly.
• New technologies such as cloud computing and streaming analytics
have been developed to handle the volume, variety & velocity of data.

6. Artificial Intelligence and Machine Learning:


• Big Data is also being used to train Machine Learning models and AI.
• This allows organization to gain insights & predictions that were not
possible before.
7. Internet of Things and 5G:
• With the advent of Internet of Things(IoT) and 5G, Big Data is also
becoming more distributed and mobile.
• This is creating new challenges and oppurtunities for Big Data
processing and Analytics.
8. Blockchain and Big Data:
• With the advent of Blockchain technology, big data can be secured in
a way that was not possible before.
• This opens up new oppurtunities for decentralized data processing
and analytics.
Big Data continues to evolve due to advances in technology and the
proliferation of connected devices and the internet.
History of Big Data
• There were many advancements in technology during World War 2,
which were primarily made to serve military purposes. Those
advancements would become useful to the commercial sector and the
general public, with personal computing becoming a viable option to
the everyday consumer.
• Chief Scientist at Silicon Graphics, John R. Mashey, considered the
father to the ‘Big Data' term.
• Big Data is a term that describes large volumes of high velocity,
complex & variable data that require advanced techniques and
technologies to enable the capture, storage, distribution,
management, and analysis of the information.
• Big Data Analytics is the process of examining and interrogating big
data assets to derive insights of value for decision making.
1) 1940s to 1989 – Data Warehousing and Personal Desktop Computers
• The origins of electronic storage can the development of the world’s first
programmable computer, the Electronic Numerical Integrator and
Computer (ENIAC). It was designed by the U.S. army during World War 2 to
solve numerical problems, such as calculate the range of artillery fire.
• In the early 1960s, International Business Machines (IBM) released the first
transistorized computer called TRAnsistorized DIgital Computer or
TRansistorized Airborne DIgital Computer(TRADIC), which helped data
centers branch out of the military and serve more general commercial
purposes.
• The first personal desktop computer to feature a Graphical User Interface
(GUI) was Lisa, released by Apple Computers in 1983.
• Throughout the 1980s, companies like Apple, Microsoft, and IBM would
release a wide range of PDCs.Thus, electronic storage was finally available
to the masses.
2) 1989 to 1999 – Emergence of the World Wide Web
• Between 1989 and 1993, British computer scientist Sir Tim Berners-
Lee would create the fundamental technologies required to power
the World Wide Web.
• These web technologies were HyperText Markup Language (HTML),
Uniform Resource Identifier (URI), and Hypertext Transfer Protocol
(HTTP).
• In April 1993, the decision was made to make the underlying code
for these web technologies free, forever.
• It made possible for individuals, businesses, and organizations who
could afford to pay for an internet service to go online and share
data with other internet-enabled computers.
• As more devices gained access to the internet, this led to a massive
explosion in the amount of information that people could access
and share data at any one time.
3) 2000s to 2010s – Controlling Data Volume, Social Media and Cloud Computing:
• The early 2000s, companies Amazon, eBay, and Google generate large amounts of
web traffic, as well as a combination of structured and unstructured data.
• Amazon launched a beta version of AWS (Amazon Web Services) in 2002, which
opened the Amazon.com platform to all developers. By 2004, over 100 applications
were built for it.
• AWS then relaunched in 2006, offering a wide range of cloud infrastructure services,
including Simple Storage Service (S3) and Elastic Compute Cloud (EC2).
• The public launch of AWS attracted a wide range of customers, such as Dropbox,
Netflix, and Reddit, who were cloud-enabled and all partner with AWS before 2010.
• Social media platforms(MySpace, Facebook, Twitter) led to a rise in the spread of
unstructured data. This would include the sharing of images and audio files,
animated GIFs, videos, status posts, and direct messages.
• These platforms needed new ways to collect, organize, and make sense of this large
amounts of unstructured data being generated at an accelerated rate.
• This led to the creation of Hadoop, an open-source framework created specifically
to manage big data sets, and the adoption of NoSQL database queries, which made
it possible to manage unstructured data – data does not comply with a relational
database model.
• With these new technologies, companies now collect large amounts of disparate
data, and then extract meaningful insights for more informed decision making.
4) 2010s to now – Optimization Techniques, Mobile Devices and IoT:
• In the 2010s, the biggest challenges facing big data was the advent of mobile
devices and the IoT (Internet of Things).
• millions of people, worldwide, with small, internet-enabled devices in their hands,
able to access the web, wirelessly communicate with other internet-enabled
devices, and upload data to the cloud.
• According to a 2017 Data Never Sleeps report by Domo, we were generating 2.5
quintillion bytes of data daily.
The rise of mobile devices and IoT devices also led to new types of data
being collected, organized, and analyzed.
Some examples include:
• Sensor Data (data collected by internet-enabled sensors to provide
valuable, real-time insight into the inner workings of a piece of
machinery)
• Social Data (publicly available social media data from platforms like
Facebook and Twitter)
• Transactional Data (data from online web stores including receipts,
storage records, and repeat purchases)
• Health-related data (heart rate monitors, patient records, medical
history)
Failure of Traditional Database in Handling Big Data
Traditional databases fail to handle big data because of the following
limitations:
• Scalability:- Traditional systems can't scale up to handle large amounts of
data. Scaling up involves adding resources like memory, CPU, or disk
space to a single server. This can be expensive, time-consuming, and
prone to failure.
• Inflexibility:- Traditional systems are not well-suited for handling
unstructured or semi-structured data.
• Latency:- Batch processing introduces latency, making it difficult to
analyze data in real-time.
• Cost:-Scaling traditional systems can be expensive due to the need for
high-end hardware and software licenses.
Cont..
• Traditional databases are optimized for structured data and smaller
datasets, whereas big data requires advanced tools due to its
complexity, volume, and variety.
• Big data is large, complex, and constantly changing, while traditional
data is typically small in size, structured, and static.
• Big data requires specialized tools and techniques to manage and
analyze effectively.
Big data has many qualities—it’s unstructured, dynamic, and complex. Humans
and IoT sensors are producing trillions of gigabytes of data each year.
It’s modern data, in an increasingly diverse range of formats and from variety of
sources. The data size and scale, along with its speed and complexity, is
challenging on traditional data storage systems.
1. Big Data Is Too Big for Traditional Storage
Facebook stores and analyzes huge quantities of data.Facebook users upload at
least 14.58 million photos per hour. Each photo garners interactions stored along
with it, such as likes and comments. Users have “liked” at least a trillion posts,
comments, and other data points. The more data that is in a relational database,
the longer each operation takes.
2. Big Data Is Too Complex for Traditional Storage
Traditional data is “structured.” A relational database—the type of database
that stores traditional data—consists of records containing clearly defined fields.
You can access this type of database using a relational database management
system (RDBMS) such as MySQL, Oracle DB, or SQL Server.
Big data is largely unstructured, consisting of myriad file types and including images,
videos, audio & socialmedia content. That’s why traditional storage solutions are
unsuitable for working with big data: They can’t properly categorize it.
Using a non-relational (NoSQL) database such as MongoDB, Cassandra, or Redis can
allow you to gain valuable insights into complex and varied sets of unstructured
data.
3. Big Data Is Too Fast for Traditional Storage
Big data grows almost instantaneously, and analysis often needs to occur in real
time. An RDBMS isn’t designed for rapid fluctuations.
for example: Internet of things (IoT) devices need to process large amounts of sensor
data with minimal latency. Sensors transmit data from the “real world” at a near-
constant rate. Traditional storage systems struggle to store and analyze data arriving
at such a velocity.
example: cybersecurity. IT departments must inspect each packet of data arriving
through a company’s firewall to check whether it contains suspicious code. Many
gigabytes might be passing through the network each day. To avoid falling victim to
Characteristics of Big data:
The characteristics of Big data also known as “3V’s” of Big data, are:
• Volume: The large amount of data(terabytes, petabytes, or exabytes)
that is generated and collected from various resources.
• Variety: The different types of data(structured, semi-structured &
unstructured) that can be included in Big data. This data can come
from different formats (text, images, videos & audios, etc.,)
• Velocity: The speed at which data is generated and must processed in
order to extract value from it. This includes real-time data streams
from social media, IoT devices & sensors.
• Veracity: The uncertainty and diversity of data which makes it difficult to clean,
process and analyze.
• Value: The ability to extract insights and make better decisions by analyzing big
data.

Additional characteristics that are important to consider:


• Complexity: Big data is complex & difficult to understand, it makes challenging
to extract insights from it.
• Scalability: Big data needs to be able to scale to handle the growing volume of
data, & the ability to quickly process & analyze it.
• Flexibility: Big data needs to be flexible to handle different types of data and
changing environments.
• Accessibility: Big data needs to be accessible to the right people, at right time,
and in the right format to drive insights & decision-making.
• Seucrity: Big data needs to be secured to protect sensitve information &
prevent unauthorized access.
Sources of Big data:-
Sources of Big data
Big data originates from numerous sources, each contributing unique insights
that help industries make better decisions. Below are the key sources and their
specific big data applications in the real world.
1. Social Media Data:
Social media platforms like Facebook, Instagram, LinkedIn, and Twitter produce a
massive volume of data every second.
What’s Captured:Posts, likes, shares, comments, video views, and hashtags.
Applications:
• Marketing and Advertising: Analyze trends, identify customer preferences, and
craft targeted campaigns.
• Sentiment Analysis: Understand public opinion on brands, products, or social
issues.
Example: Twitter trends provide real-time insights into customer sentiment
during product launches.
2. Machine Data:
Machine data comes from Internet of Things (IoT) devices, sensors, and system
logs, operating in industries like manufacturing, agriculture, and logistics.

What’s Captured:
Equipment performance, operational data, and environmental metrics.
Applications:
• Predictive Maintenance: Anticipate when machines might fail to reduce
downtime.
• Automation: Optimize workflows in smart factories or agricultural irrigation
systems.
Example: Smart home devices like thermostats adjust room temperatures based
on usage data.
3. Transaction Data:
Transaction data includes digital records from financial institutions, e-commerce
websites, and point-of-sale systems.

What’s Captured:
Purchase history, payment methods, inventory levels, and customer details.
Applications:
Fraud Detection: Monitor transactions for unusual activity.
Demand Forecasting: Predict product requirements based on buying patterns.
Example: E-commerce platforms like Amazon analyze purchase history to
recommend products.
4. Healthcare Data:
The healthcare industry collects and processes critical information from
hospitals, clinics, diagnostics labs, and wearable devices.

What’s Captured:
Patient records, genetic data, diagnostic images, and treatment
outcomes.
Applications:
• Personalized Medicine: Tailor treatments based on patient history.
• Epidemic Prediction: Use patient data to identify and contain
outbreaks.
Example: Fitness trackers provide real-time health metrics, which
doctors can use to monitor patients remotely.
5. Government and Public Data:
Government agencies and public organizations generate data from
weather monitoring, census collection, and transportation systems.

What’s Captured:
Population statistics, weather forecasts, traffic patterns, and public records.
Applications:
• Policy Making: Use demographic data to create impactful public policies.
• Urban Planning: Optimize infrastructure projects based on traffic and
population data.
Example: Smart traffic systems use data to reduce congestion in urban
areas.
6. Media and Entertainment Data:
Streaming services, gaming platforms, and digital publishers track user
activity and preferences.

What’s Captured:
Viewing habits, subscription details, social media engagement, and user
feedback.
Applications:
• Content Personalization: Recommend movies, songs, or games based on
user preferences.
• Engagement Analytics: Identify what content performs well to optimize
strategies.
Example: Netflix uses data analytics to recommend shows based on viewing
history.
7. Industrial Data:
Collected from robotics, manufacturing systems, and supply chains,
industrial data is critical for process optimization.

What’s Captured:
Production efficiency, inventory levels, shipment statuses, and machine
performance.
Applications:
• Supply Chain Optimization: Ensure timely delivery of goods by
monitoring logistics.
• Quality Assurance: Analyze production data to maintain high standards.
Example: Automotive companies monitor assembly line data to detect
defects early.
8. Scientific Research Data:
Fields like genomics, climate studies, and astronomy generate extensive
datasets from experiments and observations.

What’s Captured:
Satellite imagery, genome sequences, and experimental data.
Applications:
• Climate Models: Predict changes in weather patterns to combat global
warming.
• Medical Research: Develop new treatments or drugs using genomic
data.
Example: Space agencies use satellite data to monitor planetary
conditions.
What are the Main Components of
Big Data?
Organizations integrate these following components effectively can
unlock the potential of big data.
1. Data Sources
What It Includes:
Social media interactions, IoT devices, business transactions, and
customer feedback.
Purpose:
Provide the raw data required for analysis.
2. Data Storage
Key Systems:
• Hadoop Distributed File System (HDFS): For distributed and scalable storage.
• Data Lakes: Store large volumes of unstructured and semi-structured data.
• Cloud Storage: Solutions like Azure, AWS, and Google Cloud for flexible storage.
Purpose:
Organize and securely store data for easy access.

3. Data Processing
Techniques:
• Batch Processing: Tools like MapReduce process large data sets in chunks.
• Real-Time Streaming: Platforms like Apache Spark handle live data streams.
Purpose:
Convert raw data into structured and actionable formats.
4. Data Analytics
Methods Used:
Statistical models, machine learning algorithms, and predictive analytics.
Tools:
Python libraries like Pandas and Scikit-learn, and platforms like SAS and Tableau.
Purpose:
Derive insights, identify trends, and make data-driven predictions.
5. Data Visualization
How It’s Done:
Dashboards, heatmaps, and interactive graphs using tools like Power BI and
Tableau.
Purpose:
Present findings in an understandable way to help decision-makers.
How Does Big Data Analytics Work?
Big data analytics involves transforming vast amounts of raw data into
actionable insights. Here's a clear and concise step-by-step explanation:
1. Data Collection
What Happens: Data is gathered from diverse sources like:
• Social media platforms.
• Internet of Things (IoT) devices.
• Business databases.
• Online transactions.
Goal: Compile data in all formats—structured, unstructured, and semi-
structured—for analysis.
2. Data Cleaning
What Happens: Errors, duplicates, and irrelevant entries are removed. Common
tasks include:
• Fixing typos and standardizing formats.
• Filling missing values to avoid incomplete analysis.
Goal: Ensure the data is accurate and reliable for processing.
3. Data Processing
What Happens: Organize and structure data using powerful tools like:
• Apache Hadoop: For distributed storage and processing.
• Apache Spark: For faster, real-time data operations.
Goal: Convert raw data into manageable formats like tables or graphs for further
analysis.
4. Data Analysis
What Happens: Use statistical techniques and machine learning models to
extract insights. Popular methods include:
• Regression analysis for identifying trends.
• Clustering to group similar data points.
• Predictive modeling to forecast future trends.
Goal: Solve key business problems and predict outcomes.
5. Data Visualization
What Happens: Present the results in clear, intuitive visuals using tools like:
• Tableau and Power BI for creating interactive dashboards.
• Charts, heatmaps, and graphs to make data easy to understand.
Goal: Help stakeholders make informed decisions quickly.

You might also like