INTRODUCTION TO DATA SCIENCE
INTRODUCTION TO DATA SCIENCE
INTRODUCTION TO HADOOP:
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle big data
and is based on the MapReduce programming model, which allows for the parallel processing
of large datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount of data
and performing the computation. Its framework is based on Java programming with some
native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data and
is based on the MapReduce programming model, which allows for the parallel processing of
large datasets.
Hadoop has two main components:
HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which
allows for the storage of large amounts of data across multiple machines. It is designed to
work with commodity hardware, which makes it cost-effective.
YARN (Yet Another Resource Negotiator): This is the resource management component of
Hadoop, which manages the allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
Hadoop also includes several additional modules that provide additional functionality, such
as Hive (a SQL-like query language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and data
mining. It enables the distributed processing of large data sets across clusters of computers
using a simple programming model.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his son’s toy elephant.
In October 2003 the first paper release was Google File System. In January 2006, MapReduce
development started on the Apache Nutch which consisted of around 6000 lines coding for it
and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It was
created by Apache Software Foundation in 2006, based on a white paper written by Google in
2003 that described the Google File System (GFS) and the MapReduce programming model.
The Hadoop framework allows for the distributed processing of large data sets across clusters
of computers using simple programming models. It is designed to scale up from single servers
to thousands of machines, each offering local computation and storage. It is used by many
organizations, including Yahoo, Facebook, and IBM, for a variety of purposes such as data
warehousing, log processing, and research. Hadoop has been widely adopted in the industry
and has become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Hadoop has several key features that make it well-suited for big data processing:
Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to
operate even in the presence of hardware failures.
Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and improve
the performance
High Availability: Hadoop provides High Availability feature, which helps to make sure
that the data is always available and is not lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of
data processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the
data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to replicate the data
across the cluster for fault tolerance.
Data Compression: Hadoop provides built-in data compression feature, which helps to
reduce the storage space and improve the performance.
YARN: A resource management platform that allows multiple data processing engines like
real-time streaming, batch processing, and interactive SQL, to run and process data stored in
HDFS.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends
them across various nodes in form of large clusters. Also in case of a node failure, the system
operates and data transfer takes place between the nodes which are facilitated by HDFS.
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably, ability to
tolerate faults, scalable, block structured, can process a large amount of data simultaneously
and many more. Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for
small quantities of data. Also, it has issues related to potential stability, restrictive and rough in
nature. Hadoop also supports a wide range of software packages such as Apache Flumes,
Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig,
Apache Hive, Apache Phoenix, Cloudera Impala.
Hadoop has several advantages that make it a popular choice for big data processing:
Scalability: Hadoop can easily scale to handle large amounts of data by adding more nodes
to the cluster.
Cost-effective: Hadoop is designed to work with commodity hardware, which makes it a
cost-effective option for storing and processing large amounts of data.
Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-tolerance, which
means that if one node in the cluster goes down, the data can still be processed by the other
nodes.
Flexibility: Hadoop can process structured, semi-structured, and unstructured data, which
makes it a versatile option for a wide range of big data scenarios.
Open-source: Hadoop is open-source software, which means that it is free to use and
modify. This also allows developers to access the source code and make improvements or
add new features.
Large community: Hadoop has a large and active community of developers and users who
contribute to the development of the software, provide support, and share best practices.
Integration: Hadoop is designed to work with other big data technologies such as Spark,
Storm, and Flink, which allows for integration with a wide range of data processing and
analysis tools.
Disadvantages:
Not very effective for small data.
Hard cluster management.
Has stability issues.
Security concerns.
Complexity: Hadoop can be complex to set up and maintain, especially for organizations
without a dedicated team of experts.
Latency: Hadoop is not well-suited for low-latency workloads and may not be the best
choice for real-time data processing.
Limited Support for Real-time Processing: Hadoop’s batch-oriented nature makes it less
suited for real-time streaming or interactive data processing use cases.
Limited Support for Structured Data: Hadoop is designed to work with unstructured and
semi-structured data, it is not well-suited for structured data processing
Data Security: Hadoop does not provide built-in security features such as data encryption or
user authentication, which can make it difficult to secure sensitive data.
Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model is not
well-suited for ad-hoc queries, making it difficult to perform exploratory data analysis.
Limited Support for Graph and Machine Learning: Hadoop’s core component HDFS and
MapReduce are not well-suited for graph and machine learning workloads, specialized
components like Apache Graph and Mahout are available but have some limitations.
Cost: Hadoop can be expensive to set up and maintain, especially for organizations with
large amounts of data.
Data Loss: In the event of a hardware failure, the data stored in a single node may be lost
permanently.
Data Governance: Data Governance is a critical aspect of data management, Hadoop does
not provide a built-in feature to manage data lineage, data quality, data cataloging, data
lineage, and data audit.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
The following diagram shows three ways of how Spark can be built with Hadoop components.
There are three ways of Spark deployment as explained below.
Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark
and MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-
installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop
stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.
Components of Spark
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.
Introduction to NoSQL
NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data. Unlike traditional relational
databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are capable of scaling horizontally to
handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term
has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.
NoSQL databases are generally classified into four main categories:
1. Document databases: These databases store data as semi-structured documents, such as
JSON or XML, and can be queried using document-oriented query languages.
2. Key-value stores: These databases store data as key-value pairs, and are optimized for simple
and fast read/write operations.
3. Column-family stores: These databases store data as column families, which are sets of
columns that are treated as a single entity. They are optimized for fast and efficient querying
of large amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.
ACID
In the context of Hadoop and data science, "ACID" refers to a set of properties -
Atomicity, Consistency, Isolation, and Durability - that ensure data integrity and reliability when
performing transactions on large datasets within a distributed processing environment like
Hadoop, guaranteeing that data modifications happen completely and consistently, even in the
face of system failures or concurrent operations; essentially, it ensures that every data change is
treated as a single, indivisible unit, preventing partial updates or inconsistencies within the data.
Breaking down ACID:
Atomicity:
A transaction is either fully completed or not at all; if any part of a transaction fails, the entire
operation is rolled back to its previous state, preventing partial updates.
Consistency:
A transaction must always bring the database from one valid state to another, upholding
predefined data rules and constraints.
Isolation:
Multiple concurrent transactions should be isolated from each other, meaning that the ongoing
operations of one transaction should not interfere with the data being accessed by another
transaction.
Durability:
Once a transaction is committed, the changes made to the data must be permanently stored and
persist even in case of system crashes or power outages.
Why is ACID important in Hadoop?
Data Integrity:
With large, distributed datasets, maintaining data consistency becomes crucial, especially when
multiple users or applications are accessing and modifying data simultaneously.
Reliable Analytics:
ACID properties enable data analysts to trust the results of their queries, knowing that the data
they are working with is accurate and not corrupted by incomplete or conflicting updates.
How does Hadoop implement ACID?
Hive:
A popular data warehousing tool built on top of Hadoop that offers ACID-compliant
transactions through specific table formats and configurations, allowing for reliable updates and
deletes on large datasets.
HBase:
A NoSQL database that can be used with Hadoop and provides ACID guarantees for certain
operations, enabling real-time data updates with consistency.
CAP:
In the context of data science and Hadoop, "CAP" refers to the "Consistency,
Availability, and Partition Tolerance" theorem, a fundamental concept in distributed systems that
states a system can only guarantee two of these three properties at any given time, meaning when
dealing with large datasets on a distributed cluster like Hadoop, you must make trade-offs
between consistency of data, availability of the system, and tolerance to network partitions
depending on your application needs; essentially, you can't have all three fully optimized
simultaneously.
Key points about CAP in Hadoop:
Consistency:
Ensures that all nodes in a distributed system have the same data at any given time, meaning if
you read data from multiple nodes, you'll get the same value.
Availability:
Guarantees that the system remains operational even if some nodes fail, allowing users to access
data even during partial system outages.
Partition Tolerance:
The ability for a distributed system to continue functioning even when network partitions occur
(parts of the system are isolated from each other).
Base Model:
The rise in popularity of NoSQL databases provided a flexible and fluidity with ease to
manipulate data and as a result, a new database model was designed, reflecting these
properties. The acronym BASE is slightly more confusing than ACID but however, the words
behind it suggest ways in which the BASE model is different and acronym BASE stands for:-
1. Basically Available: Instead of making it compulsory for immediate consistency, BASE-
modelled NoSQL databases will ensure the availability of data by spreading and replicating
it across the nodes of the database cluster.
2. Soft State: Due to the lack of immediate consistency, the data values may change over time.
The BASE model breaks off with the concept of a database that obligates its own
consistency, delegating that responsibility to developers.
3. Eventually Consistent: The fact that BASE does not obligates immediate consistency but it
does not mean that it never achieves it. However, until it does, the data reads are still
possible (even though they might not reflect reality).
UNIT-V
CASE STUDY:
Disease Prediction
This article aims to implement a robust machine-learning model that can efficiently
predict the disease of a human, based on the symptoms that he/she possesses. Let us look into
how we can approach this machine-learning problem:
Approach:
Gathering the Data: Data preparation is the primary step for any machine learning problem.
We will be using a dataset from Kaggle for this problem. This dataset consists of two CSV
files one for training and one for testing. There is a total of 133 columns in the dataset out of
which 132 columns represent the symptoms and the last column is the prognosis.
Cleaning the Data: Cleaning is the most important step in a machine learning project. The
quality of our data determines the quality of our machine-learning model. So it is always
necessary to clean the data before feeding it to the model for training. In our dataset all the
columns are numerical, the target column i.e. prognosis is a string type and is encoded to
numerical form using a label encoder.
Model Building: After gathering and cleaning the data, the data is ready and can be used to
train a machine learning model. We will be using this cleaned data to train the Support Vector
Classifier, Naive Bayes Classifier, and Random Forest Classifier. We will be using
a confusion matrix to determine the quality of the models.
Inference: After training the three models we will be predicting the disease for the input
symptoms by combining the predictions of all three models. This makes our overall prediction
more robust and accurate.
Data Retrieval
A case study on data retrieval in data science could focus on how a company like
Amazon utilizes its vast customer interaction data stored in a distributed database system to
efficiently retrieve relevant information for personalized product recommendations, enabling
targeted marketing campaigns and driving sales; this process involves complex SQL queries,
data filtering, and optimized data retrieval techniques to deliver quick results to users despite the
massive data volume.
Key aspects of this case study:
Data Source:
Customer browsing history, purchase history, ratings, demographics, and other interaction data
from the Amazon website and app.
Data Retrieval Challenges:
Scalability: Handling large volumes of data in real-time to provide instant recommendations.
Data Complexity: Integrating data from multiple sources with varying structures.
Performance Optimization: Ensuring fast query response times to maintain user experience.
Data Retrieval Techniques:
SQL Queries: Using optimized SQL queries to filter and extract relevant data from the database.
Data Warehousing: Storing data in a structured data warehouse for efficient querying.
Distributed Computing: Leveraging distributed processing frameworks like Apache Spark to
handle large datasets efficiently.
Caching Mechanisms: Implementing caching strategies to store frequently accessed data for faster
retrieval.
Data exploration
Data exploration is the initial stage of data analysis, where data scientists examine a
dataset to identify its characteristics and patterns. It's a statistical process that uses data
visualization tools and statistical methods to help understand the data's quality, range, and scale.
It helps identify errors, outliers, and anomalies
It helps understand the data's characteristics, which helps determine the type of analysis needed
It helps ensure that the results are valid and applicable to business goals
It helps decision-makers understand the data context and make informed decisions
Disease profiling
"Disease profiling" in data science refers to the process of analyzing large datasets of patient
information, including medical records, genetic data, lifestyle factors, and clinical test results, to
identify patterns and characteristics associated with specific diseases, allowing for better
understanding of disease progression, risk factors, and potential treatment strategies; essentially
creating a detailed "profile" of a disease based on data analysis.
Key aspects of disease profiling:
Data collection:
Gathering comprehensive patient data from various sources like electronic health records
(EHRs), clinical trials, and genomic databases.
Data cleaning and pre-processing:
Standardizing and organizing data to ensure accuracy and consistency for analysis.
Feature engineering:
Identifying relevant variables from the data that could contribute to disease prediction, like
demographics, symptoms, lab results, and genetic markers.
Statistical analysis:
Using descriptive statistics to understand the distribution of disease characteristics within the
population.
Machine learning algorithms:
Applying algorithms like decision trees, random forests, or neural networks to identify patterns
and predict disease risk or progression based on patient profiles.
Applications of disease profiling:
Early detection:
Identifying individuals at high risk of developing a disease based on their profile to enable early
intervention and preventative measures.
Personalized medicine:
Tailoring treatment plans to individual patients based on their unique disease profile.
Disease surveillance:
Monitoring trends and outbreaks of diseases within a population using data analysis.
Drug discovery:
Identifying potential targets for new drugs by analyzing the molecular mechanisms underlying
diseases.
Challenges in disease profiling:
Data quality issues:
Inconsistent data formats, missing values, and potential errors in medical records.
Data privacy concerns:
Protecting sensitive patient information while utilizing data for analysis.
Complex disease interactions:
Understanding the interplay of multiple factors contributing to disease development.
Example of disease profiling:
Cancer profiling:
Analyzing genetic mutations in tumor samples to identify the specific subtype of cancer and
guide treatment decisions.
Cardiovascular disease profiling:
Identifying individuals at high risk for heart disease based on factors like family history,
cholesterol levels, and blood pressure