0% found this document useful (0 votes)
13 views14 pages

INTRODUCTION TO DATA SCIENCE

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce programming model. Its main components include HDFS for storage and YARN for resource management, and it supports various modules like Hive and Pig for additional functionality. While Hadoop offers advantages such as scalability and fault tolerance, it also has limitations, including complexity and challenges with small data processing.

Uploaded by

231suncommerceca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

INTRODUCTION TO DATA SCIENCE

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce programming model. Its main components include HDFS for storage and YARN for resource management, and it supports various modules like Hive and Pig for additional functionality. While Hadoop offers advantages such as scalability and fault tolerance, it also has limitations, including complexity and challenges with small data processing.

Uploaded by

231suncommerceca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT -IV

INTRODUCTION TO HADOOP:
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle big data
and is based on the MapReduce programming model, which allows for the parallel processing
of large datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount of data
and performing the computation. Its framework is based on Java programming with some
native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data and
is based on the MapReduce programming model, which allows for the parallel processing of
large datasets.
Hadoop has two main components:
 HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which
allows for the storage of large amounts of data across multiple machines. It is designed to
work with commodity hardware, which makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource management component of
Hadoop, which manages the allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
 Hadoop also includes several additional modules that provide additional functionality, such
as Hive (a SQL-like query language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
 Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and data
mining. It enables the distributed processing of large data sets across clusters of computers
using a simple programming model.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his son’s toy elephant.
In October 2003 the first paper release was Google File System. In January 2006, MapReduce
development started on the Apache Nutch which consisted of around 6000 lines coding for it
and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It was
created by Apache Software Foundation in 2006, based on a white paper written by Google in
2003 that described the Google File System (GFS) and the MapReduce programming model.
The Hadoop framework allows for the distributed processing of large data sets across clusters
of computers using simple programming models. It is designed to scale up from single servers
to thousands of machines, each offering local computation and storage. It is used by many
organizations, including Yahoo, Facebook, and IBM, for a variety of purposes such as data
warehousing, log processing, and research. Hadoop has been widely adopted in the industry
and has become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited for big data processing:

 Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to
operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and improve
the performance
 High Availability: Hadoop provides High Availability feature, which helps to make sure
that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of
data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the
data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to replicate the data
across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature, which helps to
reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing engines like
real-time streaming, batch processing, and interactive SQL, to run and process data stored in
HDFS.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends
them across various nodes in form of large clusters. Also in case of a node failure, the system
operates and data transfer takes place between the nodes which are facilitated by HDFS.
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably, ability to
tolerate faults, scalable, block structured, can process a large amount of data simultaneously
and many more. Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for
small quantities of data. Also, it has issues related to potential stability, restrictive and rough in
nature. Hadoop also supports a wide range of software packages such as Apache Flumes,
Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig,
Apache Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop


1. Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine
learning and is widely used for data processing. It also supports Java, Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes
faster.
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and processing large
data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other modules.
Advantages and Disadvantages of Hadoop
Advantages:
 Ability to store a large amount of data.
 High flexibility.
 Cost effective.
 High computational power.
 Tasks are independent.
 Linear scaling.

Hadoop has several advantages that make it a popular choice for big data processing:

 Scalability: Hadoop can easily scale to handle large amounts of data by adding more nodes
to the cluster.
 Cost-effective: Hadoop is designed to work with commodity hardware, which makes it a
cost-effective option for storing and processing large amounts of data.
 Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-tolerance, which
means that if one node in the cluster goes down, the data can still be processed by the other
nodes.
 Flexibility: Hadoop can process structured, semi-structured, and unstructured data, which
makes it a versatile option for a wide range of big data scenarios.
 Open-source: Hadoop is open-source software, which means that it is free to use and
modify. This also allows developers to access the source code and make improvements or
add new features.
 Large community: Hadoop has a large and active community of developers and users who
contribute to the development of the software, provide support, and share best practices.
 Integration: Hadoop is designed to work with other big data technologies such as Spark,
Storm, and Flink, which allows for integration with a wide range of data processing and
analysis tools.
Disadvantages:
 Not very effective for small data.
 Hard cluster management.
 Has stability issues.
 Security concerns.
 Complexity: Hadoop can be complex to set up and maintain, especially for organizations
without a dedicated team of experts.
 Latency: Hadoop is not well-suited for low-latency workloads and may not be the best
choice for real-time data processing.
 Limited Support for Real-time Processing: Hadoop’s batch-oriented nature makes it less
suited for real-time streaming or interactive data processing use cases.
 Limited Support for Structured Data: Hadoop is designed to work with unstructured and
semi-structured data, it is not well-suited for structured data processing
 Data Security: Hadoop does not provide built-in security features such as data encryption or
user authentication, which can make it difficult to secure sensitive data.
 Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model is not
well-suited for ad-hoc queries, making it difficult to perform exploratory data analysis.
 Limited Support for Graph and Machine Learning: Hadoop’s core component HDFS and
MapReduce are not well-suited for graph and machine learning workloads, specialized
components like Apache Graph and Mahout are available but have some limitations.
 Cost: Hadoop can be expensive to set up and maintain, especially for organizations with
large amounts of data.
 Data Loss: In the event of a hardware failure, the data stored in a single node may be lost
permanently.
 Data Governance: Data Governance is a critical aspect of data management, Hadoop does
not provide a built-in feature to manage data lineage, data quality, data cataloging, data
lineage, and data audit.

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.

Features of Apache Spark

Apache Spark has following features.

 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.
There are three ways of Spark deployment as explained below.

 Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark
and MapReduce will run side by side to cover all spark jobs on cluster.
 Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-
installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop
stack. It allows other components to run on top of stack.
 Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.

Components of Spark

The following illustration depicts the different components of Spark.

Apache Spark Core


Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).

GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.

Map Reduce in Hadoop


One of the three components of Hadoop is Map Reduce. The first component of Hadoop
that is, Hadoop Distributed File System (HDFS) is responsible for storing the file. The second
component that is, Map Reduce is responsible for processing the file.
MapReduce has mainly 2 tasks which are divided phase-wise.In first phase, Map is utilised and
in next phase Reduce is utilised.

Introduction to NoSQL

NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data. Unlike traditional relational
databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are capable of scaling horizontally to
handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term
has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.
NoSQL databases are generally classified into four main categories:
1. Document databases: These databases store data as semi-structured documents, such as
JSON or XML, and can be queried using document-oriented query languages.
2. Key-value stores: These databases store data as key-value pairs, and are optimized for simple
and fast read/write operations.
3. Column-family stores: These databases store data as column families, which are sets of
columns that are treated as a single entity. They are optimized for fast and efficient querying
of large amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.

ACID

In the context of Hadoop and data science, "ACID" refers to a set of properties -
Atomicity, Consistency, Isolation, and Durability - that ensure data integrity and reliability when
performing transactions on large datasets within a distributed processing environment like
Hadoop, guaranteeing that data modifications happen completely and consistently, even in the
face of system failures or concurrent operations; essentially, it ensures that every data change is
treated as a single, indivisible unit, preventing partial updates or inconsistencies within the data.
Breaking down ACID:
 Atomicity:
A transaction is either fully completed or not at all; if any part of a transaction fails, the entire
operation is rolled back to its previous state, preventing partial updates.
 Consistency:
A transaction must always bring the database from one valid state to another, upholding
predefined data rules and constraints.
 Isolation:
Multiple concurrent transactions should be isolated from each other, meaning that the ongoing
operations of one transaction should not interfere with the data being accessed by another
transaction.
 Durability:
Once a transaction is committed, the changes made to the data must be permanently stored and
persist even in case of system crashes or power outages.
Why is ACID important in Hadoop?
 Data Integrity:
With large, distributed datasets, maintaining data consistency becomes crucial, especially when
multiple users or applications are accessing and modifying data simultaneously.
 Reliable Analytics:
ACID properties enable data analysts to trust the results of their queries, knowing that the data
they are working with is accurate and not corrupted by incomplete or conflicting updates.
How does Hadoop implement ACID?
 Hive:
A popular data warehousing tool built on top of Hadoop that offers ACID-compliant
transactions through specific table formats and configurations, allowing for reliable updates and
deletes on large datasets.
 HBase:
A NoSQL database that can be used with Hadoop and provides ACID guarantees for certain
operations, enabling real-time data updates with consistency.

CAP:

In the context of data science and Hadoop, "CAP" refers to the "Consistency,
Availability, and Partition Tolerance" theorem, a fundamental concept in distributed systems that
states a system can only guarantee two of these three properties at any given time, meaning when
dealing with large datasets on a distributed cluster like Hadoop, you must make trade-offs
between consistency of data, availability of the system, and tolerance to network partitions
depending on your application needs; essentially, you can't have all three fully optimized
simultaneously.
Key points about CAP in Hadoop:
 Consistency:
Ensures that all nodes in a distributed system have the same data at any given time, meaning if
you read data from multiple nodes, you'll get the same value.
 Availability:
Guarantees that the system remains operational even if some nodes fail, allowing users to access
data even during partial system outages.
 Partition Tolerance:
The ability for a distributed system to continue functioning even when network partitions occur
(parts of the system are isolated from each other).

Base Model:

The rise in popularity of NoSQL databases provided a flexible and fluidity with ease to
manipulate data and as a result, a new database model was designed, reflecting these
properties. The acronym BASE is slightly more confusing than ACID but however, the words
behind it suggest ways in which the BASE model is different and acronym BASE stands for:-
1. Basically Available: Instead of making it compulsory for immediate consistency, BASE-
modelled NoSQL databases will ensure the availability of data by spreading and replicating
it across the nodes of the database cluster.
2. Soft State: Due to the lack of immediate consistency, the data values may change over time.
The BASE model breaks off with the concept of a database that obligates its own
consistency, delegating that responsibility to developers.
3. Eventually Consistent: The fact that BASE does not obligates immediate consistency but it
does not mean that it never achieves it. However, until it does, the data reads are still
possible (even though they might not reflect reality).

UNIT-V
CASE STUDY:

Disease Prediction
This article aims to implement a robust machine-learning model that can efficiently
predict the disease of a human, based on the symptoms that he/she possesses. Let us look into
how we can approach this machine-learning problem:

Approach:
 Gathering the Data: Data preparation is the primary step for any machine learning problem.
We will be using a dataset from Kaggle for this problem. This dataset consists of two CSV
files one for training and one for testing. There is a total of 133 columns in the dataset out of
which 132 columns represent the symptoms and the last column is the prognosis.
 Cleaning the Data: Cleaning is the most important step in a machine learning project. The
quality of our data determines the quality of our machine-learning model. So it is always
necessary to clean the data before feeding it to the model for training. In our dataset all the
columns are numerical, the target column i.e. prognosis is a string type and is encoded to
numerical form using a label encoder.
 Model Building: After gathering and cleaning the data, the data is ready and can be used to
train a machine learning model. We will be using this cleaned data to train the Support Vector
Classifier, Naive Bayes Classifier, and Random Forest Classifier. We will be using
a confusion matrix to determine the quality of the models.
 Inference: After training the three models we will be predicting the disease for the input
symptoms by combining the predictions of all three models. This makes our overall prediction
more robust and accurate.

Setting research goals


Setting research goals in data science involves identifying the problem, reviewing
relevant data, and formulating goals that are specific, measurable, and time-bound. The goals
should be communicated and documented throughout the research process.
Steps for setting research goals in data science
1. Understand the problem: Identify the problem or opportunity that the research will address.
2. Review existing data: Look at relevant data, such as sales, customer feedback, or industry
reports.
3. Formulate SMART goals: Use the SMART criteria to create goals that are specific,
measurable, achievable, relevant, and time-bound.
4. Prioritize and align goals: Depending on the scope of the research, there may be multiple
goals.
5. Communicate and document goals: Clearly communicate and document the goals throughout
the research process.
6. Create a project charter: Include the research goal, deliverables, timetable, and resources
needed.

Data Retrieval
A case study on data retrieval in data science could focus on how a company like
Amazon utilizes its vast customer interaction data stored in a distributed database system to
efficiently retrieve relevant information for personalized product recommendations, enabling
targeted marketing campaigns and driving sales; this process involves complex SQL queries,
data filtering, and optimized data retrieval techniques to deliver quick results to users despite the
massive data volume.
Key aspects of this case study:
 Data Source:
Customer browsing history, purchase history, ratings, demographics, and other interaction data
from the Amazon website and app.
 Data Retrieval Challenges:
 Scalability: Handling large volumes of data in real-time to provide instant recommendations.
 Data Complexity: Integrating data from multiple sources with varying structures.
 Performance Optimization: Ensuring fast query response times to maintain user experience.
 Data Retrieval Techniques:
 SQL Queries: Using optimized SQL queries to filter and extract relevant data from the database.
 Data Warehousing: Storing data in a structured data warehouse for efficient querying.
 Distributed Computing: Leveraging distributed processing frameworks like Apache Spark to
handle large datasets efficiently.
 Caching Mechanisms: Implementing caching strategies to store frequently accessed data for faster
retrieval.

Data science preparation


Data science preparation involves developing skills in programming, mathematics, and data
visualization, as well as gaining experience with data science tools. You can also prepare for a
data science interview by researching the company and the role.
Skills
 Programming: Learn programming languages like SQL, R, and Python
 Mathematics: Develop strong analytical and mathematical skills
 Data visualization: Learn how to use tools like Tableau, Power BI, and Matplotlib
 Machine learning: Learn about machine learning algorithms, TensorFlow, PyTorch, and Scikit-
Learn
 Data manipulation: Learn how to collect, clean, and label data
Education
 Earn a bachelor's degree in computer science, data science, or a related field
 Consider earning a master's degree in data science
 Take structured courses or attend in-person or online classes

Data exploration
Data exploration is the initial stage of data analysis, where data scientists examine a
dataset to identify its characteristics and patterns. It's a statistical process that uses data
visualization tools and statistical methods to help understand the data's quality, range, and scale.
 It helps identify errors, outliers, and anomalies
 It helps understand the data's characteristics, which helps determine the type of analysis needed
 It helps ensure that the results are valid and applicable to business goals
 It helps decision-makers understand the data context and make informed decisions

Disease profiling
"Disease profiling" in data science refers to the process of analyzing large datasets of patient
information, including medical records, genetic data, lifestyle factors, and clinical test results, to
identify patterns and characteristics associated with specific diseases, allowing for better
understanding of disease progression, risk factors, and potential treatment strategies; essentially
creating a detailed "profile" of a disease based on data analysis.
Key aspects of disease profiling:
 Data collection:
Gathering comprehensive patient data from various sources like electronic health records
(EHRs), clinical trials, and genomic databases.
 Data cleaning and pre-processing:
Standardizing and organizing data to ensure accuracy and consistency for analysis.
 Feature engineering:
Identifying relevant variables from the data that could contribute to disease prediction, like
demographics, symptoms, lab results, and genetic markers.
 Statistical analysis:
Using descriptive statistics to understand the distribution of disease characteristics within the
population.
 Machine learning algorithms:
Applying algorithms like decision trees, random forests, or neural networks to identify patterns
and predict disease risk or progression based on patient profiles.
Applications of disease profiling:
 Early detection:
Identifying individuals at high risk of developing a disease based on their profile to enable early
intervention and preventative measures.
 Personalized medicine:
Tailoring treatment plans to individual patients based on their unique disease profile.
 Disease surveillance:
Monitoring trends and outbreaks of diseases within a population using data analysis.
 Drug discovery:
Identifying potential targets for new drugs by analyzing the molecular mechanisms underlying
diseases.
Challenges in disease profiling:
 Data quality issues:
Inconsistent data formats, missing values, and potential errors in medical records.
 Data privacy concerns:
Protecting sensitive patient information while utilizing data for analysis.
 Complex disease interactions:
Understanding the interplay of multiple factors contributing to disease development.
Example of disease profiling:
 Cancer profiling:
Analyzing genetic mutations in tumor samples to identify the specific subtype of cancer and
guide treatment decisions.
 Cardiovascular disease profiling:
Identifying individuals at high risk for heart disease based on factors like family history,
cholesterol levels, and blood pressure

Presentation and Automation:


Data science presentations use visual aids to communicate findings, while data science
automation uses tools to automate tasks.
Data science presentations
 Visual aids
Data scientists use charts, graphs, and infographics to help audiences understand complex
information.
 Content
Data science presentations often include an overview, problem statement, data source,
methodology, results, interpretation, challenges, and conclusion.
 Audience
Data scientists present their findings to leadership and other stakeholders.
Data science automation
 Tools
Data science automation uses tools like Apache Kafka to collect, process, and analyze data.
 Tasks
Data science automation can automate tasks like data collection, processing, model building,
and data analytics.
 Benefits
Data science automation can help organizations streamline their data management processes
and save time.
 Applications
Data science automation is used in retail, eCommerce, and other industries.

You might also like