0% found this document useful (0 votes)

13 views14 pages

INTRODUCTION TO DATA SCIENCE

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed computing environment, utilizing the MapReduce programming model. Its main components include HDFS for storage and YARN for resource management, and it supports various modules like Hive and Pig for additional functionality. While Hadoop offers advantages such as scalability and fault tolerance, it also has limitations, including complexity and challenges with small data processing.

Uploaded by

231suncommerceca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views14 pages

INTRODUCTION TO DATA SCIENCE

Uploaded by

231suncommerceca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

UNIT -IV

INTRODUCTION TO HADOOP:
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle big data
and is based on the MapReduce programming model, which allows for the parallel processing
of large datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount of data
and performing the computation. Its framework is based on Java programming with some
native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data and
is based on the MapReduce programming model, which allows for the parallel processing of
large datasets.
Hadoop has two main components:
 HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which
allows for the storage of large amounts of data across multiple machines. It is designed to
work with commodity hardware, which makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource management component of
Hadoop, which manages the allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
 Hadoop also includes several additional modules that provide additional functionality, such
as Hive (a SQL-like query language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
 Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and data
mining. It enables the distributed processing of large data sets across clusters of computers
using a simple programming model.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his son’s toy elephant.
In October 2003 the first paper release was Google File System. In January 2006, MapReduce
development started on the Apache Nutch which consisted of around 6000 lines coding for it
and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It was
created by Apache Software Foundation in 2006, based on a white paper written by Google in
2003 that described the Google File System (GFS) and the MapReduce programming model.
The Hadoop framework allows for the distributed processing of large data sets across clusters
of computers using simple programming models. It is designed to scale up from single servers
to thousands of machines, each offering local computation and storage. It is used by many
organizations, including Yahoo, Facebook, and IBM, for a variety of purposes such as data
warehousing, log processing, and research. Hadoop has been widely adopted in the industry
and has become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited for big data processing:

 Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to
operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and improve
the performance
 High Availability: Hadoop provides High Availability feature, which helps to make sure
that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of
data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the
data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to replicate the data
across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature, which helps to
reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing engines like
real-time streaming, batch processing, and interactive SQL, to run and process data stored in
HDFS.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends
them across various nodes in form of large clusters. Also in case of a node failure, the system
operates and data transfer takes place between the nodes which are facilitated by HDFS.
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably, ability to
tolerate faults, scalable, block structured, can process a large amount of data simultaneously
and many more. Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for
small quantities of data. Also, it has issues related to potential stability, restrictive and rough in
nature. Hadoop also supports a wide range of software packages such as Apache Flumes,
Apache Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig,
Apache Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop

1. Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine
learning and is widely used for data processing. It also supports Java, Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes
faster.
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and processing large
data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other modules.
Advantages and Disadvantages of Hadoop
Advantages:
 Ability to store a large amount of data.
 High flexibility.
 Cost effective.
 High computational power.
 Tasks are independent.
 Linear scaling.

Hadoop has several advantages that make it a popular choice for big data processing:

 Scalability: Hadoop can easily scale to handle large amounts of data by adding more nodes
to the cluster.
 Cost-effective: Hadoop is designed to work with commodity hardware, which makes it a
cost-effective option for storing and processing large amounts of data.
 Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-tolerance, which
means that if one node in the cluster goes down, the data can still be processed by the other
nodes.
 Flexibility: Hadoop can process structured, semi-structured, and unstructured data, which
makes it a versatile option for a wide range of big data scenarios.
 Open-source: Hadoop is open-source software, which means that it is free to use and
modify. This also allows developers to access the source code and make improvements or
add new features.
 Large community: Hadoop has a large and active community of developers and users who
contribute to the development of the software, provide support, and share best practices.
 Integration: Hadoop is designed to work with other big data technologies such as Spark,
Storm, and Flink, which allows for integration with a wide range of data processing and
analysis tools.
Disadvantages:
 Not very effective for small data.
 Hard cluster management.
 Has stability issues.
 Security concerns.
 Complexity: Hadoop can be complex to set up and maintain, especially for organizations
without a dedicated team of experts.
 Latency: Hadoop is not well-suited for low-latency workloads and may not be the best
choice for real-time data processing.
 Limited Support for Real-time Processing: Hadoop’s batch-oriented nature makes it less
suited for real-time streaming or interactive data processing use cases.
 Limited Support for Structured Data: Hadoop is designed to work with unstructured and
semi-structured data, it is not well-suited for structured data processing
 Data Security: Hadoop does not provide built-in security features such as data encryption or
user authentication, which can make it difficult to secure sensitive data.
 Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model is not
well-suited for ad-hoc queries, making it difficult to perform exploratory data analysis.
 Limited Support for Graph and Machine Learning: Hadoop’s core component HDFS and
MapReduce are not well-suited for graph and machine learning workloads, specialized
components like Apache Graph and Mahout are available but have some limitations.
 Cost: Hadoop can be expensive to set up and maintain, especially for organizations with
large amounts of data.
 Data Loss: In the event of a hardware failure, the data stored in a single node may be lost
permanently.
 Data Governance: Data Governance is a critical aspect of data management, Hadoop does
not provide a built-in feature to manage data lineage, data quality, data cataloging, data
lineage, and data audit.

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.

Features of Apache Spark

Apache Spark has following features.

 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
 Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.
There are three ways of Spark deployment as explained below.

 Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark
and MapReduce will run side by side to cover all spark jobs on cluster.
 Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-
installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop
stack. It allows other components to run on top of stack.
 Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.

Components of Spark

The following illustration depicts the different components of Spark.

Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).

GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API. It also provides an optimized runtime for this abstraction.

Map Reduce in Hadoop

One of the three components of Hadoop is Map Reduce. The first component of Hadoop
that is, Hadoop Distributed File System (HDFS) is responsible for storing the file. The second
component that is, Map Reduce is responsible for processing the file.
MapReduce has mainly 2 tasks which are divided phase-wise.In first phase, Map is utilised and
in next phase Reduce is utilised.

Introduction to NoSQL

NoSQL is a type of database management system (DBMS) that is designed to handle and
store large volumes of unstructured and semi-structured data. Unlike traditional relational
databases that use tables with pre-defined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are capable of scaling horizontally to
handle growing amounts of data.
The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term
has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.
NoSQL databases are generally classified into four main categories:
1. Document databases: These databases store data as semi-structured documents, such as
JSON or XML, and can be queried using document-oriented query languages.
2. Key-value stores: These databases store data as key-value pairs, and are optimized for simple
and fast read/write operations.
3. Column-family stores: These databases store data as column families, which are sets of
columns that are treated as a single entity. They are optimized for fast and efficient querying
of large amounts of data.
4. Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.

ACID

In the context of Hadoop and data science, "ACID" refers to a set of properties -
Atomicity, Consistency, Isolation, and Durability - that ensure data integrity and reliability when
performing transactions on large datasets within a distributed processing environment like
Hadoop, guaranteeing that data modifications happen completely and consistently, even in the
face of system failures or concurrent operations; essentially, it ensures that every data change is
treated as a single, indivisible unit, preventing partial updates or inconsistencies within the data.
Breaking down ACID:
 Atomicity:
A transaction is either fully completed or not at all; if any part of a transaction fails, the entire
operation is rolled back to its previous state, preventing partial updates.
 Consistency:
A transaction must always bring the database from one valid state to another, upholding
predefined data rules and constraints.
 Isolation:
Multiple concurrent transactions should be isolated from each other, meaning that the ongoing
operations of one transaction should not interfere with the data being accessed by another
transaction.
 Durability:
Once a transaction is committed, the changes made to the data must be permanently stored and
persist even in case of system crashes or power outages.
Why is ACID important in Hadoop?
 Data Integrity:
With large, distributed datasets, maintaining data consistency becomes crucial, especially when
multiple users or applications are accessing and modifying data simultaneously.
 Reliable Analytics:
ACID properties enable data analysts to trust the results of their queries, knowing that the data
they are working with is accurate and not corrupted by incomplete or conflicting updates.
How does Hadoop implement ACID?
 Hive:
A popular data warehousing tool built on top of Hadoop that offers ACID-compliant
transactions through specific table formats and configurations, allowing for reliable updates and
deletes on large datasets.
 HBase:
A NoSQL database that can be used with Hadoop and provides ACID guarantees for certain
operations, enabling real-time data updates with consistency.

CAP:

In the context of data science and Hadoop, "CAP" refers to the "Consistency,
Availability, and Partition Tolerance" theorem, a fundamental concept in distributed systems that
states a system can only guarantee two of these three properties at any given time, meaning when
dealing with large datasets on a distributed cluster like Hadoop, you must make trade-offs
between consistency of data, availability of the system, and tolerance to network partitions
depending on your application needs; essentially, you can't have all three fully optimized
simultaneously.
Key points about CAP in Hadoop:
 Consistency:
Ensures that all nodes in a distributed system have the same data at any given time, meaning if
you read data from multiple nodes, you'll get the same value.
 Availability:
Guarantees that the system remains operational even if some nodes fail, allowing users to access
data even during partial system outages.
 Partition Tolerance:
The ability for a distributed system to continue functioning even when network partitions occur
(parts of the system are isolated from each other).

Base Model:

The rise in popularity of NoSQL databases provided a flexible and fluidity with ease to
manipulate data and as a result, a new database model was designed, reflecting these
properties. The acronym BASE is slightly more confusing than ACID but however, the words
behind it suggest ways in which the BASE model is different and acronym BASE stands for:-
1. Basically Available: Instead of making it compulsory for immediate consistency, BASE-
modelled NoSQL databases will ensure the availability of data by spreading and replicating
it across the nodes of the database cluster.
2. Soft State: Due to the lack of immediate consistency, the data values may change over time.
The BASE model breaks off with the concept of a database that obligates its own
consistency, delegating that responsibility to developers.
3. Eventually Consistent: The fact that BASE does not obligates immediate consistency but it
does not mean that it never achieves it. However, until it does, the data reads are still
possible (even though they might not reflect reality).

UNIT-V
CASE STUDY:

Disease Prediction
This article aims to implement a robust machine-learning model that can efficiently
predict the disease of a human, based on the symptoms that he/she possesses. Let us look into
how we can approach this machine-learning problem:

Approach:
 Gathering the Data: Data preparation is the primary step for any machine learning problem.
We will be using a dataset from Kaggle for this problem. This dataset consists of two CSV
files one for training and one for testing. There is a total of 133 columns in the dataset out of
which 132 columns represent the symptoms and the last column is the prognosis.
 Cleaning the Data: Cleaning is the most important step in a machine learning project. The
quality of our data determines the quality of our machine-learning model. So it is always
necessary to clean the data before feeding it to the model for training. In our dataset all the
columns are numerical, the target column i.e. prognosis is a string type and is encoded to
numerical form using a label encoder.
 Model Building: After gathering and cleaning the data, the data is ready and can be used to
train a machine learning model. We will be using this cleaned data to train the Support Vector
Classifier, Naive Bayes Classifier, and Random Forest Classifier. We will be using
a confusion matrix to determine the quality of the models.
 Inference: After training the three models we will be predicting the disease for the input
symptoms by combining the predictions of all three models. This makes our overall prediction
more robust and accurate.

Setting research goals

Setting research goals in data science involves identifying the problem, reviewing
relevant data, and formulating goals that are specific, measurable, and time-bound. The goals
should be communicated and documented throughout the research process.
Steps for setting research goals in data science
1. Understand the problem: Identify the problem or opportunity that the research will address.
2. Review existing data: Look at relevant data, such as sales, customer feedback, or industry
reports.
3. Formulate SMART goals: Use the SMART criteria to create goals that are specific,
measurable, achievable, relevant, and time-bound.
4. Prioritize and align goals: Depending on the scope of the research, there may be multiple
goals.
5. Communicate and document goals: Clearly communicate and document the goals throughout
the research process.
6. Create a project charter: Include the research goal, deliverables, timetable, and resources
needed.

Data Retrieval
A case study on data retrieval in data science could focus on how a company like
Amazon utilizes its vast customer interaction data stored in a distributed database system to
efficiently retrieve relevant information for personalized product recommendations, enabling
targeted marketing campaigns and driving sales; this process involves complex SQL queries,
data filtering, and optimized data retrieval techniques to deliver quick results to users despite the
massive data volume.
Key aspects of this case study:
 Data Source:
Customer browsing history, purchase history, ratings, demographics, and other interaction data
from the Amazon website and app.
 Data Retrieval Challenges:
 Scalability: Handling large volumes of data in real-time to provide instant recommendations.
 Data Complexity: Integrating data from multiple sources with varying structures.
 Performance Optimization: Ensuring fast query response times to maintain user experience.
 Data Retrieval Techniques:
 SQL Queries: Using optimized SQL queries to filter and extract relevant data from the database.
 Data Warehousing: Storing data in a structured data warehouse for efficient querying.
 Distributed Computing: Leveraging distributed processing frameworks like Apache Spark to
handle large datasets efficiently.
 Caching Mechanisms: Implementing caching strategies to store frequently accessed data for faster
retrieval.

Data science preparation

Data science preparation involves developing skills in programming, mathematics, and data
visualization, as well as gaining experience with data science tools. You can also prepare for a
data science interview by researching the company and the role.
Skills
 Programming: Learn programming languages like SQL, R, and Python
 Mathematics: Develop strong analytical and mathematical skills
 Data visualization: Learn how to use tools like Tableau, Power BI, and Matplotlib
 Machine learning: Learn about machine learning algorithms, TensorFlow, PyTorch, and Scikit-
Learn
 Data manipulation: Learn how to collect, clean, and label data
Education
 Earn a bachelor's degree in computer science, data science, or a related field
 Consider earning a master's degree in data science
 Take structured courses or attend in-person or online classes

Data exploration
Data exploration is the initial stage of data analysis, where data scientists examine a
dataset to identify its characteristics and patterns. It's a statistical process that uses data
visualization tools and statistical methods to help understand the data's quality, range, and scale.
 It helps identify errors, outliers, and anomalies
 It helps understand the data's characteristics, which helps determine the type of analysis needed
 It helps ensure that the results are valid and applicable to business goals
 It helps decision-makers understand the data context and make informed decisions

Disease profiling
"Disease profiling" in data science refers to the process of analyzing large datasets of patient
information, including medical records, genetic data, lifestyle factors, and clinical test results, to
identify patterns and characteristics associated with specific diseases, allowing for better
understanding of disease progression, risk factors, and potential treatment strategies; essentially
creating a detailed "profile" of a disease based on data analysis.
Key aspects of disease profiling:
 Data collection:
Gathering comprehensive patient data from various sources like electronic health records
(EHRs), clinical trials, and genomic databases.
 Data cleaning and pre-processing:
Standardizing and organizing data to ensure accuracy and consistency for analysis.
 Feature engineering:
Identifying relevant variables from the data that could contribute to disease prediction, like
demographics, symptoms, lab results, and genetic markers.
 Statistical analysis:
Using descriptive statistics to understand the distribution of disease characteristics within the
population.
 Machine learning algorithms:
Applying algorithms like decision trees, random forests, or neural networks to identify patterns
and predict disease risk or progression based on patient profiles.
Applications of disease profiling:
 Early detection:
Identifying individuals at high risk of developing a disease based on their profile to enable early
intervention and preventative measures.
 Personalized medicine:
Tailoring treatment plans to individual patients based on their unique disease profile.
 Disease surveillance:
Monitoring trends and outbreaks of diseases within a population using data analysis.
 Drug discovery:
Identifying potential targets for new drugs by analyzing the molecular mechanisms underlying
diseases.
Challenges in disease profiling:
 Data quality issues:
Inconsistent data formats, missing values, and potential errors in medical records.
 Data privacy concerns:
Protecting sensitive patient information while utilizing data for analysis.
 Complex disease interactions:
Understanding the interplay of multiple factors contributing to disease development.
Example of disease profiling:
 Cancer profiling:
Analyzing genetic mutations in tumor samples to identify the specific subtype of cancer and
guide treatment decisions.
 Cardiovascular disease profiling:
Identifying individuals at high risk for heart disease based on factors like family history,
cholesterol levels, and blood pressure

Presentation and Automation:

Data science presentations use visual aids to communicate findings, while data science
automation uses tools to automate tasks.
Data science presentations
 Visual aids
Data scientists use charts, graphs, and infographics to help audiences understand complex
information.
 Content
Data science presentations often include an overview, problem statement, data source,
methodology, results, interpretation, challenges, and conclusion.
 Audience
Data scientists present their findings to leadership and other stakeholders.
Data science automation
 Tools
Data science automation uses tools like Apache Kafka to collect, process, and analyze data.
 Tasks
Data science automation can automate tasks like data collection, processing, model building,
and data analytics.
 Benefits
Data science automation can help organizations streamline their data management processes
and save time.
 Applications
Data science automation is used in retail, eCommerce, and other industries.

Unit Iii
No ratings yet
Unit Iii
20 pages
Big data 2 - part
No ratings yet
Big data 2 - part
40 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Hadoop
No ratings yet
Hadoop
11 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Big Data RAJNEESH CCC
No ratings yet
Big Data RAJNEESH CCC
11 pages
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
No ratings yet
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
11 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
HADOOP
No ratings yet
HADOOP
10 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
unit 2
No ratings yet
unit 2
9 pages
UNIT II
No ratings yet
UNIT II
30 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Module-2
No ratings yet
Module-2
23 pages
Module 2 CN
No ratings yet
Module 2 CN
23 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
Assignment 5 (Hadoop)
No ratings yet
Assignment 5 (Hadoop)
1 page
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
BDAunit-II
No ratings yet
BDAunit-II
4 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Unit 2
No ratings yet
Unit 2
23 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Unit III
No ratings yet
Unit III
15 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Unit1
No ratings yet
Unit1
50 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Unit 2
No ratings yet
Unit 2
73 pages
By - Shubham Parmar
No ratings yet
By - Shubham Parmar
14 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
2-Notes
No ratings yet
2-Notes
61 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Hadoop is an Open
No ratings yet
Hadoop is an Open
4 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
BDA Mod2@AzDOCUMENTS - in
No ratings yet
BDA Mod2@AzDOCUMENTS - in
64 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Itr5xx 0016 Combo Switch Actuator Ds220217006aen
No ratings yet
Itr5xx 0016 Combo Switch Actuator Ds220217006aen
1 page
2 Inch WG Table
No ratings yet
2 Inch WG Table
1 page
904eng 40
No ratings yet
904eng 40
2 pages
Iso 15008 2017
No ratings yet
Iso 15008 2017
12 pages
Laura Karina Quintero Struss: Student
No ratings yet
Laura Karina Quintero Struss: Student
11 pages
Script Geral para Ler
No ratings yet
Script Geral para Ler
3 pages
Electronia GST Data 07-Dec
No ratings yet
Electronia GST Data 07-Dec
74 pages
OLYMPUS Ultrasonic Thickness Gauge 26MG
No ratings yet
OLYMPUS Ultrasonic Thickness Gauge 26MG
2 pages
Literature Review On Street Lights
100% (2)
Literature Review On Street Lights
5 pages
Aci209r 92 PDF
No ratings yet
Aci209r 92 PDF
47 pages
Mindray Wato EX-65 Anaesthesia Machine - Service Manual-1
No ratings yet
Mindray Wato EX-65 Anaesthesia Machine - Service Manual-1
50 pages
4A Novel SVPWM Algorithm For Five Level Active Neutral Point Clamped Converter
No ratings yet
4A Novel SVPWM Algorithm For Five Level Active Neutral Point Clamped Converter
8 pages
Program PPG Untuk Membangun Kompetensi G
No ratings yet
Program PPG Untuk Membangun Kompetensi G
9 pages
Transguard-Company-Brochure
No ratings yet
Transguard-Company-Brochure
9 pages
Pastel Grainy Psychedelic Marketing Agency Project Proposal Presentation
No ratings yet
Pastel Grainy Psychedelic Marketing Agency Project Proposal Presentation
22 pages
Time Organization For The Music Visualization Extension Project in p5.js (One-Month Timeline)
No ratings yet
Time Organization For The Music Visualization Extension Project in p5.js (One-Month Timeline)
2 pages
Felix - Reading Exercise (Edited)
No ratings yet
Felix - Reading Exercise (Edited)
12 pages
Catalog C010a Boring
No ratings yet
Catalog C010a Boring
50 pages
Type of Construction
No ratings yet
Type of Construction
7 pages
School Admissions Officer - Job Description
No ratings yet
School Admissions Officer - Job Description
1 page
Hud Sight
No ratings yet
Hud Sight
5 pages
PRACTICE TEST 4 For The Gifted
No ratings yet
PRACTICE TEST 4 For The Gifted
11 pages
Lecture 29 Network Operating System
No ratings yet
Lecture 29 Network Operating System
30 pages
Ansys Users Guide (051 100)
No ratings yet
Ansys Users Guide (051 100)
50 pages
Indonesia Construction Business Information: Tri Djoko Waluyo Director
No ratings yet
Indonesia Construction Business Information: Tri Djoko Waluyo Director
20 pages
Etos® Ed - TD
No ratings yet
Etos® Ed - TD
12 pages
Association-Analysis
No ratings yet
Association-Analysis
72 pages
Datasheet Workday Prism Analytics
No ratings yet
Datasheet Workday Prism Analytics
2 pages
Date Sheet: Diploma of Associate Engineer
No ratings yet
Date Sheet: Diploma of Associate Engineer
6 pages
Lect 06
No ratings yet
Lect 06
16 pages

INTRODUCTION TO DATA SCIENCE

Uploaded by

INTRODUCTION TO DATA SCIENCE

Uploaded by

UNIT -IV

Some common frameworks of Hadoop

Evolution of Apache Spark

Features of Apache Spark

Apache Spark has following features.

Spark Built on Hadoop

The following illustration depicts the different components of Spark.

Apache Spark Core

Map Reduce in Hadoop

Setting research goals

Data science preparation

Presentation and Automation:

You might also like