0% found this document useful (0 votes)
9 views50 pages

Da..u2 PPT 14102022

The document outlines the syllabus and key concepts of a Data Analytics course for B.Tech (CSE) students, focusing on tools, techniques, and applications in business. It covers the role of data analytics in decision making, customer service, and operational efficiency, as well as the steps involved in data analytics from problem understanding to result interpretation. Additionally, it discusses various data analytics tools like R, Python, and Apache Spark, and their applications across sectors such as retail, healthcare, and banking.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views50 pages

Da..u2 PPT 14102022

The document outlines the syllabus and key concepts of a Data Analytics course for B.Tech (CSE) students, focusing on tools, techniques, and applications in business. It covers the role of data analytics in decision making, customer service, and operational efficiency, as well as the steps involved in data analytics from problem understanding to result interpretation. Additionally, it discusses various data analytics tools like R, Python, and Apache Spark, and their applications across sectors such as retail, healthcare, and banking.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

DATA ANALYTICS

(Professional Elective
- I)

for B. TECH (CSE)


3rd YEAR – 1st SEM
(R18)
AY 22-23 Sem-1
Faculty: B.
RAVIKRISHNA
UNIT -2
INTRODUCTION &
TOOLS AND ENVIRONMENT
03:59 PM CSE
Analytics - Introduction to Tools and
Environment

UNIT – II Syllabus
Data Analytics: Introduction to Analytics, Introduction to Tools
and Environment, Application of Modeling in Business, Databases
& Types of Data and variables, Data Modeling Techniques,
Missing Imputations etc. Need for Business Modeling.

03:59 PM CSE
Analytics - Introduction to Tools and
Environment

Topics:
1.Introduction to Data Analytics
2.Data Analytics Tools and Environment
3.Need for Business Modeling.
4.Data Modeling Techniques
5.Application of Modeling in Business
6.Databases & Types of Data and variables
7.Missing Imputations etc.
03:59 PM CSE
Role of Data Analytics:
 Gather Hidden Insights – Hidden insights from data are gathered and
then analyzed with respect to business requirements.

 Generate Reports – Reports are generated from the data and are
passed on to the respective teams and individuals to deal with further
actions for a high rise in business.

 Perform Market Analysis – Market Analysis can be performed to


understand the strengths and weaknesses of competitors.

 Improve Business Requirement – Analysis of Data allows


improving Business to customer requirements and experience.
03:59 PM CSE
Ways to Use Data Analytics:

03:59 PM CSE
Role of Data Analytics:

1. Improved Decision Making:


 Data Analytics eliminates guesswork and manual tasks.

 Be it choosing the right content, planning marketing


campaigns, or developing products.
 organizations can use the insights they gain from data analytics
to make informed decisions. Thus, leading to better outcomes
and customer satisfaction.

03:59 PM CSE
Role of Data Analytics:
2. Better Customer Service:
 Data analytics allows you to tailor customer service according to
their needs.
 It also provides personalization and builds stronger relationships
with customers.
 Analyzed data can reveal information about customers’ interests,
concerns, and more.
 It helps you give better recommendations for products and services.
3. Efficient Operations:
 you can streamline your processes, save money, and boost
production.
 With
03:59an
PM improved understanding of what your audience wants, youCSE
Role of Data Analytics:

4. Effective Marketing:
 Data analytics gives you valuable insights into how your
campaigns are performing.
 Helps in fine-tuning them for optimal outcomes.

 Additionally, you can also find potential customers who are


most likely to interact with a campaign and convert into leads.

03:59 PM CSE
Steps Involved in Data Analytics:

03:59 PM CSE
Steps Involved in Data Analytics:
1. Understand the problem:
• Defining the organizational goals, and planning a lucrative
solution is the first step in the analytics process.
• Identifying relevant product recommendations, identifying frauds,
optimizing vehicle routing, etc.

2. Data Collection:
• Next, you need to collect transactional business data and
customer-related information from the past few years to address
the problems your business is facing.
• The data can have information about the total units that were sold
for a product, the sales, and profit that were made, and also when
was
03:59 PMthe order placed. Past data plays a crucial role in shaping theCSE
Steps Involved in Data Analytics:
3. Data Cleaning:
• All the data you collect will often be disorderly, messy, and
contain unwanted missing values. Such data is not suitable or
relevant for performing data analysis.
• Need to clean the data to remove unwanted, redundant, and
missing values to make it ready for analysis.

4. Data Exploration and Analysis:


• Applying suitable methods can tell you the impact and
relationship of a certain features as compared to other variables.
• You can use data visualization and business intelligence tools,
data mining techniques, and predictive modelling to analyze,
visualize,
03:59 PM and predict future outcomes from this data. CSE
Steps Involved in Data Analytics:
4. Data Exploration and Analysis:
Below are the results you can get from the analysis:
 You can identify when a customer purchases the next product.
 You can understand how long it took to deliver the product.
 You get a better insight into the kind of items a customer looks for,
product returns, etc.
 You will be able to predict the sales and profit for the next quarter.
 You can minimize order cancellation by dispatching only relevant
products.
 You’ll be able to figure out the shortest route to deliver the product,
etc.
03:59 PM CSE
Steps Involved in Data Analytics:
5. Interpret the results:

• The final step is to interpret the results and validate if the


outcomes meet your expectations.

• You can find out hidden patterns and future trends.

• This will help you gain insights that will support you with
appropriate data-driven decision making.

03:59 PM CSE
Tools in Data Analytics:

03:59 PM CSE
Tools in Data Analytics:
R programming –
 Leading analytics tool used for statistics and data modeling.
 R compiles and runs on various platforms such as UNIX, Windows, and
Mac OS.
 Provides tools to automatically install all packages as per user-
requirement.

Python –
 Python is an open-source, object-oriented programming language that
is easy to read, write, and maintain.
 It provides
03:59 PM various machine learning and visualization libraries suchCSE
Tools in Data Analytics:
Tableau Public:
• Free software that connects to any data source such as
Excel, corporate Data Warehouse, etc.
• Creates visualizations, maps, dashboards etc with real-time
updates on the web.
QlikView:
• Offers in-memory data processing with the results delivered to the
end-users quickly.
• Also offers data association and data visualization with data being
compressed
03:59 PM
to almost 10% of its original size. CSE
Tools in Data Analytics:
SAS:
• A programming language and environment for data manipulation
and analytics, this tool is easily accessible and can analyze data
from different sources.
Microsoft Excel:
• This tool is one of the most widely used tools for data analytics.
• Mostly used for clients’ internal data, this tool analyzes the tasks
that summarize the data with a preview of pivot tables.

03:59 PM CSE
Tools in Data Analytics:
 RapidMiner:
• A powerful, integrated platform that can integrate with any data
source types such as Access, Excel, Microsoft SQL, Tera data,
Oracle, Sybase etc.
• This tool is mostly used for predictive analytics, such as data
mining, text analytics, machine learning.
 KNIME:
 Konstanz Information Miner (KNIME) is an open-source data
analytics platform, which allows you to analyze and model data.
 Offers
03:59 PM
the benefit of visual programming, KNIME provides aCSE
Tools in Data Analytics:
 Open Refine:
 Also known as GoogleRefine, this data cleaning software will help
you clean up data for analysis.
 It is used for cleaning messy data, the transformation of data and
parsing data from websites.
 Apache Spark:
 One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in
memory and 10 times faster on disk.
 Also
03:59 PM
popular for data pipelines and machine learning modelCSE
Data Analytics Applications:
Data analytics is used in almost every sector of business.
1.Retail:
Data analytics helps retailers understand their customer needs and
buying habits to predict trends, recommend new products, and boost
their business.
They optimize the supply chain, and retail operations at every step of
the customer journey.
2. Healthcare:
Healthcare industries analyse patient data to provide lifesaving
diagnoses
03:59 PM
and treatment options. CSE
Data Analytics Applications:
Data analytics is used in almost every sector of business.
1.Retail:
Data analytics helps retailers understand their customer needs and
buying habits to predict trends, recommend new products, and boost
their business.
They optimize the supply chain, and retail operations at every step of
the customer journey.
2. Healthcare:
Healthcare industries analyse patient data to provide lifesaving
diagnoses
03:59 PM
and treatment options. CSE
Data Analytics Applications:
Data analytics is used in almost every sector of business.
3. Manufacturing: Using data analytics, manufacturing sectors can
discover new cost-saving opportunities. They can solve complex supply
chain issues, labour constraints, and equipment breakdowns.
4. Banking sector: Banking and financial institutions use analytics to
find out probable loan defaulters and customer churn out rate. It also
helps in detecting fraudulent transactions immediately.
5. Logistics: Logistics companies use data analytics to develop new
business models and optimize routes. This, in turn, ensures that the
delivery
03:59 PM
reaches on time in a cost-efficient manner.
CSE
Cluster Computing
 Cluster computing is a collection of tightly or loosely
connected computers that work together so that they act as a
single entity.
 The connected computers execute operations all together thus
creating the idea of a single system.
 The clusters are generally connected through fast local area
networks (LANs)

03:59 PM CSE
Cluster Computing

03:59 PM CSE
Cluster Computing
 A relatively inexpensive, unconventional to the large server or
mainframe computer solutions.
 Resolves the demand for content criticality and process services in
a faster way.
 IT companies are implementing cluster computing to augment their
scalability, availability, processing speed and resource
management at economic prices.
 Ensures that computational power is always available.
 Provides a single general strategy for the implementation and
application
03:59 PM
of parallel high-performance systems independent ofCSE
Apache Spark
 Apache Spark is a lightning-fast cluster computing
technology, designed for fast computation.
 Based on Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of computations, which
includes interactive queries and stream processing.
 The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
 Spark is designed to cover a wide range of workloads such as
batch applications, iterative algorithms, interactive queries and
streaming.
03:59 PM CSE
Apache Spark
 Apart from supporting all these workloads in a respective system,
it reduces the management burden of maintaining separate tools.

03:59 PM CSE
Apache Spark
Evolution of Apache Spark
• Spark is one of Hadoop’s sub project developed in 2009 in UC
Berkeley’s AMPLab by Matei Zaharia.
• It was Open Sourced in 2010 under a BSD license.
• It was donated to Apache software foundation in 2013
• Now Apache Spark has become a top level Apache project from
Feb-2014.

03:59 PM CSE
Apache Spark - Features
Speed −
• Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on
disk.
• Possible by reducing number of read/write operations to disk.
• It stores the intermediate processing data in memory, supports
‘Map’ and ‘reduce’.
Supports multiple languages −
• Spark provides built-in APIs in Java, Scala, or Python. Therefore,
you
03:59 PM can write applications in different languages. CSE
Apache Spark - Features
Advanced Analytics −
• Also supports SQL queries, Streaming data, Machine
learning (ML), and Graph algorithms.

03:59 PM CSE
Apache Spark – Built on Hadoop

Diagram shows three ways of how Spark can be built with Hadoop
03:59 PM CSE
components.
Apache Spark – Built on Hadoop

Big Data processing is becoming inevitable


from small to large enterprises.
There are three ways of Spark deployment
Standalone −
• Spark Standalone deployment means Spark occupies the place
on top of HDFS(Hadoop Distributed File System) and space is
allocated for HDFS, explicitly.
• Here, Spark and MapReduce will run side by side to cover all
spark
03:59 PM jobs on cluster. CSE
Apache Spark – Built on Hadoop

Hadoop Yarn −

• Hadoop Yarn deployment means, simply, spark runs on Yarn


(Yet Another Resource Negotiator) without any pre-installation
or root access required.
• Helps to integrate Spark into Hadoop ecosystem or Hadoop
stack.
03:59 PM CSE
Apache Spark – Built on Hadoop

Spark in MapReduce (SIMR) −


• Spark in MapReduce is used to launch spark job in addition to
standalone deployment.
• With SIMR, user can start Spark and uses its shell without any
administrative
03:59 PM access. CSE
Apache Spark – Components

Different components of
Spark
03:59 PM CSE
Apache Spark – Components
Apache Spark Core:
• Spark Core is the underlying general execution engine for
spark platform that all other functionality is built upon.
• Spark primarily achieves this speed via a new data model
called resilient distributed datasets (RDDs) that are stored in
memory while being computed upon, thus eliminating
expensive intermediate disk writes.

03:59 PM CSE
Apache Spark – Components
Spark SQL:
• Spark SQL is a component on top of Spark Core that
introduces a new data abstraction called SchemaRDD, which
provides support for structured and semi-structured data.
Spark Streaming:
• Spark Streaming leverages Spark Core's fast scheduling
capability to perform streaming analytics.
• It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of
03:59 PM CSE
Apache Spark – Components
MLlib (Machine Learning Library):
• MLlib is a distributed machine learning framework above
Spark because of the distributed memory-based Spark
architecture.
• It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS)
implementations.
• Spark MLlib is nine times as fast as the Hadoop disk-based
version
03:59 PM
of Apache Mahout (before Mahout gained a SparkCSE
Apache Spark – Components
GraphX:
• GraphX is a distributed graph-processing framework
on top of Spark.
• Provides an API for expressing graph computation
that can model the user-defined graphs by using
Pregel abstraction API.
• Also provides an optimized runtime for this
abstraction.
03:59 PM CSE
Scala:

 Scala is a statically typed programming language


that incorporates both functional and object oriented,
also suitable for imperative programming approaches.
 It’s a General-purpose programming language.
 In scala, everything is an object whether it is a function or a number.
 It does not have concept of primitive data.
 Scala primarily runs on JVM platform and it can also be used to write
software for native platforms using Scala-Native and JavaScript runtimes
through
03:59 PM ScalaJs. CSE
Scala:
 Scala is a Scalable Language used to
write Software for multiple platforms. Hence,
it got the name “Scala”.
 Scala is a statically typed programming language that incorporates both
functional and object oriented, also suitable for imperative programming
approaches.
 It’s a General-purpose programming language.
 In Scala, everything is an object whether it is a function or a number. It does
not have concept of primitive data.
03:59 PM CSE
Scala:
 Scala primarily runs on JVM platform and it can also be used to write
software for native platforms using Scala-Native and JavaScript runtimes
through ScalaJs.
 It is designed for applications that are concurrent (parallel),
distributed, and resilient (robust) message-driven.
 Scala offers many Duck Types(Structural Types).
 Unlike Java, Scala has many features of functional programming
languages like Scheme, Standard ML and Haskell, including type
inference,
03:59 PM
immutability, lazy evaluation, and pattern matching.
CSE
Scala:
Where Scala can be used?
 Web Applications
 Utilities and Libraries
 Data Streaming
 Parallel batch processing
 Concurrency and distributed application
 Data analytics with Spark
 AWS lambda Expression

03:59 PM CSE
Cloudera Impala:
 Cloudera Impala is
Cloudera's open source massively parallel
processing (MPP) SQL query engine for data stored in a
computer cluster running Apache Hadoop.
 Massively parallel processing (MPP) SQL query engine for
native analytic database in a computer cluster running Apache
Hadoop.
 It is shipped by vendors such as Cloudera, MapR, Oracle, and
Amazon.
03:59 PM CSE
Cloudera Impala:
 Cloudera Impala is a query engine that runs on Apache
Hadoop.
 Impala brings enabling users to issue low latency SQL queries
to data stored in HDFS and Apache HBase without requiring
data movement or transformation.
 Integrated with Hadoop to use the same file and data formats,
metadata, security and resource management frameworks
used by MapReduce, Apache Hive, Apache Pig and other
Hadoop
03:59 PM software. CSE
Cloudera Impala:
 The result is that large-scale data processing (via MapReduce)
and interactive queries can be done on the same system using
the same data and metadata – removing the need to migrate
data sets into specialized systems and/or proprietary formats
simply to perform analysis.

03:59 PM CSE
Cloudera Impala:
Features include:
 Supports HDFS and Apache HBase storage,
 Reads Hadoop file formats, including text, LZO, SequenceFile,
Avro, RCFile, and Parquet,
 Supports Hadoop security (Kerberos authentication),
 Fine-grained, role-based authorization with Apache Sentry,
 Uses metadata, ODBC driver, and SQL syntax from Apache
Hive.
03:59 PM CSE
Databases & Types of Data and
variables

Data Base: A Database is a collection of related data.

Database Management System: DBMS is a software or set of


Programs used to define, construct and manipulate the data.

Relational Database Management System:


DBMS is a software system used to maintain relational
databases.
Many
03:59 PM relational database systems have an option of using the
CSE
Databases & Types of Data and
variables

Data Base: A Database is a collection of related data.

Database Management System: DBMS is a software or set of


Programs used to define, construct and manipulate the data.

Relational Database Management System:


DBMS is a software system used to maintain relational
databases.
Many
03:59 PM relational database systems have an option of using the
CSE
End of Unit-2

50 CSE
03:59 PM

You might also like