0% found this document useful (0 votes)
33 views

Data Analytics Class - Unit-Ii

DA

Uploaded by

22wh1a1205
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Data Analytics Class - Unit-Ii

DA

Uploaded by

22wh1a1205
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

DATA ANALYTICS

By
Dr. G. Naga Satish
Professor
Department of Computer Science and Engineering

BVRIT HYDERABAD College of Engineering for Women


Name of the Course Data Analytics
Course Code CS853PE
Year & Semester B.Tech IV Year II Sem
Section CSE – B
Name of the Faculty Dr. G. Naga Satish
Lecture Hour | Date |
Name of the Topic
Course Outcome(s)

BVRIT HYDERABAD College of Engineering for Women


Data Analytics: Introduction to Analytics, Introduction to Tools and Environment,
Application of Modelling in Business, Databases & Types of Data and variables, Data
Modelling Techniques, Missing Imputations etc. Need for Business Modelling

BVRIT HYDERABAD College of Engineering for Women


Introduction to Analytics
Analytics is the science of analysis wherein we apply methods of statistics, data
mining and computer technology for doing the analysis, whereas analysis is the
process wherein the complex data available is being broken down into simpler
forms which provides more compact and better data for understanding.

BVRIT HYDERABAD College of Engineering for Women


There are major four stages in any analytics work-
Stage 1: Descriptive Analytics
Where data or information is gathered and summarized upon. This stage usually caters to
questions like “How many students dropped out last year?”

Stage 2: Diagnostic Analytics


Where data is analysed, and insights are generated which help in answering the question
in the first stage. Here the question that comes up will be “Why has the dropout rate
increased in the last one year?”

BVRIT HYDERABAD College of Engineering for Women


Stage 3: Predictive analytics
With the help of the analyses done in the previous two stages, this stage tries to answer
unforeseen phenomena like “Which students are most likely to drop out?”

Stage 4: Prescriptive Analytics


Finally, the last stage tries to analyze the type action required to be taken to support or avoid
the unforeseen phenomena predicted in the previous stage. In this scenario, prescriptive
analytics has queries like “Which students should I target to keep from dropping out?”

BVRIT HYDERABAD College of Engineering for Women


Analytics is a journey that involves a combination of potential skills, advanced
technologies, applications, and processes used by firm to gain business insights from
data and statistics. This is done to perform business planning.

Reporting Vs Analytics:
Reporting is presenting result of data analysis and Analytics is process or systems
involved in analysis of data to obtain a desired output.

BVRIT HYDERABAD College of Engineering for Women


1.Excel
2. BI tools
3. R & Python

BVRIT HYDERABAD College of Engineering for Women


Application of Modeling in Business
A statistical model embodies a set of assumptions concerning the generation of the observed
data, and similar data from a larger population.
A model represents, often in considerably idealized form, the data-generating process. 
Signal processing is an enabling technology that encompasses the fundamental theory,
applications, algorithms, and implementations of processing or transferring information
contained in many different physical, symbolic, or abstract formats broadly designated as
signals.
It uses mathematical, statistical, computational, heuristic, and linguistic representations,
formalisms, and techniques for representation, modeling, analysis, synthesis, discovery,
recovery, sensing, acquisition, extraction, learning, security, or forensics.
In manufacturing statistical models are used to define Warranty policies, solving various
conveyor related issues, Statistical Process Control etc.

BVRIT HYDERABAD College of Engineering for Women


Databases & Types of Data and variables

Data can be categorized on various parameters like Categorical, Type etc.


Data is of 2 types – Numeric and Character. Again numeric data can be further divided
into sub group of – Discrete and Continuous.
Again, Data can be divided into 2 categories – Nominal and ordinal.
Also based on usage data is divided into 2 categories – Quantitative and Qualitative

BVRIT HYDERABAD College of Engineering for Women


Types of Data
•Structured Data-It is the data that is processed, stored, and retrieved in a fixed format.
Example: Employee details, job positions, and salaries
•Unstructured Data-It is the type of data that lacks any specific form or structure.
Example: Email
•Semi Structured Data- It is the data type containing It is the type of data that lacks any
specific form or structure.
Example: Email

BVRIT HYDERABAD College of Engineering for Women


Qualitative Data
Data in which classification of objects is based on attributes and properties.
Example: Softness of skin etc.
Nominal Data
Unordered data to which an order is assigned in relation to other named categories
Example: Grade classification like pass or fail for student's test results.
Ordinal data
Ordered data that is assigned to categories in a ranked fashion
Example: Feedback to a product with 1–5 ranking

BVRIT HYDERABAD College of Engineering for Women


Quantitative Data
Data can be measured and expressed numerically.
Example: Your height and shoe size.
Discrete Data
It can only take certain values.
Example: The number of students in a class
Continuous Data
It can take any value within a specified range
Example: Share price of a company

BVRIT HYDERABAD College of Engineering for Women


Qualitative Quantitative
Data collection is unstructured. Data collection is structured.
Its asks WHY It is all about HOW MUCH or HOW
MANY

It Cannot be computed as it is non- It is statistical and is about numbers


Statistical

It develops initial understanding and It recommends the final course of action


defines the problem.

BVRIT HYDERABAD College of Engineering for Women


What is Data Modeling?
Data Modeling is a set of tools and techniques used to understand and analyse how
an organization should collect, update, and store data. It is a critical skill for the
business analyst who is involved with discovering, analysing, and specifying changes
to how software systems create and maintain information.
Data modeling is the process of creating a data model for the data to be stored in a
Database. This data model is a conceptual representation of Data objects, the
associations between different data objects and the rules. Data modelling helps in the
visual representation of data and enforces business rules, regulatory compliances, and
government policies on the data. Data Models ensure consistency in naming
conventions, default values, semantics, security while ensuring quality of the data.

BVRIT HYDERABAD College of Engineering for Women


What does a Data Modeler do?
They create an entity relationship diagram to visualize relationships between key
business concepts.
They create a conceptual-level data dictionary to communicate data requirements that
are important to business stakeholders.
They create a data map to resolve potential data issues for a data migration or integration
project.
A data modeller would not necessarily query or manipulate data or become involved in
designing or implementing databases or data repositories.

BVRIT HYDERABAD College of Engineering for Women


Why use Data Model?
The primary goals of using data model are:
•Ensures that all data objects required by the database are accurately represented. Omission
of data will lead to creation of faulty reports and produce incorrect results.
•A data model helps design the database at the conceptual, physical and logical levels.
•Data Model structure helps to define the relational tables, primary and foreign keys and
stored procedures.
•It provides a clear picture of the base data and can be used by database developers to create
a physical database.
•It is also helpful to identify missing and redundant data.

BVRIT HYDERABAD College of Engineering for Women


There are mainly three different types of data models:
1.Conceptual: This Data Model defines WHAT the system contains. This model is
typically created by Business stakeholders and Data Architects. The purpose is to
organize, scope and define business concepts and rules.
2.Logical: Defines HOW the system should be implemented regardless of the DBMS.
This model is typically created by Data Architects and Business Analysts. The purpose is
to developed technical map of rules and data structures.
3.Physical: This Data Model describes HOW the system will be implemented using a
specific DBMS system. This model is typically created by DBA and developers. The
purpose is actual implementation of the database.

BVRIT HYDERABAD College of Engineering for Women


BVRIT HYDERABAD College of Engineering for Women
Conceptual Model
The main aim of this model is to establish the entities, their attributes, and their relationships.
In this Data modelling level, there is hardly any detail available of the actual Database
structure.
The 3 basic tenants of Data Model are
Entity: A real-world thing
Attribute: Characteristics or properties of an entity
Relationship: Dependency or association between two entities

BVRIT HYDERABAD College of Engineering for Women


Characteristics of a conceptual data model
•Offers Organization-wide coverage of the business concepts.
•This type of Data Models are designed and developed for a business audience.
•The conceptual model is developed independently of hardware specifications like data
storage capacity, location or software specifications like DBMS vendor and technology.
The focus is to represent data as a user will see it in the "real world."
Conceptual data models known as Domain models create a common vocabulary for all
stakeholders by establishing basic concepts and scope.ti

BVRIT HYDERABAD College of Engineering for Women


Logical data models add further information to the conceptual model elements. It defines
the structure of the data elements and set the relationships between them.

BVRIT HYDERABAD College of Engineering for Women


The advantage of the Logical data model is to provide a foundation to form the base for the
Physical model. However, the modeling structure remains generic.
At this Data Modeling level, no primary or secondary key is defined. At this Data modeling
level, you need to verify and adjust the connector details that were set earlier for
relationships.
Characteristics of a Logical data model
•Describes data needs for a single project but could integrate with other logical data models
based on the scope of the project.
•Designed and developed independently from the DBMS.
•Data attributes will have datatypes with exact precisions and length.
•Normalization processes to the model is applied typically till 3NF.

BVRIT HYDERABAD College of Engineering for Women


Physical Data Model
A Physical Data Model describes the database specific implementation of the data model. It
offers an abstraction of the database and helps generate schema. This is because of the richness
of meta-data offered by a Physical Data Model.

BVRIT HYDERABAD College of Engineering for Women


Characteristics of a physical data model:
•The physical data model describes data need for a single project or application though it
maybe integrated with other physical data models based on project scope.
•Data Model contains relationships between tables that which addresses cardinality and
nullability of the relationships.
•Developed for a specific version of a DBMS, location, data storage or technology to be used
in the project.
•Columns should have exact datatypes, lengths assigned and default values.
•Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are
defined.

BVRIT HYDERABAD College of Engineering for Women


ADVANTAGES AND DISADVANTAGES OF DATA MODEL:
Advantages of Data model:
•The main goal of a designing data model is to make certain that data objects offered
by the functional team are represented accurately.
•The data model should be detailed enough to be used for building the physical
database.
•The information in the data model can be used for defining the relationship between
tables, primary and foreign keys, and stored procedures.
•Data Model helps business to communicate the within and across organizations.
•Data model helps to documents data mappings in ETL process
•Help to recognize correct sources of data to populate the model

BVRIT HYDERABAD College of Engineering for Women


Disadvantages of Data model:
•To develop Data model one should know physical data stored characteristics.
•This is a navigational system produces complex application development, management.
Thus, it requires a knowledge of the biographical truth.
•Even smaller change made in structure requires modification in the entire application.
•There is no set data manipulation language in DBMS.

BVRIT HYDERABAD College of Engineering for Women


Disadvantages of Data model:
•To develop Data model one should know physical data stored characteristics.
•This is a navigational system produces complex application development, management.
Thus, it requires a knowledge of the biographical truth.
•Even smaller change made in structure requires modification in the entire application.
•There is no set data manipulation language in DBMS.

BVRIT HYDERABAD College of Engineering for Women


1. Entity Relationship Diagrams
Also referred to as ER diagrams or ERDs. Entity-Relationship modeling is a default
technique for modeling and the design of relational (traditional) databases. In this
notation architect identifies:
•Entities representing objects (or tables in relational database),
•Attributes of entities including data type,
•Relationships between entities/objects (or foreign keys in a database).
ERDs work well to design a relational (classic) database, Excel databases or CSV files.
Any kind of tabular data work well for visualization of database schemas and
communication of top-level view of data.

BVRIT HYDERABAD College of Engineering for Women


2. UML Class Diagrams
UML (Unified Modeling Language) is a standardized family of notations for modeling and design of
information systems. It was derived from various existing notations to provide a standard for software
engineering. It comprises of several different diagrams representing different aspect of the system, and one
of them being a Class Diagram that can be used for data modeling. Class diagrams are equivalent of
ERDs in relational world and are mostly used to design classes in object-oriented programming languages
(such as Java or C#).
In class diagrams architects define:
•Classes (equivalent of entity in relational world),
•Attributes of a class (same as in an ERD) including data type,
•Methods associated to specific class, representing its behavior (in relational world those would be stored
procedures),
•Relationships grouped into two categories:
•Relationships between objects (instances of Classes) differentiated into Dependency, Association,
Aggregation and Composition (equivalent to relationships in an ERD),

BVRIT HYDERABAD College of Engineering for Women


•Relationships between classes of two kinds Generalization/Inheritance and
Realization/Implementation (this has no equivalent in relational world).
You can use class diagrams to design a tabular data (such as in RDBMS), but were designed
and are used mostly for object-oriented programs (such as Java or C#).

BVRIT HYDERABAD College of Engineering for Women


3. Data Dictionary
Data dictionaries are a tabular definition/representation of data assets. Data dictionary is an
inventory of data sets/tables with the list of their attributes/columns.
Core data dictionary elements:
List of data sets/tables
•List of attributes/columns of each table with data type.
Optional data dictionary elements:
•Item descriptions
•Relationships between tables/columns,
•Additional constraints, such as uniqueness, default values, value constraints or calculated
columns.
Data dictionary is suitable as detailed specification of data assets and can be supplemented
with ER diagrams, as both serve slightly different purpose.

BVRIT HYDERABAD College of Engineering for Women


MISSING IMPUTATIONS:
In statistics, imputation is the process of replacing missing data with substituted
values. When substituting for a data point, it is known as "unit imputation"; when
substituting for a component of a data point, it is known as "item imputation". There
are three main problems that missing data causes: missing data can introduce a
substantial amount of bias, make the handling and analysis of the data more
arduous, and create reductions in efficiency. Because missing data can create
problems for analyzing data, imputation is seen as a way to avoid pitfalls involved
with list wise deletion of cases that have missing values. That is to say, when one or
more values are missing for a case, most statistical packages default to discarding
any case that has a missing value, which may introduce bias or affect the
representativeness of the results.

BVRIT HYDERABAD College of Engineering for Women


Missing data is a common problem in practical data analysis. They are simply
observations that we intend to make but did not. In datasets, missing values could be
represented as ‘?’, ‘nan’, ’N/A’, blank cell, or sometimes ‘-999’, ’inf’, ‘-inf’. The
following provides an introduction of missing data and describe some basic methods on
how to handle them.

BVRIT HYDERABAD College of Engineering for Women


Imputation simply means that we replace the missing values with some guessed/estimated ones.
Mean, median, mode imputation
A simple guess of a missing value is the mean, median, or mode (most frequently appeared value)
of that variable.
Regression imputation
Mean, median or mode imputation only look at the distribution of the values of the variable with
missing entries. If we know there is a correlation between the missing value and other variables,
we can often get better guesses by regressing the missing variable on other variables.
K-nearest neighbour (KNN) imputation
Besides model-based imputation like regression imputation, neighbour-based imputation can also
be used. K-nearest neighbour (KNN) imputation is an example of neighbour-based imputation.
For a discrete variable, KNN imputer uses the most frequent value among the k nearest
neighbours and, for a continuous variable, use the mean or mode.

BVRIT HYDERABAD College of Engineering for Women


Depending on the nature of the data or data type, some other imputation methods may be
more appropriate.
For example, for longitudinal data, such as patients’ weights over a period of visits, it might
make sense to use last valid observation to fill the NA’s. This is known as Last observation
carried forward (LOCF)
In other cases, for instance, if we are dealing with time-series data, it might make senses to
use interpolation of observed values before and after a timestamp for missing values.

BVRIT HYDERABAD College of Engineering for Women


Multiple imputations
The Mean, median, mode imputation, regression imputation, stochastic regression
imputation, KNN imputer are all methods that create a single replacement value for each
missing entry. Multiple Imputation (MI), rather than a different method, is more like a
general approach/framework of doing the imputation procedure multiple times to create
different plausible imputed datasets. The key motivation to use MI is that a single
imputation cannot reflect sampling variability from both sample data and missing
values.

BVRIT HYDERABAD College of Engineering for Women


Methods to Handle Missing Data
Data can be missing in the following ways:
•Missing Completely At Random (MCAR): When missing values are randomly distributed
across all observations, then we consider the data to be missing completely at random. A quick
check for this is to compare two parts of data – one with missing observations and the other
without missing observations. On a t-test, if we do not find any difference in means between
the two samples of data, we can assume the data to be MCAR.
•Missing At Random (MAR): The key difference between MCAR and MAR is that under
MAR the data is not missing randomly across all observations, but is missing randomly only
within sub-samples of data. For example, if high school GPA data is missing randomly across
all schools in a district, that data will be considered MCAR. However, if data is randomly
missing for students in specific schools of the district, then the data is MAR.

BVRIT HYDERABAD College of Engineering for Women


•Not Missing At Random (NMAR): When the missing data has a structure to it, we
cannot treat it as missing at random. In the above example, if the data was missing for
all students from specific schools, then the data cannot be treated as MAR.

BVRIT HYDERABAD College of Engineering for Women


Thankyou

BVRIT HYDERABAD College of Engineering for Women

You might also like