0% found this document useful (0 votes)

58 views

(Mba-Ft - Year-Ii) Data Analysis Group Assignment: Submitted To: Prof. Chetan Jhaveri Date of Submission: 25 July, 2019

This document provides an analysis of retail data from a group project. It includes: - Univariate analysis of quantitative variables like item weight, visibility, and MRP including measures of central tendency, minimum, maximum, and outliers. - Discussion of qualitative variable categories and inconsistencies to address. - Identification of missing data in item weight, outlet size and need for data cleaning. - Explanation of different methods for cleaning missing data and outliers, like removing values, imputing means or medians. The analysis summarizes key retail variables and identifies areas needing data cleaning to prepare the data for further modeling and insights.

Uploaded by

Vinay Sudani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views

(Mba-Ft - Year-Ii) Data Analysis Group Assignment: Submitted To: Prof. Chetan Jhaveri Date of Submission: 25 July, 2019

Uploaded by

Vinay Sudani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

(MBA-FT – YEAR-II)

Data Analysis
Group Assignment

UNIVARIATE ANALYSIS OF RETAIL DATA

Submitted by:
Group Id: 181115
Avani Jain (181115)
Arnav Singh (181114)
Krishna Sanghavi (181131)
Vinaykumar Sudani (181161)
Piyush Jalan (181435)

Submitted to: Prof. Chetan Jhaveri

Date of Submission: 25th July, 2019
1. Univariate Analysis

This is the simplest form of data analysis which deals with a single variable from a data set. Its
main objective is to describe, to summarize data and to identify trends or patterns for variable
under consideration.

For this different measures like central tendency of data (i.e. Mean, Median, Mod), Maximum,
Minimum, Quartiles, Range, Variance etc. are used. The different graphical tools are also used
like Bar charts, Box plots, Pie Chart, Histogram etc. to represent or visualize data graphically
and analyze it.

1.1 Univariate analysis of Retail Data

The retail data has total 8523 observations with 12 different variables. Out of these 12 variables
the Item_Outlet_Sale is the dependent variable showing the sales volume. While remaining
all are independent variables. These variables include both Qualitative and Quantitative data.

 Quantitative Variable: These are the variables which has numeric values which can
be measure and quantify. In Retail Data Item_Weight, Item_Visibility and Item_MRP
are the quantitative variables.

 Qualitative Variable: These are the variables which can have names or labels as value.
These are also termed as Categorical Variable. In Retail Data Item_Fat_Content,
Item_Type, Outlet_Identifier etc. are categorical or qualitative variables.

Now, we will do the univariate analysis of all the variables in two groups.

1.1.1 Univariate analysis of Quantitative Data

A. Item_Weight

Mean 12.858

1st Quartile 8.774

2nd Quartile (Median) 12.600

3rd Quartile 16.850

Minimum 4.555

Maximum 21.350

NA’s (Missing Values) 1463

From this table we can see that total 1463 data points are missing in this variable.

Also the above Box Plot indicates that there are no outliers for this variable.

B. Item_Visibility

Mean 0.06613

1st Quartile 0.02699

2nd Quartile (Median) 0.05393

3rd Quartile 0.09459

Minimum 0.00000

Maximum 0.32839
NA’s (Missing Values) 0

For this variable there is no missing data.

The Above Box Plot shows that there are so many Outliers present in this variable. There
are total 144 data points which are outliers. The Lower and Upper limits of box plot are
0.000000 and 0.195721 respectively.

All the outlier points are as below.

0.2553949 0.2934178 0.2789741 0.2918654 0.2047000 0.2641247 0.2289931

0.2978837 0.2330398 0.2103758 0.2202256 0.3283909 0.2052948 0.2473210
0.2141398 0.2271896 0.2234403 0.2673526 0.2563754 0.2441023 0.2553483
0.2248373 0.2241746 0.2455426 0.2364336 0.2703003 0.2047130 0.3211150
0.2483015 0.2744052 0.2077835 0.1964387 0.2353541 0.2410556 0.1964909
0.3037434 0.2321219 0.2163230 0.1984248 0.2131255 0.2376513 0.2220634
0.2949392 0.2840659 0.2347335 0.3093903 0.2774594 0.2133236 0.2539478
0.2035107 0.2366168 0.2056051 0.2129632 0.2284695 0.2320727 0.3048591
0.2918266 0.2427687 0.2982053 0.3257808 0.2662344 0.2089871 0.2559291
0.2474901 0.2732128 0.2081622 0.2004264 0.2254765 0.2096001 0.3081454
0.2454837 0.2675659 0.2179941 0.2105965 0.2333566 0.2930661 0.2904303
0.2528660 0.2201111 0.2801649 0.2164782 0.1978091 0.2016877 0.3047374
0.2146811 0.2949489 0.1983166 0.2115394 0.2870171 0.2156119 0.2122938
0.3024789 0.2863448 0.1999553 0.2454074 0.2510947 0.2895228 0.2261230
0.2358594 0.2466786 0.2383664 0.2091427 0.2239853 0.2388319 0.2086625
0.2967137 0.3053054 0.1999359 0.2083977 0.2914388 0.2402682 0.2113067
0.2561522 0.1989985 0.2161075 0.2100217 0.2049845 0.2797835 0.1966600
0.2144238 0.2054844 0.2365356 0.2218565 0.2443390 0.2863578 0.3110904
0.2262844 0.1987562 0.2815095 0.2787306 0.2351832 0.3065428 0.2105116
0.2246074 0.2505600 0.2745923 0.2990979 0.2798869 0.2973127 0.2091629
0.2663967 0.2141251 0.2272607 0.2143061

C. Item_MRP

Mean 140.99

1st Quartile 93.83

2nd Quartile (Median) 143.01

3rd Quartile 185.64

Minimum 31.29

Maximum 266.89

NA’s (Missing Values) 0

For these variable there is no missing data.

The above Box plot indicates that there are no outliers in this variable.
1.1.2 Univariate Analysis of Qualitative Data

Item_Fat_Content column has some inconsistent data, i.e., not all the entries have the same standard.

This will create a hindrance at the time of analysis because R will define 4 factor levels (by default)
when it contains only 2.

Size of Outlet Stores in Cities (Tier 1, Tier 2, Tier 3)

The database doesn’t contain any information about 2410 outlet store sizes that opened in Tier 2 and
Tier 3 cities. From our preliminary analysis, we figured out that in the year 1998 555 stores opened in
tier 3 cities and about 929 (2002) and 926 (2007) stores opened in Tier 2 cities.

Tier 2 Tier 3

1998 - 555 stores

2002 929 stores -

2007 926 stores -

So, to bring about consistency we can-

1. We can divide stores on the basis of existing ratio of outlet size(for Tier 3).
2. We can equally divide the number of stores in Tier 2 cities between High and Medium Outlet
Size or can consider all the stores to be small because only small stores exist according to the
existing data.

This data replacement or filling can create biases, to be more accurate we have to analyse multiple
variables.
All other categorical data are consistent and don’t contain any missing value.

2. Data Cleaning

Data Cleaning is the process of identifying and removing or correcting inaccurate or incomplete
records from a dataset. It refers to recognizing unfinished, unreliable, inaccurate or non-
relevant parts of data and then restoring, remodeling or removing the dirty or crude data.

2.1 Data Cleaning of Retail Data

For Retail data, Data cleaning is required for following variables.

 Item_Weight: Here cleaning is required as few entries are missing in this column. If
this is not corrected, the analysis will lead to faulty results. This missing data will affect
the calculations of Mean, Median etc. which are necessary for analysis of this variable.

 Item_Fat_Content: In this column, few variables are stored in different ways. E.g.
Low Fat is written as LF at some places and regular is stored as reg at some places.
Since the same category is written in different style, it will be treated as two different
category while doing the analysis which will definitely give misleading results. Thus,
it becomes important to treat the data to ensure that all the categories are written
correctly.

 Outlet_Size: Some of the entries in this column are missing. The analysis of this
column is necessary as it will give results and will show if the outlet size is related to
the sales and will also help to find the relationship between the store size and sales.

 Item_Visibility: For this variable no data points are missing. But this column contains
144 outliers which can deflect the results of analysis. So cleaning of this variable is
very essential.

 Item_MRP: As this variable has neither any missing values nor any outliers. So this
variable do not need any kind of data cleaning.
2.2 Treatments for Data Cleaning

There are so many different treatments available to clean the data based on nature of variable
and type of data. Few of them are as explained below.

a) Missing Value Treatment: there are multiple ways to treat missing values. Few of them
are as given below.

 If number of missing value observations are very few such that it do not affect the
overall sample size even though we remove it completely from data, then we can
remove those entries. It will not majorly impact the output and the model which we
are going to build.

 If the percentage of missing value is very high for particular variable like 80% data
is missing, then we can’t remove all these entries. In that case we can avoid using
that variable for model building if possible.\

 If we are not able to remove the variable completely as mentioned above, then for
that case we can replace all missing data points by mean (calculated based on
available data points) of the data.

 However, in some cases, we can’t replace all the missing data with mean. So, we
have to replace it with median/mode.

 In some of the cases, where none of the above method is not applicable than we can
take random values from remaining available data and dump it into the empty
values.

If the data is categorical or qualitative, then we have to treat this data differently.
Following steps can be taken:

 Check the value which is most occurring and we can replace missing points with
the same value.
 Look for some other categorical values and check in which category they are
occurring more frequently. Then we can replace it with those values.

b) Spell Errors Treatment: In Retail Data set spelling mistakes are there in some columns.
Due to this we will not get the correct result. That’s why we have to treat this first. For this
we can replace the incorrect spelling cells by correct spellings. Columns like
Item_Fat_Content has spelling mistake. We should remove it before proceeding.

c) Outliers Treatment: If we consider these values while doing analysis, we get the incorrect
prediction. Because these outlier values affect the entire output. Due to this mean, median,
mode, quartile, range all values changes. So, to get the correct prediction, we should remove
these outliers.

If removal of outliers is not possible than another way is that we should try to keep these
values within µ ± 3σ limits. E.g. in Retail data set we have outliers in Item_Visibility
variable.

SMDM Project Report - Shubham Bakshi - 07.05.2023
0% (1)
SMDM Project Report - Shubham Bakshi - 07.05.2023
23 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
Market Mix Modeling Using R
No ratings yet
Market Mix Modeling Using R
10 pages
Lab 1. The Nature of Data
No ratings yet
Lab 1. The Nature of Data
15 pages
Impacts of Effective Data On Business Innovation and Growth Part 3
No ratings yet
Impacts of Effective Data On Business Innovation and Growth Part 3
14 pages
EDA
100% (1)
EDA
9 pages
Document (2)
No ratings yet
Document (2)
29 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Project - Tagged 3
No ratings yet
Project - Tagged 3
5 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
SIM - Chapters - DA T2
No ratings yet
SIM - Chapters - DA T2
5 pages
Task 4
No ratings yet
Task 4
8 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
Data Cleaning and Preprocessing Techniques
No ratings yet
Data Cleaning and Preprocessing Techniques
13 pages
Summary_ Lifecycle of Data Analysis -3982
No ratings yet
Summary_ Lifecycle of Data Analysis -3982
7 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Bank_Loan_ppt
No ratings yet
Bank_Loan_ppt
45 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
ML_EXP_NO_1
No ratings yet
ML_EXP_NO_1
8 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
CH 2
No ratings yet
CH 2
36 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
module 3 data preparation
No ratings yet
module 3 data preparation
33 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Subtitle
No ratings yet
Subtitle
2 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
71 pages
DataCleaning
No ratings yet
DataCleaning
28 pages
Linear_Regression_datascience_basit.pdf
No ratings yet
Linear_Regression_datascience_basit.pdf
19 pages
Research File 3
No ratings yet
Research File 3
10 pages
DWDM - Unit - III
No ratings yet
DWDM - Unit - III
77 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
DM_merged
No ratings yet
DM_merged
169 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
L1-D2 Basics of Data Preperation and Quality
100% (1)
L1-D2 Basics of Data Preperation and Quality
17 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
DAV practical 2
No ratings yet
DAV practical 2
6 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Business Analytics Process and Data Exploration
No ratings yet
Business Analytics Process and Data Exploration
38 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
L-5 - Regulatory Framework For Banks
No ratings yet
L-5 - Regulatory Framework For Banks
21 pages
L-7 Alternate Channels of Banking
No ratings yet
L-7 Alternate Channels of Banking
21 pages
L-3 Regulatory Framework For Banks - RBI
No ratings yet
L-3 Regulatory Framework For Banks - RBI
46 pages
An Overview of Banking in India
No ratings yet
An Overview of Banking in India
41 pages
IMC Div B Group 7 Assignment
No ratings yet
IMC Div B Group 7 Assignment
45 pages
Pnu-Assessment Questions-2
No ratings yet
Pnu-Assessment Questions-2
5 pages
Crossword
100% (1)
Crossword
1 page
Fraudulent Activity Notifications English PDF
No ratings yet
Fraudulent Activity Notifications English PDF
2 pages
Mean Median Mode
No ratings yet
Mean Median Mode
36 pages
Uts Praktikum Statistika M. Arif Elsan R
No ratings yet
Uts Praktikum Statistika M. Arif Elsan R
10 pages
Block-1 MS-08 Unit-3 PDF
No ratings yet
Block-1 MS-08 Unit-3 PDF
13 pages
Business Management Tools New Syllabus
No ratings yet
Business Management Tools New Syllabus
222 pages
2023 TSSM Units 3 4 Exam 2 Solutions
No ratings yet
2023 TSSM Units 3 4 Exam 2 Solutions
9 pages
Methods and Principles of Statistical Analysis: 2.1 Recommended Textbooks On Statistics
No ratings yet
Methods and Principles of Statistical Analysis: 2.1 Recommended Textbooks On Statistics
18 pages
4th 1st Long Test in Mathematics 10 SY 2023 2024
No ratings yet
4th 1st Long Test in Mathematics 10 SY 2023 2024
2 pages
Descriptive Statistics: Numerical Descriptive Statistics: Numerical Methods Methods Methods Methods
No ratings yet
Descriptive Statistics: Numerical Descriptive Statistics: Numerical Methods Methods Methods Methods
52 pages
Offender and Crime Characteristics of Female Serial Arsonists in Japan
No ratings yet
Offender and Crime Characteristics of Female Serial Arsonists in Japan
24 pages
Ebs 348 - Educational Statistics - Ms
No ratings yet
Ebs 348 - Educational Statistics - Ms
3 pages
Metropolitan Research Inc. Case Study
No ratings yet
Metropolitan Research Inc. Case Study
6 pages
SP Iii-22
No ratings yet
SP Iii-22
4 pages
Exercise 1:: Chapter 3: Describing Data: Numerical Measures
100% (1)
Exercise 1:: Chapter 3: Describing Data: Numerical Measures
11 pages
Chapter 3 Numerical Descriptive Measures
No ratings yet
Chapter 3 Numerical Descriptive Measures
7 pages
Statistics of Business
No ratings yet
Statistics of Business
26 pages
Business Statistics
No ratings yet
Business Statistics
205 pages
A Detailed Lesson Plan in Probability and Statistics 11
No ratings yet
A Detailed Lesson Plan in Probability and Statistics 11
17 pages
Topic 4A. Descripitve Statistics - Probability
No ratings yet
Topic 4A. Descripitve Statistics - Probability
80 pages
MCT
No ratings yet
MCT
6 pages
Hasil Output Data: Statistics
No ratings yet
Hasil Output Data: Statistics
3 pages
Question Bank Converted 1
No ratings yet
Question Bank Converted 1
33 pages
Using and Understanding Mathematics 6th Edition Bennett Solutions Manual download
100% (3)
Using and Understanding Mathematics 6th Edition Bennett Solutions Manual download
46 pages
Newsvendor Problem
No ratings yet
Newsvendor Problem
24 pages
Unit 9 Part 1
No ratings yet
Unit 9 Part 1
19 pages
DLP - STAT 8 - Q2 - wk2 - Oct14-18 - Dayrit
No ratings yet
DLP - STAT 8 - Q2 - wk2 - Oct14-18 - Dayrit
8 pages

(Mba-Ft - Year-Ii) Data Analysis Group Assignment: Submitted To: Prof. Chetan Jhaveri Date of Submission: 25 July, 2019

Uploaded by

(Mba-Ft - Year-Ii) Data Analysis Group Assignment: Submitted To: Prof. Chetan Jhaveri Date of Submission: 25 July, 2019

Uploaded by

(MBA-FT – YEAR-II)

UNIVARIATE ANALYSIS OF RETAIL DATA

Submitted to: Prof. Chetan Jhaveri

1.1 Univariate analysis of Retail Data

1.1.1 Univariate analysis of Quantitative Data

1st Quartile 8.774

2nd Quartile (Median) 12.600

3rd Quartile 16.850

NA’s (Missing Values) 1463

1st Quartile 0.02699

2nd Quartile (Median) 0.05393

3rd Quartile 0.09459

For this variable there is no missing data.

All the outlier points are as below.

0.2553949 0.2934178 0.2789741 0.2918654 0.2047000 0.2641247 0.2289931

1st Quartile 93.83

2nd Quartile (Median) 143.01

3rd Quartile 185.64

NA’s (Missing Values) 0

For these variable there is no missing data.

Size of Outlet Stores in Cities (Tier 1, Tier 2, Tier 3)

1998 - 555 stores

2002 929 stores -

2007 926 stores -

So, to bring about consistency we can-

2.1 Data Cleaning of Retail Data

For Retail data, Data cleaning is required for following variables.

You might also like