(Mba-Ft - Year-Ii) Data Analysis Group Assignment: Submitted To: Prof. Chetan Jhaveri Date of Submission: 25 July, 2019
(Mba-Ft - Year-Ii) Data Analysis Group Assignment: Submitted To: Prof. Chetan Jhaveri Date of Submission: 25 July, 2019
Data Analysis
Group Assignment
Submitted by:
Group Id: 181115
Avani Jain (181115)
Arnav Singh (181114)
Krishna Sanghavi (181131)
Vinaykumar Sudani (181161)
Piyush Jalan (181435)
This is the simplest form of data analysis which deals with a single variable from a data set. Its
main objective is to describe, to summarize data and to identify trends or patterns for variable
under consideration.
For this different measures like central tendency of data (i.e. Mean, Median, Mod), Maximum,
Minimum, Quartiles, Range, Variance etc. are used. The different graphical tools are also used
like Bar charts, Box plots, Pie Chart, Histogram etc. to represent or visualize data graphically
and analyze it.
The retail data has total 8523 observations with 12 different variables. Out of these 12 variables
the Item_Outlet_Sale is the dependent variable showing the sales volume. While remaining
all are independent variables. These variables include both Qualitative and Quantitative data.
Quantitative Variable: These are the variables which has numeric values which can
be measure and quantify. In Retail Data Item_Weight, Item_Visibility and Item_MRP
are the quantitative variables.
Qualitative Variable: These are the variables which can have names or labels as value.
These are also termed as Categorical Variable. In Retail Data Item_Fat_Content,
Item_Type, Outlet_Identifier etc. are categorical or qualitative variables.
Now, we will do the univariate analysis of all the variables in two groups.
A. Item_Weight
Mean 12.858
Maximum 21.350
From this table we can see that total 1463 data points are missing in this variable.
Also the above Box Plot indicates that there are no outliers for this variable.
B. Item_Visibility
Mean 0.06613
Minimum 0.00000
Maximum 0.32839
NA’s (Missing Values) 0
The Above Box Plot shows that there are so many Outliers present in this variable. There
are total 144 data points which are outliers. The Lower and Upper limits of box plot are
0.000000 and 0.195721 respectively.
C. Item_MRP
Mean 140.99
Minimum 31.29
Maximum 266.89
The above Box plot indicates that there are no outliers in this variable.
1.1.2 Univariate Analysis of Qualitative Data
Item_Fat_Content column has some inconsistent data, i.e., not all the entries have the same standard.
This will create a hindrance at the time of analysis because R will define 4 factor levels (by default)
when it contains only 2.
The database doesn’t contain any information about 2410 outlet store sizes that opened in Tier 2 and
Tier 3 cities. From our preliminary analysis, we figured out that in the year 1998 555 stores opened in
tier 3 cities and about 929 (2002) and 926 (2007) stores opened in Tier 2 cities.
Tier 2 Tier 3
1. We can divide stores on the basis of existing ratio of outlet size(for Tier 3).
2. We can equally divide the number of stores in Tier 2 cities between High and Medium Outlet
Size or can consider all the stores to be small because only small stores exist according to the
existing data.
This data replacement or filling can create biases, to be more accurate we have to analyse multiple
variables.
All other categorical data are consistent and don’t contain any missing value.
2. Data Cleaning
Data Cleaning is the process of identifying and removing or correcting inaccurate or incomplete
records from a dataset. It refers to recognizing unfinished, unreliable, inaccurate or non-
relevant parts of data and then restoring, remodeling or removing the dirty or crude data.
Item_Weight: Here cleaning is required as few entries are missing in this column. If
this is not corrected, the analysis will lead to faulty results. This missing data will affect
the calculations of Mean, Median etc. which are necessary for analysis of this variable.
Item_Fat_Content: In this column, few variables are stored in different ways. E.g.
Low Fat is written as LF at some places and regular is stored as reg at some places.
Since the same category is written in different style, it will be treated as two different
category while doing the analysis which will definitely give misleading results. Thus,
it becomes important to treat the data to ensure that all the categories are written
correctly.
Outlet_Size: Some of the entries in this column are missing. The analysis of this
column is necessary as it will give results and will show if the outlet size is related to
the sales and will also help to find the relationship between the store size and sales.
Item_Visibility: For this variable no data points are missing. But this column contains
144 outliers which can deflect the results of analysis. So cleaning of this variable is
very essential.
Item_MRP: As this variable has neither any missing values nor any outliers. So this
variable do not need any kind of data cleaning.
2.2 Treatments for Data Cleaning
There are so many different treatments available to clean the data based on nature of variable
and type of data. Few of them are as explained below.
a) Missing Value Treatment: there are multiple ways to treat missing values. Few of them
are as given below.
If number of missing value observations are very few such that it do not affect the
overall sample size even though we remove it completely from data, then we can
remove those entries. It will not majorly impact the output and the model which we
are going to build.
If the percentage of missing value is very high for particular variable like 80% data
is missing, then we can’t remove all these entries. In that case we can avoid using
that variable for model building if possible.\
If we are not able to remove the variable completely as mentioned above, then for
that case we can replace all missing data points by mean (calculated based on
available data points) of the data.
However, in some cases, we can’t replace all the missing data with mean. So, we
have to replace it with median/mode.
In some of the cases, where none of the above method is not applicable than we can
take random values from remaining available data and dump it into the empty
values.
If the data is categorical or qualitative, then we have to treat this data differently.
Following steps can be taken:
Check the value which is most occurring and we can replace missing points with
the same value.
Look for some other categorical values and check in which category they are
occurring more frequently. Then we can replace it with those values.
b) Spell Errors Treatment: In Retail Data set spelling mistakes are there in some columns.
Due to this we will not get the correct result. That’s why we have to treat this first. For this
we can replace the incorrect spelling cells by correct spellings. Columns like
Item_Fat_Content has spelling mistake. We should remove it before proceeding.
c) Outliers Treatment: If we consider these values while doing analysis, we get the incorrect
prediction. Because these outlier values affect the entire output. Due to this mean, median,
mode, quartile, range all values changes. So, to get the correct prediction, we should remove
these outliers.
If removal of outliers is not possible than another way is that we should try to keep these
values within µ ± 3σ limits. E.g. in Retail data set we have outliers in Item_Visibility
variable.