0% found this document useful (0 votes)
41 views6 pages

Submitted By:-Shaikshahanaafroz - Cms20Mba093: 1. Identify The Shape of The Data

This document summarizes an analysis of a vehicle sales dataset containing 72983 rows and 34 columns. It identifies that there are 19 numerical and 15 categorical variables. It checks for any syntax, data type or format errors and finds no issues other than the data type of the PurchDate column. It also checks for any missing values and identifies the number of missing values in each column.

Uploaded by

Shahana Afroz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views6 pages

Submitted By:-Shaikshahanaafroz - Cms20Mba093: 1. Identify The Shape of The Data

This document summarizes an analysis of a vehicle sales dataset containing 72983 rows and 34 columns. It identifies that there are 19 numerical and 15 categorical variables. It checks for any syntax, data type or format errors and finds no issues other than the data type of the PurchDate column. It also checks for any missing values and identifies the number of missing values in each column.

Uploaded by

Shahana Afroz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

10/2/21, 2:43 PM AI1_Assignment 1_CMS20MBA093_Shaik shahana afroz - Jupyter

Notebook

Submitted By:-ShaikShahanaAfroz_CMS20MBA093
In [2]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matp1otlib inline
import seaborn as sns

In [3] : df=pd.read_csv(”training.csv”)
df

Out[ 3 ] : Refld IsBadBuy PurchDate Auction VehYear VehicleAge Make Model Trin

0 1 0 12/7/2009 ADESA 2006 3 MAZDA MAZDA3

1500 RAM
1 2 0 12/7/2009 ADESA 2004 5 DODGE PICKUP SP
2WD

STRATUS
2 3 0 12/7/2009 ADESA 2005 4 DODGE SX?
V6

3 4 0 12/7/2009 ADESA 2004 5 DODGE NEON SX?

4 5 0 12/7/2009 ADESA 2005 4 FORD FOCUS ZX.

72978 73010 1 12/2/2009 ADESA 2001 8 MERCURY SABLE GI

72979 73011 0 12/2/2009 ADESA 2007 2 CHEVROLET MALIBU 4C L*

GRAND
72980 73012 0 12/2/2009 ADESA 2005 4 JEEP CHEROKEE La
2WD V

72981 73013 0 12/2/2009 ADESA 2006 3 CHEVROLET IMPALA L*

72982 73014 0 12/2/2009 ADESA 2006 3 MAZDA MAZDA6

72983 rows • 34 columns

1. Identify the shape of the data

localhost:8888/notebooks/AI 1_Assignment 1_CMS20MBA093_Shaik shahana afroz.ipynb 1/6


10/2/21, 2:43 PM AI1_Assignment 1_CMS20MBA093_Shaik shahana afroz - Jupyter
Notebook

I n [4] : df . shape

Out [4] : (72983, 34)

Data frame has 72983 Rows & 34 Columns

2. Identify the Categorical and Numerical


variables in the dataset. Name them.
In [5]: df.info()

<class pandas.core.frame.DataFrame'>
RangeIndex: 72983 entries, 0 to 72982
Data columns (total 34 columns):
# Column Non-Null Count Dtype

0 RefId 72983 non-null int64


1 IsBadBuy 72983 non-null int64
2 PurchDate 72983 non-null object
3 Auction 72983 non-null object
4 VehYear 72983 non-null int64
5 VehicleAge 72983 non-null int64
6 Make 72983 non-null object
7 Model 72983 non-null object
8 Trim 70623 non-null object
9 SubModel 72975 non-null object
10 Color 72975 non-null object
11 Transmission 72974 non-null object
12 WheelTypeID 69814 non-null float64
13 WheelType 69809 non-null object
14 Vehodo 72983 non-null int64
15 Nationality 72978 non-null object
16 Size 72978 non-null object
17 TopThreeAmericanName 72978 non-null object
18 MMRAcquisitionAuctionAveragePrice 72965 non-null float64
19 MMRAcquisitionAuctionCleanPrice 72965 non-null float64
20 MMRAcquisitionRetailAveragePrice 72965 non-null float64
21 MMRAcquisitonRetailCleanPrice 72965 non-null float64
22 MMRCurrentAuctionAveragePrice 72668 non-null float64
23 MMRCurrentAuctionCleanPrice 72668 non-null float64
24 MMRCurrentRetailAveragePrice 72668 non-null float64
25 MMRCurrentRetailCleanPrice 72668 non-null float64
26 PRIMEUNIT 3419 non-null object
27 AUCGUART 3419 non-null object
28 BYRNO 72983 non-null int64
29 VNZIP1 72983 non-null int64
30 VNST 72983 non-null object
31 VehBCost 72983 non-null float64
32 IsonlineSale 72983 non-null int64
33 WarrantyCost 72983 non-null int64
dtypes: float64(10), int64(9), object(15)
memory usage: 18.9+ MB

localhost:8888/notebooks/AI 1_Assignment 1_CMS20MBA093_Shaik shahana afroz.ipynb 2/6


10/2/21, 2:43 PM AI1_Assignment 1_CMS20MBA093_Shaik shahana afroz - Jupyter
It has got:- 19 numerical variables 15 catergorical variables

3. Identify any syntax/data type/format errors


In [6]: df.count()

Out [ 6] :

RefId 72983
IsBadBuy 72983
PurchDate 72983
Auction 72983
VehYear 72983
VehicleAge 72983
Make 72983
Model 72983
Trim 70623
SubModel 72975
Color 72975
Transmission 72974
WheelTypeID 69814
WheelType 69809
Vehodo 72983
Nationality 72978
Size 72978
TopThreeAmericanName 72978
MMRAcquisitionAuctionAveragePrice 72965
MMRAcquisitionAuctionCleanPrice 72965
MMRAcquisitionRetailAveragePrice 72965
MMRAcquisitonRetailCleanPrice 72965
MMRCurrentAuctionAveragePrice 72668
MMRCurrentAuctionCleanPrice 72668
MMRCurrentRetailAveragePrlce 72668
MMRCurrentRetailCleanPrice 72668
PRIMEUNIT 3419
AUCGUART 3419
BYRNO 72983
VNZIP1 72983
VNST 72983
VehBCost 72983
IsonlineSale 72983
WarrantyCost 72983
dtype: int64

There are no data type Error apart from PurchDate.

4. Are there any missing values? If yes, mention


how many in each column/Feature

localhost:8888/notebooks/AI 1_Assignment 1_CMS20MBA093_Shaik shahana afroz.ipynb


10/2/21, 2:43 PM AI1_Assignment 1_CMS20MBA093_Shaik shahana afroz - Jupyter

I n [7] : df.isnull().sum()

Out[7]: RefId 0
IsBadBuy 0
PurchDate 0
Auction 0
VehYear 0
VehicleAge 0
Make 0
Model 0
Trim 2360
SubModel 8
Color 8
Transmission 9
WheelTypeID 3169
WheelType 3174
Vehodo 0
Nationality 5
Size 5
TopThreeAmericanName 5
MMRAcquisitionAuctionAveragePrice 18
MMRAcquisitionAuctionCleanPrice 18
MMRAcquisitionRetailAveragePrice 18
MMRAcquisitonRetailCleanPrice 18
MMRCurrentAuctionAveragePrice 315
MMRCurrentAuctionCleanPrice 315
MMRCurrentRetailAveragePrice 315
MMRCurrentRetailCleanPrice 315
PRIMEUNIT 69564
AUCGUART 69564
BYRNO 0
VNZIP1 0
VNST 0
VehBCost 0
IsonlineSale 0
WarrantyCost 0
dtype: int64

Above are the Mising values in the Dataset.

5. Are there any duplicate values? If yes please


mention
In [8]: df.duplicated().sum()

Out[8]:

There are no Duplicate values in the Dataset.

localhost:8888/notebooks/AI 1_Assignment 1_CMS20MBA093_Shaik shahana afroz.ipynb


10/2/21, 2:43 PM AI1_Assignment 1_CMS20MBA093_Shaik shahana afroz - Jupyter

6. Briefly explain the basic statistical measures of


each numerical columns
In [9] : df.describe()

Refld IsBadBuy VehYear VehicleAge WheelTypelD VehOdo MMR

count 72983.000000 72983.000000 72983.000000 72983.000000 69814.000000 72983.000000

mean 36511.428497 0.122988 2005.343052 4.176644 1.494299 71499.995917

std 21077.241302 0.328425 1.731252 1.712210 0.521290 14578.913128

min 1.000000 0.000000 2001.000000 0.000000 0.000000 4825.000000

25% 18257.500000 0.000000 2004.000000 3.000000 1.000000 61837.000000

50% 36514.000000 0.000000 2005.000000 4.000000 1.000000 73361.000000

75% 54764.500000 0.000000 2007.000000 5.000000 2.000000 82436.000000

max 73014.000000 1.000000 2010.000000 9.000000 3.000000 115717.000000

7. Identify any redundant features (columns) and explain


why it is redundant.
def getDuplicateColumns(df) :
In [10]: duplicateColumnNames =
set() for x in
range(df.shape[1]):
col=df.iloc[:,x]
for y in range(x + 1, df.shape[1]):
otherCol=df.iloc[:,y]
if col.equals(otherCol) :
duplicateColumnNames.add(df.columns.values[y])
return list(duplicateColumnNames)
duplicateColNames = getDuplicateColumns(df)

print('Duplicate Columns
are: ') for column in
duplicateColNames:
print('Column Name : ', column)

Duplicate Columns are:


There Are No Duplicate Column Names

In[II]: plt.figure(figsize=(20,10))
sns.heatmap(df.corr(),annot=True,cmap='rainbow')
p1t.show()

localhost:8888/notebooks/AI 1_Assignment 1_CMS20MBA093_Shaik shahana afroz.ipynb


10/2/21, 2:43 PM AI1_Assignment 1_CMS20MBA093_Shaik shahana afroz - Jupyter

In [] :

localhost:8888/notebooks/AI 1_Assignment 1_CMS20MBA093_Shaik shahana afroz.ipynb

You might also like