0% found this document useful (0 votes)
82 views

Datascience 2 PDF

The document describes performing bivariate and multivariate analysis on diabetes data from a UCI dataset. It includes: 1) Bivariate analysis using linear and logistic regression to analyze relationships between various feature pairs. 2) Multiple linear regression to analyze the relationship between outcomes and multiple explanatory features simultaneously. 3) Visualization of the data using various plotting functions in Seaborn, including density plots, histograms, scatter plots, and 3D plots to explore relationships in the data. 4) Mapping and visualization of geographic city data from California using basemap to plot cities on a map with colors and sizes representing population and area features.

Uploaded by

Vijayan .N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

Datascience 2 PDF

The document describes performing bivariate and multivariate analysis on diabetes data from a UCI dataset. It includes: 1) Bivariate analysis using linear and logistic regression to analyze relationships between various feature pairs. 2) Multiple linear regression to analyze the relationship between outcomes and multiple explanatory features simultaneously. 3) Visualization of the data using various plotting functions in Seaborn, including density plots, histograms, scatter plots, and 3D plots to explore relationships in the data. 4) Mapping and visualization of geographic city data from California using basemap to plot cities on a map with colors and sizes representing population and area features.

Uploaded by

Vijayan .N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

5.

b) BIVARIATE ANALYSIS ON DIABETES DATA

(i) BIVARIATE ANALYSIS USING LINEAR REGRESSION

PROGRAM:
import pandas as pd
import statsmodels.api as sm
data=pd.read_csv(“pima_diabetes.csv")
#create correlation matrix
data.corr()

#Bivariate Analysis of Glucose-Insulin features


#define response variable 1
y1= data['Glucose']

#define explanatory variable 1


x1= data[['Insulin']]

#add constant to predictor variables


x1= sm.add_constant(x1)

#fit linear regression model


model1 = sm.OLS(y1, x1).fit()

#view model summary


print(model1.summary())

#Bivariate Analysis of Age-Pregnancies features


#define response variable 2
y2 = data['Age']

#define explanatory variable 2


x2 = data['Pregnancies']
#add constant to predictor variables
x2 = sm.add_constant(x2)

#fit linear regression model model2


= sm.OLS(y2, x2).fit()

#view model summary


print(model2.summary())

#Bivariate Analysis of SkinThickness-BMI features


#define response variable 3
y3 = data['SkinThickness']

#define explanatory variable 3


x3 = data[['BMI']]

#add constant to predictor variables


x3 = sm.add_constant(x3)

#fit linear regression model


Model3 = sm.OLS(y3, x3).fit()

#view model summary


print(model3.summary())
OUTPUT:
a. Correlation Matrix

b. Bivariate Analysis of Glucose-Insulin features


c. Bivariate Analysis of Age-Pregnancies features

d. Bivariate Analysis of SkinThickness-BMI features


(ii) BIVARIATE ANALYSIS USING LOGISTIC REGRESSION

PROGRAM:
# importing libraries
import statsmodels.api as sm
import pandas as pd

# loading the training dataset


data = pd.read_csv('pima_diabetes.csv', index_col = 0)

# defining the dependent and independent variables


Xtrain = data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction','Age']]
ytrain = data[['Outcome']]

# building the model and fitting the data


log_reg = sm.Logit(ytrain, Xtrain).fit()

# printing the summary table


print(log_reg.summary())
OUTPUT:
5.c) MULTIPLE REGRESSION ANALYSIS ON DIABETES DATA

PROGRAM:
# importing modules and packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as ssm

# importing data
df = pd.read_csv('pima_diabetes.csv')

# creating feature variables


X = df.drop('Outcome', axis=1)
Y = df['Outcome']

X=ssm.add_constant(X) #to add constant value in the model


model= ssm.OLS(Y,X).fit() #fitting the model predictions=
model.summary() #summary of the model predictions
OUTPUT:
6. APPLICATION OF PLOTTING FUNCTIONS ON UCI DATASET

a) NORMAL CURVES

PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

df = pd.read_csv("adult.csv")

#Check the structure of the data df.info()

sns.set(font_scale=1.5)
sns.catplot(x="relationship", y="age", data=df,
kind="point",hue='income',capsize=0.4,ci=None,aspect=2)

# Show plot
plt.xticks(rotation=90) plt.show()

sns.set(font_scale=1)
sns.relplot(x="educational-num", y="hours-per-week",
data=df, kind="line",row='income' , ci=None,
hue="relationship",style="relationship",markers=True,
dashes=False,aspect=2)

# Show plot
plt.show()
OUTPUT:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 48842 non-null int64
1 workclass 48842 non-null object
2 fnlwgt 48842 non-null int64
3 education 48842 non-null object
4 educational-num 48842 non-null int64
5 marital-status 48842 non-null object
6 occupation 48842 non-null object
7 relationship 48842 non-null object
8 race 48842 non-null object
9 gender 48842 non-null object
10 capital-gain 48842 non-null int64
11 capital-loss 48842 non-null int64
12 hours-per-week 48842 non-null int64
13 native-country 48842 non-null object 14 income 48842 non-null
object dtypes: int64(6), object(9)
memory usage: 5.6+ MB
b) DENSITY AND CONTOUR PLOTS

PROGRAM:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import warnings warnings.simplefilter(action="ignore", category=FutureWarning)

df = pd.read_csv("adult.csv")

# set seaborn style

sns.set_style("white")

#Map a third variable “income” with a hue semantic to show conditional distributions
sns.kdeplot(data=df, x="age", y="educational-num", hue="income")

#Show filled contours

sns.kdeplot(data=df, x="age", y="educational-num", hue="income", fill=True)


sns.kdeplot(data=df, x="age", y="fnlwgt", hue="income")

sns.kdeplot(data=df, x="age", y="fnlwgt", hue="income", fill=True)


sns.kdeplot(data=df, x="age", y="hours-per-week", hue="income")

sns.kdeplot(data=df, x="age", y="hours-per-week", hue="income", fill=True)


c) CORRELATION AND SCATTER PLOTS

PROGRAM:

import numpy as np import

pandas as pd import

matplotlib.pyplot as plt import

seaborn as sns

import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

df = pd.read_csv("adult.csv")

# set seaborn style

sns.set_style("white")

sns.scatterplot(data=df[0:100], x="educational-num", y="hours-per-week")


sns.scatterplot(data=df[0:100], x="relationship", y="age")

sns.scatterplot(data=df[0:100], x="relationship", y="age", hue="income")

cormat = df.corr()
sns.heatmap(cormat, annot=True);
d) HISTOGRAMS

PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

df = pd.read_csv("adult.csv")

# set seaborn style


sns.set_style("white")

<AxesSubplot:xlabel='hours-per-week', ylabel='Count'>
sns.histplot(data=df[:100], x="hours-per-week", kde=True, color="red")
<AxesSubplot:xlabel='hours-per-week', ylabel='Count'>

sns.distplot(df["hours-per-week"], color="green")
<AxesSubplot:xlabel='hours-per-week', ylabel='Density'>

sns.histplot(data=df, x="hours-per-week", bins=10)


<AxesSubplot:xlabel='hours-per-week', ylabel='Count'>
sns.histplot(data=df[:100], x="hours-per-week", hue="income", multiple="stack")

sns.histplot(data=df[:100], x="age", hue="income", multiple="stack")


<AxesSubplot:xlabel='age', ylabel='Count'>
df.hist(figsize=(12,12), layout=(3,3), sharex=False)
array([[<AxesSubplot:title={'center':'age'}>,
<AxesSubplot:title={'center':'fnlwgt'}>,
<AxesSubplot:title={'center':'educational-num'}>],
[<AxesSubplot:title={'center':'capital-gain'}>,
<AxesSubplot:title={'center':'capital-loss'}>,
<AxesSubplot:title={'center':'hours-per-week'}>],
[<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
e) THREE-DIMENSIONAL PLOTTING

PROGRAM:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px

df = pd.read_csv("adult.csv")

fig = px.scatter_3d(df[:200], x='age', y='capital-gain', z='hours-per-week', color='income')


fig.show()

fig1 = px.scatter_3d(df[:200], x='age', y='educational-num', z='relationship', color='income')


fig1.show()
7 . VISUALIZING GEOGRAPHIC DATA WITH BASEMAP

PROGRAM:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

cities = pd.read_csv('california_cities.csv')

# Extract the data


lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area= cities['area_total_km2'].values
# Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h', lat_0=37.5, lon_0=-119, width=1E6,
height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')

# scatter city data, with color reflecting population and size reflecting area
m.scatter(lon, lat, latlon=True, c=np.log10(population), s=area, cmap='Reds', alpha=0.5)

# create colorbar and legend


plt.colorbar(label=r'$\log_{10}({\rm population})$') plt.clim(3, 7)

# make legend with dummy points for


a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a, label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='lower left')

You might also like