0% found this document useful (0 votes)
79 views

Data Science Lab Manual

The document discusses working with NumPy arrays and Pandas dataframes in Python. It shows how to create arrays and dataframes from lists, dictionaries, other arrays and series. Array slicing and basic operations like shape and size are demonstrated. Dataframes can be constructed from 2D arrays, dictionaries, other dataframes and series.

Uploaded by

HANISHA SAALIH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Data Science Lab Manual

The document discusses working with NumPy arrays and Pandas dataframes in Python. It shows how to create arrays and dataframes from lists, dictionaries, other arrays and series. Array slicing and basic operations like shape and size are demonstrated. Dataframes can be constructed from 2D arrays, dictionaries, other dataframes and series.

Uploaded by

HANISHA SAALIH
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Ex no : 1 Download, install and explore the features of Numpy,

Date : scipy,jupyter and Pandas Package

AIM

PROCEDURE
1.Setting up your machine for data science in
Python2.Download and Install Anaconda
3.Installing Anaconda on Windows

For problem solvers, installing and using the Anaconda distribution of Python. This section details the
installation of the Anaconda distribution of Python on Windows 10. I think the Anaconda distribution of
Python is the best option for problem solvers who want to use Python. Anaconda is free (although the
download is large which can take time) and can be installed on school or work computers where you
don't have administrator access or the ability to install new programs. Anaconda comes bundled with
about 600 packages pre- installed including NumPy, Matplotlib andSymPy.

Follow the steps below to install the Anaconda distribution of Python on Windows.

1. Visit Anaconda.com/downloads
2. Select Windows
3. Download the .exe installer
4. Open and run the .exe installer
5. Open the Anaconda Prompt and run some Python code
Feature of python package:

1. Pandas

Pandas is a free Python software library for data analysis and data handling. It was created as a
community library project and initially released around 2008. Pandas provides various high-
performance
and easy-to-use data structures and operations for manipulating data in the form of numerical tables and
time series. Pandas also has multiple tools for reading and writing data between in-memory data
structures and different file formats. In short, it is perfect for quick and easy data manipulation, data
aggregation, reading, and writing the data as well as data visualization. Pandas can also take in data
from different types of files such as CSV, excel etc.or a SQL database and create a Python object
known as a data frame. A data frame contains rows and columns and it can be used for data
manipulation with operations such as join, merge, groupby, concatenate etc.

NumPy is a free Python software library for numerical computing on data that can be in the form of
large arrays and multi-dimensional matrices. These multidimensional matrices are the main objects in
NumPy where their dimensions are called axes and the number of axes is called a rank. NumPy also
provides various tools to work with these arrays and high-level mathematical functions to manipulate
this data with linear algebra, Fourier transforms, random number crunchings, etc. Some of the basic
array operations that can be performed using NumPy include adding, slicing, multiplying, flattening,
reshaping, and indexing the arrays. Other advanced functions include stacking the arrays, splitting them
into sections, broadcasting arrays, etc

2. NumPy

NumPy is a free Python software library for numerical computing on data that can be in the form of
large arrays and multi-dimensional matrices. These multidimensional matrices are the main objects in
NumPy where their dimensions are called axes and the number of axes is called a rank. NumPy also
provides various tools to work with these arrays and high-level mathematical functions to manipulate
this data with linear algebra, Fourier transforms, random number crunchings, etc. Some of the basic
array operations that can be performed using NumPy include adding, slicing, multiplying, flattening,
reshaping, and indexing the arrays. Other advanced functions include stacking the arrays, splitting them
into sections, broadcasting arrays, etc
3. SciPy

SciPy is a free software library for scientific computing and technical computing on the data. It was
created as a community library project and initially released around 2001. SciPy library is built on the
NumPy array object and it is part of the NumPy stack which also includes other scientific computing
libraries and tools such as Matplotlib, SymPy, pandas etc. This NumPy stack has users which also use
comparable applications such as GNU Octave, MATLAB, GNU Octave, Scilab, etc. SciPy allows for
various scientific computing tasks that handle data optimization, data integration, data interpolation,
and data modification using linear algebra, Fourier transforms, random number generation, special
functions, etc. Just like NumPy, the multidimensional matrices are the main objects in SciPy, which are
provided by the NumPy module itself.

Python Libraries for Data Visualization

1. Matplotlib

Matplotlib is a data visualization library and 2-D plotting library of Python It was initially released in
2003 and it is the most popular and widely-used plotting library in the Python community. It comes with
an interactive environment across multiple platforms. Matplotlib can be used in Python scripts, the
Python and IPython shells, the Jupyter notebook, web application servers etc. It can be used to embed
plots into applications using various GUI toolkits like Tkinter, GTK+, wxPython, Qt, etc. So you can
use Matplotlib to create plots, bar charts, pie charts, histograms, scatterplots, error charts, power
spectra, stemplots, and whatever other visualization charts you want! The Pyplot module also provides a
MATLAB-like interface that is just as versatile and useful as MATLAB while being totally free and
open source.
2. Seaborn
Seaborn is a Python data visualization library that is based on Matplotlib and closely integrated with
the numpy and pandas data structures. Seaborn has various dataset-oriented plotting functions that
operate on data frames and arrays that have whole datasets within them. Then it internally performs the
necessary statistical aggregation and mapping functions to create informative plots that the user desires.
It is a high-level interface for creating beautiful and informative statistical graphics that are integral to
exploring and understanding data. The Seaborn data graphics can include bar charts, pie charts,
histograms, scatterplots, error charts, etc. Seaborn also has various tools for choosing color palettes that
can reveal patterns in the data.
3. Plotly
Plotly is a free open-source graphing library that can be used to form data visualizations. Plotly
(plotly.py) is built on top of the Plotly JavaScript library (plotly.js) and can be used to create web-
based data visualizations that can be displayed in Jupyter notebooks or web applications using Dash or
saved as individual HTML files. Plotly provides more than 40 unique chart types like scatter plots,
histograms, line charts, bar charts, pie charts, error bars, box plots, multiple axes, sparklines,
dendrograms, 3-D charts, etc. Plotly also provides contour plots, which are not that common in otherdata
visualization libraries. In addition to all this, Plotly can be used offline with no internet connection.

RESULTS
Ex no: 2 Working with Numpy arrays
Date:

AIM

ALGORITHM
PROGRAM
import numpy as np
# Creating array object
arr = np.array( [[ 1, 2,3],[ 4, 2,
5]] )# Printing type of arr
object print("Array is of type: ",
type(arr))# Printing array
dimensions (axes)
print("No. of dimensions: ", arr.ndim)
# Printing shape of array
print("Shape of array: ",
arr.shape)
# Printing size (total number of elements) of
arrayprint("Size of array: ", arr.size)
# Printing type of elements in array
print("Array stores elements of type: ", arr.dtype)
OUTPUT
Array is of type: <class 'numpy.ndarray'>
No. ofdimensions: 2
Shape of array:
(2, 3)Size of
array: 6
Array stores elements of type: int32

RESULTS
Ex no: 2a Working with Numpy arrays
Date:

AIM

ALGORITHM
PROGRAM
Program to Perform Array Slicing
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a)
print("After slicing") print(a[1:])
OUTPUT:
Our array is:
[[1 2 3]
[3 4 5]
[4 5 6]]
The items in the second column
are:[2 4 5]
The items in the second row
are:[3 4 5]
The items column 1 onwards
are:[[2 3]
[4 5]
[5 6]]

RESULTS
Ex no: 2b Working with Numpy arrays
Date:

AIM

ALGORITHM
PROGRAM

Program to Perform Array Slicing


import numpy as np
a = np.array([[1,2,3],[3,4,5],[4,5,6]])
print('Our array
is:' )print(a)
# this returns array of items in the second column
print('The items in the second column
are:' )print(a[...,1])
print('\n' )
# Now we will slice all items from the second
rowprint ('The items in the second row are:' )
print(a[1,...])
print('\n' )
# Now we will slice all items from column 1
onwardsprint('The items column 1 onwards are:')
print(a[...,1:])
OUTPUT:
Our array is:
[[1 2 3]
[3 4 5]
[4 5 6]]
The items in the second column
are:[2 4 5]
The items in the second row
are:[3 4 5]
The items column 1 onwards
are:[[2 3]
[4 5]
[5 6]]

RESULT:
Ex no: 3 Create a Dataframe using a list of elements
Date:

AIM

ALGORITHM
PROGRAM

import numpy as npimport pandas as pd


data = np.array([['','Col1','Col2'],['Row1',1,2],['Row2',3,4]])
print(pd.DataFrame(data=data[1:,1:],index = data[1:,0],
columns=data[0,1:]))# Take a 2D array as input to your DataFrame
my_2darray = np.array([[1, 2, 3], [4, 5,
6]])print(pd.DataFrame(my_2darray))
# Take a dictionary as input to your
DataFramemy_dict = {1: ['1', '3'], 2: ['1',
'2'], 3: ['2', '4']}
print(pd.DataFrame(my_dict))
# Take a DataFrame as input to your DataFrame
my_df = pd.DataFrame(data=[4,5,6,7], index=range(0,4),
columns=['A'])print(pd.DataFrame(my_df))
# Take a Series as input to your DataFrame
my_series = pd.Series({"United Kingdom":"London", "India":"New Delhi", "United
States":"Washington", "Belgium":"Brussels"})
print(pd.DataFrame(my_series))
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
# Use the `shape`
propertyprint(df.shape)
# Or use the `len()` function with the `index`
propertyprint(len(df.index)
# Take a DataFrame as input to your DataFrame
my_df = pd.DataFrame(data=[4,5,6,7], index=range(0,4),
columns=['A'])print(pd.DataFrame(my_df))
# Take a Series as input to your DataFrame
my_series = pd.Series({"United Kingdom":"London", "India":"New Delhi", "United
States":"Washington", "Belgium":"Brussels"})
print(pd.DataFrame(my_series))
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
OUTPUT:

Col1
Col2Row1 1
2
Row2 3 4

0 1 2
0 12 3
1 45 6

1 2 3
0 1 1 2
1 3 2 4
A
0 4
1 5
2 6
3 7
0
United Kingdom
London
India New Delhi
United States
Washington Belgium
Brussels
(2, 3)
2

RESULT:
Ex no: 4 Descriptive analytics on the Iris data set
Date:

AIM

ALGORITHM
PROGRAM
# download iris.csv from https://datahub.io/machine-
learning/irisimport pandas as pd
# Reading CSV file
df=pd.read_csv('iris_csv.c
sv')#Printing top 5 rows
df.head()
df.shape
df.info()
df.describe()
df.isnull().su
m()
df.value_counts('class')
OUTPUT:
RangeIndex: 150 entries, 0 to
149 Data columns (total 5
columns):
# Column Non-Null Count Dtype

0 sepallength 150 non-null float64


1 sepalwidth 150 non-null float64
2 petallength 150 non-null float64
3 petalwidth 150 non-null float64
4 class 150 non-null
object dtypes: float64(4),
object(1) memory usage: 6.0+
KB

RESULT:
Ex no: 5(a) Univariate analysis on UCI diabetes dataset
Date:

AIM

ALGORITHM
PROGRAM
#univariate Analysis of diabetes
datasetimport pandas as pd
import numpy as
np import
statistics as st
#Load the
dataset
df=pd.read_csv("diabetes_csv.csv")print("MEAN:\n",df.mean(numeric_only=True))
print("MEDIAN:\n",df.median(numeric_only=True))
print("MODE:\n",df.mode(numeric_only=True))
print("STANDARD
DEVITION:\n",df.std(numeric_only=True))
print("VARIANCE:\n",df.var(numeric_only=True))
print("SKEWNESS:\n",df.skew(numeric_only=True))
print("KURTOSIS:\n:",df.kurtosis(numeric_only=True))
OUTPUT:

MEAN:
preg 3.845052
plas 120.894531
pres 69.105469
skin 20.536458
insu 79.799479
mass 31.992578
pedi 0.471876
age 33.240885
dtype: floa

MEDIAN:
preg 3.0000
plas 117.0000
pres 72.0000

skin 23.0000
insu 30.5000
mass 32.0000
pedi 0.3725
age 29.0000
dtype:
float64
MODE:
preg plas pres skin insu mass pedi age
0 1.0 99 70.0 0.0 0.0 32.0 0.254 22.0
1 NaN 100 NaN NaN NaN NaN 0.258 NaN
STANDARD DEVITION:
preg 3.369578
plas 31.972618
pres 19.355807
skin 15.952218
insu 115.244002
mass 7.884160
pedi 0.331329
age 11.760232
dtype:
float64
VARIAN
CE:
preg 11.354056
plas 1022.248314
pres 374.647271
skin 254.473245
insu 13281.180078
mass 62.159984
pedi 0.109779
age 138.303046
dtype: float64

SKEWNESS:
preg 0.901674
plas 0.173754
pres -1.843608
skin 0.109372
insu 2.272251
mass -0.428982
pedi 1.919911
age 1.129597
dtype: float64

KURTOSIS:
: preg 0.159220
plas 0.640780
pres 5.180157
skin -0.520072
insu 7.214260
mass 3.290443
pedi 5.594954
age 0.643159
dtype: float64

RESULT:
Ex.No: 5(b) Bivariate analysis: Linear regression modeling
Date:

AIM:

ALGORITHM:
PROGRAM:
# Importing all the libraries
import matplotlib.pyplot
as pltimport numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import
mean_squared_error#loading dataset
diabetes =
datasets.load_diabetes()
diabetes.keys()
# to find the content of data
df = pd.DataFrame(diabetes['data'],columns = diabetes['feature_names'])
#putting our data in a Dataframe
x = df
y = diabetes['target']
from sklearn.model_selection import
train_test_split#to split our data into training
and testing set
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state =
101)#splitting our data
#importing Model
from sklearn import linear_model
model =
linear_model.LinearRegression()
model.fit(x_train, y_train)
# Training data is used always
# Prediction of testset result of the Prepared
Modely_pre = model
puts the test feature value to get the label value which are predicted by the model#Cross Validation
Scores
from sklearn.model_selection import
cross_val_score #importing
scores = cross_val_score(model,x,y,scoring="neg_mean_squared_error",
cv=10)rmse_scores=np.sqrt(-scores).mean()
#calculating root mean sq. of the resulted scores of
arrayprint("Cross validation",rmse_scores)
#Checking predictions acuracy by r2 Scores (value lies between 0

to 1)from sklearn.metrics import r2_score


print("r2:",r2_score(y_test, y_pre))
#Calculating Root Mean Square
Error
mse=mean_squared_error(y_test,
y_pre)rmse=np.sqrt(mse)
print("RMSE:",rmse)
#Getting Weights and Intercept of
Modelprint("Weights:",model.coef_)
print("\nIntercept",model.intercept_)
OUTPUT:
Cross validation 54.40461553640237
r2: 0.4576767417719556
RMSE: 58.009275047552
Weights: [ -8.02566358 -308.83945001 583.63074324 299.9976184 -360.68940198
95.14235214 -93.03306818 118.15005596 662.12887711 26.07401648]

Intercept 153.72029738615726

RESULT:
Ex.No: 5(c) Bivariate analysis: Logistic regression modeling
Date:

AIM:

ALGORITHM:
PROGRAM:
import matplotlib.pyplot
as pltimport numpy as np
import pandas as pd
from sklearn import
datasets#, linear_model
from sklearn.metrics import
mean_squared_error
diabetes=datasets.load_diabetes()
diabetes.keys()
#to find the content of data
df=pd.DataFrame(diabetes[‘data’],columns=diabetes[‘feature_names’])
#putting our data in a dataframe
x=df
y=diabetes[‘tar
get’]
from sklearn.model_selection import train_test_split

#to spli our data into training and testing set


x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=101
) #splitting our data
#importing Model
from sklearn.linear_model import
LogisticRegression#Built Model
model=LogisticRegressi
on()model.fit(x_train,
y_train)
#prediction of testset result of the prepared
modely_pre=model.predict(x_test)
#Checking predictions accuracy by r2 Scores(value lies between 0
to 1)from sklearn.metrics import r2_score
print(‘r^2:’,r2_score(y_test,y_pre))

#calculating root mean square error


mse=mean_squared_error(y_test,y_pre)
rmse=np.sqrt(mse)
print(‘RMSE:’rmse)
OUTPUT:

r^2: -0.44401265478624397
RMSE: 94.65723681369009

RESULT:
Ex.No: 6 Explore Various Plotting Functions
Date:

AIM:

ALGORITHM:
PROGRAM:

import numpy
as np import
pandas as pd
import matplotlib.pyplot
as pltimport seaborn as
sns
df=pd.read_csv('Heart.csv
')
df

#Normal Curve : Age


Variable
f,ax=plt.subplots(figsize=(1
0,6))x=df['Age']
ax=sns.distplot(x,bins=10)
plt.title('Normal curve')
plt.show()

#Density and contour


plots
x=np.linspace(0,5,50)
y=np.linspace(0,5,40)
X, Y = np.meshgrid(x,y)
z=np.sin(X)**10+np.cos(10+Y*X)*np.co
s(X)plt.contour(x,y,z)
plt.title('Density and Contour
Plots')plt.show()

#Correlation Plot Pair plot


sns.pairplot(data=df,vars=['Age','RestBP','Chol'])
plt.title('Correlation Plot Pair plot')
plt.show()
#Histogram
#df.hist(figsize=(12,12),layout=(5,
3))data=np.random.randn(1000)
plt.hist(data,bins=10)
plt.title('Histogram')
plt.show()

#Scatterplot to visualize the relationship between age and trestbps


variablef,ax=plt.subplots(figsize=(8,6))
ax=sns.scatterplot(x='Age',y='RestBP',data=df)
plt.title('Scatter
plot')plt.show()

#Histogram
#df.hist(figsize=(12,12),layout=(5,
3))data=np.random.randn(1000)
plt.hist(data,bins=10)
plt.title('Histogram')
plt.show()
OUTPUT
RESULT:
Ex.No: 7 Three Dimensional Plotting
Date:

AIM:
.

ALGORITHM:
PROGRAM:
import numpy
as np import
pandas as pd
import matplotlib.pyplot
as plt
df=pd.read_csv('Heart.csv
') fig=plt.figure()
#syntax for 3-D
Projection
ax=plt.axes(projection='3
d')#defining all 3 axes
x=df['Age']
x=pd.Series(x,name='Age
Variable')y=df['Sex']
y=pd.Series(y,name='sex
variable') z=df['Chol']
z=pd.Series(z,name='cholesterol
variable')#Plotting
ax.plot3D(x,y,z,'green')
ax.set_title('3D line plot Heart disease
dataset')plt.show()
OUTPUT:

RESULT:
Ex.No: 8 Visualizing Geographic Data with Basemap
Date:

AIM:

ALGORITHM:
PROGRAM:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import
Basemap fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc',
resolution=None,width=8E6,
height=8E6,
lat_0=20,
lon_0=78,)
m.etopo(scale=0.7,
alpha=0.7)
# Map (long, lat) to (x, y) for
plottingx, y = m(80, 13)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Chennai',
fontsize=20);
OUTPUT:

RESULT:

You might also like