Data Analysis with Python
Data Analysis with Python
R Ambedkar National
Institute of
Technology ,Jalandhar
Internship ( 2 Months)
26-05-2022 to 23-07-2022
Roll No : 19113054
Instructor : K.S.
Page | 1
Rana
Acknowledgement
Page | 2
Firstly , I would like to thank Rail Coach Factory for giving me
such a great opportunity to do my internship project in their
esteemed organization at its Technical Training Centre,
Kapurthala.
Finally , I would like to thank my family and friends for all the
support and encouragement . I would also like to thank my
fellow students for many helpful discussions and good ideas
along the way.
Page | 3
Contents
Page | 4
DATA ANALYSIS
Data analysis is a process of inspecting, cleansing,
transforming, and modelling data with the goal of
discovering useful information ,informing conclusions,
and supporting decision-making.
Data analysis has multiple facets and
approaches ,encompassing diverse techniques under a
variety of names, and is used in different business,
science, and social science domains.
Page | 5
Data requirements
The data is necessary as inputs to the analysis, which is
specified based upon the requirements of those
directing the analysis (or customers, who will use the
finished product of the analysis
Data collection
Data is collected from a variety of sources. The
requirements may be communicated by analysts to
custodians of the data; such as, Information Technology
personnel within an organization. The data may also be
collected from sensors in the environment, including
traffic cameras, satellites, recording devices, etc. It
may also be obtained through interviews, downloads
from online sources, or reading documentation.
Page | 6
Data processing
The phases of the intelligence cycle used to convert
raw information into actionable intelligence or
knowledge are conceptually similar to the phases in
data analysis.
Data cleaning
Once processed and organized, the data may be
incomplete, contain duplicates, or contain errors. The
need for data cleaning will arise from problems in the
way that the datum are entered and stored. Data
cleaning is the process of preventing and correcting
these errors. Common tasks include record matching,
identifying inaccuracy of data, overall quality of
existing data, deduplication, and column segmentation.
Exploratory data analysis
Once the datasets are cleaned, they can then be
analysed. Analysts may apply a variety of techniques,
referred to as exploratory data analysis, to begin
understanding the messages contained within the
obtained data. The process of data exploration may
result in additional data cleaning or additional requests
for data; thus, the initialization of the iterative phases
mentioned in the lead paragraph of this section.
Descriptive statistics, such as, the average or median,
can be generated to aid in understanding the data.
Data visualization is also a technique used, in which the
analyst is able to examine the data in a graphical
format in order to obtain additional insights, regarding
the messages within the data.
Modelling and algorithms
Page | 7
Mathematical formulas or models (known as
algorithms), may be applied to the data in order to
identify relationships among the variables; for example,
using correlation or causation. In general terms, models
may be developed to evaluate a specific variable based
on other variable(s) contained within the dataset, with
some residual error depending on the implemented
model's accuracy
Page | 8
indentation. Python is dynamically-typed and garbage-
collected.
NumPy Library
NumPy, which stands for Numerical Python, is a library
consisting of multidimensional array objects and a
collection of routines for processing those arrays. Using
NumPy, mathematical and logical operations on arrays
can be performed.
Operations using NumPy
Using NumPy, a developer can perform the following
operations −
Mathematical and logical operations on arrays.
Fourier transforms and routines for shape
manipulation.
Operations related to linear algebra. NumPy has in-
built functions for linear algebra and random
number generation.
numpy_array = np.array(list_to_convert)
a = np.zeros(shape,dtype=type_of_zeros)
type of zeros can be int or float as it is required
eg.
a = np.zeros((3,4), dtype = np.float16)
Similar to np.zeros:
a = np.ones((3,4), dtype=np.int32)
1.np.full(shape_as_tuple,value_to_fill,dtype=type_y
ou_want)
a = np.full((2,3),1,dtype=np.float16)
a would be:
array([[1., 1., 1.],
[1., 1., 1.]], dtype=float16)
2. np.empty(shape_as_tuple,dtype=int)
a = np.empty((2,2),dtype=np.int16)
a would be:
array([[25824, 25701],
[ 2606, 8224]], dtype=int16)
Page | 10
7. Getting an array of evenly spaced values with
np.arrange and np.linspace
linspace:
np.linspace(start,stop,num=50,endpoint=bool_valu
e,retstep=bool_value)
np.linspace(1,2,num=5,endpoint=False,retstep=Tr
ue)
np.arange(start=where_to_start,stop=where_to_st
op,step=step_size)
x = np.array([1,2,3])
x = np.ones((3,2,4),dtype=np.int16)
x = np.ones((2,3), dtype=np.int16)
x.dtype will produce
dtype('int16')
y = np.array([[1,3],[5,6]])
x = np.copy(y)
Use array_name.T
np.dot(matrix1, matrix2)
a = np.array([[1,2,3],[4,8,16]])
z = np.cross(x, y)
Pandas Library
Pandas is a software library written for the Python
programming language for data manipulation and
analysis. In particular, it offers data structures and
operations for manipulating numerical tables and time
series.
Library features:
Page | 12
Data Frame object for data manipulation with
integrated indexing.
Tools for reading and writing data between in-
memory data structures and different file formats.
Data alignment and integrated handling of missing
data.
Reshaping and pivoting of data sets.
Label-based slicing, fancy indexing, and sub
setting of large data sets.
Data structure column insertion and deletion.
Group by engine allowing split-apply-combine
operations on data sets.
Data set merging and joining.
Hierarchical axis indexing to work with high-
dimensional data in a lower-dimensional data
structure.
Time series-functionality: Date range generation
[6] and frequency conversions, moving window
statistics, moving window linear regressions, date
shifting and lagging.
Provides data filtration.
Page | 13
Embed in JupyterLab and Graphical User
Interfaces.
Use a rich array of third-party packages built
on Matplotlib.
Seaborn Library
Seaborn is a Python data visualization library based on
matplotlib. It provides a high-level interface for drawing
attractive and informative statistical graphics.
It provides beautiful default styles and color
palettes to make statistical plots more attractive. It is built on
the top of matplotlib library and also closely integrated to the
data structures from pandas.
Seaborn aims to make visualization the central
part of exploring and understanding data. It provides dataset-
oriented APIs, so that we can switch between different visual
representations for same variables for better understanding of
dataset.
Weather data set is a time series data set with per hour
information about the weather conditions of a particular
location. It records temperature ,dew point
temperature ,Relative Humidity ,Visibility, windspeed ,pressure
Page | 15
and conditions. The data is available as a CSV file. We are going
to analyse this data using pandas data frame.
Page | 16
Page | 17
Page | 18
Page | 19
Page | 20
COVID-19 Dataset Analysis With Python
This data is available as a CSV file, downloaded from Kaggle.
We will analyse this data using pandas data frame .
Page | 21
Page | 22
Page | 23
IPL 2008-2020 Dataset Analysis With
Python
Data is taken from kaggle and contains ball-by-ball information
from IPL 2008 to IPL 2022 .We are going to analyse this data
using pandas data frame.
Page | 24
Page | 25
Page | 26
Page | 27
Page | 28
Page | 29
Page | 30
Netflix Dataset Analysis With Python
This Netflix Dataset has information about the TV shows and
movies available on Netflix till 2021. This Dataset is available
on kaggle website for free.
Page | 31
Page | 32
Page | 33
Page | 34
Page | 35
Page | 36
Page | 37
Page | 38
Page | 39
Census Dataset Analysis 2011 With
Python
The data used here is of 2011 India Census of each district. This
data is available as a CSV file, downloaded from kaggle.
Page | 40
Page | 41
Page | 42
Page | 43
Page | 44
Data Visualisation Techniques
Page | 45
Page | 46
Page | 47
Page | 48
Page | 49
Page | 50
Page | 51