0% found this document useful (0 votes)
24 views

Programming For Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Programming For Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

UNIT 3

DATA SCIENCE USING PYTHON


Data Science Packages-Numpy,scipy, pandas-Building models and evalution
with Scikit-Data Loading,Storage and File format- Data Wrangling: Clean,
Transform, Merge, Reshape-Plotting and Visualization- Data Aggregation and
Group Operations – Time Series - The Jupyter and PyDev development
environments-Neural Network Basics - Data Exploration in Python - Statistical
Methods for Evaluation using R - Visualization using Python - Building models
and evalution with Scikit

Data Science Packages: Numpy,scipy, pandas


NumPy :

NumPy, short for Numerical Python, has long been a cornerstone of numerical computing
in Python. It provides the data structures, algorithms, and library glue needed for most
scientific applications involving numerical data in Python. NumPy contains, among other
things:

• A fast and efficient multidimensional array object ndarray

• Functions for performing element-wise computations with arrays or mathemati‐ cal


operations between arrays

• Tools for reading and writing array-based datasets to disk

• Linear algebra operations, Fourier transform, and random number generation

• A mature C API to enable Python extensions and native C or C++ code to access NumPy‟s
data structures and computational facilities

Beyond the fast array-processing capabilities that NumPy adds to Python, one of its primary
uses in data analysis is as a container for data to be passed between algorithms and libraries.
For numerical data, NumPy arrays are more efficient for storing and manipulating data than
the other built-in Python data structures. Also, libraries written in a lower-level language,
such as C or FORTRAN, can operate on the data stored in a NumPy array without copying
data into some other memory representation. Thus, many numerical computing tools for
Python either assume NumPy arrays as a primary data structure or else target interoperability
with NumPy.
NumPy Library

NumPy is a Python library used for scientific computing. This is Python's scientific
computing core library, providing high-performance multidimensional array objects, tools for
manipulating those arrays, and various mathematical functions. It also contains useful linear
algebra, Fourier transform, and random number capabilities.

NumPy is a scientific computing library for Python. It provides powerful tools for
manipulating and analyzing numerical data. It is used for array-based computations, linear
algebra, Fourier transforms random number functions, and more.

Example program for Numpy: Here's a short program that creates two random 3x3 matrices,
multiplies thimport numpy as np

import numpy as np

# Create random matrices


mat1 = np.random.rand(3, 3)
mat2 = np.random.rand(3, 3)

# Print the matrices


print("Matrix 1:")
print(mat1)
print("\nMatrix 2:")
print(mat2)

# Multiply the matrices


result = np.matmul(mat1, mat2)

# Print the result


print("\nResult of matrix multiplication:")
print(result)

output:
Matrix 1:
[[0.9317527 0.8167975 0.44996549]
[0.33871151 0.02973976 0.93414816]
[0.2228187 0.84413553 0.64904411]]

Matrix 2:
[[0.46360681 0.26930509 0.81966568]
[0.9149686 0.61693763 0.24072468]
[0.00164111 0.46880786 0.67252377]]

Result of matrix multiplication:


[[1.18004941 0.96578622 1.26296152]
[0.18577295 0.54750031 0.91302613]
[0.87672293 0.88506217 0.8223387 ]]
Benefits of Using NumPy Library

1. Easy to use: NumPy is very easy to use, and its syntax is simple, making it easier to
code.

2. Speed: NumPy is very fast as it uses highly optimized C and Fortran libraries under
the hood.

3. Memory efficiency: NumPy is very memory efficient as it stores data in a compact


form and uses less memory compared to other libraries.

4. Compatibility: NumPy is compatible with many other libraries such as SciPy, Scikit-
learn, Matplotlib, etc.

5. Array broadcasting: Array broadcasting allows you to perform operations on arrays


of different shapes. This helps in writing efficient and concise code.

6. Math library: NumPy has an extensive math library that provides many
mathematical functions such as trigonometric functions �, logarithms, etc.

7. Linear algebra support: NumPy supports linear algebra operations such as matrix
multiplication, vector operations, etc. �.

Common Use Cases for NumPy:

 Scientific computing and data analysis


 Linear algebra
 Random number generation

pandas :

pandas provides high-level data structures and functions designed to make working with
structured or tabular data intuitive and flexible. pandas blends the array-computing ideas of
NumPy with the kinds of data manipu‐ lation capabilities found in spreadsheets and relational
databases (such as SQL). It provides convenient indexing functionality to enable you to
reshape, slice and dice, perform aggregations, and select subsets of data. Since data
manipulation, prepara‐ tion, and cleaning are such important skills in data analysis.

Pandas is an open-source library for data analysis and manipulation. It provides a wide
range of data structures and tools for working with data. It is designed for easy data
wrangling and manipulation and can be used for a variety of tasks such as data cleaning, data
analysis, data visualization, and more. Pandas can be used for data analysis in Python and
other languages such as R and Julia.

Give python example :Here is an example of using Pandas to read a CSV file and display the
data as a table:
import pandas as pd
Read in CSV file
df = pd.read_csv('example.csv')
Display data as a table
print(df)

Summary:

Pandas: Pandas is a data-analysis library that provides high-level data structures and robust
data analysis tools. It is used for data wrangling, cleaning, and preparation. It is designed to
make data manipulation and analysis easy and intuitive.

Common Use Cases for Pandas:

 Data wrangling, cleaning, and preparation


 Data analysis and exploration
 Time series analysis

SciPy

SciPy (Scientific Python) is another free and open-source Python library for data science that
is extensively used for high-level computations. It‟s extensively used for scientific and
technical computations, because it extends NumPy and provides many user-friendly and
efficient routines for scientific calculations.

Features:

 Collection of algorithms and functions built on the NumPy extension of Python

 High-level commands for data manipulation and visualization

 Multidimensional image processing with the SciPy ndimage submodule

 Includes built-in functions for solving differential equations

Applications:

 Multidimensional image operations

 Solving differential equations and the Fourier transform

 Optimization algorithms and Linear algebra

SciPy
SciPy is a collection of packages addressing a number of foundational problems in scientific
computing. Here are some of the tools it contains in its various modules:

scipy.integrate

Numerical integration routines and differential equation solvers

scipy.linalg

Linear algebra routines and matrix decompositions extending beyond those pro‐ vided in
numpy.linalg

scipy.optimize

Function optimizers (minimizers) and root finding algorithms

scipy.signal

Signal processing tools

scipy.sparse

Sparse matrices and sparse linear system solvers

scipy.special

Wrapper around SPECFUN, a FORTRAN library implementing many common mathematical


functions, such as the gamma function

scipy.stats

Standard continuous and discrete probability distributions (density functions, samplers,


continuous distribution functions), various statistical tests, and more descriptive statistics

Together, NumPy and SciPy form a reasonably complete and mature computational
foundation for many traditional scientific computing applications.

Here's a simple Python script demonstrating the usage of SciPy:

import numpy as np
from scipy import optimize, integrate, stats

# Optimization example
def rosen(x):
return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)

initial_guess = np.array([1.3, 0.7, 0.8, 1.9, 1.2])


result = optimize.minimize(rosen, initial_guess)
print("Optimized minimum value:", result.fun)
print("Optimized parameters:", result.x)
# Integration example
result_integration, error = integrate.quad(lambda x: np.exp(-x**2), -
np.inf, np.inf)
print("\nResult of integration:", result_integration)

# Statistical functions example


data = np.random.normal(loc=0, scale=1, size=1000)
mean, std_dev = stats.norm.fit(data)
print("\nEstimated mean of data:", mean)
print("Estimated standard deviation of data:", std_dev)

output:

Optimized minimum value: 4.581791550738518e-11


Optimized parameters: [0.99999925 0.99999852 0.99999706 0.99999416
0.99998833]

Result of integration: 1.7724538509055159

Estimated mean of data: -0.032565889512581714


Estimated standard deviation of data: 1.0165728551628443

>>FOR MORE REFERENCE ABOUT - NUMPY,PANDAS,SCIPY

Building models and evalution with Scikit:


What is Scikit-Learn (Sklearn)
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence interface
in Python. This library, which is largely written in Python, is built upon NumPy,
SciPy and Matplotlib.

>>FOR MORE REFERENCE ABOUT- SCIKI LEARN

>>FOR MORE REFERENEC ABOUT LEARNING MODEL BUILDING IN SCIKIT


LEARN

evalution with Scikit learn python


To evaluate machine learning models with Scikit-learn in Python, you typically use various
metrics and techniques depending on the type of problem you're dealing with (classification,
regression, clustering, etc.). Here's a general overview of how you can evaluate models using
Scikit-learn:

1.Splitting Data: First, you'll need to split your dataset into training and testing sets using
train_test_split function.

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
2.Model Training: Train your machine learning model using the training data.

from sklearn.svm import SVC

model = SVC()
model.fit(X_train, y_train)

3. Model Evaluation:

For classification problems, you can use metrics like accuracy, precision, recall, F1-score,
ROC-AUC, etc.

For regression problems, metrics like mean squared error (MSE), mean absolute error
(MAE), R-squared, etc., are commonly used.

Here's an example of evaluating a classification model:

from sklearn.metrics import accuracy_score, classification_report

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))

For regression, the process is similar, but you'll use different evaluation metrics:

from sklearn.metrics import mean_squared_error, r2_score


# Predictions
y_pred = model.predict(X_test)
# Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
# R-squared
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)
Cross-validation: It's often a good practice to perform cross-validation to get a better
estimate of the model's performance. Scikit-learn provides functions like cross_val_score for
this purpose.

from sklearn.model_selection import cross_val_score


scores = cross_val_score(model, X_train, y_train, cv=5) # 5-fold cross-validation
print("Cross-Validation Scores:", scores)
print("Mean CV Score:", scores.mean())

These are some basic steps to evaluate machine learning models using Scikit-learn in Python.
Depending on your specific problem and requirements, you might need to delve deeper into
specific matrics and techniques.

Data Loading,Storage and File format:


Data loading, storage, and file formats are essential aspects of working with data in Python.
These tasks involve reading data from various sources, storing it in memory or on disk, and
handling different file formats.

Reading data and making it accessible (often called data loading) is a necessary first step for
using most of the tools in this book. The term parsing is also sometimes used to describe
loading text data and interpreting it as tables and different data types. I‟m going to focus on
data input and output using pandas, though there are numerous tools in other libraries to help
with reading and writing data in various formats.

Input and output typically fall into a few main categories: reading text files and other more
efficient on-disk formats, loading data from databases, and interacting with network sources
like web APIs.

1.Data Loading:
Loading data refers to the process of bringing data into memory from external sources such
as files, databases, or APIs. In Python, several libraries facilitate data loading, including:

Pandas: Pandas is a powerful library for data manipulation and analysis. It provides
functions like read_csv, read_excel, read_sql, etc., to read data from CSV files, Excel files,
SQL databases, etc.

 NumPy: NumPy also offers functions like loadtxt and genfromtxt to load data from
text files into NumPy arrays.
 JSON and CSV modules: Python's built-in json and csv modules are handy for
loading data from JSON and CSV files, respectively.
 Requests: When dealing with web APIs, the requests library is commonly used to
fetch data from remote servers.

2.Data Storage:
Data storage involves saving data to disk or a database for later retrieval. Common storage
options in Python include:
 File Systems: Data can be stored in files on disk, using various file formats like CSV,
Excel, JSON, HDF5, etc.

 Databases: Python supports interacting with relational databases (e.g., SQLite,


MySQL, PostgreSQL) and NoSQL databases (e.g., MongoDB) using libraries like
SQLAlchemy, sqlite3, pymongo, etc.
 Cloud Storage: Data can be stored on cloud platforms like Amazon S3, Google Cloud
Storage, or Azure Blob Storage using respective Python SDKs.

3.File Formats:
File formats define the structure and encoding of data when stored in files. Common file
formats used in Python include:

 CSV (Comma-Separated Values): Simple and widely used for tabular data.

 JSON (JavaScript Object Notation): Lightweight and human-readable format for


storing and transmitting data.

 Excel: Popular spreadsheet format supported by libraries like Pandas and openpyxl.

 HDF5 (Hierarchical Data Format version 5): Designed for storing and organizing
large amounts of numerical data.

 Parquet, Avro, ORC: Columnar file formats commonly used in big data processing
frameworks like Apache Spark.

 XML (eXtensible Markup Language): Semi-structured data format often used for
configuration files or web services.

 YAML (YAML Ain't Markup Language): Human-readable data serialization


format similar to JSON, often used in configuration files.

In summary, Python provides a rich ecosystem of libraries and tools for efficiently loading,
storing, and working with data from various sources and in different formats, empowering
data scientists, engineers, and analysts to handle diverse data requirements effectively.

>>Data Loading, Storage and file format

Data Wrangling: Clean, Transform, Merge, Reshape-Plotting and


Visualization
Data Wrangling in Python:
Data Wrangling is the process of gathering, collecting, and transforming Raw data into
another format for better understanding, decision-making, accessing, and analysis in
less time. Data Wrangling is also known as Data Munging
Importance Of Data Wrangling:
Data Organization: Data wrangling involves organizing raw data from various sources,
ensuring it's structured and ready for analysis.
Relevance Filtering: By wrangling data, irrelevant or redundant information can be filtered
out, allowing focus on what's important for users, such as top-selling books in specific genres
like motivation.
Enhanced User Experience: Well-wrangled data enables the presentation of tailored
recommendations to users, improving their experience on the platform.
Informed Decision-Making: Data wrangling helps in extracting insights and patterns,
empowering users to make informed decisions based on factors like sales, ratings, or bundled
book packages.
Optimized Offerings: Through data wrangling, the platform can continuously optimize its
offerings by analyzing user preferences and behaviors, thus staying competitive in the
market.

Data Wrangling in Python:


Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework
of Python is used for Data Wrangling. Pandas is an open-source library
in Python specifically developed for Data Analysis and Data Science. It is used for
processes like data sorting or filtration, Data grouping, etc.
Data wrangling in Python deals with the below functionalities:
1. Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with
mean, mode, the most frequent value of the column, or simply by dropping the row
having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements,
where new data can be added or pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which
are required to be removed or filtered
5. Other: After dealing with the raw dataset with the above functionalities we get an
efficient dataset as per our requirements and then it can be used for a required purpose
like data analyzing, machine learning, data visualization, model training etc.

Below are example of Data Wrangling that implements the above functionalities on a
raw dataset:
Data exploration in Python
Here in Data exploration, we load the data into a dataframe, and then we visualize the data
in a tabular format.
# Import pandas package
import pandas as pd
# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav',
'Anuj', 'Ravi', 'Natasha', 'Riya'],
'Age': [17, 17, 18, 17, 18, 17, 17],
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}
# Convert into DataFrame
df = pd.DataFrame(data)

# Display data
df

Output:

Summary:
Data Wrangling:

Data wrangling involves preparing raw data for analysis. It consists of several key steps:

Cleaning Data:
 Identify and handle missing data.
 Remove duplicates.
 Correct inconsistent data formats or values.
 Address outliers or anomalies.
Examples:
Cleaning Data:

 Removing or correcting missing values, outliers, or inconsistent data.


Example: Removing rows with missing values from a dataset.
import pandas as pd

# Load data
data = pd.read_csv('data.csv')

# Remove rows with missing values


cleaned_data = data.dropna()

Output: cleaned_data will contain the dataset with rows containing missing values removed.
Transforming Data:

 Convert data types (e.g., from strings to numerical values).


 Standardize data formats.
 Normalize numerical data.
 Create new features or variables based on existing ones.
Example:
Transforming Data:

 Converting data into a suitable format for analysis.


Example: Converting categorical variables into numerical representations.

# Transform categorical variables using one-hot encoding


transformed_data = pd.get_dummies(cleaned_data, columns=['category'])

Output: transformed_data will contain the dataset with categorical variables converted
into numerical representations using one-hot encoding.
Merging Data:

 Combine data from different sources based on common variables or indices.


 Perform inner, outer, left, or right joins to merge datasets.
 Handle merge conflicts or missing values during merging.
Example:
Merging Data:

 Combining multiple datasets based on common keys or indices.


Example: Merging two datasets based on a common column.
# Load second dataset
data2 = pd.read_csv('data2.csv')

# Merge datasets based on 'id' column


merged_data = pd.merge(cleaned_data, data2, on='id')

Output: merged_data will contain the merged dataset based on a common column ('id' in this
example).
Reshaping Data:
 Pivot data from wide to long format or vice versa.
 Stack or unstack data.
 Melt or reshape data to fit specific analysis or visualization requirements.

Example:
Reshaping Data:

 Restructuring data into a different layout or form.


Example: Pivoting a DataFrame.
# Pivot the DataFrame
reshaped_data = merged_data.pivot(index='date', columns='category', values='value')

Output: reshaped_data will contain the pivoted DataFrame, restructured based on the
specified index, columns, and values.

>>for more reference about data wrangling

Plotting and Visualization:


Python provides various libraries that come with different features for visualizing
data. All these libraries come with different features and can support various types of
graphs. In this tutorial, we will be discussing four such libraries.
 Matplotlib
 Seaborn
 Bokeh
 Plotly

Note: Plotting Data:Visualizing data using plots such as line plots, scatter plots, histograms, etc.
Data Visualization:Enhancing plots with labels, titles, legends, etc., for better interpretation.
Matplotlib
Matplotlib is an easy-to-use, low-level data visualization library that is built on NumPy
arrays. It consists of various plots like scatter plot, line plot, histogram, etc. Matplotlib
provides a lot of flexibility.
To install this type the below command in the terminal.

pip install matplotlib

After installing Matplotlib, let‟s see the most commonly used plots using this library.

Scatter Plot
Scatter plots are used to observe relationships between variables and uses dots to represent
the relationship between them. The scatter() method in the matplotlib library is used to
draw a scatter plot.

import pandas as pd
import matplotlib.pyplot as plt

# reading the database


data = pd.read_csv("tips.csv")

# Scatter plot with day against tip


plt.scatter(data['day'], data['tip'])

# Adding Title to the Plot


plt.title("Scatter Plot")

# Setting the X and Y labels


plt.xlabel('Day')
plt.ylabel('Tip')

plt.show()

Output:
This graph can be more meaningful if we can add colors and also change the size of the
points. We can do this by using the c and s parameter respectively of the scatter function.
We can also show the color bar using the colorbar() method.

Line Chart

Line Chart is used to represent a relationship between two data X and Y on a


different axis. It is plotted using the plot() function. Let‟s see the below example.
import pandas as pd
import matplotlib.pyplot as plt

# reading the database


data = pd.read_csv("/content/tips.csv")

# Scatter plot with day against tip


plt.plot(data['tip'])
plt.plot(data['size'])

# Adding Title to the Plot


plt.title("Line chart")

# Setting the X and Y labels


plt.xlabel('Day')
plt.ylabel('Tip')
plt.show()

output:
Seaborn
Seaborn is a high-level interface built on top of the Matplotlib. It provides beautiful design
styles and color palettes to make more attractive graphs.
To install seaborn type the below command in the terminal.
pip install seaborn
Seaborn is built on the top of Matplotlib, therefore it can be used with the Matplotlib as
well. Using both Matplotlib and Seaborn together is a very simple process. We just have to
invoke the Seaborn Plotting function as normal, and then we can use Matplotlib‟s
customization function.
Note: Seaborn comes loaded with dataset such as tips, iris, etc. but for the sake of this
tutorial we will use Pandas for loading these datasets.

Bokeh
Let‟s move on to the third library of our list. Bokeh is mainly famous for its interactive
charts visualization. Bokeh renders its plots using HTML and JavaScript that uses modern
web browsers for presenting elegant, concise construction of novel graphics with high-level
interactivity.
To install this type the below command in the terminal.
pip install bokeh

Interactive Data Visualization

One of the key features of Bokeh is to add interaction to the plots. Let‟s see various
interactions that can be added.
Interactive Legends
click_policy property makes the legend interactive. There are two types of interactivity –
 Hiding: Hides the Glyphs.
 Muting: Hiding the glyph makes it vanish completely, on the other hand, muting the
glyph just de-emphasizes the glyph based on the parameters.

>>for more reference about plotting and visualization


DATA AGGREGATION AND GROUP OPERATIONS
Categorizing a dataset and applying a function to each group, whether an aggregation or
transformation, is often a critical component of a data analysis workflow. After loading,
merging, and preparing a dataset, you may need to compute group statistics or possibly pivot
tables for reporting or visualization purposes. pandas provides a flexible groupby interface,
enabling you to slice, dice, and summarize datasets in a natural way.
One reason for the popularity of relational databases and SQL (which stands for “structured
query language”) is the ease with which data can be joined, filtered, transformed, and
aggregated. However, query languages like SQL are somewhat constrained in the kinds of
group operations that can be performed. As you will see, with the expressiveness of Python
and pandas, we can perform quite complex group operations by utilizing any function that
accepts a pandas object or NumPy array. In this chapter, you will learn how to:
 Split a pandas object into pieces using one or more keys (in the form of functions,
arrays, or DataFrame column names)
 Calculate group summary statistics, like count, mean, or standard deviation, or a user-
defined function
 Apply within-group transformations or other manipulations, like normalization, linear
regression, rank, or subset selection
 Compute pivot tables and cross-tabulations
 Perform quantile analysis and other statistical group analyses
Aggregation:
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to
get a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset. The function used for aggregation is agg(), the parameter is
the function we want to perform.
Some functions used in the aggregation are:

Examples:
 The sum() function is used to calculate the sum of every value.

df.sum()

Output:
 The describe() function is used to get a summary of our dataset

df.describe()

Output:

 We used agg() function to calculate the sum, min, and max of each column in
our dataset.

df.agg(['sum', 'min', 'max'])

Output:

Grouping Operations:
Grouping is used to group data using some criteria from our dataset. It is used as split -
apply-combine strategy.
 Splitting the data into groups based on some criteria.
 Applying a function to each group independently.
 Combining the results into a data structure.
Examples:
We use groupby() function to group the data on “Maths” value. It returns the object as
result.

df.groupby(by=['Maths'])

Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012581821388>
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.

a = df.groupby('Maths')
a.first()

Output:

First grouping based on “Maths” within each team we are grouping based on “Science”
Python

b = df.groupby(['Maths', 'Science'])
b.first()

Output:

TIME SERIES
What are time series visualization and analytics?
Time series visualization and analytics empower users to graphically represent time-based
data, enabling the identification of trends and the tracking of changes over different
periods. This data can be presented through various formats, such as line graphs, gauges,
tables, and more.
The utilization of time series visualization and analytics facilitates the extraction of insights
from data, enabling the generation of forecasts and a comprehensive understanding of the
information at hand. Organizations find substantial value in time series data as it allows
them to analyze both real-time and historical metrics.
What is Time Series Data?
Time series data is a sequential arrangement of data points organized in consecutive time
order. Time-series analysis consists of methods for analyzing time-series data to extract
meaningful insights and other valuable characteristics of the data.
Importance of time series analysis
Time-series data analysis is becoming very important in so many industries, like financial
industries, pharmaceuticals, social media companies, web service providers, research, and
many more. To understand the time-series data, visualization of the data is essential. In
fact, any type of data analysis is not complete without visualizations, because one good
visualization can provide meaningful and interesting insights into the data.
Basic Time Series Concepts
 Trend: A trend represents the general direction in which a time series is moving
over an extended period. It indicates whether the values are increasing,
decreasing, or staying relatively constant.
 Seasonality: Seasonality refers to recurring patterns or cycles that occur at
regular intervals within a time series, often corresponding to specific time units
like days, weeks, months, or seasons.
 Moving average: The moving average method is a common technique used in
time series analysis to smooth out short-term fluctuations and highlight longer-
term trends or patterns in the data. It involves calculating the average of a set of
consecutive data points, referred to as a “window” or “rolling window,” as it
moves through the time series
 Noise: Noise, or random fluctuations, represents the irregular and unpredictable
components in a time series that do not follow a discernible pattern. It introduces
variability that is not attributable to the underlying trend or seasonality.
 Differencing: Differencing is used to make the difference in values of a
specified interval. By default, it‟s one, we can specify different values for plots.
It is the most popular method to remove trends in the data.
 Stationarity: A stationary time series is one whose statistical properties, such as
mean, variance, and autocorrelation, remain constant over time.
 Order: The order of differencing refers to the number of times the time series
data needs to be differenced to achieve stationarity.
 Autocorrelation: Autocorrelation, is a statistical method used in time series
analysis to quantify the degree of similarity between a time series and a lagged
version of itself.
 Resampling: Resampling is a technique in time series analysis that involves
changing the frequency of the data observations. It‟s often used to transform the
data to a different frequency (e.g., from daily to monthly) to reveal patterns or
trends more clearly.
Types of Time Series Data
Time series data can be broadly classified into two sections:
1. Continuous Time Series Data: Continuous time series data involves measurements or
observations that are recorded at regular intervals, forming a seamless and uninterrupted
sequence. This type of data is characterized by a continuous range of possible values and is
commonly encountered in various domains, including:
 Temperature Data: Continuous recordings of temperature at consistent
intervals (e.g., hourly or daily measurements).
 Stock Market Data: Continuous tracking of stock prices or values throughout
trading hours.
 Sensor Data: Continuous measurements from sensors capturing variables like
pressure, humidity, or air quality.
2. Discrete Time Series Data: Discrete time series data, on the other hand, consists of
measurements or observations that are limited to specific values or categories. Unlike
continuous data, discrete data does not have a continuous range of possible values but
instead comprises distinct and separate data points. Common examples include:
 Count Data: Tracking the number of occurrences or events within a specific
time period.
 Categorical Data: Classifying data into distinct categories or classes (e.g.,
customer segments, product types).
 Binary Data: Recording data with only two possible outcomes or states.
Visualization Approach for Different Data Types:
 Plotting data in a continuous time series can be effectively represented
graphically using line, area, or smooth plots, which offer insights into the
dynamic behavior of the trends being studied.
 To show patterns and distributions within discrete time series data, bar charts,
histograms, and stacked bar plots are frequently utilized. These methods provide
insights into the distribution and frequency of particular occurrences or
categories throughout time.

THE JUPYTER NOTEBOOK


Introduction
The Jupyter Notebook is an open-source web application that you can use to create and share
documents that contain live code, equations, visualizations, and text. Jupyter Notebook is
maintained by the people at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have an
IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows you to write your programs in Python, but there are currently over 100
other kernels that you can also use.
Getting Up and Running with Jupyter Notebook
The Jupyter Notebook is not included with Python, so if you want to try it out, you will need
to install Jupyter.
There are many distributions of the Python language. This article will focus on just two of
them for the purposes of installing Jupyter Notebook. The most popular is CPython, which is
the reference version of Python that you can get from their website. It is also assumed that
you are using Python 3.
Installation
If so, then you can use a handy tool that comes with Python called pip to install Jupyter
Notebook like this:
Shell
$ pip install jupyter
The next most popular distribution of Python is Anaconda. Anaconda has its own installer
tool called conda that you could use for installing a third-party package. However, Anaconda
comes with many scientific libraries preinstalled, including the Jupyter Notebook, so you
don‟t actually need to do anything other than install Anaconda itself.
Starting the Jupyter Notebook Server
Now that you have Jupyter installed, let‟s learn how to use it. To get started, all you need to
do is open up your terminal application and go to a folder of your choice. I recommend using
something like your Documents folder to start out with and create a subfolder there
called Notebooks or something else that is easy to remember.
Then just go to that location in your terminal and run the following command:
Shell

$ jupyter notebook
This will start up Jupyter and your default browser should start (or open a new tab) to the
following URL: http://localhost:8888/tree
Your browser should now look something like this:

Note that right now you are not actually running a Notebook, but instead you are just running
the Notebook server. Let‟s actually create a Notebook now!
Creating a Notebook
Now that you know how to start a Notebook server, you should probably learn how to create
an actual Notebook document.
All you need to do is click on the New button (upper right), and it will open up a list of
choices. On my machine, I happen to have Python 2 and Python 3 installed, so I can create a
Notebook that uses either of these. For simplicity‟s sake, let‟s choose Python 3.
Your web page should now look like this:

Naming
You will notice that at the top of the page is the word Untitled. This is the title for the page
and the name of your Notebook. Since that isn‟t a very descriptive name, let‟s change it!
Just move your mouse over the word Untitled and click on the text. You should now see an
in-browser dialog titled Rename Notebook. Let‟s rename this one to Hello Jupyter:
Running Cells
A Notebook‟s cell defaults to using code whenever you first create one, and that cell uses the
kernel that you chose when you started your Notebook.
In this case, you started yours with Python 3 as your kernel, so that means you can write
Python code in your code cells. Since your initial Notebook has only one empty cell in it, the
Notebook can‟t really do anything.
Thus, to verify that everything is working as it should, you can add some Python code to the
cell and try running its contents.
Let‟s try adding the following code to that cell:

Python

print('Hello Jupyter!')
Running a cell means that you will execute the cell‟s contents. To execute a cell, you can just
select the cell and click the Run button that is in the row of buttons along the top. It‟s towards
the middle. If you prefer using your keyboard, you can just press Shift + Enter .
When I ran the code above, the output looked like this:

If you have multiple cells in your Notebook, and you run the cells in order, you can share
your variables and imports across cells. This makes it easy to separate out your code into
logical chunks without needing to reimport libraries or recreate variables or functions in
every cell.
When you run a cell, you will notice that there are some square braces next to the word In to
the left of the cell. The square braces will auto fill with a number that indicates the order that
you ran the cells. For example, if you open a fresh Notebook and run the first cell at the top
of the Notebook, the square braces will fill with the number 1.
PyDev development environments

PYDEV DEVELOPMENT ENVIRONMENTS

Python is an interpreted programming language and claims to be a very effective


programming language. Python was developed by Guido van Rossum.
The name Python is based on the TV show called Monty Python‟s Flying Circus. During
execution the Python source code is translated into bytecode which is then interpreted by
the Python interpreter. Python source code can also run on the Java Virtual Machine, in this
case you are using Jython.
Key features of Python are:
 high-level data types, as for example extensible lists

 statement grouping is done by indentation instead of brackets


 variable or argument declaration is not necessary
 supports for object-orientated, procedural and functional programming style
Installation
Download Python from http://www.python.org. Download the version 3.3.1 or higher of
Python. If you are using Windows, you can use the native installer for Python.
Eclipse Python plugin
The following assume that you have already Eclipse installed. For an installation description
of Eclipse please see Eclipse IDE for Java.
For Python development under Eclipse you can use the PyDev Plugin which is an open source
project.
Configuration of Eclipse
You also have to maintain in Eclipse the location of your Python installation. Open in
the Window Preference Pydev Interpreter Python menu.

Press the New button and enter the path to python.exe in your Python installation directory.
For Linux and Mac OS X users this is normally /usr/bin/python.
The result should look like the following.

3. Your first Python program in Eclipse

Select File New Project. Select Pydev → Pydev Project.


Create a new project with the name "de.vogella.python.first". Select Python version 2.6 and
your interpreter.

Press finish .
Select Window Open Perspective Other. Select the PyDev perspective.

Select the "src" folder of your project, right-click it and select New → PyDev Modul. Create a
module "FirstModule".
Right-click your model and select Run As → Python run.

NEURAL NETWORK BASICS


Neural networks extract identifying features from data, lacking pre-programmed
understanding. Network components include neurons, connections, weights, biases,
propagation functions, and a learning rule. Neurons receive inputs, governed by thresholds
and activation functions. Connections involve weights and biases regulating information
transfer. Learning, adjusting weights and biases, occurs in three stages: input computation,
output generation, and iterative refinement enhancing the network‟s proficiency in diverse
tasks.
These include:
1. The neural network is simulated by a new environment.
2. Then the free parameters of the neural network are changed as a result of this
simulation.
3. The neural network then responds in a new way to the environment because of
the changes in its free parameters.

Importance of Neural Networks


The ability of neural networks to identify patterns, solve intricate puzzles, and adjust to
changing surroundings is essential. Their capacity to learn from data has far-reaching
effects, ranging from revolutionizing technology like natural language processing and self-
driving automobiles to automating decision-making processes and increasing efficiency in
numerous industries. The development of artificial intelligence is largely dependent on
neural networks, which also drive innovation and influence the direction of technology.
How does Neural Networks work?
Let‟s understand with an example of how a neural network works:
Consider a neural network for email classification. The input layer takes features like email
content, sender information, and subject. These inputs, multiplied by adjusted weights, pass
through hidden layers. The network, through training, learns to recognize patterns
indicating whether an email is spam or not. The output layer, with a binary activation
function, predicts whether the email is spam (1) or not (0). As the network iteratively
refines its weights through backpropagation, it becomes adept at distinguishing between
spam and legitimate emails, showcasing the practicality of neural networks in real-world
applications like email filtering.
Working of a Neural Network
Neural networks are complex systems that mimic some features of the functioning of the
human brain. It is composed of an input layer, one or more hidden layers, and an output
layer made up of layers of artificial neurons that are coupled. The two stages of the basic
process are called backpropagation and forward propagation.

Forward Propagation
 Input Layer: Each feature in the input layer is represented by a node on the
network, which receives input data.
 Weights and Connections: The weight of each neuronal connection indicates
how strong the connection is. Throughout training, these weights are changed.
 Hidden Layers: Each hidden layer neuron processes inputs by multiplying them
by weights, adding them up, and then passing them through an activation
function. By doing this, non-linearity is introduced, enabling the network to
recognize intricate patterns.
 Output: The final result is produced by repeating the process until the output
layer is reached.
Backpropagation
 Loss Calculation: The network‟s output is evaluated against the real goal
values, and a loss function is used to compute the difference. For a regression
problem, the Mean Squared Error (MSE) is commonly used as the cost function.
 Loss Function:

 Gradient Descent: Gradient descent is then used by the network to reduce the
loss. To lower the inaccuracy, weights are changed based on the derivative of
the loss with respect to each weight.
 Adjusting weights: The weights are adjusted at each connection by applying
this iterative process, or backpropagation, backward across the network.
 Training: During training with different data samples, the entire process of
forward propagation, loss calculation, and backpropagation is done iteratively,
enabling the network to adapt and learn patterns from the data.
 Actvation Functions: Model non-linearity is introduced by activation functions
like the rectified linear unit (ReLU) or sigmoid. Their decision on whether to
“fire” a neuron is based on the whole weighted input.
Learning of a Neural Network
1. Learning with supervised learning
In supervised learning, the neural network is guided by a teacher who has access to both
input-output pairs. The network creates outputs based on inputs without taking into account
the surroundings. By comparing these outputs to the teacher-known desired outputs, an
error signal is generated. In order to reduce errors, the network‟s parameters are changed
iteratively and stop when performance is at an acceptable level.
2. Learning with Unsupervised learning
Equivalent output variables are absent in unsupervised learning. Its main goal is to
comprehend incoming data‟s (X) underlying structure. No instructor is present to offer
advice. Modeling data patterns and relationships is the intended outcome instead. Words
like regression and classification are related to supervised learning, whereas unsupervised
learning is associated with clustering and association.
3. Learning with Reinforcement Learning
Through interaction with the environment and feedback in the form of rewards or penalties,
the network gains knowledge. Finding a policy or strategy that optimizes cumulative
rewards over time is the goal for the network. This kind is frequently utilized in gaming and
decision-making applications.
Types of Neural Networks
There are seven types of neural networks that can be used.
 Feedforward Neteworks: A feedforward neural network is a simple artificial
neural network architecture in which data moves from input to output in a single
direction. It has input, hidden, and output layers; feedback loops are absent. Its
straightforward architecture makes it appropriate for a number of applications,
such as regression and pattern recognition.
 Multilayer Perceptron (MLP): MLP is a type of feedforward neural network
with three or more layers, including an input layer, one or more hidden layers,
and an output layer. It uses nonlinear activation functions.
 Convolutional Neural Network (CNN): A Convolutional Neural
Network (CNN) is a specialized artificial neural network designed for image
processing. It employs convolutional layers to automatically learn hierarchical
features from input images, enabling effective image recognition and
classification. CNNs have revolutionized computer vision and are pivotal in
tasks like object detection and image analysis.
 Recurrent Neural Network (RNN): An artificial neural network type intended
for sequential data processing is called a Recurrent Neural Network (RNN). It is
appropriate for applications where contextual dependencies are critical, such as
time series prediction and natural language processing, since it makes use of
feedback loops, which enable information to survive within the network.
 Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed
to overcome the vanishing gradient problem in training RNNs. It uses memory
cells and gates to selectively read, write, and erase information.
DATA EXPLORATION IN PYTHON
Part 1: How to load data file(s) using Pandas?
Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In Python, it is
easy to load data from any source, due to its simple syntax and availability of predefined
libraries, such as Pandas. Here I will make use of Pandas itself.
Pandas features a number of functions for reading tabular data as a Pandas DataFrame
object. Below are the common functions that can be used to read data (including
read_csv in Pandas):

Loading data from a CSV file(s):


import pandas as pd
df = pd.read_csv("train.csv") #I am working in Windows environment
print(df.head(3))
Loading data from excel file(s):
df=pd.read_excel("E:/EMP.xlsx", "Data") # Load Data sheet of excel file EMP
print df
Output:

Loading data from a txt file(s):


df=pd.read_csv("E:/Test.txt",sep='\t') # Load Data from text file having tab '\t' delimeter
print df
Output
Part 2: How to convert a variable to a different data type?
Converting a variable data type to others is an important and common procedure we
perform after loading data. Let‟s look at some of the commands to perform these
conversions:
 Convert numeric variables to string variables and vice versa
srting_outcome = str(numeric_input) #Converts numeric_input to string_outcome
integer_outcome = int(string_input) #Converts string_input to integer_outcome
float_outcome = float(string_input) #Converts string_input to integer_outcome

The later operations are especially useful when you input value from user using
raw_input(). By default, the values are read at string.
 Convert character date to Date:There are multiple ways to do this. Thesimplest
would be to use the datetime library and strptime function.
 Here is the code:
from datetime import datetime
char_date = 'Apr 1 2015 1:20 PM' #creating example character date
date_obj = datetime.strptime(char_date, '%b %d %Y %I:%M%p')
print date_obj
Part 3: How to transpose a Data set or dataframe using Pandas?
Here, I want to transpose Table A into Table B on the variable Product. This task can be
accomplished by using Pandas dataframe.pivot:

#Transposing Pandas dataframe by a variable


df=pd.read_excel("E:/transpose.xlsx", "Sheet1") # Load Data sheet of excel file EMP
print df
result= df.pivot(index= 'ID', columns='Product', values='Sales')
result
Output

Part 4: How to sort a Pandas DataFrame?


Sorting of data can be done using dataframe.sort(). It can be based on multiple variables
and ascending or descending both orders.

#Sorting Pandas Dataframe


df=pd.read_excel("E:/transpose.xlsx", "Sheet1") #Add by variable name(s) to sort
print df.sort(['Product','Sales'], ascending=[True, False])
Above, we have a table with variables ID, Product and Sales. Now, we want to sort it by

Product and Sales (in descending order) as shown in table 2.

Part 5: How to create plots (Histogram, Scatter, Box Plot)?

Data visualization always helps to understand the data easily. Python has libraries

like matplotlib and seaborn to create multiple graphs effectively. Let‟s look at the some

of the visualizations to understand below behavior of variable(s) .

 The distribution of age

 Relation between age and sales; and

 If sales are normally distributed or not?

Histogram:
#Plot Histogram
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
#Plots in matplotlib reside within a figure object, use plt.figure to create new figure
fig=plt.figure()
#Create one or more subplots using add_subplot, because you can't create blank figure
ax = fig.add_subplot(1,1,1)
#Variable
ax.hist(df['Age'],bins = 5)
#Labels and Tit
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('#Employee')
plt.show()
Output

Scatter plot:
#Plots in matplotlib reside within a figure object, use plt.figure to create new figure
fig=plt.figure()
#Create one or more subplots using add_subplot, because you can't create blank figure
ax = fig.add_subplot(1,1,1)
#Variable
ax.scatter(df['Age'],df['Sales'])
#Labels and Tit
plt.title('Sales and Age distribution')
plt.xlabel('Age')
plt.ylabel('Sales')
plt.show()
Output
Box-plot:
import seaborn as sns
sns.boxplot(df['Age'])
sns.despine()
Output

Part 6: How to generate frequency tables with Pandas?


Frequency Tables can be used to understand the distribution of a categorical variable or
n categorical variables using frequency tables.

import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
print df
test= df.groupby(['Gender','BMI'])
test.size()
Output

Part 7: How to do sample Data set in Python?


To select sample of a data set, we will use library numpy and random. Sampling of data
set always helps to understand data quickly.Let‟s say, from EMP table, I want to select
random sample of 5 employees.
#Create Sample dataframe
import numpy as np
import pandas as pd
from random import sample
# create random index
rindex = np.array(sample(xrange(len(df)), 5))
# get 5 random rows from the dataframe df
dfr = df.ix[rindex]
print dfr
Output

Part 8: How to remove duplicate values of a variable in a Pandas Dataframe?


Often, we encounter duplicate observations. To tackle this in Python, we can use
dataframe.drop_duplicates().
#Remove Duplicate Values based on values of variables "Gender" and "BM I"
rem_dup=df.drop_duplicates(['Gender', 'BMI'])
print rem_dup
Output

Part 9: How to group variables in Pandas to calculate count, average, sum?


To understand the count, average and sum of variable, I would suggest you use
dataframe.describe() with Pandas groupby().
Let‟s look at the code:
test= df.groupby(['Gender'])
test.describe()

Output
Part 10: How to recognize and Treat missing values and outliers in Pandas?
To identify missing values , we can use dataframe.isnull(). You can also refer article
“Data Munging in Python (using Pandas)“, here we have done a case study to recognize
and treat missing and outlier values.
# Identify missing values of dataframe
df.isnull()
Output

To treat missing values, there are various imputation methods available. You can refer

these articles for methods to detect Outlier and Missing values. Imputation methods for

both missing and outlier values are almost similar. Here we will discuss general case

imputation methods to replace missing values. Let‟s do it using an example:

#Example to impute missing values in Age by the mean


import numpy as np
meanAge = np.mean(df.Age) #Using numpy mean function to calculate the mean
value
df.Age = df.Age.fillna(meanAge) #replacing missing values in the DataFrame
Part 11: How to merge / join data sets and Pandas dataframes?
Joining / merging is one of the common operation required to integrate datasets from
different sources. They can be handled effectively in Pandas using merge function:
Code:
df_new = pd.merge(df1, df2, how = 'inner', left_index = True, right_index = True) #
merges df1 and df2 on index
# By changing how = 'outer', you can do outer join.
# Similarly how = 'left' will do a left join
# You can also specify the columns to join instead of indexes, which are used by default.

STATISTICAL METHODS FOR EVALUATION USING R


R has a range of functions for carrying out summary statistics. The following table shows a
few of the functions that operate on single variables.
Some functions produce more than one result. The quantile() function for example:

> data1
[1] 3 5 7 5 3 2 6 8 5 6 9
> quantile(data1)
0% 25% 50% 75% 100%
2.0 4.0 5.0 6.5 9.0
The functions operate over a one-dimensional object (called a vector in R-speak). If your data
are a column of a larger dataset then you‟ll need to use attach() or the $ so that R can “find”
your data.
> mf
Length Speed Algae NO3 BOD
1 20 12 40 2.25 200
2 21 14 45 2.15 180
3 22 12 45 1.75 135
4 23 16 80 1.95 120
> mean(Speed)
Error in mean(Speed) : object „Speed‟ not found
> mean(mf$Speed)
[1] 15.8
> attach(mf)
> quantile(Algae)
0% 25% 50% 75% 100%
25 40 65 75 85
> detach(mf)
If your data contain NA elements, the result may show as NA. You can overcome this in
many (but not all) commands by adding na.rm = TRUE as a parameter.
> nad
[1] NA 3 5 7 5 3 2 6 8 5 6 9 NA NA
> mean(nad)
[1] NA
> mean(nad, na.rm = TRUE)
[1] 5.363636
T-test
Student‟s t-test is a classic method for comparing mean values of two samples that are
normally distributed (i.e. they have a Gaussian distribution). Such samples are described as
being parametric and the t-test is a parametric test. In R the t.test() command will carry out
several versions of the t-test.

t.test(x, y, alternative, mu, paired, var.equal, …)

 x – a numeric sample.
 y – a second numeric sample (if this is missing the command carries out a 1-sample
test).
 alternative – how to compare means, the default is “two.sided”. You can also specify
“less” or “greater”.
 mu – the true value of the mean (or mean difference). The default is 0.
 paired – the default is paired = FALSE. This assumes independent samples. The
alternative paired = TRUE is used for matched pair tests.
 equal – the default is var.equal = FALSE. This treats the variance of the two samples
separately. If you set var.equal = TRUE you conduct a classic t-test using pooled
variance.
 … – there are additional parameters that we aren‟t concerned with here.
In most cases you want to compare two independent samples:

> mow
[1] 12 15 17 11 15
> unmow
[1] 8 9 7 9

> t.test(mow, unmow)

Welch Two Sample t-test

data: mow and unmow


t = 4.8098, df = 5.4106, p-value = 0.003927
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.745758 8.754242
sample estimates:
mean of x mean of y
14.00 8.25
If you specify a single variable you can carry out a 1-sample test by specifying the mean to
compare against:
> data1
[1] 3 5 7 5 3 2 6 8 5 6 9
> t.test(data1, mu = 5)

One Sample t-test

data: data1
t = 0.55902, df = 10, p-value = 0.5884
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
3.914249 6.813024
sample estimates:
mean of x
5.363636
If you have matched pair data you can specify paired = TRUE as a parameter (see more in
14.4).

U-test
The U-test is used for comparing the median values of two samples. You use it when the data
are not normally distributed, so it is described as a non-parametric test. The U-test is often
called the Mann-Whitney U-test but is generally attributed to Wilcoxon (Wilcoxon Rank Sum
test), hence in R the command is wilcox.test().wilcox.test(x, y, alternative, mu, paired, …)

 x – a numeric sample.
 y – a second numeric sample (if this is missing the command carries out a 1-sample
test).
 alternative – how to compare means, the default is “two.sided”. You can also specify
“less” or “greater”.
 mu – the true value of the median (or median difference). The default is 0.
 paired – the default is paired = FALSE. This assumes independent samples. The
alternative paired = TRUE is used for matched pair tests.
 … – there are additional parameters that we aren‟t concerned with here.
> wilcox.test(Grass, Heath)

Wilcoxon rank sum test with continuity correction

data: Grass and Heath


W = 20.5, p-value = 0.03625
alternative hypothesis: true location shift is not equal to 0
If there are tied ranks you may get a warning message about exact p-values. You can safely
ignore this!
Paired tests

The t-test and the U-test can both be used when your data are in matched pairs. Sometimes
this kind of test is also called a repeated measures test (depending on circumstance). You can
run the test by adding paired = TRUE to the appropriate command.

Here is an example where the data show the effectiveness of greenhouse sticky traps in
catching whitefly. Each trap has a white side and a yellow side. To compare white and yellow
we can use a matched pair.

> mpd
white yellow
1 4 4
2 3 7
3 4 2
4 1 2
5 6 7
6 4 10
7 6 5
8 4 8

> attach(mpd)
> t.test(white, yellow, paired = TRUE, var.equal = TRUE)

Paired t-test

data: white and yellow


t = -1.6567, df = 7, p-value = 0.1415
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.9443258 0.6943258
sample estimates:
mean of the differences
-1.625
> detach(mpd)
You can do a similar thing with the wilcox.test().

Chi Squared tests

Tests for association are easily carried out using the chisq.test() command. Your data need to
be arranged as a contingency table. Here is an example:

> bird
Garden Hedgerow Parkland Pasture Woodland
Blackbird 47 10 40 2 2
Chaffinch 19 3 5 0 2
Great Tit 50 0 10 7 0
House Sparrow 46 16 8 4 0
Robin 9 3 0 0 2
Song Thrush 4 0 6 0 0
In this dataset the columns form one set of categories (habitats) and the rows form another set
(bird species). In the original spreadsheet (CSV file) the first column contains the bird species
names; these are “converted” to row names when the data are imported:

> bird = read.csv(file = “birds.csv”, row.names = 1)


You can carry out a test for association with the chisq.test() function. This time though, you
should assign the result to a named variable.

> cs = chisq.test(bird)
Warning message:
In chisq.test(bird) : Chi-squared approximation may be incorrect

> cs

Pearson‟s Chi-squared test


data: bird
X-squared = 78.274, df = 20, p-value = 7.694e-09

In this instance you get a warning message (this is because there are expected values < 5).
The basic “result” shows the overall significance but there are other components that may
prove useful:

> names(cs)
[1] “statistic” “parameter” “p.value” “method” “data.name” “observed”
[7] “expected” “residuals” “stdres”
You can view the components using the $ like so:

> cs$expected
Garden Hedgerow Parkland Pasture Woodland
Blackbird 59.915254 10.955932 23.623729 4.4508475 2.0542373
Chaffinch 17.203390 3.145763 6.783051 1.2779661 0.5898305
Great Tit 39.745763 7.267797 15.671186 2.9525424 1.3627119
House Sparrow 43.898305 8.027119 17.308475 3.2610169 1.5050847
Robin 8.305085 1.518644 3.274576 0.6169492 0.2847458
Song Thrush 5.932203 1.084746 2.338983 0.4406780 0.2033898
Now you can see the expected values. Other useful components are $residuals and $stdres,
which are the Pearson residuals and the standardized residuals respectively.

You can also use square brackets to get part of the result e.g.
> cs$stdres[1, ]
Garden Hedgerow Parkland Pasture Woodland
-3.2260031 -0.3771774 4.7468743 -1.4651818 -0.0471460
Now you can see the standardized residuals for the Blackbird row.

Yates‟ correction

When using a 2 x 2 contingency table it is common practice to reduce the Observed –


Expected differences by 0.5. To do this add correct=TRUE to the original function e.g.

> your.chi = chisq.test(your.data, correct=TRUE)


If your table is larger than 2 x 2 then the correction will not be applied (the basic test will run
instead) even if you request it.

Goodness of Fit test

A goodness of fit test is a special kind of test of association. You use it when you have a set
of data in categories that you want to compare to a “known” standard. A classic example is in
genetics, where the theory suggests a particular ratio of phenotypes:

> pea
[1] 116 40 31 13

> ratio
[1] 9 3 3 1
Here you have counts of pea plants (there are 4 phenotypes) from an experiment in cross-
pollination. The genetic theory suggests that your results should be in the ratio 9:3:3:1. Are
these results in that ratio? The goodness of fit test will tell you.

> gfit = chisq.test(pea, p = ratio, rescale.p = TRUE)


> gfit

Chi-squared test for given probabilities


data: pea
X-squared = 1.4222, df = 3, p-value = 0.7003
Notice that the rescale.p = TRUE parameter was used to ensure that the expected
probabilities sum to one. The final result is not significant, meaning that the peas observed
are not statistically significantly different from the expected ratio.

The result contains the same components as the regular chi squared test:

> gfit$expected
[1] 112.5 37.5 37.5 12.5
> gfit$stdres
[1] 0.4988877 0.4529108 -1.1775681 0.1460593

VISUALIZATION USING PYTHON

Data visualization is the practice of translating data into visual contexts, such as a map
or graph, to make data easier for the human brain to understand and to draw
comprehension from. The main goal of data viewing is to make it easier to identify
patterns, styles, and vendors in large data sets. The term is often used in a unique way,
including information drawings, information visuals, and mathematical diagrams.

Process of Data Visualization


Several different fields are involved in the data recognition process, to facilitate or
reveal existing relationships or discovering something new in a dataset.
1. Filtering and processing.
Refining and refining data transforms it into information by analyzing, interpreting,
summarizing, comparing, and researching.
2. Translation & visual representation.
Creating visual representation by describing image sources, language, context, and word
of introduction, all for the recipient.
3. Visualization and interpretation.
Finally, visual acuity is effective if it has a cognitive impact on knowledge construction.

Data visualization formats


1. Bar Charts
Bar charts are one of the most popular ways to visualize data because it presents quickly
set data an understandable format that allows viewers to see height and depth at a glance.
They are very diverse and are often used comparing different categories, analyzing
changes over time, or comparing certain parts. The three variations on the bar chart are:
Vertical column: The data is used chronologically, too it should be in left-to-right
format.
Horizontal column: It is used to visualize categories
Full stacked column: Used to visualize the categories that together add up to 100%

Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)


2. Histograms
Histograms represent flexibility in the form of bars, where the face of each bar is equal
to the number of values represented. They offer an overview of demographic or sample
distribution with a particular aspect. The two differences in the histogram are:

1. Standing columns

2. Horizontal columns

Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)


3.Pie charts
The pie chart contains a circle divided into categories, each representing a portion of the
theme. They can be divided into no more than five data groups. They can be useful for
comparing different or continuous data.
The two differences in the pie chart are:
1.Standard: Used to show relationships between components.
2.Donut: A variation of style that facilitates the inclusion of a whole value or design
element in the center.

Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)


4. Scatter Plot
Scatter plots sites use a point spread over the Cartesian integration plane to show the
relationship between the two variables. They also help us determine whether the
different data groups are related or not.

Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)

5. Heat Maps
Temperature maps represent individual values from a set of data in the matrix
using color variation or color intensity. They often use color to help viewers
compare and contrast data at two distinct categories. They are useful for viewing
web pages, where the areas most users encounter are represented by “hot” colors,
and pages that receive the fewest clicks are displayed in “cold” colors.

Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)


6. Line Plot
This is used to display changes or trends in data over time. They are especially useful in
showing relationships, speeding, slowing down, and instability in the data set.
Source: Netquest- A Comprehensive Guide to Data Visualization (Melisa Matias)

BUILDING MODELS AND EVALUTION WITH SCIKIT

Scikit-learn is an open-source Python library that implements a range of machine

learning, pre-processing, cross-validation, and visualization algorithms using a unified

interface. It is an open-source machine-learning library that provides a plethora of tools

for various machine-learning tasks such as Classification, Regression, Clustering, and

many more.
Installation of Scikit- learn
The latest version of Scikit-learn is 1.1 and it requires Python 3.8 or newer.
Scikit-learn requires:
 NumPy
 SciPy as its dependencies.
Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you
have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is
using pip:
!pip install -U scikit-learn
Let us get started with the modeling process now.
Step 1: Load a Dataset
A dataset is nothing but a collection of data. A dataset generally has two main components:
 Features: (also known as predictors, inputs, or attributes) they are simply the
variables of our data. They can be more than one and hence represented by
a feature matrix („X‟ is a common notation to represent feature matrix). A list
of all the feature names is termed feature names.
 Response: (also known as the target, label, or output) This is the output variable
depending on the feature variables. We generally have a single response column
and it is represented by a response vector („y‟ is a common notation to
represent response vector). All the possible values taken by a response vector are
termed target names.
Loading exemplar dataset: scikit-learn comes loaded with a few example datasets like the
iris and digits datasets for classification and the boston house prices dataset for regression.
Loading external dataset: Now, consider the case when we want to load an external
dataset. For this purpose, we can use the pandas library for easily loading and
manipulating datasets.
To install pandas, use the following pip command:
! pip install pandas
In pandas, important data types are:
 Series: Series is a one-dimensional labeled array capable of holding any data
type.
 DataFramet: is a 2-dimensional labeled data structure with columns of
potentially different types. You can think of it like a spreadsheet or SQL table,
or a dict of Series objects. It is generally the most commonly used pandas object.
Step 2: Splitting the Dataset
One important aspect of all machine learning models is to determine their accuracy. Now,
in order to determine their accuracy, one can train the model using the given dataset and
then predict the response values for the same dataset using that model and hence, find the
accuracy of the model.
But this method has several flaws in it, like:
 The goal is to estimate the likely performance of a model on out-of-sample data.
 Maximizing training accuracy rewards overly complex models that won‟t
necessarily generalize our model.
 Unnecessarily complex models may over-fit the training data.
A better option is to split our data into two parts: the first one for training our machine
learning model, and the second one for testing our model.
Advantages of train/test split
 The model can be trained and tested on different data than the one used for
training.
 Response values are known for the test dataset; hence predictions can be
evaluated.
 Testing accuracy is a better estimate than training accuracy of out-of-sample
performance.
Step 3: Training the Model
Now, it‟s time to train some prediction models using our dataset. Scikit-learn provides a
wide range of machine learning algorithms that have a unified/consistent interface for
fitting, predicting accuracy, etc.
The example given below uses KNN (K nearest neighbors) classifier.
Features of Scikit-learn
 Simple and efficient tools for data mining and data analysis. It features various
classification, regression, and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means, etc.
 Accessible to everybody and reusable in various contexts.
 Built on the top of NumPy, SciPy, and matplotlib.
 Open source, commercially usable – BSD license.
Benefits of using Scikit-learn Libraries
 Consistent interface to machine learning models
 Provides many tuning parameters but with sensible defaults.
 Exceptional documentation
 Rich set of functionalities for companion tasks.
 Active community for development and support.

You might also like