Programming For Data Science
Programming For Data Science
NumPy, short for Numerical Python, has long been a cornerstone of numerical computing
in Python. It provides the data structures, algorithms, and library glue needed for most
scientific applications involving numerical data in Python. NumPy contains, among other
things:
• A mature C API to enable Python extensions and native C or C++ code to access NumPy‟s
data structures and computational facilities
Beyond the fast array-processing capabilities that NumPy adds to Python, one of its primary
uses in data analysis is as a container for data to be passed between algorithms and libraries.
For numerical data, NumPy arrays are more efficient for storing and manipulating data than
the other built-in Python data structures. Also, libraries written in a lower-level language,
such as C or FORTRAN, can operate on the data stored in a NumPy array without copying
data into some other memory representation. Thus, many numerical computing tools for
Python either assume NumPy arrays as a primary data structure or else target interoperability
with NumPy.
NumPy Library
NumPy is a Python library used for scientific computing. This is Python's scientific
computing core library, providing high-performance multidimensional array objects, tools for
manipulating those arrays, and various mathematical functions. It also contains useful linear
algebra, Fourier transform, and random number capabilities.
NumPy is a scientific computing library for Python. It provides powerful tools for
manipulating and analyzing numerical data. It is used for array-based computations, linear
algebra, Fourier transforms random number functions, and more.
Example program for Numpy: Here's a short program that creates two random 3x3 matrices,
multiplies thimport numpy as np
import numpy as np
output:
Matrix 1:
[[0.9317527 0.8167975 0.44996549]
[0.33871151 0.02973976 0.93414816]
[0.2228187 0.84413553 0.64904411]]
Matrix 2:
[[0.46360681 0.26930509 0.81966568]
[0.9149686 0.61693763 0.24072468]
[0.00164111 0.46880786 0.67252377]]
1. Easy to use: NumPy is very easy to use, and its syntax is simple, making it easier to
code.
2. Speed: NumPy is very fast as it uses highly optimized C and Fortran libraries under
the hood.
4. Compatibility: NumPy is compatible with many other libraries such as SciPy, Scikit-
learn, Matplotlib, etc.
6. Math library: NumPy has an extensive math library that provides many
mathematical functions such as trigonometric functions �, logarithms, etc.
7. Linear algebra support: NumPy supports linear algebra operations such as matrix
multiplication, vector operations, etc. �.
pandas :
pandas provides high-level data structures and functions designed to make working with
structured or tabular data intuitive and flexible. pandas blends the array-computing ideas of
NumPy with the kinds of data manipu‐ lation capabilities found in spreadsheets and relational
databases (such as SQL). It provides convenient indexing functionality to enable you to
reshape, slice and dice, perform aggregations, and select subsets of data. Since data
manipulation, prepara‐ tion, and cleaning are such important skills in data analysis.
Pandas is an open-source library for data analysis and manipulation. It provides a wide
range of data structures and tools for working with data. It is designed for easy data
wrangling and manipulation and can be used for a variety of tasks such as data cleaning, data
analysis, data visualization, and more. Pandas can be used for data analysis in Python and
other languages such as R and Julia.
Give python example :Here is an example of using Pandas to read a CSV file and display the
data as a table:
import pandas as pd
Read in CSV file
df = pd.read_csv('example.csv')
Display data as a table
print(df)
Summary:
Pandas: Pandas is a data-analysis library that provides high-level data structures and robust
data analysis tools. It is used for data wrangling, cleaning, and preparation. It is designed to
make data manipulation and analysis easy and intuitive.
SciPy
SciPy (Scientific Python) is another free and open-source Python library for data science that
is extensively used for high-level computations. It‟s extensively used for scientific and
technical computations, because it extends NumPy and provides many user-friendly and
efficient routines for scientific calculations.
Features:
Applications:
SciPy
SciPy is a collection of packages addressing a number of foundational problems in scientific
computing. Here are some of the tools it contains in its various modules:
scipy.integrate
scipy.linalg
Linear algebra routines and matrix decompositions extending beyond those pro‐ vided in
numpy.linalg
scipy.optimize
scipy.signal
scipy.sparse
scipy.special
scipy.stats
Together, NumPy and SciPy form a reasonably complete and mature computational
foundation for many traditional scientific computing applications.
import numpy as np
from scipy import optimize, integrate, stats
# Optimization example
def rosen(x):
return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)
output:
1.Splitting Data: First, you'll need to split your dataset into training and testing sets using
train_test_split function.
model = SVC()
model.fit(X_train, y_train)
3. Model Evaluation:
For classification problems, you can use metrics like accuracy, precision, recall, F1-score,
ROC-AUC, etc.
For regression problems, metrics like mean squared error (MSE), mean absolute error
(MAE), R-squared, etc., are commonly used.
# Predictions
y_pred = model.predict(X_test)
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))
For regression, the process is similar, but you'll use different evaluation metrics:
These are some basic steps to evaluate machine learning models using Scikit-learn in Python.
Depending on your specific problem and requirements, you might need to delve deeper into
specific matrics and techniques.
Reading data and making it accessible (often called data loading) is a necessary first step for
using most of the tools in this book. The term parsing is also sometimes used to describe
loading text data and interpreting it as tables and different data types. I‟m going to focus on
data input and output using pandas, though there are numerous tools in other libraries to help
with reading and writing data in various formats.
Input and output typically fall into a few main categories: reading text files and other more
efficient on-disk formats, loading data from databases, and interacting with network sources
like web APIs.
1.Data Loading:
Loading data refers to the process of bringing data into memory from external sources such
as files, databases, or APIs. In Python, several libraries facilitate data loading, including:
Pandas: Pandas is a powerful library for data manipulation and analysis. It provides
functions like read_csv, read_excel, read_sql, etc., to read data from CSV files, Excel files,
SQL databases, etc.
NumPy: NumPy also offers functions like loadtxt and genfromtxt to load data from
text files into NumPy arrays.
JSON and CSV modules: Python's built-in json and csv modules are handy for
loading data from JSON and CSV files, respectively.
Requests: When dealing with web APIs, the requests library is commonly used to
fetch data from remote servers.
2.Data Storage:
Data storage involves saving data to disk or a database for later retrieval. Common storage
options in Python include:
File Systems: Data can be stored in files on disk, using various file formats like CSV,
Excel, JSON, HDF5, etc.
3.File Formats:
File formats define the structure and encoding of data when stored in files. Common file
formats used in Python include:
CSV (Comma-Separated Values): Simple and widely used for tabular data.
Excel: Popular spreadsheet format supported by libraries like Pandas and openpyxl.
HDF5 (Hierarchical Data Format version 5): Designed for storing and organizing
large amounts of numerical data.
Parquet, Avro, ORC: Columnar file formats commonly used in big data processing
frameworks like Apache Spark.
XML (eXtensible Markup Language): Semi-structured data format often used for
configuration files or web services.
In summary, Python provides a rich ecosystem of libraries and tools for efficiently loading,
storing, and working with data from various sources and in different formats, empowering
data scientists, engineers, and analysts to handle diverse data requirements effectively.
Below are example of Data Wrangling that implements the above functionalities on a
raw dataset:
Data exploration in Python
Here in Data exploration, we load the data into a dataframe, and then we visualize the data
in a tabular format.
# Import pandas package
import pandas as pd
# Assign data
data = {'Name': ['Jai', 'Princi', 'Gaurav',
'Anuj', 'Ravi', 'Natasha', 'Riya'],
'Age': [17, 17, 18, 17, 18, 17, 17],
'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}
# Convert into DataFrame
df = pd.DataFrame(data)
# Display data
df
Output:
Summary:
Data Wrangling:
Data wrangling involves preparing raw data for analysis. It consists of several key steps:
Cleaning Data:
Identify and handle missing data.
Remove duplicates.
Correct inconsistent data formats or values.
Address outliers or anomalies.
Examples:
Cleaning Data:
# Load data
data = pd.read_csv('data.csv')
Output: cleaned_data will contain the dataset with rows containing missing values removed.
Transforming Data:
Output: transformed_data will contain the dataset with categorical variables converted
into numerical representations using one-hot encoding.
Merging Data:
Output: merged_data will contain the merged dataset based on a common column ('id' in this
example).
Reshaping Data:
Pivot data from wide to long format or vice versa.
Stack or unstack data.
Melt or reshape data to fit specific analysis or visualization requirements.
Example:
Reshaping Data:
Output: reshaped_data will contain the pivoted DataFrame, restructured based on the
specified index, columns, and values.
Note: Plotting Data:Visualizing data using plots such as line plots, scatter plots, histograms, etc.
Data Visualization:Enhancing plots with labels, titles, legends, etc., for better interpretation.
Matplotlib
Matplotlib is an easy-to-use, low-level data visualization library that is built on NumPy
arrays. It consists of various plots like scatter plot, line plot, histogram, etc. Matplotlib
provides a lot of flexibility.
To install this type the below command in the terminal.
After installing Matplotlib, let‟s see the most commonly used plots using this library.
Scatter Plot
Scatter plots are used to observe relationships between variables and uses dots to represent
the relationship between them. The scatter() method in the matplotlib library is used to
draw a scatter plot.
import pandas as pd
import matplotlib.pyplot as plt
plt.show()
Output:
This graph can be more meaningful if we can add colors and also change the size of the
points. We can do this by using the c and s parameter respectively of the scatter function.
We can also show the color bar using the colorbar() method.
Line Chart
output:
Seaborn
Seaborn is a high-level interface built on top of the Matplotlib. It provides beautiful design
styles and color palettes to make more attractive graphs.
To install seaborn type the below command in the terminal.
pip install seaborn
Seaborn is built on the top of Matplotlib, therefore it can be used with the Matplotlib as
well. Using both Matplotlib and Seaborn together is a very simple process. We just have to
invoke the Seaborn Plotting function as normal, and then we can use Matplotlib‟s
customization function.
Note: Seaborn comes loaded with dataset such as tips, iris, etc. but for the sake of this
tutorial we will use Pandas for loading these datasets.
Bokeh
Let‟s move on to the third library of our list. Bokeh is mainly famous for its interactive
charts visualization. Bokeh renders its plots using HTML and JavaScript that uses modern
web browsers for presenting elegant, concise construction of novel graphics with high-level
interactivity.
To install this type the below command in the terminal.
pip install bokeh
One of the key features of Bokeh is to add interaction to the plots. Let‟s see various
interactions that can be added.
Interactive Legends
click_policy property makes the legend interactive. There are two types of interactivity –
Hiding: Hides the Glyphs.
Muting: Hiding the glyph makes it vanish completely, on the other hand, muting the
glyph just de-emphasizes the glyph based on the parameters.
Examples:
The sum() function is used to calculate the sum of every value.
df.sum()
Output:
The describe() function is used to get a summary of our dataset
df.describe()
Output:
We used agg() function to calculate the sum, min, and max of each column in
our dataset.
Output:
Grouping Operations:
Grouping is used to group data using some criteria from our dataset. It is used as split -
apply-combine strategy.
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.
Examples:
We use groupby() function to group the data on “Maths” value. It returns the object as
result.
df.groupby(by=['Maths'])
Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012581821388>
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.
a = df.groupby('Maths')
a.first()
Output:
First grouping based on “Maths” within each team we are grouping based on “Science”
Python
b = df.groupby(['Maths', 'Science'])
b.first()
Output:
TIME SERIES
What are time series visualization and analytics?
Time series visualization and analytics empower users to graphically represent time-based
data, enabling the identification of trends and the tracking of changes over different
periods. This data can be presented through various formats, such as line graphs, gauges,
tables, and more.
The utilization of time series visualization and analytics facilitates the extraction of insights
from data, enabling the generation of forecasts and a comprehensive understanding of the
information at hand. Organizations find substantial value in time series data as it allows
them to analyze both real-time and historical metrics.
What is Time Series Data?
Time series data is a sequential arrangement of data points organized in consecutive time
order. Time-series analysis consists of methods for analyzing time-series data to extract
meaningful insights and other valuable characteristics of the data.
Importance of time series analysis
Time-series data analysis is becoming very important in so many industries, like financial
industries, pharmaceuticals, social media companies, web service providers, research, and
many more. To understand the time-series data, visualization of the data is essential. In
fact, any type of data analysis is not complete without visualizations, because one good
visualization can provide meaningful and interesting insights into the data.
Basic Time Series Concepts
Trend: A trend represents the general direction in which a time series is moving
over an extended period. It indicates whether the values are increasing,
decreasing, or staying relatively constant.
Seasonality: Seasonality refers to recurring patterns or cycles that occur at
regular intervals within a time series, often corresponding to specific time units
like days, weeks, months, or seasons.
Moving average: The moving average method is a common technique used in
time series analysis to smooth out short-term fluctuations and highlight longer-
term trends or patterns in the data. It involves calculating the average of a set of
consecutive data points, referred to as a “window” or “rolling window,” as it
moves through the time series
Noise: Noise, or random fluctuations, represents the irregular and unpredictable
components in a time series that do not follow a discernible pattern. It introduces
variability that is not attributable to the underlying trend or seasonality.
Differencing: Differencing is used to make the difference in values of a
specified interval. By default, it‟s one, we can specify different values for plots.
It is the most popular method to remove trends in the data.
Stationarity: A stationary time series is one whose statistical properties, such as
mean, variance, and autocorrelation, remain constant over time.
Order: The order of differencing refers to the number of times the time series
data needs to be differenced to achieve stationarity.
Autocorrelation: Autocorrelation, is a statistical method used in time series
analysis to quantify the degree of similarity between a time series and a lagged
version of itself.
Resampling: Resampling is a technique in time series analysis that involves
changing the frequency of the data observations. It‟s often used to transform the
data to a different frequency (e.g., from daily to monthly) to reveal patterns or
trends more clearly.
Types of Time Series Data
Time series data can be broadly classified into two sections:
1. Continuous Time Series Data: Continuous time series data involves measurements or
observations that are recorded at regular intervals, forming a seamless and uninterrupted
sequence. This type of data is characterized by a continuous range of possible values and is
commonly encountered in various domains, including:
Temperature Data: Continuous recordings of temperature at consistent
intervals (e.g., hourly or daily measurements).
Stock Market Data: Continuous tracking of stock prices or values throughout
trading hours.
Sensor Data: Continuous measurements from sensors capturing variables like
pressure, humidity, or air quality.
2. Discrete Time Series Data: Discrete time series data, on the other hand, consists of
measurements or observations that are limited to specific values or categories. Unlike
continuous data, discrete data does not have a continuous range of possible values but
instead comprises distinct and separate data points. Common examples include:
Count Data: Tracking the number of occurrences or events within a specific
time period.
Categorical Data: Classifying data into distinct categories or classes (e.g.,
customer segments, product types).
Binary Data: Recording data with only two possible outcomes or states.
Visualization Approach for Different Data Types:
Plotting data in a continuous time series can be effectively represented
graphically using line, area, or smooth plots, which offer insights into the
dynamic behavior of the trends being studied.
To show patterns and distributions within discrete time series data, bar charts,
histograms, and stacked bar plots are frequently utilized. These methods provide
insights into the distribution and frequency of particular occurrences or
categories throughout time.
$ jupyter notebook
This will start up Jupyter and your default browser should start (or open a new tab) to the
following URL: http://localhost:8888/tree
Your browser should now look something like this:
Note that right now you are not actually running a Notebook, but instead you are just running
the Notebook server. Let‟s actually create a Notebook now!
Creating a Notebook
Now that you know how to start a Notebook server, you should probably learn how to create
an actual Notebook document.
All you need to do is click on the New button (upper right), and it will open up a list of
choices. On my machine, I happen to have Python 2 and Python 3 installed, so I can create a
Notebook that uses either of these. For simplicity‟s sake, let‟s choose Python 3.
Your web page should now look like this:
Naming
You will notice that at the top of the page is the word Untitled. This is the title for the page
and the name of your Notebook. Since that isn‟t a very descriptive name, let‟s change it!
Just move your mouse over the word Untitled and click on the text. You should now see an
in-browser dialog titled Rename Notebook. Let‟s rename this one to Hello Jupyter:
Running Cells
A Notebook‟s cell defaults to using code whenever you first create one, and that cell uses the
kernel that you chose when you started your Notebook.
In this case, you started yours with Python 3 as your kernel, so that means you can write
Python code in your code cells. Since your initial Notebook has only one empty cell in it, the
Notebook can‟t really do anything.
Thus, to verify that everything is working as it should, you can add some Python code to the
cell and try running its contents.
Let‟s try adding the following code to that cell:
Python
print('Hello Jupyter!')
Running a cell means that you will execute the cell‟s contents. To execute a cell, you can just
select the cell and click the Run button that is in the row of buttons along the top. It‟s towards
the middle. If you prefer using your keyboard, you can just press Shift + Enter .
When I ran the code above, the output looked like this:
If you have multiple cells in your Notebook, and you run the cells in order, you can share
your variables and imports across cells. This makes it easy to separate out your code into
logical chunks without needing to reimport libraries or recreate variables or functions in
every cell.
When you run a cell, you will notice that there are some square braces next to the word In to
the left of the cell. The square braces will auto fill with a number that indicates the order that
you ran the cells. For example, if you open a fresh Notebook and run the first cell at the top
of the Notebook, the square braces will fill with the number 1.
PyDev development environments
Press the New button and enter the path to python.exe in your Python installation directory.
For Linux and Mac OS X users this is normally /usr/bin/python.
The result should look like the following.
Press finish .
Select Window Open Perspective Other. Select the PyDev perspective.
Select the "src" folder of your project, right-click it and select New → PyDev Modul. Create a
module "FirstModule".
Right-click your model and select Run As → Python run.
Forward Propagation
Input Layer: Each feature in the input layer is represented by a node on the
network, which receives input data.
Weights and Connections: The weight of each neuronal connection indicates
how strong the connection is. Throughout training, these weights are changed.
Hidden Layers: Each hidden layer neuron processes inputs by multiplying them
by weights, adding them up, and then passing them through an activation
function. By doing this, non-linearity is introduced, enabling the network to
recognize intricate patterns.
Output: The final result is produced by repeating the process until the output
layer is reached.
Backpropagation
Loss Calculation: The network‟s output is evaluated against the real goal
values, and a loss function is used to compute the difference. For a regression
problem, the Mean Squared Error (MSE) is commonly used as the cost function.
Loss Function:
Gradient Descent: Gradient descent is then used by the network to reduce the
loss. To lower the inaccuracy, weights are changed based on the derivative of
the loss with respect to each weight.
Adjusting weights: The weights are adjusted at each connection by applying
this iterative process, or backpropagation, backward across the network.
Training: During training with different data samples, the entire process of
forward propagation, loss calculation, and backpropagation is done iteratively,
enabling the network to adapt and learn patterns from the data.
Actvation Functions: Model non-linearity is introduced by activation functions
like the rectified linear unit (ReLU) or sigmoid. Their decision on whether to
“fire” a neuron is based on the whole weighted input.
Learning of a Neural Network
1. Learning with supervised learning
In supervised learning, the neural network is guided by a teacher who has access to both
input-output pairs. The network creates outputs based on inputs without taking into account
the surroundings. By comparing these outputs to the teacher-known desired outputs, an
error signal is generated. In order to reduce errors, the network‟s parameters are changed
iteratively and stop when performance is at an acceptable level.
2. Learning with Unsupervised learning
Equivalent output variables are absent in unsupervised learning. Its main goal is to
comprehend incoming data‟s (X) underlying structure. No instructor is present to offer
advice. Modeling data patterns and relationships is the intended outcome instead. Words
like regression and classification are related to supervised learning, whereas unsupervised
learning is associated with clustering and association.
3. Learning with Reinforcement Learning
Through interaction with the environment and feedback in the form of rewards or penalties,
the network gains knowledge. Finding a policy or strategy that optimizes cumulative
rewards over time is the goal for the network. This kind is frequently utilized in gaming and
decision-making applications.
Types of Neural Networks
There are seven types of neural networks that can be used.
Feedforward Neteworks: A feedforward neural network is a simple artificial
neural network architecture in which data moves from input to output in a single
direction. It has input, hidden, and output layers; feedback loops are absent. Its
straightforward architecture makes it appropriate for a number of applications,
such as regression and pattern recognition.
Multilayer Perceptron (MLP): MLP is a type of feedforward neural network
with three or more layers, including an input layer, one or more hidden layers,
and an output layer. It uses nonlinear activation functions.
Convolutional Neural Network (CNN): A Convolutional Neural
Network (CNN) is a specialized artificial neural network designed for image
processing. It employs convolutional layers to automatically learn hierarchical
features from input images, enabling effective image recognition and
classification. CNNs have revolutionized computer vision and are pivotal in
tasks like object detection and image analysis.
Recurrent Neural Network (RNN): An artificial neural network type intended
for sequential data processing is called a Recurrent Neural Network (RNN). It is
appropriate for applications where contextual dependencies are critical, such as
time series prediction and natural language processing, since it makes use of
feedback loops, which enable information to survive within the network.
Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed
to overcome the vanishing gradient problem in training RNNs. It uses memory
cells and gates to selectively read, write, and erase information.
DATA EXPLORATION IN PYTHON
Part 1: How to load data file(s) using Pandas?
Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In Python, it is
easy to load data from any source, due to its simple syntax and availability of predefined
libraries, such as Pandas. Here I will make use of Pandas itself.
Pandas features a number of functions for reading tabular data as a Pandas DataFrame
object. Below are the common functions that can be used to read data (including
read_csv in Pandas):
The later operations are especially useful when you input value from user using
raw_input(). By default, the values are read at string.
Convert character date to Date:There are multiple ways to do this. Thesimplest
would be to use the datetime library and strptime function.
Here is the code:
from datetime import datetime
char_date = 'Apr 1 2015 1:20 PM' #creating example character date
date_obj = datetime.strptime(char_date, '%b %d %Y %I:%M%p')
print date_obj
Part 3: How to transpose a Data set or dataframe using Pandas?
Here, I want to transpose Table A into Table B on the variable Product. This task can be
accomplished by using Pandas dataframe.pivot:
Data visualization always helps to understand the data easily. Python has libraries
like matplotlib and seaborn to create multiple graphs effectively. Let‟s look at the some
Histogram:
#Plot Histogram
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
#Plots in matplotlib reside within a figure object, use plt.figure to create new figure
fig=plt.figure()
#Create one or more subplots using add_subplot, because you can't create blank figure
ax = fig.add_subplot(1,1,1)
#Variable
ax.hist(df['Age'],bins = 5)
#Labels and Tit
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('#Employee')
plt.show()
Output
Scatter plot:
#Plots in matplotlib reside within a figure object, use plt.figure to create new figure
fig=plt.figure()
#Create one or more subplots using add_subplot, because you can't create blank figure
ax = fig.add_subplot(1,1,1)
#Variable
ax.scatter(df['Age'],df['Sales'])
#Labels and Tit
plt.title('Sales and Age distribution')
plt.xlabel('Age')
plt.ylabel('Sales')
plt.show()
Output
Box-plot:
import seaborn as sns
sns.boxplot(df['Age'])
sns.despine()
Output
import pandas as pd
df=pd.read_excel("E:/First.xlsx", "Sheet1")
print df
test= df.groupby(['Gender','BMI'])
test.size()
Output
Output
Part 10: How to recognize and Treat missing values and outliers in Pandas?
To identify missing values , we can use dataframe.isnull(). You can also refer article
“Data Munging in Python (using Pandas)“, here we have done a case study to recognize
and treat missing and outlier values.
# Identify missing values of dataframe
df.isnull()
Output
To treat missing values, there are various imputation methods available. You can refer
these articles for methods to detect Outlier and Missing values. Imputation methods for
both missing and outlier values are almost similar. Here we will discuss general case
> data1
[1] 3 5 7 5 3 2 6 8 5 6 9
> quantile(data1)
0% 25% 50% 75% 100%
2.0 4.0 5.0 6.5 9.0
The functions operate over a one-dimensional object (called a vector in R-speak). If your data
are a column of a larger dataset then you‟ll need to use attach() or the $ so that R can “find”
your data.
> mf
Length Speed Algae NO3 BOD
1 20 12 40 2.25 200
2 21 14 45 2.15 180
3 22 12 45 1.75 135
4 23 16 80 1.95 120
> mean(Speed)
Error in mean(Speed) : object „Speed‟ not found
> mean(mf$Speed)
[1] 15.8
> attach(mf)
> quantile(Algae)
0% 25% 50% 75% 100%
25 40 65 75 85
> detach(mf)
If your data contain NA elements, the result may show as NA. You can overcome this in
many (but not all) commands by adding na.rm = TRUE as a parameter.
> nad
[1] NA 3 5 7 5 3 2 6 8 5 6 9 NA NA
> mean(nad)
[1] NA
> mean(nad, na.rm = TRUE)
[1] 5.363636
T-test
Student‟s t-test is a classic method for comparing mean values of two samples that are
normally distributed (i.e. they have a Gaussian distribution). Such samples are described as
being parametric and the t-test is a parametric test. In R the t.test() command will carry out
several versions of the t-test.
x – a numeric sample.
y – a second numeric sample (if this is missing the command carries out a 1-sample
test).
alternative – how to compare means, the default is “two.sided”. You can also specify
“less” or “greater”.
mu – the true value of the mean (or mean difference). The default is 0.
paired – the default is paired = FALSE. This assumes independent samples. The
alternative paired = TRUE is used for matched pair tests.
equal – the default is var.equal = FALSE. This treats the variance of the two samples
separately. If you set var.equal = TRUE you conduct a classic t-test using pooled
variance.
… – there are additional parameters that we aren‟t concerned with here.
In most cases you want to compare two independent samples:
> mow
[1] 12 15 17 11 15
> unmow
[1] 8 9 7 9
data: data1
t = 0.55902, df = 10, p-value = 0.5884
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
3.914249 6.813024
sample estimates:
mean of x
5.363636
If you have matched pair data you can specify paired = TRUE as a parameter (see more in
14.4).
U-test
The U-test is used for comparing the median values of two samples. You use it when the data
are not normally distributed, so it is described as a non-parametric test. The U-test is often
called the Mann-Whitney U-test but is generally attributed to Wilcoxon (Wilcoxon Rank Sum
test), hence in R the command is wilcox.test().wilcox.test(x, y, alternative, mu, paired, …)
x – a numeric sample.
y – a second numeric sample (if this is missing the command carries out a 1-sample
test).
alternative – how to compare means, the default is “two.sided”. You can also specify
“less” or “greater”.
mu – the true value of the median (or median difference). The default is 0.
paired – the default is paired = FALSE. This assumes independent samples. The
alternative paired = TRUE is used for matched pair tests.
… – there are additional parameters that we aren‟t concerned with here.
> wilcox.test(Grass, Heath)
The t-test and the U-test can both be used when your data are in matched pairs. Sometimes
this kind of test is also called a repeated measures test (depending on circumstance). You can
run the test by adding paired = TRUE to the appropriate command.
Here is an example where the data show the effectiveness of greenhouse sticky traps in
catching whitefly. Each trap has a white side and a yellow side. To compare white and yellow
we can use a matched pair.
> mpd
white yellow
1 4 4
2 3 7
3 4 2
4 1 2
5 6 7
6 4 10
7 6 5
8 4 8
> attach(mpd)
> t.test(white, yellow, paired = TRUE, var.equal = TRUE)
Paired t-test
Tests for association are easily carried out using the chisq.test() command. Your data need to
be arranged as a contingency table. Here is an example:
> bird
Garden Hedgerow Parkland Pasture Woodland
Blackbird 47 10 40 2 2
Chaffinch 19 3 5 0 2
Great Tit 50 0 10 7 0
House Sparrow 46 16 8 4 0
Robin 9 3 0 0 2
Song Thrush 4 0 6 0 0
In this dataset the columns form one set of categories (habitats) and the rows form another set
(bird species). In the original spreadsheet (CSV file) the first column contains the bird species
names; these are “converted” to row names when the data are imported:
> cs = chisq.test(bird)
Warning message:
In chisq.test(bird) : Chi-squared approximation may be incorrect
> cs
In this instance you get a warning message (this is because there are expected values < 5).
The basic “result” shows the overall significance but there are other components that may
prove useful:
> names(cs)
[1] “statistic” “parameter” “p.value” “method” “data.name” “observed”
[7] “expected” “residuals” “stdres”
You can view the components using the $ like so:
> cs$expected
Garden Hedgerow Parkland Pasture Woodland
Blackbird 59.915254 10.955932 23.623729 4.4508475 2.0542373
Chaffinch 17.203390 3.145763 6.783051 1.2779661 0.5898305
Great Tit 39.745763 7.267797 15.671186 2.9525424 1.3627119
House Sparrow 43.898305 8.027119 17.308475 3.2610169 1.5050847
Robin 8.305085 1.518644 3.274576 0.6169492 0.2847458
Song Thrush 5.932203 1.084746 2.338983 0.4406780 0.2033898
Now you can see the expected values. Other useful components are $residuals and $stdres,
which are the Pearson residuals and the standardized residuals respectively.
You can also use square brackets to get part of the result e.g.
> cs$stdres[1, ]
Garden Hedgerow Parkland Pasture Woodland
-3.2260031 -0.3771774 4.7468743 -1.4651818 -0.0471460
Now you can see the standardized residuals for the Blackbird row.
Yates‟ correction
A goodness of fit test is a special kind of test of association. You use it when you have a set
of data in categories that you want to compare to a “known” standard. A classic example is in
genetics, where the theory suggests a particular ratio of phenotypes:
> pea
[1] 116 40 31 13
> ratio
[1] 9 3 3 1
Here you have counts of pea plants (there are 4 phenotypes) from an experiment in cross-
pollination. The genetic theory suggests that your results should be in the ratio 9:3:3:1. Are
these results in that ratio? The goodness of fit test will tell you.
The result contains the same components as the regular chi squared test:
> gfit$expected
[1] 112.5 37.5 37.5 12.5
> gfit$stdres
[1] 0.4988877 0.4529108 -1.1775681 0.1460593
Data visualization is the practice of translating data into visual contexts, such as a map
or graph, to make data easier for the human brain to understand and to draw
comprehension from. The main goal of data viewing is to make it easier to identify
patterns, styles, and vendors in large data sets. The term is often used in a unique way,
including information drawings, information visuals, and mathematical diagrams.
1. Standing columns
2. Horizontal columns
5. Heat Maps
Temperature maps represent individual values from a set of data in the matrix
using color variation or color intensity. They often use color to help viewers
compare and contrast data at two distinct categories. They are useful for viewing
web pages, where the areas most users encounter are represented by “hot” colors,
and pages that receive the fewest clicks are displayed in “cold” colors.
many more.
Installation of Scikit- learn
The latest version of Scikit-learn is 1.1 and it requires Python 3.8 or newer.
Scikit-learn requires:
NumPy
SciPy as its dependencies.
Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you
have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is
using pip:
!pip install -U scikit-learn
Let us get started with the modeling process now.
Step 1: Load a Dataset
A dataset is nothing but a collection of data. A dataset generally has two main components:
Features: (also known as predictors, inputs, or attributes) they are simply the
variables of our data. They can be more than one and hence represented by
a feature matrix („X‟ is a common notation to represent feature matrix). A list
of all the feature names is termed feature names.
Response: (also known as the target, label, or output) This is the output variable
depending on the feature variables. We generally have a single response column
and it is represented by a response vector („y‟ is a common notation to
represent response vector). All the possible values taken by a response vector are
termed target names.
Loading exemplar dataset: scikit-learn comes loaded with a few example datasets like the
iris and digits datasets for classification and the boston house prices dataset for regression.
Loading external dataset: Now, consider the case when we want to load an external
dataset. For this purpose, we can use the pandas library for easily loading and
manipulating datasets.
To install pandas, use the following pip command:
! pip install pandas
In pandas, important data types are:
Series: Series is a one-dimensional labeled array capable of holding any data
type.
DataFramet: is a 2-dimensional labeled data structure with columns of
potentially different types. You can think of it like a spreadsheet or SQL table,
or a dict of Series objects. It is generally the most commonly used pandas object.
Step 2: Splitting the Dataset
One important aspect of all machine learning models is to determine their accuracy. Now,
in order to determine their accuracy, one can train the model using the given dataset and
then predict the response values for the same dataset using that model and hence, find the
accuracy of the model.
But this method has several flaws in it, like:
The goal is to estimate the likely performance of a model on out-of-sample data.
Maximizing training accuracy rewards overly complex models that won‟t
necessarily generalize our model.
Unnecessarily complex models may over-fit the training data.
A better option is to split our data into two parts: the first one for training our machine
learning model, and the second one for testing our model.
Advantages of train/test split
The model can be trained and tested on different data than the one used for
training.
Response values are known for the test dataset; hence predictions can be
evaluated.
Testing accuracy is a better estimate than training accuracy of out-of-sample
performance.
Step 3: Training the Model
Now, it‟s time to train some prediction models using our dataset. Scikit-learn provides a
wide range of machine learning algorithms that have a unified/consistent interface for
fitting, predicting accuracy, etc.
The example given below uses KNN (K nearest neighbors) classifier.
Features of Scikit-learn
Simple and efficient tools for data mining and data analysis. It features various
classification, regression, and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means, etc.
Accessible to everybody and reusable in various contexts.
Built on the top of NumPy, SciPy, and matplotlib.
Open source, commercially usable – BSD license.
Benefits of using Scikit-learn Libraries
Consistent interface to machine learning models
Provides many tuning parameters but with sensible defaults.
Exceptional documentation
Rich set of functionalities for companion tasks.
Active community for development and support.