Python Cheat Sheet For Data Analysis

This document provides a cheat sheet for performing data analysis in Python. It covers topics like data loading, wrangling, exploration, normalization, modeling and validation. Specific functions and code snippets are presented for tasks like handling missing data, correlations, grouping, regression analysis and cross validation.

Uploaded by

Abdullah amin

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Python Cheat Sheet For Data Analysis

Uploaded by

Abdullah amin

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Python Cheat Sheet for Data Analysis

Data Loading Data Wrangling Exploratory Data Analysis

Complete data frame correlation
Read CSV dataset Replace missing data with frequency
df.corr()
# load without header MostFrequentEntry =
df = pd.read_csv(<CSV path>, header = None) df[‘attribute_name’].value_counts().idxmax() Specific attribute correlation
# load using first row as header df[[‘attribute1’,’attribute2’,...]].corr()
df = pd.read_csv(<CSV path>, header = 0) df[‘attribute_name’].replace(np.nan,MostFrequentEntry
, inplace=True) Scatter plot
Print first few entries from matlplotlib import pyplot as plt
Replace missing data with mean plt.scatter(df[[‘attribute_1’]], df[[‘attribute_2’]])
#n=number of entries; default 5
df.head(n) AverageValue= Regression plot
df[‘attribute’].astype(<data_type>).mean(axis=0) import seaborn as sns
Print last few entries sns.regplot(x=‘attribute_1’,y=‘attribute_2’, data=df)
df[‘attribute’].replace(np.nan, AverageValue,
#n=number of entries; default 5 inplace=True) Box plot
df.tail(n) import seaborn as sns
Fix the data types sns.boxplot(x=‘attribute_1’,y=‘attribute_2’, data=df)
Assign header names
df[[‘attribute1’, ‘attribute2’, ...]] = Grouping by attributes
df.columns = headers df[[‘attribute1’, ‘attribute2’, df_group = df[[‘attribute_1’,’attribute_2’,...]]
...]].astype(‘data_type’)
Replace “?” with NaN #data_type can be int, float, char, etc. GroupBy statements
# Group by a single attribute
df = df.replace(“?”, np.nan) Data normalization df_group = df_group.groupby(['attribute_1'],
as_index=False).mean()
df[‘attribute_name’] =
Retrieve data types df[‘attribute_name’]/df[‘attribute_name’].max() # Group by multiple attributes
df.dtypes df_group = df_group.groupby(['attribute_1',
Binning 'attribute_2'],as_index=False).mean()
Retrieve statistical description bins = np.linspace(min(df[‘attribute_name’]), Pivot tables
max(df[‘attribute_name’],n) grouped_pivot =
# default use
# n is the number of bins needed df_group.pivot(index='attribute_1',columns='attribute
df.describe()
# include all attributes _2')
GroupNames = [‘Group1’,’Group2’,’Group3’,...]
df.describe(include=”all”)
Pseudocolor plot
df['binned_attribute_name'] =
Retrieve data set summary pd.cut(df['attribute_name'], bins, labels=GroupNames, from matlplotlib import pyplot as plt
include_lowest=True) plt.pcolor(grouped_pivot, cmap='RdBu')
df.info()
Pearson Coefficient and p-value
Change column name
Save data frame to csv from scipy import stats
df.rename(columns={‘old_name’:’new_name’}, pearson_coef,p_value=stats.pearsonr(df[’attribute_1’]
df.to_csv(<output CSV path>) inplace=True) , df['attribute_2'])

Indicator variables
dummy_variable = pd.get_dummies(df[‘attribute_name’])

df = pd.concat([df, dummy_variable],axis = 1)

Python Cheat Sheet for Data Analysis
Model Development Pipeline
lre=LinearRegression()
from sklearn.pipeline import Pipeline
Linear regression from sklearn.preprocessing import StandardScaler Rcross =
from sklearn.linear_model import LinearRegression Input=[('scale',StandardScaler()), ('polynomial', cross_val_score(lre,x_data[['attribute_1']],y_data,cv
lr = LinearRegression() PolynomialFeatures(include_bias=False)), =n)
('model',LinearRegression())] # n indicates number of times, or folds, for which
Train linear regression model the cross validation is to be done
X = df[[‘attribute_1’, ‘attribute_2’, ...]] pipe=Pipeline(Input)
Y = df['target_attribute'] Mean = Rcross.mean()
lr.fit(X,Y) Z = Z.astype(float) Std_dev = Rcross.std()
pipe.fit(Z,y)
Generate output predictions ypipe=pipe.predict(Z) Cross-validation prediction
from sklearn.model_selection import cross_val_score
Y_hat = lr.predict(X) R2 value
Identify the coefficient and intercept # For linear regression model from sklearn.linear_model import LinearRegression
X = df[[‘attribute_1’, ‘attribute_2’, ...]]
coeff = lr.coef_ Y = df['target_attribute'] lre=LinearRegression()
intercept = lr.intercept_
Residual plot lr.fit(X,Y) yhat = cross_val_predict(lre,x_data[[‘attribute_1’]],
R2_score = lr.score(X,Y) y_data,cv=4)
import seaborn as sns
sns.residplot(x=df[[‘attribute_1’]], # For polynomial regression model Ridge regression and prediction
y=df[[‘attribute_2’]]) from sklearn.metrics import r2_score from sklearn.linear_model import Ridge
Distribution plot pr=PolynomialFeatures(degree=2)
f = np.polyfit(x, y, n)
import seaborn as sns p = np.poly1d(f) x_train_pr=pr.fit_transform(x_train[[‘attribute_1’,
sns.distplot(df['attribute_name'], hist=False) R2_score = r2_score(y, p(x)) ‘attribute_2’, ...]])
# can include other parameters like color, label,
etc. MSE value x_test_pr=pr.fit_transform(x_test[[‘attribute_1’,
from sklearn.metrics import mean_squared_error ‘attribute_2’, ...]])
Polynomial regression mse = mean_squared_error(Y, Yhat)
f = np.polyfit(x, y, n) RidgeModel=Ridge(alpha=1)
#creates the polynomial features of order n Model Evaluation and Refinement RidgeModel.fit(x_train_pr, y_train)
yhat = RigeModel.predict(x_test_pr)
p = np.poly1d(f) Split data for training and testing
#p becomes the polynomial model used to generate the from sklearn.model_selection import train_test_split Grid search
predicted output
from sklearn.model_selection import GridSearchCV
y_data = df[‘target_attribute’]
Y_hat = p(x) from sklearn.linear_model import Ridge
x_data=df.drop('target_attribute',axis=1)
# Y_hat is the predicted output
parameters= [{'alpha': [0.001,0.1,1, 10, 100, 1000,
x_train, x_test, y_train, y_test =
Multi-variate polynomial regression 10000, ...]}]
train_test_split(x_data, y_data, test_size=0.10,
from sklearn.preprocessing import PolynomialFeatures random_state=1)
RR=Ridge()
Cross-validation score Grid1 = GridSearchCV(RR, parameters1,cv=4)
Z = df[[‘attribute_1’,’attribute_2’,...]]
pr=PolynomialFeatures(degree=n) from sklearn.model_selection import cross_val_score Grid1.fit(x_data[[‘attribute_1’, ‘attribute_2’,
Z_pr=pr.fit_transform(Z) ...]], y_data)
from sklearn.linear_model import LinearRegression
BestRR=Grid1.best_estimator_

BestRR.score(x_test[[‘attribute_1’, ‘attribute_2’,
...]], y_te

A Comprehensive Guide To Ensemble Learning (With Python Codes) PDF
100% (1)
A Comprehensive Guide To Ensemble Learning (With Python Codes) PDF
49 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
SQL Using R
No ratings yet
SQL Using R
30 pages
Pandas
No ratings yet
Pandas
43 pages
Statistics For Data Sciences
No ratings yet
Statistics For Data Sciences
10 pages
What Is The Purpose of Factless Fact Table
No ratings yet
What Is The Purpose of Factless Fact Table
11 pages
Big O Notation Cheat Sheet - Leetcode Cheat Sheet - La Vivien Post1233
No ratings yet
Big O Notation Cheat Sheet - Leetcode Cheat Sheet - La Vivien Post1233
5 pages
Data Warehousing & Dimensional Modeling Concepts !!
No ratings yet
Data Warehousing & Dimensional Modeling Concepts !!
33 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Keras Cheat Sheet Python For Data Science: Model Architecture Inspect Model
No ratings yet
Keras Cheat Sheet Python For Data Science: Model Architecture Inspect Model
1 page
PYTHON PANDAS Cheat Sheet
No ratings yet
PYTHON PANDAS Cheat Sheet
2 pages
Kotlin Flow API - Android Cheat Sheet 1.1 - Feb 2022
No ratings yet
Kotlin Flow API - Android Cheat Sheet 1.1 - Feb 2022
1 page
SCS3250A - Module 1 - Introduction To Statistics and Analytics
No ratings yet
SCS3250A - Module 1 - Introduction To Statistics and Analytics
44 pages
TensorFlow With R
No ratings yet
TensorFlow With R
46 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Data Warehousing Concepts
No ratings yet
Data Warehousing Concepts
50 pages
Downloadable: Cheat Sheets For AI, Neural Networks, Machine Learning, Deep Learning & Data Science PDF
No ratings yet
Downloadable: Cheat Sheets For AI, Neural Networks, Machine Learning, Deep Learning & Data Science PDF
34 pages
SAT Writing - Punctuation and Grammar
No ratings yet
SAT Writing - Punctuation and Grammar
5 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
Rapids Cheatsheet
100% (1)
Rapids Cheatsheet
2 pages
Introductory Concepts of Probabability & Statistics
No ratings yet
Introductory Concepts of Probabability & Statistics
6 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Scikit Learn Docs
100% (1)
Scikit Learn Docs
2,201 pages
Uber - Analysis - Jupyter - Notebook
100% (1)
Uber - Analysis - Jupyter - Notebook
10 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Ds Capstone Template Coursera
No ratings yet
Ds Capstone Template Coursera
49 pages
Snowflake Admin Keypoints
No ratings yet
Snowflake Admin Keypoints
3 pages
Cheat Sheet - Machine Learning - Data Science Interview PDF
No ratings yet
Cheat Sheet - Machine Learning - Data Science Interview PDF
16 pages
Advanced NLP With Spacy Chapter3
No ratings yet
Advanced NLP With Spacy Chapter3
29 pages
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
No ratings yet
Pandas Illustrated: The Definitive Visual Guide To Pandas - by Lev Maximov - Jan, 2023 - Better Programming
99 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
Tensor Flow 2
No ratings yet
Tensor Flow 2
3 pages
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
No ratings yet
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
20 pages
ML Glossary
No ratings yet
ML Glossary
44 pages
Introduction To Spark With Sparklyr in R
No ratings yet
Introduction To Spark With Sparklyr in R
11 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
Advanced Statistical Techniques For Analytics (Course Handout, 2018H2)
No ratings yet
Advanced Statistical Techniques For Analytics (Course Handout, 2018H2)
6 pages
Learning Path Machine Learning
No ratings yet
Learning Path Machine Learning
7 pages
Deep Dive Pytorch
No ratings yet
Deep Dive Pytorch
986 pages
Building A Career in Data Science - The Overview
No ratings yet
Building A Career in Data Science - The Overview
2 pages
Generalized Additive Model
No ratings yet
Generalized Additive Model
10 pages
How To Work With List Columns
No ratings yet
How To Work With List Columns
104 pages
Research Paper Presentation Pandas Moshiul Arefin
No ratings yet
Research Paper Presentation Pandas Moshiul Arefin
30 pages
Aindump2go dp-300 Exam Question 2022-Nov-04 by Ferdinand 105q Vce
No ratings yet
Aindump2go dp-300 Exam Question 2022-Nov-04 by Ferdinand 105q Vce
6 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Econ275 (Stanford) PDF
No ratings yet
Econ275 (Stanford) PDF
4 pages
List Comprehension in Python
No ratings yet
List Comprehension in Python
8 pages
Statistics Cheat Sheet
100% (1)
Statistics Cheat Sheet
4 pages
Data Visualisation Using Pyplot
No ratings yet
Data Visualisation Using Pyplot
20 pages
Hadoopov
No ratings yet
Hadoopov
90 pages
Notes On ARIMA: ND RD
No ratings yet
Notes On ARIMA: ND RD
4 pages
Deep Learning Booklet
No ratings yet
Deep Learning Booklet
55 pages
M5 - Custom Model Building With SQL in BigQuery ML Slides
No ratings yet
M5 - Custom Model Building With SQL in BigQuery ML Slides
32 pages
The Python Bible For Beginners
No ratings yet
The Python Bible For Beginners
185 pages
Bahan Univariate Linear Regression
No ratings yet
Bahan Univariate Linear Regression
64 pages
The Simple Guide to SAS: From Null to Novice
From Everand
The Simple Guide to SAS: From Null to Novice
Kirby Thomas
No ratings yet
Hadoop For Dummies
From Everand
Hadoop For Dummies
Dirk deRoos
3/5 (2)
Django 1.0 Template Development
From Everand
Django 1.0 Template Development
Scott Newman
No ratings yet
Certificate Flange
No ratings yet
Certificate Flange
1 page
Specimen Dimensions For - Testing
No ratings yet
Specimen Dimensions For - Testing
1 page
10.-Selection-criteria-of-pumps-1
No ratings yet
10.-Selection-criteria-of-pumps-1
20 pages
CH 01
No ratings yet
CH 01
68 pages
Manifolds Detailed
No ratings yet
Manifolds Detailed
9 pages
Logarithmic Decrement
No ratings yet
Logarithmic Decrement
6 pages
Template
No ratings yet
Template
4 pages
ENTP2
No ratings yet
ENTP2
33 pages
ENTP1
No ratings yet
ENTP1
31 pages
1D SS Heat Conduction
No ratings yet
1D SS Heat Conduction
65 pages
Entp Mind Set
No ratings yet
Entp Mind Set
22 pages
Introduction HMT
No ratings yet
Introduction HMT
8 pages
NSTSE Results 2024-25
No ratings yet
NSTSE Results 2024-25
5 pages
1MRK512001-BEN - en - G - Test System COMBITEST
No ratings yet
1MRK512001-BEN - en - G - Test System COMBITEST
11 pages
Analisis Lipstik
No ratings yet
Analisis Lipstik
11 pages
Desalegn Bayeh Dangnaw
No ratings yet
Desalegn Bayeh Dangnaw
59 pages
Good Thesis Statement Example For Research Paper
100% (1)
Good Thesis Statement Example For Research Paper
11 pages
Industrial Pumps
No ratings yet
Industrial Pumps
3 pages
Seminar Exoskeleton 42125
No ratings yet
Seminar Exoskeleton 42125
13 pages
Earth Changes 2012 by Sal Rachele
88% (8)
Earth Changes 2012 by Sal Rachele
77 pages
LP156WH4 TLN2
No ratings yet
LP156WH4 TLN2
31 pages
SM2016 MSC
No ratings yet
SM2016 MSC
1 page
Drawings REV1TT
No ratings yet
Drawings REV1TT
120 pages
Basics of Soft Computing
No ratings yet
Basics of Soft Computing
9 pages
Effect of Impeller Blades Number On The Performance of A Centrifugal Pump
No ratings yet
Effect of Impeller Blades Number On The Performance of A Centrifugal Pump
11 pages
Cover Letter Format HVAC Technician
No ratings yet
Cover Letter Format HVAC Technician
1 page
Neotown Project Description
No ratings yet
Neotown Project Description
9 pages
Learning Program Plan (Edited)
No ratings yet
Learning Program Plan (Edited)
13 pages
LATVIA Waste Prevention 2023
No ratings yet
LATVIA Waste Prevention 2023
14 pages
Maths Practice Questions Class 4 ST - Dominics School
No ratings yet
Maths Practice Questions Class 4 ST - Dominics School
4 pages
Subiecte de Antrenament Lingvistică
No ratings yet
Subiecte de Antrenament Lingvistică
1 page
Mock Test 1 For Forester by Umer Rasheed Meer
No ratings yet
Mock Test 1 For Forester by Umer Rasheed Meer
16 pages
1320-Ducted Commercial Split Units T3, Fixed Speed Compressor, Cool Only (TSG)
No ratings yet
1320-Ducted Commercial Split Units T3, Fixed Speed Compressor, Cool Only (TSG)
29 pages
User Manual: TC2290 NATIVE / TC2290-DT
No ratings yet
User Manual: TC2290 NATIVE / TC2290-DT
132 pages
01 - Fatigue Analysis of Aircraft Landing Gear
No ratings yet
01 - Fatigue Analysis of Aircraft Landing Gear
8 pages
Activity 2 Final Term
No ratings yet
Activity 2 Final Term
2 pages
CLASS X AI Holiday Homework
No ratings yet
CLASS X AI Holiday Homework
11 pages
Sustainable and Social Quality of Refugee Housing Architecture
No ratings yet
Sustainable and Social Quality of Refugee Housing Architecture
10 pages
APSET 2014 Question Paper III CHEMICAL SCIENCES
No ratings yet
APSET 2014 Question Paper III CHEMICAL SCIENCES
20 pages
Global 3 User Manual
No ratings yet
Global 3 User Manual
25 pages
Esl Discussion Topics For Adults
No ratings yet
Esl Discussion Topics For Adults
15 pages
08e_Mbembe_The Idea of a Borderless World
No ratings yet
08e_Mbembe_The Idea of a Borderless World
10 pages