Python for Data Science
pandas
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Describe the value of Pandas to data science in Python
§ Highlight the key data structures of Pandas
§ Discuss the capabilities of Pandas that has resulted in its
wide spread adoption as a tool for analytics
pandas Benefits
Python for Data Science
Support for time-series data
Image Source: Visualizations
http://www.kdnug
gets.com/wp-
content/uploads/d
ata-variety.png
• Data variety support
• Data integration
• Data transformation
Descriptive statistics
pandas
Data Structures
Python for Data Science
pandas DataFrame
pandas Series
pandas Series
Python for Data Science
• A 1-dimensional labeled array
• Supports many data types
• Axis labels à index
• get and set values by index label
• Valid argument to most NumPy
methods
pandas DataFrame
Python for Data Science
• A 2-dimensional labeled data
structure
• A dictionary of Series objects
• Columns can be of potentially
different types
• Optionally parameters for fine-tuning:
• index (row labels)
• columns (column labels)
Pandas provides many constructors to create DataFrames!
Summary
Python for Data Science
Pandas supports all steps of DS pipeline
Import Transform Visualize
Python for Data Science
pandas:
Data Ingestion
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Describe the efficient and easy to use methods
that pandas provides for importing data into
memory
§ Identify functions such as ‘read_csv’ for reading a
CSV file into a DataFrame
§ Discuss about other data sources that pandas can
directly import from
Python for Data Science
Data Ingestion
read_csv
• Input : Path to a Comma
Separated File
Python for Data Science
• Output: Pandas DataFrame
object containing contents of
the file
read_json
• Input : Path to a JSON file or
a valid JSON String
Python for Data Science
• Output: Pandas DataFrame
or a Series object containing
the contents
read_html
• Input : A URL or a file or a
Python for Data Science
raw HTML String
• Output: A list of Pandas
DataFrames
read_sql_query
• Input1 : SQL Query
Python for Data Science
• Input2 : Database connection
• Output: Pandas DataFrame
object containing contents of
the file
Image Source: http://www.sqa.org.uk/e-learning/SQLIntro01CD/images/pic024.jpg
read_sql_table
• Input1 : Name of SQL table in
Python for Data Science
database
• Input2 : Database connection
• Output : Pandas DataFrame object
containing contents of the table
Image Source: http://www.w3processing.com/SQL/images/SQL002.png
Summary
Python for Data Science
• There are many other methods available in Pandas to ingest data:
• Google Big Query
• SAS files
• Excel tables
• Clipboard contents
• Pickle files
• http://pandas.pydata.org/pandas-docs/stable/api.html#input-output
Python for Data Science
pandas:
Descriptive Statistics
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Describe the capabilities of Pandas for performing
statistical analysis on data
§ Leverage frequently used functions such as describe()
§ Explore other statistical functions in Pandas, which is
constantly evolving
describe()
Python for Data Science
• Syntax : data_frame.describe()
• Output: Shows summary statistics of the dataframe
corr()
Python for Data Science
• Syntax: data_frame.corr()
• Computes pairwise Pearson coefficient (ρ) of columns
• Other coefficients available: Kendall, Spearman
Covariance
Standard deviation
func = min(), max(), mode(), median()
Python for Data Science
• The general syntax for calling these functions is
• data_frame.func()
• Frequently used optional parameter:
• axis = 0 (rows) or 1 (columns)
mean()
Python for Data Science
• Syntax: data_frame.mean(axis={0 or 1})
• Axis = 0 : Index
• Axis = 1 : Columns
• Output: Series or DataFrame with the mean values
std()
• Syntax: data_frame.std(axis={0 or 1})
• Axis = 0 : Index
• Axis = 1 : Columns
• Output: Series or DataFrame with the Standard Deviation values
• Normalized by N-1
any()
Python for Data Science
• Output: Returns whether ANY element is True
• Benefits:
• Can detect if a cell matches a condition very quickly
all()
• Output: Returns whether ALL element is True
• Benefits:
• Can detect if a column or row matches a condition very quickly
Summary
Python for Data Science
• Some other functions that are worth exploring:
• Count()
• Clip()
• Rank()
• Round()
• http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats
Python for Data Science
pandas:
Data Cleaning
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§Explain why there is need to clean data
§Describe data cleaning as an activity
§Leverage key methods pandas provides for
data cleaning
Real-world data is messy!
Python for Data Science
• Missing values
• Outliers in the data
• Invalid data (e.g. negative values for age)
• NaN value (np.nan)
• None value
Handling Data Quality Issues
Python for Data Science
• Replace the value
• Fill gaps forward / backward
• Drop fields
• Interpolation
df.replace()
Python for Data Science
9999.0000
0.0000
df.replace()
Python for Data Science
9999.0000
0.0000
Fill missing data gaps forward and backward
Python for Data Science
http://pandas.pydata.org/pandas-docs/stable/missing_data.html
Drop fields using dropna()
Python for Data Science
Drop fields using dropna() – axis=0
Python for Data Science
Drop fields using dropna() -- axis=1
Python for Data Science
Perform linear interpolation
Python for Data Science
Summary
Python for Data Science
• There are many other ways to transform missing data:
• Using ‘polynomial’ interpolation
• Using Regular Expressions for replacement
• More : http://pandas.pydata.org/pandas-docs/version/0.15.2/missing_data.html#numeric-replacement
Python for Data Science
pandas:
Data Visualization
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Identify key plotting functions of Pandas
§ Recognize the ease of utilization of native Pandas
methods (for e.g. with DataFrames)
Python for Data Science
DataFrame
Python for Data Science
df.plot.bar()
Python for Data Science
df.plot.box()
Python for Data Science
df.plot.hist()
Python for Data Science
df.plot()
Summary
Python for Data Science
Explore here: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting
Python for Data Science
pandas:
Frequent Data Operations
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Handpick data (rows or columns) in a DataFrame using
Pandas methods
§ Add/ Delete rows or columns in a DataFrame
§ Perform aggregation operations / group by
Python for Data Science
Slice Out Columns
Python for Data Science
Filter Out Rows
Python for Data Science
Insert New Column
Python for Data Science
Add a New Row
Python for Data Science
Delete a Row
Python for Data Science
Delete a Column
Python for Data Science
Group By and Aggregate
Summary
Python for Data Science
We saw a subset of transformation, more to explore here :
http://pandas.pydata.org/pandas-docs/stable/api.html
Python for Data Science
pandas:
Merging DataFrames
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Explain that data is usually distributed
across different locations and tables
§ Combine data from distinct DataFrames
to obtain the big picture
§ Distinguish among different ways to
combine data sets
Example Dataframes
Python for Data Science
right
left
pandas.concat() : Stack Dataframes
Python for Data Science
pandas.concat() : Stack Dataframes
Python for Data Science
pandas.concat() : Stack Dataframes
Python for Data Science
Inner Join using pandas.concat()
Python for Data Science
Inner Join using pandas.concat()
Python for Data Science
Stack DataFrames using append()
Python for Data Science
Inner Join using merge()
Python for Data Science
Summary
Python for Data Science
More adventure:
http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
Python for Data Science
pandas:
Frequent String Operations
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Describe what operations the string methods can perform
§ Navigate your way to find the right string method for you
§ Perform basic string operations in Pandas
Python for Data Science
str.split()
Python for Data Science
str.contains()
Python for Data Science
str.replace()
str.extract() – Returns first match found
Python for Data Science
Summary
Python for Data Science
Explore more :
http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Python for Data Science
pandas:
ParsingTimestamps
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
By the end of this video, you should be able to:
Python for Data Science
§ Explain what Unix time / POSIX time / epoch time is
§ Describe data types for datetime
§ Select rows based on time stamps
§ Sort tables in chronological order
Unix time / POSIX time / epoch time
Python for Data Science
• Number of seconds elapsed since
• 00:00:00
• Coordinated Universal Time (UTC),
• Thursday, 1 January 1970
• Prominent in UNIX like systems
• Parsing Timestamp: We have to read POSIX time and understand what the
exact time stamp was
Data Types for Timestamps
Python for Data Science
• Generic data type: datetime64[ns]
• Convert int64 timestamp to <M8[ns] or >M8[ns] on your
machine
Convert Timestamp to Python Format
to_datetime()
Python for Data Science
Select Rows Based on Timestamps
Python for Data Science
Sort Tables in Chronological Order
Python for Data Science
Summary
Python for Data Science
• POSIX / Unix time can be hard to read for users
• Converting to Python datetime format gives practical ways to:
• Select data based on human readable time stamps
• Create conditions using understandable time stamps
Python for Data Science
pandas:
Summary of Movie Rating Notebook
Dr. Ilkay Altintas and Dr. Leo Porter
Twitter: #UCSDpython4DS
Python for Data Science
• read_csv()
• Convert CSV file
to Pandas
DataFrame
• Data Structures:
Pandas & Series
Python for Data Science
• describe ()
• min(), max()
• std()
• mode ()
• corr ()
• any (), all ()
Python for Data Science
• Detection:
• isnull()
• any ()
• Cleaning:
• dropna ()
Python for Data Science
• Inline plotting
• Histograms
• Boxplots
• Changing limits on Y-
axis
Python for Data Science
• Slicing Columns
• Filtering Rows
• groupby ()
• mean ()
• count ()
Python for Data Science
• merge ()
• how=inner
• on = keys
Python for Data Science
• str.split()
• str.contains()
• str.extract()
Python for Data Science
• POSIX Time
• to_datetime ()
• Data type:
• datetime64[ns]
• Select using time
• Sort using time
Summary
Python for Data Science
• A typical data ingestion and transformation cycle
• Movies notebook as a representative example