0% found this document useful (0 votes)

12 views

Python Lecture 5 (2025)

The lecture covers Python programming topics including data reshaping, data cleaning, and techniques for handling missing or duplicate data. Key concepts discussed include slicing, dicing, concatenation, merging, and joining DataFrames, as well as methods for identifying and addressing data issues. Practical coding exercises are provided through associated .ipynb files on the course platform.

Uploaded by

loicikuzwe

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Python Lecture 5 (2025)

Uploaded by

loicikuzwe

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Lecture 5.

Python Programming

Dr. ir. Guido van Capelleveen, 10 March 2025

Today’s coding topics:
1. Reshaping data
1. Slicing and dicing
2. Concat, merge & join

2. Data cleaning
1. Missing data
2. Wrong data formats
3. Wrong data
4. Duplicate data
5. Duplicate labels

Each of the five topics have an associated .ipynb files on canvas to practice with code.
Reshaping data
Slicing & dicing
Slicing: Slicing is a technique for selecting specific subsets of data from a DataFrame or Series object. It allows you to
extract rows and columns based on their position or labels.

Dicing: Dicing is a technique for selecting specific subsets of data from a DataFrame based on multiple dimensions or
criteria. It allows you to extract precise portions of your data for analysis or manipulation.
Slicing & dicing
These are principles commonly known in the field of data warehousing for selecting and presenting parts of data.
Slicing/dicing syntax

Slices can be made by using the squared brackets [ ], using a start index and a stop index value, separated
by a : (colon) symbol as pandas assumes you want to slice by position.
Slicing/dicing on objects
Slices can be taken from multiple objects. For example, remember we did the same on strings?
Syntax: default values and step size.
Let’s create a pandas Series object.

Take of a slice only containing the odd Take of a slice only containing the last
values. two values

See how we use default values See how we use starting index for going
backwards in data.
See how we set the step size.
Slicing syntax: using string indexes
Let again create a pandas series object but with string indexes. It slices when the string is found

It slices when the string is found: based on the index order alphab. (default)
Syntax: specific positions .iloc[position]

Previously defined.

For a specific position, independent of the index data type Or use conditions for how we like to slice.
(e.g., when its a string) we can use .iloc[position]
Syntax: dataFrames instead of Series
In dataFrames we have 2 or more dimensions.
Thus, we also can have selection criteria on
each of the dimensions for slicing and dicing.

Previously defined.

Rows Columns Rows Columns

Concat, merge & join
Concatenation fundamentals
Concatenation: Concatenation in pandas is a powerful method for combining multiple DataFrames or Series objects into
a single data structure.

columns

C
rows

A A
Concat Concat C D
B
B D

(1) Glue by rows (2) Glue by columns

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
Concatenation of dataframes (syntax)
Key
In these three dataFrames (df4, df5, df6) we
have data with 4 keys (A, B, C, D) but with
different index positions.

Values

Concat the dataframes

Index (positions)
Concatenation (glue by column, bad example)
If we would like to concatenate columns, by matching index of the row, we use the parameter *axis=1*.

Previously defined.

Notice how columns are used to concatenate.

Since there are datafrmaes containing columns
with equal indexes, nothing gets ‘glued’ like in
method 2, i.e. ‘glue by columns’.
Concatenation (glue a column, good example)

Previously defined.

When indexes are larger in one dataframe, NaN (not a

number) will arise.
Joins
'Inner join': Returns only the rows that have matching values in both
DataFrames.
'Outer joins': Returns all rows from both DataFrames. If there is no
match, the result will contain NaN values.
• Left (outer) join: Returns all rows from the left DataFrame and
matching item rows from the right DataFrame. If there is no match,
the result will contain NaN values.
• Right (outer) join: Similar to a left join but returns all rows from
the right DataFrame and matching rows from the left DataFrame.
• Full (outer) join: A full outer join is equivalent to an outer join in
pandas. It includes all rows from both DataFrames, filling in missing
values with NaN where there is no match.

Join is a system that we know from relational databases

It joins data by having ‘equal’ data in both data sets A

Join using the concat method
Joining principles are available in the concat() method.

Previously defined.

It thus limits the rows to only with similar indexes in both A (df4) and B (df8).

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
Join a dataframe (using a key)
We can also glue the two data sets instead of using the index or column, by using a key. For example, join data from A
and B based on the client_id key in each data set. We use method merge(), and using the argument ‘on’ to set the key
value we want to use as our inner joining criteria.

It now limits to the rows for which there is data about key1 in both data sets A (left) and B (right).
Joining (multiple keys?, join method?)
We can set of course multiple keys in the join.

Furthermore, we can set the join method We can set keys even if they have different names in the two data sets.
Pivoting
Pivoting: Pivoting is the reorganization of a data frame by means of aggregation of selected columns values as rows in
the new data frame.
Data Cleaning
Clean your data
Bad data could be:

• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
Missing data
• To make detecting missing values easier (and
across different array dtypes), pandas provides the:
• isna() method
• notna() method.

• NaN stands for Not a Number (but means.. missing

number). Different forms exist based on the data
type (e.g., NaT, ).
Fillna() can replace these NaN’s.
• Do not confuse with values such as None. None is
a value that there is no object, usually the return of
a method (there was no return value). None is not
the same as 0, False, or an empty object.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
Wrong data formats
Some code to generate the following data frame preceded.

Applying datatime methods to such columns is not possible We can drop the data as an option.
because of the wrong data format.
Wrong data formats
But we can also modify specific values. For example, using the loc[] method.
Duplicates
Lets create a situation where there is a Then, lets identify the duplicates in the data frame and drop
duplicate row in our dataframe these from the data frame. We use the drop_duplicates()
method.
Finally, be careful with duplicate labels!

What if we use the ‘key’/’label’ of a column or row to obtain the data of that column or row when two columns have the
same ‘label’.

You can use methods such as is_unique(), duplicated() to test columns and index labels and
detect these.
Questions?

Az 700 Oct 24
No ratings yet
Az 700 Oct 24
32 pages
Sbi Account Statement 2024-05 - 18120418XXXXXXX8761
No ratings yet
Sbi Account Statement 2024-05 - 18120418XXXXXXX8761
8 pages
Repair Work Order Request Form
No ratings yet
Repair Work Order Request Form
1 page
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
Pandas
No ratings yet
Pandas
94 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
OOM Unit 2
No ratings yet
OOM Unit 2
145 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Data Wrangling and Analysis
100% (1)
Data Wrangling and Analysis
36 pages
UNIT IV Material
No ratings yet
UNIT IV Material
23 pages
99c949c0-5910-425f-9ac5-155882800fa5
No ratings yet
99c949c0-5910-425f-9ac5-155882800fa5
36 pages
Combining Datasets
No ratings yet
Combining Datasets
36 pages
Unit 4 DSE
No ratings yet
Unit 4 DSE
9 pages
UU Python Training Session 4 2022 03 01 v02
No ratings yet
UU Python Training Session 4 2022 03 01 v02
22 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
Data Science Data Manipulation With Pandas
No ratings yet
Data Science Data Manipulation With Pandas
77 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
83% (12)
Pandas Cheat Sheet
2 pages
Pandas
No ratings yet
Pandas
13 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
Pandas - Dataframe - Merging or Joining
No ratings yet
Pandas - Dataframe - Merging or Joining
29 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Reference Guide - Pandas Tools For Structuring A Dataset
No ratings yet
Reference Guide - Pandas Tools For Structuring A Dataset
5 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Python Programming Pandas Across Examples
No ratings yet
Python Programming Pandas Across Examples
350 pages
Merge, Join, and Concatenate: Concatenating Objects
No ratings yet
Merge, Join, and Concatenate: Concatenating Objects
62 pages
Notes - EDA-Unit2 (1)
No ratings yet
Notes - EDA-Unit2 (1)
43 pages
DA - 2. Pandas
No ratings yet
DA - 2. Pandas
79 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
Pandas Data Wrangling Cheatsheet Datacamp PDF
No ratings yet
Pandas Data Wrangling Cheatsheet Datacamp PDF
1 page
Pandas - Powerful Python Data Analysis Toolkit
No ratings yet
Pandas - Powerful Python Data Analysis Toolkit
95 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Notes For Python Part III
No ratings yet
Notes For Python Part III
44 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
python 2.1.2 (2)
No ratings yet
python 2.1.2 (2)
7 pages
Python Pandas Demo PDF
100% (2)
Python Pandas Demo PDF
23 pages
07 Data Cleaning and Preparation
No ratings yet
07 Data Cleaning and Preparation
41 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
Python Basic and Advanced-Day 11
No ratings yet
Python Basic and Advanced-Day 11
26 pages
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
No ratings yet
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
2 pages
Data Science - Sec3
No ratings yet
Data Science - Sec3
27 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Pandas Data Analysis Handbook
No ratings yet
Pandas Data Analysis Handbook
55 pages
4th Unit Answer Bank
No ratings yet
4th Unit Answer Bank
40 pages
Praveen PPT
No ratings yet
Praveen PPT
9 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
Lecture 8 - Data Wrangling Using Pandas
No ratings yet
Lecture 8 - Data Wrangling Using Pandas
31 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
Pandas
No ratings yet
Pandas
25 pages
CO3_3_Indexing and Sorting, Loading Data From CSV
No ratings yet
CO3_3_Indexing and Sorting, Loading Data From CSV
29 pages
Chapter 4
No ratings yet
Chapter 4
40 pages
Pandas
No ratings yet
Pandas
29 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Annotated Bibliography
No ratings yet
Annotated Bibliography
6 pages
Audit Trail
No ratings yet
Audit Trail
1 page
Real-Time Stage Tracking Camera Using Raspberry Pi
No ratings yet
Real-Time Stage Tracking Camera Using Raspberry Pi
50 pages
All India Bank Officers Association
No ratings yet
All India Bank Officers Association
14 pages
NCERT Solutions For Class 11 Psychology Chapter 5
No ratings yet
NCERT Solutions For Class 11 Psychology Chapter 5
4 pages
DR D Venkat-Reddy
No ratings yet
DR D Venkat-Reddy
5 pages
21CSU393 Kunal Verma - CA Lab Manual
No ratings yet
21CSU393 Kunal Verma - CA Lab Manual
81 pages
OMR-Report by Catalyst Crew
No ratings yet
OMR-Report by Catalyst Crew
7 pages
Performativity in Art, Literature
No ratings yet
Performativity in Art, Literature
330 pages
DFL-SERIES REVA RELEASE NOTES v12.00.13 PDF
No ratings yet
DFL-SERIES REVA RELEASE NOTES v12.00.13 PDF
114 pages
x 86 Family and Microcontroller
No ratings yet
x 86 Family and Microcontroller
11 pages
Acpi C
No ratings yet
Acpi C
5 pages
Eye Gaze Communication System Final
No ratings yet
Eye Gaze Communication System Final
12 pages
NIBSS - Webi
No ratings yet
NIBSS - Webi
17 pages
STG003 - Shutterstocks Cloud Storage Revolution With Amazon S3
No ratings yet
STG003 - Shutterstocks Cloud Storage Revolution With Amazon S3
12 pages
Meshtastic in South Africa - GadgeteerZA
No ratings yet
Meshtastic in South Africa - GadgeteerZA
20 pages
User Guide: Newland Android PDA UHF Application
No ratings yet
User Guide: Newland Android PDA UHF Application
14 pages
Calvin UMA 6050A2466401-MB-A01 PV 20121101
No ratings yet
Calvin UMA 6050A2466401-MB-A01 PV 20121101
66 pages
Cisa Exam Prep Domain 1 2019
100% (2)
Cisa Exam Prep Domain 1 2019
76 pages
SC-300T00A-ENU-EducatorTeachingGuide
No ratings yet
SC-300T00A-ENU-EducatorTeachingGuide
7 pages
Algebra I Chapter 9 Review For TEST PDF
No ratings yet
Algebra I Chapter 9 Review For TEST PDF
4 pages
List Diamond Qilastore
No ratings yet
List Diamond Qilastore
12 pages
M 3311a SP
No ratings yet
M 3311a SP
30 pages
Project Report On Tourist Security System
No ratings yet
Project Report On Tourist Security System
75 pages
Telegraph - Co.uk-The Best Dating Sites and Apps PDF
100% (2)
Telegraph - Co.uk-The Best Dating Sites and Apps PDF
10 pages
Flowcharts and Programming QBASIC
33% (3)
Flowcharts and Programming QBASIC
131 pages
University of Toronto Master Thesis
100% (3)
University of Toronto Master Thesis
7 pages