0% found this document useful (0 votes)
12 views

Python Lecture 5 (2025)

The lecture covers Python programming topics including data reshaping, data cleaning, and techniques for handling missing or duplicate data. Key concepts discussed include slicing, dicing, concatenation, merging, and joining DataFrames, as well as methods for identifying and addressing data issues. Practical coding exercises are provided through associated .ipynb files on the course platform.

Uploaded by

loicikuzwe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Python Lecture 5 (2025)

The lecture covers Python programming topics including data reshaping, data cleaning, and techniques for handling missing or duplicate data. Key concepts discussed include slicing, dicing, concatenation, merging, and joining DataFrames, as well as methods for identifying and addressing data issues. Practical coding exercises are provided through associated .ipynb files on the course platform.

Uploaded by

loicikuzwe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Lecture 5.

Python Programming

Dr. ir. Guido van Capelleveen, 10 March 2025


Today’s coding topics:
1. Reshaping data
1. Slicing and dicing
2. Concat, merge & join

2. Data cleaning
1. Missing data
2. Wrong data formats
3. Wrong data
4. Duplicate data
5. Duplicate labels

Each of the five topics have an associated .ipynb files on canvas to practice with code.
Reshaping data
Slicing & dicing
Slicing: Slicing is a technique for selecting specific subsets of data from a DataFrame or Series object. It allows you to
extract rows and columns based on their position or labels.

Dicing: Dicing is a technique for selecting specific subsets of data from a DataFrame based on multiple dimensions or
criteria. It allows you to extract precise portions of your data for analysis or manipulation.
Slicing & dicing
These are principles commonly known in the field of data warehousing for selecting and presenting parts of data.
Slicing/dicing syntax

Slices can be made by using the squared brackets [ ], using a start index and a stop index value, separated
by a : (colon) symbol as pandas assumes you want to slice by position.
Slicing/dicing on objects
Slices can be taken from multiple objects. For example, remember we did the same on strings?
Syntax: default values and step size.
Let’s create a pandas Series object.

Take of a slice only containing the odd Take of a slice only containing the last
values. two values

See how we use default values See how we use starting index for going
backwards in data.
See how we set the step size.
Slicing syntax: using string indexes
Let again create a pandas series object but with string indexes. It slices when the string is found

It slices when the string is found: based on the index order alphab. (default)
Syntax: specific positions .iloc[position]

Previously defined.

For a specific position, independent of the index data type Or use conditions for how we like to slice.
(e.g., when its a string) we can use .iloc[position]
Syntax: dataFrames instead of Series
In dataFrames we have 2 or more dimensions.
Thus, we also can have selection criteria on
each of the dimensions for slicing and dicing.

Previously defined.

Rows Columns Rows Columns


Concat, merge & join
Concatenation fundamentals
Concatenation: Concatenation in pandas is a powerful method for combining multiple DataFrames or Series objects into
a single data structure.

columns

C
rows

A A
Concat Concat C D
B
B D

(1) Glue by rows (2) Glue by columns

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
Concatenation of dataframes (syntax)
Key
In these three dataFrames (df4, df5, df6) we
have data with 4 keys (A, B, C, D) but with
different index positions.

Values

Concat the dataframes


Index (positions)
Concatenation (glue by column, bad example)
If we would like to concatenate columns, by matching index of the row, we use the parameter *axis=1*.

Previously defined.

Notice how columns are used to concatenate.


Since there are datafrmaes containing columns
with equal indexes, nothing gets ‘glued’ like in
method 2, i.e. ‘glue by columns’.
Concatenation (glue a column, good example)

Previously defined.

When indexes are larger in one dataframe, NaN (not a


number) will arise.
Joins
'Inner join': Returns only the rows that have matching values in both
DataFrames.
'Outer joins': Returns all rows from both DataFrames. If there is no
match, the result will contain NaN values.
• Left (outer) join: Returns all rows from the left DataFrame and
matching item rows from the right DataFrame. If there is no match,
the result will contain NaN values.
• Right (outer) join: Similar to a left join but returns all rows from
the right DataFrame and matching rows from the left DataFrame.
• Full (outer) join: A full outer join is equivalent to an outer join in
pandas. It includes all rows from both DataFrames, filling in missing
values with NaN where there is no match.

Join is a system that we know from relational databases

It joins data by having ‘equal’ data in both data sets A


Join using the concat method
Joining principles are available in the concat() method.

Previously defined.

It thus limits the rows to only with similar indexes in both A (df4) and B (df8).

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
Join a dataframe (using a key)
We can also glue the two data sets instead of using the index or column, by using a key. For example, join data from A
and B based on the client_id key in each data set. We use method merge(), and using the argument ‘on’ to set the key
value we want to use as our inner joining criteria.

It now limits to the rows for which there is data about key1 in both data sets A (left) and B (right).
Joining (multiple keys?, join method?)
We can set of course multiple keys in the join.

Furthermore, we can set the join method We can set keys even if they have different names in the two data sets.
Pivoting
Pivoting: Pivoting is the reorganization of a data frame by means of aggregation of selected columns values as rows in
the new data frame.
Data Cleaning
Clean your data
Bad data could be:

• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
Missing data
• To make detecting missing values easier (and
across different array dtypes), pandas provides the:
• isna() method
• notna() method.

• NaN stands for Not a Number (but means.. missing


number). Different forms exist based on the data
type (e.g., NaT, ).
Fillna() can replace these NaN’s.
• Do not confuse with values such as None. None is
a value that there is no object, usually the return of
a method (there was no return value). None is not
the same as 0, False, or an empty object.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
Wrong data formats
Some code to generate the following data frame preceded.

Applying datatime methods to such columns is not possible We can drop the data as an option.
because of the wrong data format.
Wrong data formats
But we can also modify specific values. For example, using the loc[] method.
Duplicates
Lets create a situation where there is a Then, lets identify the duplicates in the data frame and drop
duplicate row in our dataframe these from the data frame. We use the drop_duplicates()
method.
Finally, be careful with duplicate labels!

What if we use the ‘key’/’label’ of a column or row to obtain the data of that column or row when two columns have the
same ‘label’.

You can use methods such as is_unique(), duplicated() to test columns and index labels and
detect these.
Questions?

You might also like