Python Lecture 5 (2025)
Python Lecture 5 (2025)
Python Programming
2. Data cleaning
1. Missing data
2. Wrong data formats
3. Wrong data
4. Duplicate data
5. Duplicate labels
Each of the five topics have an associated .ipynb files on canvas to practice with code.
Reshaping data
Slicing & dicing
Slicing: Slicing is a technique for selecting specific subsets of data from a DataFrame or Series object. It allows you to
extract rows and columns based on their position or labels.
Dicing: Dicing is a technique for selecting specific subsets of data from a DataFrame based on multiple dimensions or
criteria. It allows you to extract precise portions of your data for analysis or manipulation.
Slicing & dicing
These are principles commonly known in the field of data warehousing for selecting and presenting parts of data.
Slicing/dicing syntax
Slices can be made by using the squared brackets [ ], using a start index and a stop index value, separated
by a : (colon) symbol as pandas assumes you want to slice by position.
Slicing/dicing on objects
Slices can be taken from multiple objects. For example, remember we did the same on strings?
Syntax: default values and step size.
Let’s create a pandas Series object.
Take of a slice only containing the odd Take of a slice only containing the last
values. two values
See how we use default values See how we use starting index for going
backwards in data.
See how we set the step size.
Slicing syntax: using string indexes
Let again create a pandas series object but with string indexes. It slices when the string is found
It slices when the string is found: based on the index order alphab. (default)
Syntax: specific positions .iloc[position]
Previously defined.
For a specific position, independent of the index data type Or use conditions for how we like to slice.
(e.g., when its a string) we can use .iloc[position]
Syntax: dataFrames instead of Series
In dataFrames we have 2 or more dimensions.
Thus, we also can have selection criteria on
each of the dimensions for slicing and dicing.
Previously defined.
columns
C
rows
A A
Concat Concat C D
B
B D
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
Concatenation of dataframes (syntax)
Key
In these three dataFrames (df4, df5, df6) we
have data with 4 keys (A, B, C, D) but with
different index positions.
Values
Previously defined.
Previously defined.
Previously defined.
It thus limits the rows to only with similar indexes in both A (df4) and B (df8).
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html
Join a dataframe (using a key)
We can also glue the two data sets instead of using the index or column, by using a key. For example, join data from A
and B based on the client_id key in each data set. We use method merge(), and using the argument ‘on’ to set the key
value we want to use as our inner joining criteria.
It now limits to the rows for which there is data about key1 in both data sets A (left) and B (right).
Joining (multiple keys?, join method?)
We can set of course multiple keys in the join.
Furthermore, we can set the join method We can set keys even if they have different names in the two data sets.
Pivoting
Pivoting: Pivoting is the reorganization of a data frame by means of aggregation of selected columns values as rows in
the new data frame.
Data Cleaning
Clean your data
Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
Missing data
• To make detecting missing values easier (and
across different array dtypes), pandas provides the:
• isna() method
• notna() method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
Wrong data formats
Some code to generate the following data frame preceded.
Applying datatime methods to such columns is not possible We can drop the data as an option.
because of the wrong data format.
Wrong data formats
But we can also modify specific values. For example, using the loc[] method.
Duplicates
Lets create a situation where there is a Then, lets identify the duplicates in the data frame and drop
duplicate row in our dataframe these from the data frame. We use the drop_duplicates()
method.
Finally, be careful with duplicate labels!
What if we use the ‘key’/’label’ of a column or row to obtain the data of that column or row when two columns have the
same ‘label’.
You can use methods such as is_unique(), duplicated() to test columns and index labels and
detect these.
Questions?