String manipulation refers to cleaning, transforming, and processing text data so it becomes suitable for analysis. Pandas provides a wide collection of .str functions that make it easy to work with string columns inside a DataFrame such as converting cases, trimming spaces, splitting, extracting patterns, replacing values, and more.
In this article, we will perform string manipulation using the dataset shown below:
import pandas as pd
import numpy as np
data = { 'Name': ['Lukas', 'Sofia', 'Hiroshi', 'Marta', 'Yannis', np.nan, 'Elena'],
'City': ['Berlin', 'Madrid', 'Tokyo', 'Warsaw', 'Athens', 'Oslo', 'Lisbon'] }
df = pd.DataFrame(data)
print(df)
Output
Name City 0 Lukas Berlin 1 Sofia Madrid 2 Hiroshi Tokyo 3 Marta Warsaw 4 Yannis Athens 5 NaN Oslo 6 Elena Lisbon
Column Datatype in Pandas
Sometimes columns that appear like strings may internally be stored as other datatypes. To ensure consistent string operations, it is often useful to convert selected columns to the string dtype.
Below, we convert the entire DataFrame to string type using .astype('string').
print(df.astype('string'))
This ensures every column supports Pandas' string functions without errors.
String Operations in Pandas
Below are the commonly used string manipulation methods in Pandas, explained with short examples.
1. lower(): This method converts every character in the column to lowercase, ensuring consistent text formatting.
print(df['Name'].str.lower())
Output
0 lukas
1 sofia
2 hiroshi
3 marta
4 yannis
5 NaN
6 elena
Name: Name, dtype: object
2. upper(): This method transforms all characters in the column to uppercase for uniform, standardized text.
print(df['Name'].str.upper())
Output
0 LUKAS
1 SOFIA
2 HIROSHI
3 MARTA
4 YANNIS
5 NaN
6 ELENA
Name: Name, dtype: object
3. strip(): This method removes unwanted leading and trailing spaces from each string to clean the data.
print(df['Name'].str.strip())
Output
0 Lukas
1 Sofia
2 Hiroshi
3 Marta
4 Yannis
5 NaN
6 Elena
Name: Name, dtype: object
4. split(): This method splits each string into a list of parts based on a given separator.
print(df['Name'].str.split('a'))
Output
0 [Luk, s]
1 [Sofi, ]
2 [Hiroshi]
3 [M, rt, ]
4 [Y, nnis]
5 NaN
6 [Elen, ]
Name: Name, dtype: object
5. len(): This method calculates and returns the character length of each string in the column.
print(df['Name'].str.len())
Output
0 5.0
1 5.0
2 7.0
3 5.0
4 6.0
5 NaN
6 5.0
Name: Name, dtype: float64
6. cat(): This method concatenates all strings in the column into a single string using a chosen separator.
print(df['Name'].str.cat(sep=', '))
Output
Lukas, Sofia, Hiroshi, Marta, Yannis, Elena
7. get_dummies(): This method converts each unique string into a separate one-hot encoded column for modeling.
print(df['City'].str.get_dummies())
Output
Athens Berlin Lisbon Madrid Oslo Tokyo Warsaw
0 0 1 0 0 0 0 0
1 0 0 0 1 0 0 0
2 0 0 0 0 0 1 0
3 0 0 0 0 0 0 1
4 1 0 0 0 0 0 0
5 0 0 0 0 1 0 0
6 0 0 1 0 0 0 0
8. startswith(): This method checks whether each string begins with the specified prefix.
print(df['Names'].str.startswith('E'))
Output
0 False
1 False
2 False
3 False
4 False
5 NaN
6 True
Name: Name, dtype: object
9. endswith(): This method checks whether each string ends with the specified suffix.
print(df['Names'].str.endswith('a'))
Output
0 False
1 True
2 False
3 True
4 False
5 NaN
6 True
Name: Name, dtype: object
10. replace(): This method replaces occurrences of a specific substring or pattern with a new value.
print(df['Name'].str.replace('Elena', 'Emily'))
Output
0 Lukas
1 Sofia
2 Hiroshi
3 Marta
4 Yannis
5 NaN
6 Emily
Name: Name, dtype: object
11. repeat(): This method duplicates each string a given number of times.
print(df['Name'].str.repeat(2))
Output
0 LukasLukas
1 SofiaSofia
2 HiroshiHiroshi
3 MartaMarta
4 YannisYannis
5 NaN
6 ElenaElena
Name: Name, dtype: object
12. count(): This method counts how many times a specific substring or pattern appears in each string.
print(df['Name'].str.count('a'))
Output
0 1.0
1 1.0
2 0.0
3 2.0
4 1.0
5 NaN
6 1.0
Name: Name, dtype: float64
13. find(): This method returns the index of the first occurrence of a pattern within each string.
print(df['Name'].str.find('a'))
Output
0 3.0
1 4.0
2 -1.0
3 1.0
4 1.0
5 NaN
6 4.0
Name: Name, dtype: float64
14. findall(): This method returns a list of all occurrences of a pattern found in each string.
print(df['Name'].str.findall('a'))
Output
0 [a]
1 [a]
2 []
3 [a, a]
4 [a]
5 NaN
6 [a]
Name: Name, dtype: object
15. islower(): This method checks whether all characters in each string are lowercase.
print(df['Name'].str.islower())
Output
0 False
1 False
2 False
3 False
4 False
5 NaN
6 False
Name: Name, dtype: object
16. isupper(): This method checks whether all characters in each string are uppercase.
print(df['Name'].str.isupper())
Output
0 False
1 False
2 False
3 False
4 False
5 NaN
6 False
Name: Name, dtype: object
17. isnumeric(): This method checks whether each string contains only numeric characters.
print(df['Name'].str.isnumeric())
Output
0 False
1 False
2 False
3 False
4 False
5 NaN
6 False
Name: Name, dtype: object
18. swapcase(): This method swaps uppercase letters to lowercase and lowercase letters to uppercase for each string.
print(df['Name'].str.swapcase())
Output
0 lUKAS
1 sOFIA
2 hIROSHI
3 mARTA
4 yANNIS
5 NaN
6 eLENA
Name: Name, dtype: object