Data Processing is an important part of any task that includes data-driven work. It helps us to provide meaningful insights from the data. As we know Python is a widely used programming language, and there are various libraries and tools available for data processing.
In this article, we are going to see Data Processing in Python, Loading, Printing rows and Columns, Data frame summary, Missing data values Sorting and Merging Data Frames, Applying Functions, and Visualizing Dataframes.
What is Data Processing in Python?
Data processing in Python refers to manipulating, transforming, and analyzing data by using Python. It contains a series of operations that aim to change raw data into structured data. or meaningful insights. By converting raw data into meaningful insights it makes it suitable for analysis, visualization, or other applications.Python provides several libraries and tools that facilitate efficient data processing, making it a popular choice for working with diverse datasets.
Refer to this article – Introduction to Data Processing
What is Pandas?
Pandas is a powerful, fast, and open-source library built on NumPy. It is used for data manipulation and real-world data analysis in Python. Easy handling of missing data, Flexible reshaping and pivoting of data sets, and size mutability make pandas a great tool for performing data manipulation and handling the data efficiently.
Loading Data in Pandas DataFrame
Reading CSV file using pd.read_csv and loading data into a data frame. Import pandas as using pd for the shorthand. You can download the data from here.
Python
#Importing pandas library
import pandas as pd
#Loading data into a DataFrame
data_frame=pd.read_csv('Mall_Customers.csv')
Printing rows of the Data
By default, data_frame.head() displays the first five rows and data_frame.tail() displays last five rows. If we want to get first ‘n’ number of rows then we use, data_frame.head(n) similar is the syntax to print the last n rows of the data frame.
Python
#displaying first five rows
display(data_frame.head())
#displaying last five rows
display(data_frame.tail())
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
195 196 Female 35 120 79
196 197 Female 45 126 28
197 198 Male 32 126 74
198 199 Male 32 137 18
199 200 Male 30 137 83
[4]
0s
# Program to print all the column name of the dataframe
print(list(data_frame.columns))
Printing the column names of the DataFrame
Python
# Program to print all the column name of the dataframe
print(list(data_frame.columns))
Output:
['CustomerID', 'Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']
Summary of Data Frame
The functions info() prints the summary of a DataFrame that includes the data type of each column, RangeIndex (number of rows), columns, non-null values, and memory usage.
Python
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 200 non-null int64
1 Genre 200 non-null object
2 Age 200 non-null int64
3 Annual Income (k$) 200 non-null int64
4 Spending Score (1-100) 200 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.9+ KB
Descriptive Statistical Measures of a DataFrame
The describe() function outputs descriptive statistics which include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. For numeric data, the result’s index will include count, mean, std, min, and max as well as lower, 50, and upper percentiles. For object data (e.g. strings), the result’s index will include count, unique, top, and freq.
Python
Output:
CustomerID Age Annual Income (k$) Spending Score (1-100)
count 200.000000 200.000000 200.000000 200.000000
mean 100.500000 38.850000 60.560000 50.200000
std 57.879185 13.969007 26.264721 25.823522
min 1.000000 18.000000 15.000000 1.000000
25% 50.750000 28.750000 41.500000 34.750000
50% 100.500000 36.000000 61.500000 50.000000
75% 150.250000 49.000000 78.000000 73.000000
max 200.000000 70.000000 137.000000 99.000000
Missing Data Handing
Find missing values in the dataset
The isnull( ) detects the missing values and returns a boolean object indicating if the values are NA. The values which are none or empty get mapped to true values and not null values get mapped to false values.
Python
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
.. ... ... ... ... ...
195 False False False False False
196 False False False False False
197 False False False False False
198 False False False False False
199 False False False False False
[200 rows x 5 columns]
[8]
0s
Find the number of missing values in the dataset
To find out the number of missing values in the dataset, use data_frame.isnull( ).sum( ). In the below example, the dataset doesn’t contain any null values. Hence, each column’s output is 0.
Python
data_frame.isnull().sum()
Output:
CustomerID 0
Genre 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
Removing missing values
The data_frame.dropna( ) function removes columns or rows which contains atleast one missing values.
data_frame = data_frame.dropna()
By default, data_frame.dropna( ) drops the rows where at least one element is missing. data_frame.dropna(axis = 1) drops the columns where at least one element is missing.
Fill in missing values
We can fill null values using data_frame.fillna( ) function.
data_frame = data_frame.fillna(value)
But by using the above format all the null values will get filled with the same values. To fill different values in the different columns we can use.
data_frame[col] = data_frame[col].fillna(value)
Row and column manipulations
Removing rows
By using the drop(index) function we can drop the row at a particular index. If we want to replace the data_frame with the row removed then add inplace = True in the drop function.
Python
#Removing 4th indexed value from the dataframe
data_frame.drop(4).head()
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
5 6 Female 22 17 76
[ ]
This function can also be used to remove the columns of a data frame by adding the attribute axis =1 and providing the list of columns we would like to remove.
Renaming rows
The rename function can be used to rename the rows or columns of the data frame.
Python
data_frame.rename({0:"First",1:"Second"})
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
First 1 Male 19 15 39
Second 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
... ... ... ... ... ...
195 196 Female 35 120 79
196 197 Female 45 126 28
197 198 Male 32 126 74
198 199 Male 32 137 18
199 200 Male 30 137 83
[200 rows x 5 columns]
Adding new columns
Python
#Creates a new column with all the values equal to 1
data_frame['NewColumn'] = 1
data_frame.head()
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100) \
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
NewColumn
0 1
1 1
2 1
3 1
4 1
Sorting DataFrame values
Sort by column
The sort_values( ) are the values of the column whose name is passed in the by attribute in the ascending order by default we can set this attribute to false to sort the array in the descending order.
Python
data_frame.sort_values(by='Age', ascending=False).head()
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100) \
70 71 Male 70 49 55
60 61 Male 70 46 56
57 58 Male 69 44 46
90 91 Female 68 59 55
67 68 Female 68 48 48
NewColumn
70 1
60 1
57 1
90 1
67 1
Sort by multiple columns
Python
data_frame.sort_values(by=['Age','Annual Income (k$)']).head(10)
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100) \
33 34 Male 18 33 92
65 66 Male 18 48 59
91 92 Male 18 59 41
114 115 Female 18 65 48
0 1 Male 19 15 39
61 62 Male 19 46 55
68 69 Male 19 48 59
111 112 Female 19 63 54
113 114 Male 19 64 46
115 116 Female 19 65 50
NewColumn
33 1
65 1
91 1
114 1
0 1
61 1
68 1
111 1
113 1
115 1
Merge Data Frames
The merge() function in pandas is used for all standard database join operations. Merge operation on data frames will join two data frames based on their common column values. Let’s create a data frame.
Python
#Creating dataframe1
df1 = pd.DataFrame({
'Name':['Jeevan', 'Raavan', 'Geeta', 'Bheem'],
'Age':[25, 24, 52, 40],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']})
df1
Output:
Name Age Qualification
0 Jeevan 25 Msc
1 Raavan 24 MA
2 Geeta 52 MCA
3 Bheem 40 Phd
Now we will create another data frame.
Python
#Creating dataframe2
df2 = pd.DataFrame({'Name':['Jeevan', 'Raavan', 'Geeta', 'Bheem'],
'Salary':[100000, 50000, 20000, 40000]})
df2
Output:
Name Salary
0 Jeevan 100000
1 Raavan 50000
2 Geeta 20000
3 Bheem 40000
Now. let’s merge these two data frames created earlier.
Python
#Merging two dataframes
df = pd.merge(df1, df2)
df
Output:
Name Age Qualification Salary
0 Jeevan 25 Msc 100000
1 Raavan 24 MA 50000
2 Geeta 52 MCA 20000
3 Bheem 40 Phd 40000
Apply Function
By defining a function beforehand
The apply( ) function is used to iterate over a data frame. It can also be used with lambda functions.
Python
# Apply function
def fun(value):
if value > 70:
return "Yes"
else:
return "No"
data_frame['Customer Satisfaction'] = data_frame['Spending Score (1-100)'].apply(fun)
data_frame.head(10)
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100) \
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
5 6 Female 22 17 76
6 7 Female 35 18 6
7 8 Female 23 18 94
8 9 Male 64 19 3
9 10 Female 30 19 72
NewColumn Customer Satisfaction
0 1 No
1 1 Yes
2 1 No
3 1 Yes
4 1 No
5 1 Yes
6 1 No
7 1 Yes
8 1 No
9 1 Yes
By using the lambda operator
This syntax is generally used to apply log transformations and normalize the data to bring it in the range of 0 to 1 for particular columns of the data.
Python
const = data_frame['Age'].max()
data_frame['Age'] = data_frame['Age'].apply(lambda x: x/const)
data_frame.head()
Output:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100) \
0 1 Male 0.271429 15 39
1 2 Male 0.300000 15 81
2 3 Female 0.285714 16 6
3 4 Female 0.328571 16 77
4 5 Female 0.442857 17 40
NewColumn Customer Satisfaction
0 1 No
1 1 Yes
2 1 No
3 1 Yes
4 1 No
Visualizing DataFrame
Scatter plot
The plot( ) function is used to make plots of the data frames.
Python
# Visualization
data_frame.plot(x ='CustomerID', y='Spending Score (1-100)',kind = 'scatter')
Output:

Scatter plot of the Customer Satisfaction column
Histogram
The plot.hist( ) function is used to make plots of the data frames.
Python
Output:

Histogram for the distribution of the data
Conclusion
There are other functions as well of pandas data frame but the above mentioned are some of the common ones generally used for handling large tabular data. One can refer to the pandas documentation as well to explore more about the functions mentioned above.
Similar Reads
Read And Write Tabular Data using Pandas
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with ârelationalâ or âlabeledâ data both easy and intuitive. It aims to be the fundamental, high-level building block for doing practical, real-world data analysis in Python.The two primary d
3 min read
How to preprocess string data within a Pandas DataFrame?
Sometimes, the data which we're working on might be stuffed in a single column, but for us to work on the data, the data should be spread out into different columns and the columns must be of different data types. When all the data is combined in a single string, the string needs to be preprocessed.
3 min read
Data profiling in Pandas using Python
Pandas is one of the most popular Python library mainly used for data manipulation and analysis. When we are working with large data, many times we need to perform Exploratory Data Analysis. We need to get the detailed description about different columns available and there relation, null check, dat
1 min read
Streamlined Data Ingestion with Pandas
Data Ingestion is the process of, transferring data, from varied sources to an approach, where it can be analyzed, archived, or utilized by an establishment. The usual steps, involved in this process, are drawing out data, from its current place, converting the data, and, finally loading it, in a lo
9 min read
Python | Pandas Series.data
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas series is a One-dimensional ndarray with axis labels. The labels need not be un
2 min read
Python | Pandas DataFrame.where()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas where() method in Python is used to check a data frame for one or more conditio
2 min read
Reading rpt files with Pandas
In most cases, we usually have a CSV file to load the data from, but there are other formats such as JSON, rpt, TSV, etc. that can be used to store data. Pandas provide us with the utility to load data from them. In this article, we'll see how we can load data from an rpt file with the use of Pandas
2 min read
Split Pandas Dataframe by Rows
In this article, we will explore the process of Split Pandas Dataframe by Rows. The Pandas DataFrame serves as the focal point, and throughout this discussion, we will experiment with various methods to Split Pandas Dataframe by Rows. Split Pandas DataFrame by Rows In this article, we will elucidate
3 min read
Python | Pandas dataframe.std()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas dataframe.std() function return sample standard deviation over requested axis.
2 min read
Data Structures in Pandas
Pandas is an open-source library that uses for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. It offers a tool for cleaning and processes your data. It is the most popular Python
4 min read