
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Creating a DataFrame Using Excel Files
What is a dataframe?
A dataframe is a two-dimensional object used to store data in a tabular format, where data is arranged in rows and columns. One can create a dataframe using various ways, and one of the most common methods is importing data from excel files. In this document, we will focus on how to create a dataframe using Excel files in Python, step-by-step.
Why are dataframes important for data analysis
Dataframes are important for data analysis for several reasons ?
Easy to read and manipulate ? Dataframes are a two-dimensional table-like data structure that allow for easy organization and manipulation of data. They are designed to handle complex data operations and can handle data with missing values or different data types.
Efficient data processing ? Dataframes can be used to efficiently process large datasets. They are designed to support vectorized operations, which can perform computations on entire arrays of data at once, rather than iterating over each row or column individually.
Integration with other libraries ? Dataframes are widely used in Python libraries for data analysis and scientific computing such as Pandas, NumPy, Matplotlib, and Scikit-learn. They can be easily integrated with these libraries to perform various tasks such as data cleaning, visualization, and machine learning.
Data exploration and visualization ? Dataframes provide a powerful set of tools for exploring and visualizing data. They allow for easy filtering, sorting, and grouping of data, and can be used to create a wide range of visualizations, including histograms, scatterplots, and heatmaps.
Overall, dataframes are an essential tool for data analysis, and can greatly simplify the process of working with large and complex datasets.
Prerequisites
Before we dive into the task few things should is expected to be installed onto your system ?
List of recommended settings ?
pip install pandas, excel
It is expected that the user will have access to any standalone IDE such as VS-Code, PyCharm, Atom or Sublime text.
Even online Python compilers can also be used such as Kaggle.com, Google Cloud platform or any other will do.
Updated version of Python. At the time of writing the article I have used 3.10.9 version.
Knowledge of the use of Jupyter notebook.
Knowledge and application of virtual environment would be beneficial but not required.
It is also expected that the person will have a good understanding of statistics and mathematics.
Step 1: Importing Libraries
Before we start, we need to import the necessay libraries in Python. We will use pandas, an open-source data manipulation library to manipulate excel files. Use the following code to import pandas ?
import pandas as pd
Step 2: Reading Data from Excel Files
To create a dataframe using Excel files, we first need to read the data from the Excel files. We can do this with the help of the read_excel() function provided by pandas. This function can read the data from both .xls and .xlsx formats.
Basic creation and reading
To create an excel file with one sheet we can create a dataframe. To read the data from a single sheet of the Excel file, you can simply pass the filename of the Excel file along with the sheet name or index to the read_excel() function as shown below ?
Example
# create a sample excel file with 2 columns and 5 rows and 3 sheets import pandas as pd df1 = pd.DataFrame({'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'], 'Age': [25, 30, 27, 22, 32]}) df2 = pd.DataFrame({'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'], 'Age': [25, 30, 27, 22, 32]}) df3 = pd.DataFrame({'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'], 'Age': [25, 30, 27, 22, 32]}) with pd.ExcelWriter('sample.xlsx') as writer: df1.to_excel(writer, sheet_name='Sheet1', index=False) df2.to_excel(writer, sheet_name='Sheet2', index=False) df3.to_excel(writer, sheet_name='Sheet3', index=False) # import the excel file import pandas as pd df = pd.read_excel('sample.xlsx', sheet_name='Sheet1') print(df)
Output
Where filename is the path of the excel file and df is the name of the dataframe.
... Name Age 0 John 25 1 Smith 30 2 Alex 27 3 James 22 4 peter 32
If you want to read a specific range of rows and columns from the sheet, you can use the optional parameters 'header' (to specify the number of rows to be used as columns headers) and 'usecols' (to specify the columns to be read) as shown below ?
df = pd.read_excel('filename.xlsx', sheet_name='Sheet1', header=0, usecols='A:C')
The above code will read only the first three columns(A, B, C) of the Sheet1 with the first row as the header.
Reading Multiple Sheets
Sometimes you may have to read data from multiple sheets. To do this, you can pass the sheet names as a list to the read_excel() function as shown below ?
Example
df1 = pd.read_excel('filename.xlsx', sheet_name=['Sheet1', 'Sheet2']) print (df1)
The above code will read data from both Sheet1 and Sheet2.
Output
Name Age 0 John 25 1 Smith 30 2 Alex 27 3 James 22 4 peter 32 Name Age 0 John 25 1 Smith 30 2 Alex 27 3 James 22 4 peter 32
Reading Data from Multiple Excel Files
To read data from multiple Excel files, you can use the glob() function from the glob module. The glob function searches for all the files that match a specified pattern and returns a list of file names. Use the following code to import the glob module ?
import glob
Once you have imported the glob module, use the following code to read the data from all the Excel files in a folder ?
path = r'folder_path/*.xlsx' files = glob.glob(path) dfs = [] for file in files: data = pd.read_excel(file, sheet_name='Sheet1') dfs.append(data) df = pd.concat(dfs, ignore_index=True)
The above code will read data from all the Excel files with .xlsx extension present in the folder_path directory.
Step 3: Data Cleaning and Manipulation
Now that we have read the data from the Excel file, we can perform various data cleaning and manipulation operations to prepare the data for analysis. Some of the common data cleaning operations include removing duplicates, handling missing values, dropping irrelevant columns, etc.
Removing Duplicates
To remove duplicates from a dataframe, we can use the drop_duplicates() function provided by pandas.
Example
Use the following code to remove duplicates from the dataframe ?
import pandas as pd df = pd.DataFrame({'Name': ['John', 'Smith', 'Alex', 'James', 'Peter', 'John', 'Smith', 'Alex', 'James', 'Peter'], 'Age': [25, 30, 27, 22, 32, 25, 30, 27, 22, 32]}) print(df)
Output
Name Age 0 John 25 1 Smith 30 2 Alex 27 3 James 22 4 peter 32 5 John 25 6 Smith 30 7 Alex 27 8 James 22 9 peter 32
Example
df.drop_duplicates(inplace=True) print(df)
Output
Name Age 0 John 25 1 Smith 30 2 Alex 27 3 James 22 4 peter 32
The above code will remove all the duplicate rows from the dataframe.
Handling Missing Values
To handle missing values in a dataframe, we can use various functions provided by pandas like isnull(), fillna() and dropna().
isnull()? This function is used to check for missing values. Use the following code to check for missing values ?
df.isnull().sum()
The above code will return the total number of missing values in each column of the dataframe.
dropna() ? This function is used to remove rows with missing values. Use the following code to remove all rows with at least one missing value ?
df.dropna(inplace=True)
The above code will remove all rows with at least one missing value from the dataframe.
fillna()? This function is used to fill missing values with a specified value. Use the following code to fill all missing values with 0 ?
df.fillna(0, inplace=True)
The above code will fill all the missing values in the dataframe with 0.
Dropping Irrelevant Columns
To drop irrelevant columns from a dataframe, we can use the drop() function provided by pandas. Use the following code to drop a column named 'column1' ?
df.drop('column1', axis=1, inplace=True)
The above code will remove the column named 'column1' from the dataframe.
Conclusion
In this document, we have discussed how to create a dataframe using Excel files in Python. We have covered various aspects like reading data from Excel files, handling missing values and removing duplicates. This knowledge can be applied to real-world scenarios like importing data from various sources for analysis.