Data Indexing and Slicing Handling Categorical Data
Your Go-To Pandas CheatSheet
Accessing and Manipulating DataFrame Elements Pandas Python Cheatsheet: Data Manipulation 2. Accessing Rows: Data indexing and slicing in Pandas allow selecting and manipulating specific subsets of With Pandas, you can effectively preprocess and manipulate datasets to derive data by label or position. 1. Encoding Categorical Variables:
for Efficient Data Processing
You can access rows in a DataFrame using the loc[] or iloc[] index-df. loc[] is used for label-based indexing, while iloc[] is used for positional indexing. meaningful insights and make informed decisions. Categorical variables need to be encoded into numeric form for analysis. Pandas provides 1. Indexing by label and position: various methods like pd.get_dummies() for one-hot encoding and pd.Categorical() for df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’ ], ordinal encoding. ‘Age’: [25, 30, 35], ‘City’: [ ’New York’, ‘London’, ‘Paris’ ] }) Data Cleaning Pandas allows indexing DataFrame elements using either label-based indexing (loc[]) or positional indexing (iloc[]). df = pd.DataFrame( { ‘Category’: [ ‘A’, ‘B’, ‘A’, ‘C’, ‘B’ ] } ) # Access a single row by label Data cleaning with Pandas involves handling missing values, duplicates, and unnecessary # Perform one-hot encoding using pd.get_dummies( ) Series Pandas DataFrame Cheatsheet row = df.loc[ 0 ] columns, ensuring data integrity and quality for analysis. df = pd.DataFrame( { ‘A’: [1, 2, 3, 4, 5], ‘B’: [6, 7, 8, 9, 10] } encoded_df = pd.get_dummies( df[ ‘Category’ ] ) # Access multiple rows by label A Series in Pandas is a one-dimensional labeled array capable of rows = df.loc[ 1 : 2 ] 1. Handling missing values: # Perform ordinal encoding using pd.Categorical( ) holding data of any type, providing both a sequence of values and # Indexing using label-based indexing df [ ‘Category’ ] = pd.Categorical( df[ ‘Category’ ], ordered=True, categories=[ ‘A’, ‘B’, ‘C’ ] ) A Data Frame in Pandas is a two-dimensional labeled data structure corresponding labels or indices for each value. Missing values are a common occurrence in datasets. In Pandas, you can handle missing label_indexing = df.loc[2, ‘A’] df [ ‘Category’ ] = df[ ‘Category’ ].cat.codes consisting of columns of potentially different data types, allowing for # Access a single row by position efficient handling, analysis, and manipulation of tabular data. values using methods like dropna() and fillna(). row = df.iloc[ 0 ] # Indexing using positional indexing Creating a Series df = pd.DataFrame( { ‘A’: [1, 2, None, 4, 5], positional_indexing = df.iloc[3, 1] 2. Performing Categorical Data Operations: # Access multiple rows by position ‘B’: [6, None, 8, 9, 10] } rows = df.iloc[ 1 : 3 ] Once categorical variables are encoded, you can perform operations like grouping, sorting, This section explains how to create a Series in Pandas, which is a 2. Slicing data by rows and columns: one-dimensional labeled array. Creating a DataFrame and statistical calculations on them. 3. Modifying DataFrame Elements: # Drop rows with missing values You can slice DataFrame data to select specific rows and columns using the slicing notation. 1. From a list: This section focuses on creating a DataFrame (df), which is a df_dropna = df.dropna ( ) df = pd.DataFrame( { ‘Category’: [ ‘A’, ‘B’, ‘A’, ‘C’, ‘B’ ], two-dimensional labeled data structure in Pandas meaning a df You can modify DataFrame elements by assigning new values using ‘Value’: [10, 20, 30, 40, 50] } ) data = [10, 20, 30, 40, 50] indexing techniques. df = pd.DataFrame( { ‘A’: [1, 2, 3, 4, 5], can contain multiple columns and rows. ‘B’: [6, 7, 8, 9, 10] } series = pd, Series ( data ) df = pd.DataFrame( { ‘A’: [1, 2, None, 4, 5], ‘B’: [6, None, 8, 9, 10] } # Group by ‘Category’ and calculate the mean of ‘Value’ 1. From a Dictionary: df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’ ], 2. From Numpy Arrays: ‘Age’: [25, 30, 35], # Slicing rows grouped_df = df.groupby( ‘Category’ ) [ ‘Value’ ].mean( ) You can create a DataFrame from a dictionary, where keys represent # Fill missing values with 0 sliced_rows = df[1:4] my_array = np.array([ 10, 20, 30, 40, 50]) column names and values represent data for each column. ‘City’: [ ’New York’, ‘London’, ‘Paris’ ] }) df_fillna = df.fillna ( 0 ) # Sort by ‘Category’ in ascending order series_from_array = pd.Series(my_array) # Slicing columns sorted_df = df.sort_values( ‘Category’ ) data = { ‘Name’ : [ ‘John’, ‘Emma’, ‘Peter’ ], # Modify a single element 3. From Dictionaries: sliced_columns = df[ [ ‘A’, ‘B’ ] ] ‘Age’ : [25, 30, 35], df.at[0, ‘Name’] = ‘Johnny’ data = { ‘Name’: [ ‘John’, ‘Emma’, None ], my_dict = { ‘a’ : 10, ‘b’ : 20, ‘c’ : 30} ‘City’ : [ ‘New York’, ‘London’, ‘Parts’ ]} ‘Age’: [25, 30, pd.NA], # Perform statistical calculations by ‘Category’ series_from_dict = pd.Series(my_dict) # Modify a column ‘City’: [ ’New York’, ‘London’, ‘Paris’ ] } stats_df = df.groupby( ‘Category’ ) [ ‘Value’ ].agg( [ ‘mean’, ‘std’ ] ) df = pd.DataFrame( data ) df[ ‘Age’ ] = df[ ‘Age’ ] + 1 4. From Tuples: 2. From a List of Lists: # Modify multiple columns # Specify missing values as None or np.nan while creating the DataFrame Pandas Cheatsheet Python: Advanced df = pd.DataFrame( data ) Data Manipulation my_tuple = (10, 20, 30, 40, 50) You can create a DataFrame from a list of lists, where each inner list df[[ ‘Age’, ‘City’ ]] = df[[ ‘Age’, ‘City]].apply( lambda x: series_from_tuple = pd.Series(my_tuple) represents a row of data. x.upper() ) 5. From Sets: 2. Dealing with duplicates: data = [[ ‘John’, 25, ‘New York’ ], 4. Adding a column: Data Visualization with Pandas [ ‘Emma’, 30, ‘London’ ], In this section, we have covered slightly advanced techniques for data manipulation with my_set = {10, 20, 30, 40, 50} [ ‘Peter’, 35, ‘Paris’ ]] Duplicates in a DataFrame can impact analysis and lead to incorrect results. Pandas provides Pandas data such as DataFrames, working with date and time data, and handling To add a column to a DataFrame in Pandas, you can assign values to a new series_from_set = pd.Series(my_set) the drop_duplicates() method to handle duplicates. categorical data. df = pd.DataFrame( data, columns=[ ‘Name’, ‘Age’, ‘City’] ) column using indexing. Here's an example: Basic Plotting 6. From Scalars: df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’ ], df = pd.DataFrame( { ‘A’: [1, 2, 2, 3, 4], 3. From a NumPy array: ‘B’: [ ‘a’, ‘b’, ‘b’, ‘c’, ‘d’ ] } 1. Line plots: Line plots are useful for visualizing trends and patterns over time or ‘Age’: [25, 30, 35], my_scalar = 10 continuous variables. You can create line plots using the .plot() method in series_from_scalar = pd.Series(my_scalar) To create a DataFrame from a NumPy array in Pandas, you can use ‘City’: [ ’New York’, ‘London’, ‘Paris’ ] }) # Drop duplicates based on all columns Combining and Merging DataFrames Pandas. the pd.DataFrame() function. # Add a new column df_deduplicated = df.drop_duplicates( ) # Create a DataFrame df[ ‘Salary’ ] = [50000, 60000, 70000] data = { 'Year': [2010, 2011, 2012, 2013, 2014], # Create a NumPy array # Drop duplicates based on column ‘A’ 1. Concatenating DataFrames: 'Sales': [100, 200, 300, 400, 500] } Accessing Elements in a Series data = np.array( [ [10, 20, 30], df_deduplicated_col = df.drop_duplicates( ‘A’ ) [40, 50, 60], In this section, we have covered slightly advanced techniques for data manipulation with Here, you will learn how to access specific elements in a Series using [70, 80, 90] ] ) Pandas data such as DataFrames, working with date and time data, and handling df = pd.DataFrame(data) various indexing techniques. 3. Removing unnecessary columns: categorical data. # Create a DataFrame from the NumPy array Filtering and Sorting DataFrame data # Create a line plot df = pd.DataFrame( data, columns=[ ‘A’, ‘B’, ‘C’ ] ) df.plot(x='Year', y='Sales', kind='line') 1. Accessing by Position: Sometimes, certain columns are not required for analysis or modeling. You can remove columns using the drop() method. df1 = pd.DataFrame( { ‘A’: [1, 2, 3], This section covers techniques for filtering and sorting data within a DataFrame. ‘B’: [4, 5, 6] } ) # Display the plot You can access elements in a Series by their numerical position using You will learn how to apply conditions to select rows based on specific criteria, square brackets [] and the index position. The position index starts plt.show() create boolean masks, and use logical operators for complex filtering.. df = pd.DataFrame( { ‘A’: [1, 2, 3], df2 = pd.DataFrame( { ‘A’: [7, 8, 9], from 0. ‘B’: [4, 5, 6], ‘B’: [10, 11, 12] } ) ‘C’: [7, 8, 9] } ) 2. Bar plots: Bar plots are suitable for comparing categories or discrete variables. series = pd.Series([10, 20, 30, 40, 50]) 1. Here's an example that demonstrates how to retrieve rows based # Concatenate vertically on a specific criteria in a Pandas DataFrame: Pandas provides an easy way to create bar plots using the .plot() method. df_concat_vertical = pd.concat( [ df1, df2 ] ) print(series[2]) # Access the element at index 2 Loading data into a DataFrame # Remove column ‘B’ df_removed_col = df.drop( ‘B’, axis=1 ) # Create a DataFrame df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’, ‘Emily’, ‘Daniel’ ], # Concatenate horizontally 2. Accessing by Label/Index: data = { 'City': ['New York', 'London', 'Paris' ], Here, you will explore different methods to load external data into a # Remove multiple columns ‘B’ and ‘C’ df_concat_horizontal = pd.concat( [ df1, df2 ], axis=1 ) ‘Age’: [25, 30, 35, 28, 32], 'Population': [8500000, 8900000, 2200000] } If you have assigned custom labels or index unique values to the DataFrame. It covers reading data from CSV, Excel, JSON format file, ‘City’: [ ’New York’, ‘London’, ‘Paris’, ‘London’, ‘Berlin’ ] }) df_removed_cols = df.drop( [ ‘B’, ‘C’], axis=1 ) and SQL databases. Series,you can access elements using those labels. 2. Joining and merging DataFrames: df = pd.DataFrame(data) # Select rows where Age is greater than 30 series = pd.Series([10, 20, 30, 40, 50], index=[ ‘A’, ‘B’, ‘C’, ‘D’, ‘E’]) 1. From a CSV file: selected_rows = df [ df [ ‘Age’] > 30 ] Joining and merging DataFrames allows you to combine them based on common columns # Create a bar plot Use the pd.read_csv() function to read data from a CSV file and or indexes. You can use the pd.merge() function or DataFrame's join() method for this df.plot( x='City', y='Population', kind='bar' ) print(series[’C’]) # Access the element with label ‘C’ create a DataFrame. print( selected_rows ) purpose. df = pd.read_csv( ‘data.csv’ ) Data Transformation # Display the plot 3. Accesssing by boolean indexing: 2. Here's an example that demonstrates how to create boolean masks for a plt.show() 2. From an Excel file: DataFrame in Pandas: df1 = pd.DataFrame( { ‘A’: [1, 2, 3], Boolean indexing allows you to select specific elements based on a Data transformation with Pandas involves applying functions to data, adding new columns, ‘B’: [4, 5, 6] } ) 3. Scatter plots: Scatter plots help visualize the relationship between two condition, creating a boolean mask that identifies which elements Use the pd.read_excel() function to read data from an Excel file and df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’, ‘Emily’, ‘Daniel’ ], grouping and aggregating data, and reshaping data. satisfy the condition. df2 = pd.DataFrame( { ‘A’: [2, 3, 4], continuous variables. Pandas makes it simple to create scatter plots using create a DataFrame. ‘Age’: [25, 30, 35, 28, 32], ‘C’: [7, 8, 9] } ) the .plot() method. df = pd.read_excel( ‘data.xlsx’ ) ‘City’: [ ’New York’, ‘London’, ‘Paris’, ‘London’, ‘Berlin’ ] }) 1. Applying functions to data: # Create a Series s = pd.Series([10, 20, 30, 40, 50]) # Perform an inner join based on column ‘A’ # Create a DataFrame 3. From a SQL database: # Create a boolean mask for rows where City is ‘London’ Pandas allows you to apply custom functions to manipulate data in a DataFrame using the df_inner_join = pd.merge( df1, df2, on=’A’ ) data = { 'Age': [25, 30, 35, 40, 45], Use the pd.read_sql() function to execute a SQL table query and mask = df[ ‘City’ ] == ‘London’ apply() method. # Create a boolean mask based on a condition 'Income': [50000, 60000, 70000, 80000, 90000] } load the results into a DataFrame. # Perform a left join based on column ‘A’ mask = s > 25 print( mask ) df = pd.DataFrame( { ‘A’: [1, 2, 3], df_left_join = df1.merge( df2, on=’A’, how=’left’ df = pd.DataFrame(data) conn = sqlite3.connect( ‘database.db’ ) ‘B’: [4, 5, 6] } ) # Use the boolean mask to access elements that satisfy the query = ‘SELECT * FROM table_name’ 3. Also, Pandas provides several sorting methods to arrange data in a # Create a scatter plot condition selected_elements = s[mask] df = pd.read_sql( query, conn) # Apply a function to double the values in column ‘A’ df.plot(x='Age', y='Income', kind='scatter') DataFrame based on column values. Here are some of the common sorting df[ ‘A_doubled’ ] = df[ ‘A’ ].apply( lambda x: x * 2 ) 4. From a JSON file: methods: Working with Dates and Times # Display the plot Use the pd.read_json() function to read data from a JSON file and a. sort_values(): plt.show() create a DataFrame. 2. Adding new columns: Sorts the DataFrame based on the values of one or more columns. df = pd.read_json( ‘data.json’) 1. Handling date and time data: Basic Operations on Series You can add new columns to a DataFrame by assigning values to them. The new column will df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’ ], be aligned with the existing data. Pandas provides specialized data types and functions for handling date and time data. You 4. From a URL or web API: ‘Age’: [25, 30, 20], can convert columns to datetime data type format using the pd.to_datetime() function This part covers the basic operations you can perform on a Series. Use the appropriate method such as pd.read_csv() or pd.read_json() ‘Salary’: [ 50000, 60000, 45000 ] }) df = pd.DataFrame( { ‘A’: [1, 2, 3], by passing the URL of the data source. ‘B’: [4, 5, 6] } ) df1 = pd.DataFrame( { ‘Date’: [ ‘2022-01-01’, ‘2022-02-01’, ‘2022-03-01’ ], 1. Arithmetic Operations: # Sample URL containing CSV data # Sort by ‘Age’ in ascending order ‘Sales’: [100, 200, 300] } ) df_sorted_age_asc = df.sort_values( ‘Age’ ) # Apply a new column ‘C’ with the sum of columns ‘A’ and ‘B’ You can perform arithmetic operations on a Series, such as addition, url = ‘https://raw.githubusercontent.com/datasciencedojo/datasets/ master/titanic.csv’ df[ ‘C’ ] = df[ ‘A’ ] + df[ ‘B’ ] # Convert ‘Date’ column to datetime format subtraction, multiplication, and division. These operations are applied # Sort by ‘Salary’ in descending order df[ ‘Date’ ] = pd.to_datetime( df[ ‘Date’ ] ) to every element of numerical columns. # Load data from the URL into a DataFrame df_sorted_salary_desc = df.sort_values( ‘Salary’, ascending=False ) 3. Grouping and aggregating data: You can perform arithmetic operations on a Series, such as addition, df = pd.read_csv( url ) Grouping and aggregating data allows you to calculate summary statistics based on specific 2. Extracting information from dates: subtraction, multiplication, and division. These operations are applied # Display the DataFrame b. sort_index(): to every element of numerical columns. print( df.head() ) groups or categories in the DataFrame. Sorts the DataFrame based on the row labels (index). Once the date column is in datetime format, you can extract various components such as series1 = pd.Series([ 1, 2, 3, 4, 5]) df = pd.DataFrame( { ‘Name’: [ ‘John’, ‘Emma’, ‘John’, ‘Emma’, ‘John’ ], year, month, day, etc. using the .dt accessor. series2 = pd.Series([ 10, 20, 30, 40, 50]) df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’ ], ‘Age’: [25, 30, 35, 28, 32], ‘Age’: [25, 30, 20] }, df1 = pd.DataFrame( { ‘Date’: [ ‘2022-01-01’, ‘2022-02-01’, ‘2022-03-01’ ], ‘Salary’: [50000, 60000, 70000, 55000, 80000] } ) # Addition Accessing and Manipulating DataFrame Elements ‘index=[ 2, 1, 0 ] ) ‘Sales’: [100, 200, 300] } ) result = series1 + series2 # Sort by row labels ( index ) in ascending order # Group by ‘Name’ and calculate the average ‘Age’ and maximum ‘Salary’ # Subtraction This part focuses on accessing and manipulating elements within a grouped_df = df.grouphy( ‘Name’ ).agg( { ‘Age’: ‘mean’, ‘Salary’: ‘max’ } ) # Convert ‘Date’ column to datetime format DataFrame. You will learn how to select specific rows and columns df_sorted_index_asc = df.sort_index( ) df[ ‘Date’ ] = pd.to_datetime( df[ ‘Date’ ] ) result = series1 - series2 using various indexing techniques. # Sort by row labels ( index ) in descending order # Multiplication 4. Reshaping data with pivot table and melt: # Extract year and month from ‘Date’ column 1. Accessing Columns: df_sorted_index_desc = df.sort_index( ascending=False ) result = series1 * series2 df[ ‘Year’ ] = df[ ‘Date’ ].dt.year To access a specific DataFrame column, you can use square brackets Pivot tables and melt allow you to reshape and transform your data between wide and long df[ ‘Month’ ] = df [ ‘Date’ ].dt.month # Division [] with multiple or single column name as the key. c. sort_values() with Multiple Columns: formats. result = series1 / series2 df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’ ], Sorts the DataFrame based on multiple columns. df = pd.DataFrame( { ‘Name’: [ ‘John’, ‘Emma’, ‘Peter’ ], ‘Age’: [25, 30, 35], ‘Subject’: [ ‘Math’, ‘Science’, ‘Math’ ], df = pd.DataFrame({ ‘Name’: [ ’John’, ‘Emma’, ‘Peter’ ], ‘Score’: [80, 90, 85 ] } ) 2. Aggregation Functions: ‘City’: [ ’New York’, ‘London’, ‘Paris’ ] }) ‘Age’: [25, 30, 20], # Access a single column ‘Salary’: [ 50000, 60000, 45000 ] }) # Convert the DataFrame to a pivot table Pandas Series provide built-in aggregation functions that allow you to pivot_table = df.pivot_table( values=’Score’ , index=’Name’ , columns=’Subject’ ) calculate various statistical measures on the data, such as sum, mean, name_column = df[ ‘Name’ ] # Sort by ‘Age’ in ascending order, and for tied values, sort by ‘Salary’ in median, minimum, maximum, and more. descending order # Convert the pivot table back to the original format # Access multiple columns name_age_columns = df[[ ‘Name’, ‘Age’ ]] df_sorted_multi = df.sort_values( by=[ ‘Age’, ‘Salary’ ], ascending=[ True, False] ) melted_df = pd.melt( pivot_table.reset_index( ), id_vars=’Name’, value_vars=[ ‘Math’, series = pd.Series([ 10, 20, 30, 40, 50]) ‘Science’ ] )
# Applying a custom function to square each element
[Ebooks PDF] download (Ebook) Data Science Essentials in Python: Collect - Organize - Explore - Predict - Value by Dmitry Zinoviev ISBN 9781680501841, 1680501844 full chapters