Becoming A Data Analyst
Becoming A Data Analyst
Every effort has been made in the preparation of this book to ensure the
accuracy of the information presented. However, the information contained in
this book is sold without warranty, either express or implied. Neither the
author, nor Packt Publishing, and its dealers and distributors will be held
liable for any damages caused or alleged to be caused directly or indirectly by
this book.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK
ISBN: 978-1-80512-641-6
www.packt.com
Table of Contents
1. Becoming a Data Analyst: A beginner’s guide to kickstarting your data
analysis journey
2. 1 Understanding the Business Context of Data Analysis
I. Join our book community on Discord
II. A Data Analyst’s Role in the Data Analytics Lifecyle
i. Business Understanding
ii. Data Inspection
iii. Data Pre-processing & Preparation
iv. Exploratory Data Analysis
v. Data Validation
vi. Explanatory Data Analysis
III. Summary
3. 2 Introduction to SQL
I. Join our book community on Discord
II. SQL and its use cases
i. Brief History of SQL
ii. SQL and data analysis
III. Different Databases
i. Relational vs. Non-Relational Databases
ii. Popular DBMS’s
IV. SQL Terminology
i. Query
ii. Statement
iii. Clause
iv. Keyword
v. View
V. Setting up your environment
i. Choosing a DBMS
ii. Installing necessary software
iii. Creating a sample database
VI. Writing Basic SQL Queries
i. SELECT Statement
ii. Structure of a Query
iii. INSERT Statements
iv. UPDATE Statements
v. DELETE Statements
vi. SQL basic rules and syntax
VII. Filtering and organizing data with clauses
i. WHERE Clause
ii. ORDER BY Clause
iii. DISTINCT Clause
iv. LIMIT Clause
VIII. Using operators and functions
i. Comparison operators
ii. Logical operators (AND, OR)
iii. LIKE operator
iv. Arithmetic operators
v. Functions for calculations
vi. Functions for text manipulation
vii. Date functions
IX. Summary
4. 3 Joining Tables in SQL
I. Join our book community on Discord
II. Table relations
i. Implementing Relationships in SQL
ii. SQL Joins
iii. Best Practices for Using JOIN in SQL
III. Summary
5. 4 Creating Business Metrics with Aggregations
I. Join our book community on Discord
II. Aggregations in Business Metrics
i. Aggregations in SQL to Analyze Data
ii. GROUP BY Clause
iii. HAVING clause
iv. Best Practices for Aggregations
III. Summary
6. 5 Advanced SQL
I. Join our book community on Discord
II. Working with subqueries
i. Types of subqueries
ii. Non-code explanation of subquery
iii. Using a basic subquery on our library database
iv. Subquery vs joining tables
v. Rules of subquery usage
vi. More advanced subquery usage on our library database
III. Common Table Expressions
i. Use cases for CTEs
ii. Examples with the library database
IV. Window functions: A panoramic view of your data
i. Example with the Library Database
V. Understanding date time manipulation
i. Date and time functions
ii. Examples with the library database
VI. Understanding text manipulation
i. Text functions
ii. Text functions in the library databaseU
VII. Best practices: bringing it all together
i. Write readable SQL Code
ii. Be careful with NULL values
iii. Use subqueries and CTEs wisely
iv. Think about performance
v. Test your queries
VIII. Summary
7. 6 SQL for Data Analysis Case Study
I. Join our book community on Discord
II. Setting up the database
III. Performing data analysis with SQL
IV. Exploring the data
i. General data insights
V. Analyzing the data
i. Examining the clothing category
ii. Determining the number of customers
iii. Researching the top payment methods
iv. Gathering customer feedback
v. Exploring the relationship between ratings and sales
vi. Finding the percentage of products with reviews
vii. Effectiveness of discounts
viii. Identifying the top customers
ix. Top-selling clothing products
x. Top 5 high-performing products
xi. Most popular product by country
xii. ADDResearching the performance of delivery
xiii. Future projections with linear regression
VI. Summary
8. 7 Fundamental Statistical Concepts
I. Join our book community on Discord
II. Descriptive statistics
i. Levels of measurement
ii. Measures of central tendency
iii. Measures of variability
III. Inferential statistics
i. Probability theory
ii. Probability distributions
iii. Correlation vs causation
IV. Summary
9. 8 Testing Hypotheses
I. Join our book community on Discord
II. Technical requirements (H1 – Section)
III. Introduction to Hypothesis Testing
i. Role of Hypothesis Testing in Data Analysis
ii. Null and Alternative Hypothesis
iii. Step by Step Guide to Performing Hypothesis Testing
IV. One Sample t-Test
V. Conditions for Performing a One-Sample T-Test
i. Case Study: Average Exam Scores
VI. Two Sample t-Test
i. Case Study: Comparing Exam Scores Between Two Schools
VII. Chi Square Test
i. Case Study: Effect of Tutoring on Passing Rates
VIII. Analysis of Variance (ANOVA)
i. Case Study: Comparing Exam Scores Among Three Schools
IX. Summary
10. 9 Business Statistics Case Study
I. Join our book community on Discord
II. Technical requirements (H1 – Section)
III. Case Study Overview
i. Learning Objectives:
ii. Questions:
iii. Solutions:
IV. Additional Topics to Explore
i. Text Analytics
ii. Big Data
iii. Time Series Analysis
iv. Predictive Analytics
v. Prescriptive Analytics & Optimization
vi. Database Management
V. Where to practice
VI. Summary
11. 10 Data analysis and programming
I. Join our book community on Discord
II. The role of programming and our case
III. Different programming languages
i. Python
ii. R
iii. SQL
iv. Julia
v. MATLAB
IV. Working with the Command Line Interface (CLI)
i. Command Line Interface (CLI) vs Graphical User Interface
(GUI)
ii. Accessing the CLI
iii. Typical CLI tasks
iv. Using the CLI for programming
V. Setting up your system for Python programming
i. Check if Python is installed
ii. MacOS
iii. Linux
iv. Windows
v. Browser (cloud-based)
vi. Testing the Python setup
VI. Python use cases for CleanAndGreen
i. Data Cleaning and Preparation
ii. Data Visualization
iii. Statistical Modeling
iv. Predictive Modeling/Machine Learning
v. General remarks on Python
VII. Summary
12. 11 Introduction to Python
I. Join our book community on Discord
II. Understanding the Python Syntax
i. Print Statements
ii. Comments
iii. Variables
iv. Operations on variables
v. Operators and Expressions
III. Exploring Data Types in Python
i. Strings
ii. Integers
iii. Floats
iv. Booleans
v. Type Conversion
IV. Indexing and Slicing in Python
V. Unpacking Data Structures
i. Lists
ii. Dictionaries
iii. Sets
iv. Tuples
VI. Mastering Control Flow Structures
i. Conditional Statements in Python
ii. Looping in Python
VII. Functions in Python
i. Creating Your Own Functions
ii. Python Built-In Functions
VIII. Summary
13. 12 Analyzing data with NumPy & Pandas
I. Join our book community on Discord
II. Introduction to NumPy
i. Installing and Importing NumPy
ii. Basic NumPy Operations
III. Statistical and Mathematical Operations
i. Mathematical Operations with NumPy Arrays
IV. Multi-dimensional Arrays
i. Creating Multi-dimensional Arrays
ii.
Accessing elements in Multi-dimensional Arrays
iii. Reading Data from a CSV File
V. Introduction to Pandas
i. Series and DataFrame
ii. Loading Data with Pandas
iii. Data Analysis with Pandas
iv. Data Analysis
VI. Summary
14. 13 Introduction to Exploratory Data Analysis
I. Join our book community on Discord
II. The Importance of EDA
i. The EDA Process
ii. Tools and Techniques
III. Univariate Analysis
i. Analyzing Continuous Variables
ii. Analyzing Categorical Variables
IV. Bivariate Analysis
i. Understanding bivariate analysis
ii. Correlation vs Causation
iii. Visualizing relationships between two continuous variables
V. Multivariate analysis
i. Heatmaps
ii. Pair plots
VI. Summary
15. 14 Data Cleaning
I. Join our book community on Discord
II. Technical requirements
III. Importance of data cleaning
i. Impact on data quality
ii. Relevance to business decisions
IV. Common data cleaning challenges
i. Inconsistent formats
ii. Misspellings and Inaccuracies
iii. Duplicate records
V. Dealing with missing values
i. Causes of missing values
ii. Strategies for handling missing values
iii. Types of missing data
VI. Dealing with duplicate values
i. Causes of duplicate data
ii. Identification and removal
VII. Dealing with outliers
i. Types of outliers
ii. Impact on analysis
iii. Techniques for identifying and handling outliers
VIII. Cleaning and transforming data
i. Handling inconsistencies
ii. Converting categorical data
iii. Normalizing numerical features
IX. Data validation
i. Validation methods
X. Summary
16. 17 Exploratory Data Analysis Case Study
I. Join our book community on Discord
II. Technical Requirements
III. E-commerce Sales Optimization Case Study
i. Time Series Analysis
ii. Customer Segmentation
iii. Product Analysis
iv. Payment and Returns
v. Case Study Answers
IV. Summary
Becoming a Data Analyst: A
beginner’s guide to kickstarting
your data analysis journey
Welcome to Packt Early Access. We’re giving you an exclusive preview of
this book before it goes on sale. It can take many months to write a book, but
our authors have cutting-edge information to share with you today. Early
Access gives you an insight into the latest developments by making chapter
drafts available. The chapters may be a little rough around the edges right
now, but our authors will update them over time.You can dip in and out
of this book or follow along from start to finish; Early Access is designed to
be flexible. We hope you enjoy getting to know more about the process of
writing a Packt book.
Business Understanding
The first and probably most important phase is business understanding as
your work here will set the direction for the rest of the project. Mistakes here
result in performing unnecessary work or providing a solution that does not
solve the intended problem. There are about 3 main areas of this phase:
1. Defining business objectives: Here you will be defining the actual goal
of the data project. The basis for every project is to provide a solution
that solves a problem or improve a process. An important concept is to
understand symptoms vs the root cause. The symptoms will be the
visual signs or effects of an issue of a system. Symptoms trigger the
investigation of an issue or the need for a solution.
The root cause is the underlying issue that all the symptoms stem from.
Unlike symptoms, root causes are often not visible or apparent without a
thorough investigation. Addressing the symptoms will only provide
temporary fixes, while addressing the root cause will resolve the problem
more permanently. When speaking with stakeholders, often they may spend
most of their time speaking about the symptoms. Many times, what they say
is the problem really isn’t the problem. As a data analyst, you must know
how to ask the right questions to sift through symptoms to figure out the root
cause and the correct business problem. Tools for success: Five Whys and
Fishbone diagrams. The Five Whys is a quick and effective technique to
uncover a root cause. Where you begin with a problem statement and follow
with asking “why” five times or any amount needed to land at the underlying
root cause. Below is a diagram the visually depicts the process.
The Fishbone Diagram, as pictured below, is a visual aid to identify and
organize the possible causes and effects of a problem. This is also a great
method for prioritizing the different root problems to solve.
1. Gather relative information: Once you determine your business
objective, your next task is to gather the data and any additional
information regarding the project. Typical task includes identifying your
data sources. This will be internal sources and may include external data
as well. The data sources can include company databases, reports, or
third-party research. Other data sources can include additional
interviews with stakeholders, web scraping, or surveys.
Tools for success: An important skill for a data analyst is the ability to
navigate an organization to source all the information and data that you need.
This will be done through interviews and emails. Its also important to know
that you will work with other people who will have different roles and
responsibilities within the project. A great way to keep all this information
organized is a RACI matrix, pictured below. A RACI matrix helps you keep
track of the responsibilities of others in your project and whom you consulted
and informed. This is a great way to ensure a high level of communication in
a project.
Determine key performance indicators (KPIs) and metrics: KPIs and
metrics are used to measure and evaluate a business process. A metric is a
measure used to track and monitor a business process. Examples include
website traffic and profit margin. A KPI is a measurable value that’s often
tied to business goals and strategy. Examples include net promoter score,
conversion rates, and customer lifetime value. While they are very similar in
nature, KPIs are more closely aligned with business goals while metrics are
used to track any business activity. When speaking with stakeholders, you
would also like to define the critical success factors, the essential activities
that must go well in order for the objective to be achieved. Based on those
success factors, you will develop your KPIs and metrics. These measures will
be part of the essential data used for business decisions.
Tools for Success: A Scope of Work (SOW) or project charter are excellent
tools to ensure a clear scope for your project. It is a document that can
summarize all the information gathered in the previous steps that would
include the project overview, your tasks, expected results, timeline, and
expected deliverable. The expected deliverable is one of the most important
elements as it defines the work you will present at the end of the project.
Deliverables can include a dashboard, setting up an automated process to
maintain the dashboard, a simple report, or a PowerPoint presentation.
Data Inspection
Once the business objective has been established, a data analyst will then set
out to understand their data. This phase will introduce more technical work
involving the initial data collection. There are three major areas:
1. Collect initial data: You already identified your data sources, here you
will explore your company’s databases, reports, or extract external data
through web scraping to build your initial dataset.
2. Determine data availability: Identify how often this data is gathered or
updated. Also, how you would be able to access it for future use.
3. Explore data and characteristics: Where you take a first look to
identify important variables, data types, and the format of your data.
You will also determine if you need to gather more data, data
enrichment, and identify the initial data that will be used for analysis.
Tools for success: To collect the initial data, a data analyst can typically use
SQL to explore a database. If there is data that needs to be scraped from the
web or even a PDF file, programming tools such as python or R can be used.
Data validation: You want to ensure that your data follows the business
logic or rules. This can be investigated by exploring the ranges or
formatting of your variables. Mistakes can occur, especially when data
is entered manually. Essentially, you verify whether your data makes
sense. For example, if you have a customer table with an age of 150, this
indicates an error.
Missing value treatment: Missing data is a common issue when data
cleaning. There are multiple methods that include deletion, ignoring the
missing data, assigning a missing data column, and imputation. More
details on these methods will be explained in
Removing duplicates: Duplicates will lead to misleading numbers and
will cause incorrect conclusions.
Outlier treatment: Where you will identify outliers and determine how
your will treat them. An outlier is a data point that is significantly
different from most of the other data points. They are normally the result
of normal variation of a process or errors. There are multiple treatment
methods including ignoring them, imputation, deletion, or
transformation.
Data normalization: When data is moved through different tools and
phases during the data pipeline, the data types may involuntarily
convert. Here you will fix the formatting, convert units of
measurements, or standardize categorical data.
Feature engineering: Where you transform variables to better represent
the data. This can involve binning, aggregations, or combining variables.
Data Validation
After exploring your data, you want to ensure your analyses make sense
according to the business logic. Like in the data preparation phase, you will
perform a second data validation step to verify your numbers make sense
before you present them in your deliverable. It is often helpful to include data
quality checks within your process.Tools for success: Many different tools
can be used to perform data validation or quality checks. Microsoft Excel can
be used to quickly compare expected and actual values. If you are building
and automating a data pipeline, implementing unit tests or error handling is
essential to ensure errors will be caught and dealt with for future data.
Summary
In this chapter we went over the importance of understanding the business
context of a project. This included an overview of each phase of the data
analytics lifecycle while including the typical roles and responsibilities of a
data analyst in each phase. Different tools were provided to aid a data
analyst’s understanding of the business problem of a project. We introduced
certain topics such as statistics, SQL, and exploratory data analysis that we
will cover in more detail in future chapters.
2 Introduction to SQL
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
Let's dive into the fascinating world of SQL and kickstart our journey toward
becoming a data analyst!
SQL and its use cases
SQL is the abbreviation of Structured Query Language. It is the language we
need to interact with the database. A database is a collection of organized
information. This information is also known as data. Typically, when we talk
about databases, we mean electronically stored databases.Let’s think of an
example for a database. It usually helps me to picture something that we use
on a daily basis. For example, the database of the local library. They need to
keep track of a number of things, amongst others:
This information can all be stored in a database. A database like this would
consist of different tables:
Books
Members
Loans
Title
Author
ISBN
Publication year
Publisher
In Figure 2.1 you can see an example of what that table could look like.
Please mind that this is a simplified version, we’ll make this bit by bit slightly
more complicated as we learn more.
As you can see, every row of the table represents a book. And don’t get too
excited about these really cool titles, it’s unfortunately dummy data.So, at
this point you may wonder, great, what part about this is SQL? And that
would be a great question! We can use SQL to do the following things:
Since we are focusing on the data analytics purposes, the last use case is the
most important for us. We will need to get data from a table and do all sorts
of things with it. But let’s start with the origin of SQL first.
Different Databases
We use SQL to interact with a database. There are different types of
databases out there. Databases typically have a Database Management
Systems (DBMS) available to inspect and manage data. Each DBMS has its
own set of features and strengths, making them suitable for specific
applications and use cases. Let’s first explore the differences between
relational and non-relational databases and then discuss some popular
database management systems.
Relational Databases
In Figure 2.3 we see the book table. And as you can see, we now use the id of
the author to represent the author.
We call the id column in the table the primary key. A primary key is a unique
identifier for each record in a table, ensuring that no two records have the
same value for the primary key attribute. Primary keys help enforce data
integrity and consistency, and they are used to establish relationships between
tables. When we use this key in another table to refer to the record in the
other table, it’s called a foreign key.SQL is the standard language for
interacting with relational databases. Relational databases are very common.
Some of the advantages of relational databases are data consistency, support
for complex queries, and ease of data retrieval. There are also some
downsides, it’s hard to work with unstructured data and it becomes slow for
large data volumes. In these cases, non-relational databases might be a better
option.
Non-Relational Databases
Popular DBMS’s
There are several popular DBMS options available. They all have their own
set of features and strengths. Let’s list the most popular ones.
As a new data analyst, you’re probably not going to be the person choosing a
certain database. However, in your data analyst journey, you will likely
encounter and work with various databases and DBMS. So, knowing these
common ones and understanding their strength, will help you along the
way.Next up, we’ll be dealing with some SQL terminology that is going to be
necessary for your success as a data analyst.
SQL Terminology
Let’s familiarize ourselves with key SQL terminology to better understand
the structure of relational databases and the language used to interact with
them. In this section, we’ll go through a brief explanation of what they are
and what their role in data management is. Let’s start with one of the most
central terms: query.
Query
A query is a request for specific data or information from a database. In
SQL, queries are used to get, update, insert, or delete data stored in tables.
Queries are written using SQL statements, which are composed of clauses
and keywords. Let’s see what statements are next.
Statement
An SQL statement is a text string composed of SQL commands, clauses, and
keywords. It is used to perform the various query tasks such as creating,
updating, deleting, or getting data in a relational database. SQL statements
are the building blocks of SQL queries, and they typically contain one or
more clauses. Which brings us to the next key concept: clause.
Clause
A clause is a part of an SQL statement that performs a specific function or
operation. Clauses can for example be used to define conditions, specify
sorting, group data, or join tables. Some common SQL clauses include
SELECT, FROM, WHERE, and ORDER BY. Each clause is often associated
with specific SQL keywords.
Keyword
Keywords are reserved words in SQL that have a predefined meaning and
are used to construct SQL statements. They help define the structure and
syntax of a query. Examples of SQL keywords include SELECT, INSERT,
UPDATE, DELETE, and CREATE. When writing more complex SQL
queries, you'll often use keywords in conjunction with views.
View
A view is a virtual table that is based on the result of an SQL query. It does
not store data itself but provides a way to access and manipulate data from
one or more underlying tables. Views can be used to simplify complex
queries and customize data presentation for specific users. Let’s move on to
setting up our environment, so that we can get some practice with writing
SQL soon.
Choosing a DBMS
Selecting a suitable database management system is the first step in setting up
your environment. There are several popular options out there such as
MySQL, PostgreSQL, SQL Server, Oracle, and SQLite. For beginners,
MySQL or SQLite are excellent choices due to their simplicity. In the
examples, we’ll be using MySQL, so that would be my recommendation to
go with for now.
So, let’s assume you’ll be installing MySQL, this is what you’ll need to
install:
MySQL Server
MySQL Workbench
You can follow the official installation guides provided by the respective
database management system to ensure a smooth setup. Here’s the one for
MySQL:
Windows: https://dev.mysql.com/doc/mysql-installation-
excerpt/8.0/en/macos-installation.html
MacOS: https://dev.mysql.com/doc/mysql-installation-
excerpt/8.0/en/macos-installation.html
Linux: https://dev.mysql.com/doc/mysql-installation-
excerpt/8.0/en/linux-installation.html
Once you’ve managed to do this, it’s time to set up a sample database that we
can use.
After this, we can open MySQL Workbench. This brings us to the start
screen.
Figure 2.4 – Welcome to MySQL Workbench
Fill out the form as above, and click on OK. The connection library should
now appear. We can double click it, and the connection should open. It
should look like Figure 2.6.
Figure 2.6 – Opened connection
Alright, if you’re here, that’s a great place to be! Let’s add in a schema and
some tables. For that, you can use the following code snippet (you can find
this in the associated GitHub folder):
-- Creating schema for the library
CREATE SCHEMA Library;
-- Creating tables within the Library schema
USE Library;
-- Creating the Authors table
CREATE TABLE Authors (
id INT AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
nationality VARCHAR(50)
);
-- Creating the Books table
CREATE TABLE Books (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(100) NOT NULL,
author_id INT,
isbn VARCHAR(20),
publication_year INT,
publisher VARCHAR(50),
FOREIGN KEY (author_id) REFERENCES Authors(id)
);
-- Creating the Members table
CREATE TABLE Members (
id INT AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
join_date DATE,
email VARCHAR(100) UNIQUE
);
-- Creating the BorrowedBooks table
CREATE TABLE BorrowedBooks (
id INT AUTO_INCREMENT PRIMARY KEY,
book_id INT,
member_id INT,
borrow_date DATE,
due_date DATE,
return_date DATE,
FOREIGN KEY (book_id) REFERENCES Books(id),
FOREIGN KEY (member_id) REFERENCES Members(id)
);
Figure 2.7 – Query editor with the SQL statements for schema and table
creation
We now need to hit the lightning icon (the most left one). Make sure you
didn’t select a part of the SQL, because then it will only try and execute that
part. It should show green checks and success messages in the bottom part.
The tables won’t show until you hit the refresh icon next to schemas (upper
left part). You can see the newly created tables on the left in Figure 2.8.
The tables are still empty. In order to add some data in, we need to execute
another set of statements.Go ahead and clear the query editor. Then copy the
below code snippet in:
-- Populating the Authors table
INSERT INTO Authors (id, first_name, last_name, nationality) VAL
UES
(1, 'Akira', 'Suzuki', 'Japanese'),
(2, 'Amara', 'Diop', 'Senegalese'),
(3, 'Johannes', 'Müller', 'German'),
(4, 'Ying', 'Li', 'Chinese'),
(5, 'Aisling', 'O''Brien', 'Irish'),
(6, 'Carlos', 'Fernandez', 'Spanish'),
(7, 'Adeola', 'Adeyemi', 'Nigerian'),
(8, 'Anastasia', 'Ivanova', 'Russian'),
(9, 'Sofia', 'Silva', 'Portuguese'),
(10, 'Vivek', 'Patel', 'Indian');
-- Populating the Books table
INSERT INTO Books (id, title, author_id, isbn, publication_year,
publisher) VALUES
(1, 'Data Analysis with SQL', 1, '1234567890123', 2022, 'TechP
ress'),
(2, 'SQL for Data Analysts', 2, '2345678901234', 2021, 'DataBo
oks'),
(3, 'Mastering SQL for Data Analysis', 3, '3456789012345', 202
0, 'Analytics Publishing'),
(4, 'Efficient Data Analytics with SQL', 4, '4567890123456', 2
019, 'TechPress'),
(5, 'Advanced SQL Techniques for Data Analysts', 5, '567890123
4567', 2018, 'DataBooks'),
(6, 'SQL for Business Intelligence', 6, '6789012345678', 2021,
'Analytics Publishing'),
(7, 'Data Wrangling with SQL', 7, '7890123456789', 2020, 'Tech
Press'),
(8, 'SQL for Big Data', 8, '8901234567890', 2019, 'DataBooks')
,
(9, 'Data Visualization Using SQL', 9, '9012345678901', 2018,
'Analytics Publishing'),
(10, 'SQL for Data Science', 10, '0123456789012', 2021, 'TechP
ress');
-- Populating the Members table
INSERT INTO Members (id, first_name, last_name, email, join_date
) VALUES
(1, 'Alice', 'Johnson', '[email protected]', '2021-01-
01'),
(2, 'Bob', 'Smith', '[email protected]', '2021-02-15'),
(3, 'Chiara', 'Rossi', '[email protected]', '2021-05-10
'),
(4, 'David', 'Gonzalez', '[email protected]', '2021-0
7-20'),
(5, 'Eve', 'Garcia', '[email protected]', '2021-09-30'),
(6, 'Femi', 'Adeyemi', '[email protected]', '2021-11-10
'),
(7, 'Grace', 'Kim', '[email protected]', '2021-12-15'),
(8, 'Henrik', 'Jensen', '[email protected]', '2022-03-
05'),
(9, 'Ingrid', 'Pettersson', '[email protected]', '
2022-04-20'),
(10, 'Jia', 'Wang', '[email protected]', '2022-06-01');
-- Populating the Borrowed_Books table
INSERT INTO BorrowedBooks (id, book_id, member_id, borrow_date,
due_date) VALUES
(1, 1, 1, '2022-04-01', '2022-05-01'),
(2, 2, 2, '2022-04-05', '2022-05-05'),
(3, 3, 3, '2022-04-10', '2022-05-10'),
(4, 4, 4, '2022-04-15', '2022-05-15'),
(5, 5, 5, '2022-04-20', '2022-05-20'),
(6, 6, 6, '2022-04-25', '2022-05-25'),
(7, 7, 7, '2022-04-30', '2022-05-30'),
(8, 8, 8, '2022-05-02', '2022-06-02'),
(9, 9, 9, '2022-05-05', '2022-06-05'),
(10, 10, 10, '2022-05-08', '2022-06-08');
Run this by clicking on the lightning icon. Again, you should get green
confirmation messages that the data is in there. Without writing any SQL
ourselves, we can verify that the data is in there. Right click on the
“BorrowedBooks” on the left, and then click on “Select Rows – Limit 1000”,
as displayed in Figure 2.9.
And that’s it! You have your system up and running. Let’s finally get to it:
writing our own SQL statements.
SELECT Statement
The SELECT statement is the foundation of data retrieval in SQL. We use it
to fetch data from one or more tables within a database. The basic structure of
a SELECT query is as follows:
SELECT column_name(s)
FROM table_name
WHERE conditions;
For example, to retrieve all books in our library catalog, the query would look
like:
SELECT *
FROM books;
With the * we say we’d like to get all the columns. And we specify the table
books. This will return all the books in our table, as you can see in Figure
2.11.
Figure 2.11 – The result of the SELECT * from books query
We could also have specified only the title and isbn column:
SELECT title, isbn
FROM books;
That would have given us the output that you can see in Figure 2.12.
Figure 2.12 – Select statement for only two columns
By the way, SQL is case insensitive. That means that we could spell the
column and table names in both upper- and lowercase, and it would still
know what column or table we’re referring to.
Structure of a Query
A SQL query is composed of several clauses, each serving a specific purpose.
The main clauses in a SELECT query are:
INSERT Statements
INSERT statements are used for adding new data to a table. The basic
structure of an INSERT query is:
INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
For example, to add a new book to our library catalog, the query would look
like:
INSERT INTO books (title, author_id, publication_year)
VALUES ('How to SQL book', 1, 2015);
Please note that we’re not specifying all the columns here. Depending on how
the database is created, this can or cannot be done. In our case, we can. It will
add the value null for the missing columns. Sometimes you need to change
some values of a row after adding it. This can be done with the UPDATE
statement.
UPDATE Statements
With the UPDATE statements we can modify existing data in a table. The
basic structure of an UPDATE query is:
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE conditions;
For example, to update the publication year of a book in our library catalog,
the query would look like:
UPDATE books
SET publication_year = 2016
WHERE id = 11;
Using the WHERE clause for other columns than the id might require you to
disable safe update mode.
SET SQL_SAFE_UPDATES = 0;
UPDATE books
SET publication_year = 2017
WHERE title = 'How to SQL book';
We now use the first line to turn off safe update mode. And after that, we can
update based on non-id fields. We can also delete records from the table.
Let’s explore how to do that next.
DELETE Statements
This might be a surprise, but the DELETE statements are used for removing
data from a table. The basic structure of a DELETE query is:
DELETE FROM table_name
WHERE conditions;
For example, to remove a book from our library catalog, the query would
look like:
DELETE FROM books
WHERE title = 'How to SQL book';
This will remove the book that we just added. We really need to master these
basic SQL statements to interact with data. Before we do so, let’s see some
basic syntax rules.
Here we only select the column named “from”. And since that is a keyword,
we have to enclose it in backticks in order to use it as a literal value.
Capitalization
In this query, the keywords SELECT, FROM, and WHERE are capitalized,
while the table and column names remain in lowercase. However, the
following also works:
SeLeCT tITle, autHor_ID
From BOOKS
WheRE publiCatION_yeaR > 2010;
Single quotes
We use single quotes are used to enclose string literals, such as text or date
values. For example:
INSERT INTO books (title, author)
VALUES ('The Art of Data Analysis', 'Jane Doe');
Here, the title and author values are enclosed in single quotes since they are
string literals. We have seen the queries end with semicolons so far. Let’s talk
about it.
Semicolons
The semicolon is used to mark the end of an SQL statement. While some
database management systems do not require semicolons, it's a good practice
to include them for clarity and to avoid potential errors. For example:
SELECT * FROM books;
UPDATE books SET price = price * 1.04;
In this example, semicolons are used to mark the end of each SQL statement.
This helps to make it clear where one statement ends and the next begins. In
this particular example we are increasing all the prices. The basic rules are
important whenever you’re writing SQL. That should be enough to be ready
for becoming more proficient at filtering the data next.
WHERE Clause
The WHERE clause filters records based on specified conditions. Above we
filtered for the condition equal to a certain id or title. We use WHERE in
conjunction with SELECT, UPDATE, and DELETE statements. For
example, to find all books published before 2019 in our library catalog, the
query would look like:
SELECT *
FROM books
WHERE publication_year < 2019;
This will return two results. The books with id 5 and 9. Let’s have a look at
how we can sort our results.
ORDER BY Clause
We can sort our results with the ORDER BY clause. We can specify one or
multiple columns to sort on. We can also choose to sort the results in
ascending (ASC) or descending (DESC) order. For example, to retrieve
books sorted by publication year in ascending order, the query would look
like:
SELECT *
FROM books
WHERE publication_year < 2019
ORDER BY publication_year ASC;
We can also specify the next column to sort for. In our current example
nothing changes, since both the results have publication year 2018:
5 Advanced SQL Techniques for Data Analysts 5 5678901234567
2018 DataBooks
9 Data Visualization Using SQL 9 9012345678901
2018 Analytics Publishing
Now this will swap around the results, since when the publication year is the
same, it will sort by publisher from A-Z.
9 Data Visualization Using SQL 9 9012345678901 2
018 Analytics Publishing
5 Advanced SQL Techniques for Data Analysts 5 567890123456
7 2018 DataBooks
We can do more things with the clauses. Let’s see how we can use
DISTINCT to only get unique values.
DISTINCT Clause
The DISTINCT clause is used to eliminate duplicate records from the result
set of a query. For example, to retrieve a list of unique publishers from our
library catalog, the query would look like:
SELECT DISTINCT publisher
FROM books;
This will return a list of the unique publishers in our code; this is the result:
TechPress
DataBooks
Analytics Publishing
Since we only select the publisher column, that’s the only one we’ll get.
DISTINCT is often used for inner SELECT statements, something we’ll learn
about in the later chapters. Let’s for now how a look at the LIMIT clause.
LIMIT Clause
We use the LIMIT clause to limit the number of records returned by a query.
This can be particularly useful when working with large datasets or when you
need to get only a specific number of records. For example, to fetch the top 5
oldest books in our library catalog, the query would look like:
SELECT *
FROM books
ORDER BY publication_year ASC
LIMIT 5;
We select all the columns from the books table, and order them by
publication year. We then only get the first 5 results, as indicated by
LIMIT 5 .These filtering clauses really add to what we can do and the
questions we can answer with the use of SQL. Let’s add some other great
tools in our SQL toolbox and learn how to work with operators and functions.
Comparison operators
The comparison operators are used in the WHERE clause to filter records
based on the specified conditions. We have seen = already. This is for
checking whether a certain field equals a certain value. We also saw the <, to
check for books before a certain publication year. The common comparison
operators include =, <>, >, <, >=, and <=. For example, to find books
published in or after 2020 you would use the following query:
SELECT *
FROM books
WHERE publication_year >= 2020;
This will return the books with id 1, 2, 3, 4, 7 and 10. The other comparison
operators work in a similar way. Sometimes you’d like to specify multiple
conditions. This can be achieved with the logical operators,
This will return only one result (the id is 5). When we check for a certain text
value, we surround this text value with single quotes. As you can see in the
example above. It is looking for an exact match. We can also look for the text
occurring in the field. This can be done with the LIKE operator.
LIKE operator
The LIKE operator is used to search for a specified pattern in a column. A
pattern could be:
We use the percentage symbol (%) to represent zero or more characters and
the underscore (_) to represent a single character. For example, to find books
with titles starting with "The", we would use the following query:
SELECT *
FROM books
WHERE title LIKE 'The%';
To find books with titles containing the word "Data", we could do this:
SELECT *
FROM Books
WHERE title LIKE '%Data%';
We can also do more things with the numeric data fields. This can be done
with the use of the arithmetic operators.
Arithmetic operators
The Arithmetic operators (+, -, *, /, %) add the ability to perform
calculations on numeric columns. Let's imagine that our books table has a
price column. We can calculate the discounted price of each book, assuming
a 10% discount:
SELECT title, price, price * 0.9 AS discounted_price
FROM books;
We can also work with multiple arithmetic operators. Let’s calculate the price
difference between the original price and a discounted price of 15%:
SELECT title, price, price - (price * 0.85) AS price_difference
FROM books;
Our we could get the value of a single page with the following query where
we calculate the price per page of each book:
SELECT title, price, page_count, price / page_count AS price_per
_page
FROM books;
And lastly, I had to be a bit creative with my example here. Let’s assume we
have a special lucky 7 discount campaing and we want to find the remainder
when the price is divided by 7. Here’s how to do that:
SELECT title, price, price % 7 AS remainder
FROM books;
Let’s say that our books table had a column page count (it doesn’t so this
won’t work without adding it). We can calculate the average number of pages
per book by using the AVG function:
SELECT AVG(page_count)
FROM books;
The total of all the pages of all the books in our library could be calculated
like this:
SELECT SUM(page_count)
FROM books;
And we could find the publication year of the oldest book with the use of the
MIN function:
SELECT MIN(publication_year)
FROM books;
There are also functions to manipulate text values. We call a text values
commonly a string.
We can use the LENGTH function to get the number of characters in string.
The following SQL will give back a number for each title, representing the
length.
SELECT LENGTH(title)
FROM books;
Date functions
Date functions are used to manipulate and perform calculations on date and
time values. Some common date functions include CURRENT_DATE ,
CURRENT_TIMESTAMP , DATE_ADD , and DATE_DIFF . For example, to find
borrow sessions that were due before today, we can say:
SELECT *
FROM BorrowedBooks
WHERE due_date < CURRENT_DATE
And to get the number of days that books were allowed to be borrowed we
could say:
SELECT book_id, DATEDIFF(due_date, borrow_date) AS allowed_borro
w_duration
FROM BorrowedBooks
WHERE due_date < CURRENT_DATE
Here we use the DATEDIFF function. This will come up with the number of
days between two dates. We show the result in the output with the name
allowed_borrow_duration .These functions and operators we just covered
are enabling us to answer yet more complicated question with the use of SQL
and a dataset. That’s it for this introductory chapter!
Summary
You’ve made your way through the first chapter on SQL, well done! We
introduced the fundamental concepts of SQL and its role in data analysis. We
began by exploring the history and importance of SQL, followed by the
difference between relational and non-relational databases. Then we saw an
overview of different database management systems, such as MySQL,
PostgreSQL, SQL Server, Oracle, and SQLite.We then delved into essential
SQL terminologies, such as queries, statements, clauses, keywords, and
views. We also provided guidance on setting up your environment, choosing
a suitable database management system, and creating a sample database.Next,
we covered basic SQL query writing, including the use of SELECT,
INSERT, UPDATE, and DELETE statements. We also explained how to
filter data using various clauses like WHERE, ORDER BY, DISTINCT, and
LIMIT.Lastly, we discussed the use of operators and functions in SQL,
including comparison operators, logical operators (AND, OR), the LIKE
operator, arithmetic operators, and various functions for calculations, text
manipulation, and date manipulation.At this point, you should have a solid
understanding of the basic SQL concepts and be able to write simple queries
to interact with databases. In the next chapters, we will expand on these
concepts and explore how to join tables, work with groupings, and create
more complex queries for data analysis.
3 Joining Tables in SQL
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
After learning about the SQL basics and understanding how databases work,
it's time to dive a little deeper. We'll discuss table relationships and joins.
Joining tables is typically considered a tough topic, but it's one that you'll
really need to become comfortable with to be a data analyst. Don't worry,
we'll build it up carefully. The topics we'll cover in this chapter are:
So please join me on this journey to learn about table joins. We'll continue to
work with our library example.
Table relations
Imagine we tried to put all the data in our library database into one big table.
What would happen? Well, we would quickly find that we're repeating a lot
of the information. For example, if a book gets borrowed by multiple
members, we'd have to repeat its details each time it is borrowed. This
repetition is not efficient to say the least. Because we have repetition, the data
becomes harder to maintain, and it increases the likelihood of errors. Imagine
that a detail of the member changes, we'd have to change it in all the rows
where the member borrowed a book. Chances are that we forget to change it
as some spots, and we'll lose data integrity as a result.This is why we separate
our data into multiple tables, and to make this practical, we establish
relationships between these tables. Let's revisit our library database, which is
already divided into tables as per the previous chapter: authors , books ,
borrowed_books , and members .We have the following relationships
between these tables:
Each book is written by one author, and an author can write multiple
books. This is a one-to-many relationship between the authors and
books tables.
Similarly, we can reverse the relationship between authors and book
and call it a many-to-one relationship.
Each member can borrow multiple books, and each book can be
borrowed by multiple members. This is a many-to-many relationship
between the members and the books tables. This relationship is
represented using the borrowed_books table.
There's also the concept of a one-to-one relationship that we don't have
in our database. However, let's suppose we have a table for rare_books
in the library, where each rare book has a unique insurance policy. This
forms a one-to-one relationship between the rare_books and
insurance_policies tables. One rare book only has one insurance
policy. And one insurance policy is only for one rare book.
And here are the first two entries of our books table:
In the books table, author_id is a foreign key that links each book to an
author in the authors table. The value of author_id in the books table
matches the value of id (the primary key) in the authors table.
SQL Joins
So, when we want to retrieve data from multiple related tables, we perform a
join. A join combines records from two or more tables based on a related
column between them. This allows you to create a new result set that includes
data from multiple tables. Here are some examples of questions that can be
answered with the use of SQL joins:
There are several types of joins. We'll discuss the most common ones here,
starting with inner join. The joins that we will discuss deal with NULL values
differently. We're going to explain the differences by using the following
data:
And here are the first five entries of our books table:
The DROP TABLE statement is used in SQL to remove an entire table from
a database. It permanently deletes the table and all of its data, indexes,
triggers, and other associated objects. The DROP TABLE statement is
primarily used when you no longer need a table and want to remove it from
the database altogether.The basic syntax of the DROP TABLE statement is as
follows:
DROP TABLE table_name;
Unlike the DROP TABLE statement, which removes the entire table, the
DELETE statement focuses on deleting specific data within the table.
If you recall from the last chapter, not specifying any conditions and not
using a WHERE clause leads to deleting all records in that table. Below is an
example where the statements will clear all data from the tables:
DELETE FROM books;
DELETE FROM members;
Next, let's focus on populating all the tables we deleted records from or just
created.
Comments in SQL
With -- we can add a comment to SQL. This will ignore the text
that comes after. We use it in these examples to indicate what the
SQL snippet is doing.
-- Populating the Authors table
INSERT INTO authors VALUES (1, 'Emeka Okafor', 'Nigeria', 1982,
'[email protected]'),
(2, 'Mei Lin', 'China', 1978, '[email protected]'),
(3, 'Sophia Martin', 'USA', 1985, '[email protected]'),
(4, 'Juan Rodriguez', 'Spain', 1970, '[email protected]
'),
(5, 'Aya Morimoto', 'Japan', 1990, '[email protected]'),
(6, "Maria Low", "Germany", 1981, "[email protected]");
-- Re-Populating the Books table
INSERT INTO books VALUES (1, 'Data Analytics for Beginners', 1,
'978-111111111', 2020, 'DataPress Publishing'),
(2, 'The Art of Data Analysis', 2, '978-222222222', 2019, 'Analy
tical Press'),
(3, 'Mastering Python', 1, '978-333333333', 2021, 'Python Publis
hers'),
(4, 'SQL for Data Science', 3, '978-444444444', 2022, 'Science P
ublishing House'),
(5, 'The Basics of Quantum Computing', NULL, '978-555555555', 20
23, 'Quantum Press'),
(6, 'The Basics of Blockchain', 5, '978-666666666', 2022, 'Block
chain Press'),
(7, 'SQL for Data Analysis', NULL, '978-777777777', 2023, 'Data
Analysis Press'),
(8, NULL, 4, '978-888888888', 2019, 'AI Publishing');
-- Populating the Members table
INSERT INTO members (first_name, last_name, email, join_date) VA
LUES
('Alice', 'Johnson', '[email protected]', '2021-01-01'
),
('Bob', 'Smith', '[email protected]', '2021-02-15'),
('Chiara', 'Rossi', '[email protected]', '2021-05-10'),
('David', 'Gonzalez', '[email protected]', '2021-07-2
0'),
('Eve', 'Garcia', '[email protected]', '2021-09-30'),
('Femi', 'Adeyemi', '[email protected]', '2021-11-10'),
('Grace', 'Kim', '[email protected]', '2021-12-15'),
('Henrik', 'Jensen', '[email protected]', '2022-03-05'
),
('Ingrid', 'Pettersson', '[email protected]', '202
2-04-20'),
('Jia', 'Wang', '[email protected]', '2022-06-01');
-- Populating the Borrowed_Books table
INSERT INTO borrowed_books (id, book_id, member_id, borrow_date,
due_date) VALUES
(1, 1, 1, '2022-04-01', '2022-05-01'),
(2, 2, 2, '2022-04-05', '2022-05-05'),
(3, 3, 3, '2022-04-10', '2022-05-10'),
(4, 4, 4, '2022-04-15', '2022-05-15'),
(5, 5, 5, '2022-04-20', '2022-05-20'),
(6, 6, NULL, '2022-04-25', '2022-05-25'),
(7, NULL, 7, '2022-04-30', '2022-05-30'),
(8, 4, 8, '2022-05-02', '2022-06-02'),
(9, NULL, NULL, '2022-05-05', '2022-06-05'),
(10, 7, 10, '2022-05-08', '2022-06-08'),
(11, 3, 6, NULL, NULL, NULL, NULL);
Note that we did not specify the column names when inserting data in the
authors and books table. Why? Because we're adding data to all columns in
the order their columns are specified, so it makes no difference if you leave
the columns unspecified. Similarly, we did specify the columns for the
members table and left out the id column because that's set to
AUTO_INCREMENT .Copy and paste each of these statements individually and
execute them. Once you have successfully done everything above, we can
move to the next section and learn about the types of joins.You can verify
you have the correct data if you perform a select all statement and you'll get
the following output:
Figure 3.0 – Select all statements and their output
Several types of joins in SQL allow you to combine data from multiple tables
based on specified conditions:
1. INNER JOIN
2. LEFT JOIN (or LEFT OUTER JOIN)
3. RIGHT JOIN (or RIGHT OUTER JOIN)
4. FULL JOIN (or FULL OUTER JOIN)
5. CROSS JOIN (or Cartesian Join)
6. SELF JOIN
Now you are ready to learn about each of these. We will begin exploring
inner Join first.
Inner Join
An inner join retrieves data that exists in both tables being joined. It returns
records that have matching values in both tables. Let's see a practical
example. If you want to find all members who have borrowed a book, you
would use an inner join.
SELECT members.first_name, members.last_name, borrowed_books.boo
k_id
FROM borrowed_books
INNER JOIN members ON borrowed_books.member_id = members.id;
This query will return the names of the members who have borrowed a book
and the ID of the book they borrowed.Here's how the query works:
We can make a slightly more complex example; if you want to find all books
that have been borrowed and by whom, you will use a double inner
join.Here's an example using the library database (schema):
SELECT books.title, members.first_name, members.last_name
FROM borrowed_books AS bb
INNER JOIN books ON bb.book_id = books.id
INNER JOIN members ON bb.member_id = members.id;
This query will return the title of each borrowed book and the name of the
member who borrowed it.
As seen above in Figure 3.1, the borrowed books are returned with the first
and last names of the members who borrowed them.
Left Join
A left join retrieves all the records from the left table and any matching
records from the right table. If there is no match, the result is NULL on the
right side.For example, you may want to find all authors and any books
they've written that are in the library, whether the books have been borrowed
or not.
SELECT authors.name, books.title
FROM authors
LEFT JOIN books ON authors.id = books.author_id;
This query will return a list of all authors and the titles of any books they've
written. If an author hasn't written any books in the library, the books.title
field for that author will be NULL .
Figure 3.2 Results of Inner Join Query
In this case, the result shows that Emeka Okafor is the author of two books
while Juan Rodriguez is the author of a book without a title, hence the
NULL value in the title column. Moreover, Maria Low is not the author of
any books in our library database and that's why there's a NULL value for
her. This query joins two tables ( books , authors ) and retrieves the author's
name from the authors table and the book's title from the books table.
Right Join
A right join retrieves all the records from the right table and the matching
records from the left. The result is NULL on the left side if there is no match.
If a value in the left table doesn't occur in the right table, this value is left out
(pun intended).When might such a join be useful? For instance, when we
want a list of all books and who wrote them (if anyone).
SELECT authors.name, books.title
FROM authors
RIGHT JOIN books
ON authors.id = books.author_id;
Figure 3.3 - Results of Right Join Query
This query will return all members, even those who have yet to borrow any
books. The filtering criteria here is that all records without NULL in book_id
and member_id will be shown.
Figure 3.4 - Results of Right Join Query
Full Join
A full join, also known as full outer join, retrieves all records where there is
a match in either the left or the right table. If there is no match, the result is
NULL on either side.Here's an example where we want to find all books and
authors, whether the book has an author in the authors table or not, and
whether the author has a book in the books table or not:
SELECT authors.name, books.title
FROM authors
FULL JOIN books
ON authors.id = books.author_id;
However, this gives an error because MySQL doesn't support the FULL JOIN
keyword (other flavors of SQL like PostgreSQL do support this keyword, so
this query would work in that instance). Hence, we will have to try another
approach. We will have to "simulate" the results of a FULL JOIN .
SELECT a.name, b.title
FROM authors a
LEFT JOIN books b ON a.id = b.author_id
UNION
SELECT a.name, b.title
FROM authors a
RIGHT JOIN books b ON a.id = b.author_id
WHERE a.id IS NULL;
Cross Join
The CROSS JOIN or Cartesion join basically results in each row of the first
table being matched with every row in the second table (known as the
cartesian product in mathematics). This type of join is useful for retrieving
all possible table record combinations. Here's an example to better understand
this:
SELECT a.name, b.title
FROM authors AS a
CROSS JOIN books AS b;
As seen below in Figure 3.6, the result contains 48 rows of all possible
combinations for the author names and book titles.
Figure 3.6 - Results of Cross Join Query
Self Join
As evident by its name, SELF JOIN is used to join a table to itself based on
a condition. For example, an employee's table would contain a column to
specify whether a specific employee is a manager. Here's the query:
SELECT e.employee_name, m.employee_name AS manager_name
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id;
Note that the same table has two aliases, "e" and "m", and is treated as two
individual tables joined on the manager_id column of the table. The output
returns an employee's name and the respective manager's name. If there is a
NULL value in the manager's name column, it indicates that the employee is a
manager. Congratulations! You have learned all about the types of joins in
SQL, let's learn about the best practices and performance tips to optimize
your SQL query results now.
There are multiple ways to write joins, but using the explicit JOIN syntax,
rather than just separating your tables with commas, leads to clearer and
maintainable code. Another benefit of explicit JOIN syntax is separating your
join logic from your filtering logic.Consider this example from our library
database:
-- Implicit JOIN Syntax
SELECT members.name, books.title
FROM members, borrowed_books, books
WHERE members.id = borrowed_books.member_id AND books.id = borro
wed_books.book_id;
The explicit JOIN syntax is easier to read and separates our join conditions
from any other filter conditions we might have.
Being mindful of your join conditions is crucial to obtaining the correct data.
Careless join conditions can result in unintended cross-joins (cartesian
products) or just outright incorrect data.Consider these two join conditions in
our library database:
-- Incorrect JOIN condition, resulting in a Cartesian product
SELECT books.title, authors.name
FROM books
JOIN authors ON books.id = authors.id;
The first query will join books and authors whose ids match, which isn't
correct in our data model. The second query correctly joins books and
authors using the author_id foreign key in the books table, which is the
correct way.
Remember that in a LEFT JOIN , all rows from the first (left) table will be
included in the result set, even if there's no matching record in the second
(right) table. Conversely, in a RIGHT JOIN , all rows from the second (right)
table will be included, even if there's no matching record in the first (left)
table.For instance, if you want to find all authors and their related books, you
might use a LEFT JOIN , with `authors` as the first table. This would ensure
that authors who have yet to write any books are included in the result. Using
a second table in the same statement and calling LEFT JOIN will still refer to
the first table mentioned.
Aliases help make the SQL query easier to read and help the code look
cleaner. As an additional tip, you only have to type that alias in your query
and not the full name of the table repeatedly. This saves time and helps
prevent errors when a table name is complex. Here's the same example as we
used in RIGHT JOIN above:
SELECT a.name, b.title
FROM authors AS a
RIGHT JOIN books AS b
ON a.id = b.author_id;
Summary
Great job on completing this chapter! This chapter was dedicated to
deepening your understanding of SQL by introducing the concept of joining
tables. Joining tables in SQL is crucial for fetching data spread across
different tables and stitching them together logically. This way, we can
perform a more complex and insightful analysis.We started this chapter by
emphasizing the importance and purpose of joins in SQL. From there, we
took a comprehensive dive into various types of joins: INNER JOIN ,
LEFT JOIN , RIGHT JOIN , OUTER JOIN , CROSS JOIN , and SELF JOIN , each
with its unique application in data analysis.In each section, we broke down
the structure and syntax of the join types, followed by real-world examples.
This approach aims to illustrate how each join works and provides you with
ample practice to cement your understanding.We clarified how INNER JOIN
is used to fetch common records from multiple tables while LEFT JOIN ,
RIGHT JOIN , and OUTER JOIN help retrieve expected and exclusive records
from either or all tables. We also discussed the less commonly used
SELF JOIN and CROSS JOIN and demonstrated their usage in specific
scenarios.The chapter wrapped up with best practices for using joins to
ensure that you write efficient and clean SQL join queries.At this point, you
should feel comfortable using various SQL joins to combine data from
different tables effectively. It must have been tough! But remember,
mastering SQL joins is a significant step towards becoming a proficient data
analyst. In the upcoming chapters, we'll learn about creating business metrics
with aggregations, advanced SQL techniques, and much more.
4 Creating Business Metrics with
Aggregations
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
Now that you know joins, you are ready for the next step in your journey to
become a data analyst. We have arrived at aggregations and their significance
in creating business metrics. With our current SQL knowledge, we are ready
to dig deeper into data and use this data to gain actionable insights. This
chapter will teach you about aggregations and how to use them to generate
meaningful business metrics. This chapter will arm you with the knowledge
and practical skills to use aggregations effectively in SQL, enabling you to
extract valuable business metrics from large datasets. Here's the overview of
what we're going to cover:
Let's level up our SQL game and dive into SQL aggregations and re their role
in shaping business metrics!
SELECT COUNT(column_name)
FROM table_name
WHERE condition;
Let's use an example from our library database to demonstrate how the
COUNT function is applied. Suppose we want to know how many books are
currently borrowed. We'd use the COUNT function on the borrowed books
table for this.
SELECT COUNT(*) AS "Currently Borrowed”
FROM borrowed_books
WHERE return_date IS NULL;
In Figure 4.1, we can see that the result is a single aggregate value. The result
is 10 because we set NULL for all 10 return_date records. The result
informs us that 10 books are currently borrowed.
Figure 4.1 - Results of the COUNT function to find the number of borrowed
books
Similarly, we can use COUNT to identify how many members are registered in
the library:
SELECT COUNT(*) AS "TOTAL NUMBRER OF MEMBERS"
FROM members;
Figure 4.2 - Results of the COUNT function to find the total number of
members
With the COUNT function, we've added the ability to quantify data sets based
on specific conditions to our SQL toolbox. Now, let's transition to a function
that goes beyond counting – the SUM function.
The SUM function in SQL allows us to add all the values in a specific
column. Its syntax is similar to that of COUNT :
SELECT SUM(column_name)
FROM table_name
WHERE condition;
After that, here’s the update statement to add a fine to a specific row:
UPDATE borrowed_books
SET fine_amount = 1000
WHERE id=2;
UPDATE borrowed_books
SET return_date= '2022-10-11'
WHERE id=2;
If we want to calculate the total amount of fines collected, we can use the
SUM function as follows:
SELECT SUM(fine_amount)
FROM borrowed_books
WHERE return_date IS NOT NULL;
Figure 4.3 - Results of the SUM function to find the total amount of fines
collected
This query summarizes the fine_amount for all records where return_date
is not NULL (meaning the book has been returned) and a fine is specified.
Note that if a return date is not NULL , but the fine_amount is set to NULL
for that row, then it won’t affect the result.By utilizing the SUM function, we
can quickly perform calculations that would be time-consuming to do
manually. It might be obvious but let’s make sure this is clear: SUM works on
numerical data types only. Now, let's move on to a function that performs
calculations and provides insights about the data distribution. I’m talking
about the AVG function.
Suppose we want to calculate the average birth years of the authors in our
library to gain an estimate of their ages. We can use the AVG function as
follows:
SELECT AVG(birth_year)
FROM authors
WHERE birth_year IS NOT NULL;
Figure 4.4 - Results of the AVG function to find the average year of birth
This query calculates the average birth_year for all records where
birth_year is not NULL . It is worth observing that the average value is in
decimal.Using the AVG function, we can understand the central tendency of
our data, which can provide valuable insights in various scenarios. After
dealing with AVG , let's look at how we can identify the smallest and largest
values in our data set using the MIN and MAX functions.
Using MIN and MAX Functions in SQL
The MIN and MAX functions in SQL return the smallest and largest values
in a column, respectively. These functions are beneficial for identifying the
range of our data.Here is the basic syntax for the MIN and MAX functions:
SELECT MIN(column_name)
FROM table_name
WHERE condition;
SELECT MAX(column_name)
FROM table_name
WHERE condition;
Let's illustrate their use with our library database. We will use the
join_date column in the members table, showing each member's joining
date. If we want to find out the shortest and longest period the members have
been visiting our library, we can use the MIN and MAX functions as follows:
SELECT MIN(join_date) FROM members;
SELECT MAX(join_date) FROM members;
The first query will return the minimum join date. This means the join date
for the oldest member. The result is displayed in Figure 4.5.
Figure 4.6 - Results of the MAX function on the join_date column
The second query will provide the latest join_date . This is the join date for
the most recent member. The result is displayed in Figure 4.6.By using the
MIN and MAX functions, we can get a sense of the boundaries or extremities
in our data. This informs us about our dataset's overall spread of
values.Having covered individual row calculations, you might wonder how
we can perform these aggregations over data groups rather than the entire
dataset. For instance, what if we wanted to know each member's total fines or
average loan duration? This is where the power of the GROUP BY clause in
SQL comes into play.
GROUP BY Clause
The GROUP BY clause is used in SQL to group rows with the same values
in specified columns into aggregated data, which can be helpful when we
want to perform computations on categorized data.Here's a basic syntax of
GROUP BY :
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s);
When you group results by a specific column, it essentially means that the
output will be arranged so that rows with identical values in the specified
column will be clustered together.Let's use the library database again to
illustrate this. If we wanted to find out how many books each member has
borrowed, we could group the results by the id of the member in the
borrowed_books table:
Figure 4.9 - Results of the GROUP BY clause, grouped by the member id.
This SQL query will return a list of member ids along with the count of
books each member has borrowed (Figure 4.9). The GROUP BY clause is
something you’ll need a lot because it allows us to analyze our data at more
granular levels. It's like zooming in from a high-level overview to the details
of each category.
This query will return each member's id and the total fines they've incurred,
as displayed in Figure 4.10.Using aggregate functions with GROUP BY allows
us to perform more specific and detailed analysis, which could reveal
meaningful patterns and insights within our data.While GROUP BY allows us
to segment our data for analysis, we may want to filter the results of our
groupings. SQL achieves this through the HAVING clause, which we will
discuss in the next section. Just as the WHERE clause is used to filter rows, the
HAVING clause is used to filter groups. This is especially useful when finding
groups that match certain conditions.
HAVING clause
The HAVING clause was added to SQL because the WHERE keyword could
not be used with aggregate functions. HAVING is typically used with the
GROUP BY clause to filter the group's results by operation.Here's the basic
syntax of HAVING :
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition;
The key difference between HAVING and WHERE is that WHERE filters
individual rows before the aggregation operation while HAVING filters the
results after the GROUP BY operation. This is why the order of these clauses
also matters. In other words, WHERE applies conditions on individual rows
before they're even grouped, whereas HAVING filters groups after aggregate
functions form them.For example, using our library database, let's say we
want to find out which members have borrowed one or more books. We
could use the HAVING clause as follows:
SELECT member_id, COUNT(book_id) as "Number of Borrowed Books”
FROM borrowed_books
GROUP BY member_id
HAVING COUNT(book_id) >= 1;
Figure 4.11 - Results of the HAVING clause to find which members have
borrowed one or more books.
This query will return a list of members who have borrowed one or more
books, as displayed in Figure 4.11.
This will list all members who have incurred a fine over $500. As you can
see there is one match, as displayed in Figure 4.12.Here are a few general
rules when it comes to GROUP BY and HAVING with aggregations:
any column that appears in the SELECT clause must be either part of the
GROUP BY clause or used in an aggregation function;
you must use the HAVING clause to apply conditions to the groups
themselves;
the GROUP BY clause usually comes after the WHERE clause and before
the ORDER BY clause.
Let’s look at some best practices for aggregations before we wrap up this
chapter.
The column that you choose to aggregate can significantly impact the results
of your analysis. Therefore, picking a column that makes sense in the context
of your question or task is crucial.In our library database, what should we do
if we want to find out how many books each member has borrowed?
Aggregating on the member_id column would make sense, as it provides a
count per member:
SELECT member_id, COUNT(book_id) as BooksBorrowed
FROM borrowed_books
GROUP BY member_id;
This way, we aggregate only the relevant data, reducing the computational
load.
Keep Readability in Mind
Clear, understandable SQL queries are essential for both your future self and
others who might interact with your code. This includes using clear aliases
for your aggregated columns and correctly formatting and indenting your
SQL queries.For instance, BooksBorrowed is a much more meaningful name
than COUNT(book_id) and would make it easier for others (and future you)
to understand your query:
SELECT
member_id,
COUNT(book_id) as BooksBorrowed
FROM borrowed_books
GROUP BY member_id;
Aggregations are complex. Keeping these best practices in mind will help the
readability and understandability of your analyses. And that’s it for this
chapter. Let’s go over what we’ve learnt.
Summary
In the chapter "Creating Business Metrics with Aggregations”, we learned to
analyze data by using aggregations and GROUP BY and HAVING . We start the
journey by emphasizing the importance of aggregations in analyzing data,
explaining how they provide valuable insights by summarizing large volumes
of data into meaningful metrics.Next, we deep dive into various aggregate
functions, including COUNT , SUM , AVG , MIN , and MAX . Each function is
introduced with its definition, syntax, and practical examples from our library
database. We discuss how COUNT can help determine the total number of
rows in a table, the number of non- NULL values in a given column, and how
SUM can be used to total numerical data. We explore the AVG function to
calculate average values and discuss how MIN and MAX can find the smallest
and largest values.We then introduce the GROUP BY clause, which allows you
to group results by a specific column and pair it with aggregate functions to
derive meaningful summaries from the data. We distinguish between the
roles of the HAVING and WHERE clauses, explaining that while WHERE filters
records before grouping and aggregation, HAVING filters after.We continue to
link theory with practice by providing use cases from our library database,
demonstrating how to combine GROUP BY , HAVING , and various aggregation
functions to answer complex questions and filter aggregated results
effectively.Lastly, we touch upon the best practices when using aggregations.
We underscore the importance of selecting columns carefully when
performing aggregations and provide guidelines to improve query
performance and readability.The journey continues, so let's move on to even
more exciting SQL concepts!
5 Advanced SQL
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
You have a solid SQL foundation, congrats! But that’s not where we’re going
to stop. We’re going to take your SQL expertise to the next level. You have
the basic building blocks you need to create queries. This chapter will add
even more building blocks for using SQL. This will enhance your data
analysis capabilities and help you with transforming data into actionable
insights. Here's what this chapter will cover:
As you can tell, this chapter is called Advanced SQL for a reason. But fear
not, we’ll take it slow. And by the end of this chapter, you will have a solid
understanding of these advanced SQL features and you’ll be able to utilize
them effectively in your daily data analysis tasks. So, let's get started with
subqueries to kick off this chapter!
The first SELECT is the main query, the second SELECT between the
parentheses is the subquery. Subqueries are always written inside a pair of
parentheses. You might use subqueries when the result of the entire query
depends on the unknown dynamic data. If a simple comparison can't solve
these problems, you may need a subquery. You’ll need to know how to
navigate advanced SQL queries, because there are situations where you
simply need them. However, subqueries come with a cost. They can be
slower than regular queries because they require multiple passes over the
data. The database has to execute the inner query before it can execute the
outer query. This can be quite slow if the inner query returns a large amount
of data.That’s why we should not overuse subqueries and be mindful of their
cost. You should always look for ways to rewrite your queries to avoid
subqueries if possible. For example, in some cases, you can use a JOIN
instead of a subquery, which is faster.We’ll see a more concrete example
soon, but let’s first discuss the different types of subqueries in more detail.
Types of subqueries
Depending on the data they return, subqueries can be distinguished between
four types:
Scalar subquery: This type of subquery returns a single value. It's used
where you need only one value. This can, for example, be the result of a
COUNT .
Row subquery: This type returns a single row of data. This is usually
helpful when inserting values in a table when you must fetch the data
from a temporary table.
Column subquery: It produces a single column of data.
Table subquery: This type of subquery returns an entire table. This is
used when you want to create a temporary table for the runtime of a
query—this subquery takes place after FROM .
Before we see a real example, let’s make sure that you understand the
concept by comparing the SQL subquery to a real-life subquery.
The following screenshot shows the results of this subquery. As you can see,
there are three authors who have published books after 2019.
Fig 5.1 - Results of the subquery which returns only the name of the authors
Note that the preceding query could also be run using an INNER JOIN . This
would be the query in that case:
SELECT a.name
FROM authors a
JOIN books b ON b.author_id = a.id
WHERE publication_year > 2019;
The subquery approach returns three rows while JOIN will return four rows
because Emeka Okafor has published twice after 2019. That’s why we get her
name twice, and that’s something we don’t want. This can be fixed using a
DISTINCT keyword before a.name , in which case, the JOIN query will
perform better because it has to make only one pass through the data. Next,
let’s see some differences between using subquery and JOIN and when to
choose which one.
You’re doing great! With these basics out of the way, let’s discuss more
complex questions we can answer with subqueries.
More advanced subquery usage on our library database
Alright, leveling up… What if you were to find the number of books each
author has written? Here’s a subquery in the SELECT clause using an
aggregate function to create a temporary column for the outer query's
runtime. (You may need to read this twice, sorry about that. Look at the
following example, that helps too!)
SELECT name,
(
SELECT COUNT(*)
FROM books
WHERE author_id = authors.id
) AS book_count
FROM authors;
The inner query that executes first uses an aggregate function COUNT()
to retrieve the count of books published by each author.
The order of execution works such that first, the outer query fetches all
rows from the authors table.
The inner query is executed for each row to find the count of books for
each author filtered by the id from the authors table.
As you can see, the results in the following screenshot depict all authors and
their book count:
Fig 5.2 – Result of the subquery with COUNT
Okay, don’t look at the following code snippet but see if you can come up
with the solution to the following task yourself. Here’s the task: find the book
titles borrowed by members who joined the library between 1st May 2021
and 31st December 2022. Try to write your own query for this problem and
then compare it to the solution here.Ready? Let’s have a look.
SELECT id, title
FROM books
WHERE id IN (
SELECT book_id
FROM borrowed_books
WHERE return_date IS NULL
AND book_id IS NOT NULL
);
AND id IN (
SELECT book_id
FROM borrowed_books
WHERE member_id IN (
SELECT id
FROM members
WHERE join_date BETWEEN '2021-05-01' AND '2022-12-31'
)
);
It might look daunting, but when you see what’s going on, you’ll find it
simple. Before we explain the query, let’s see the result:
Fig 5.3 – Book titles borrowed by members who joined the library between
1st May and 31st December 2021
Okay, so how did we get there? Let’s break it down. There are two inner
queries. The first one selects all the ids of the books from the
borrowed_books table for which the corresponding return_date is NULL
and book_id is NOT NULL – books that are currently borrowed. The second
subquery selects all ids from the members table who joined between May
2021 and December 2022. And then the outer query uses the ids returned
from these subqueries as the condition to select the book titles and their ids
from the books table.At this point, you’re ready to face an even more
complex subquery involving a JOIN that retrieves data from across three
tables. Suppose we want to find out the books borrowed by the member who
joined the library most recently. This task is not straightforward and requires
two pieces of information. We first need to determine the most recent join
date from the members table, and then we need to find the books borrowed
by the member who joined on that date. Next, we need to query the books
table for these books.Here’s how to do it. Using a subquery, we are able to
accomplish this within a single SQL statement:
SELECT b.title
FROM books AS b
RIGHT JOIN borrowed_books AS bb ON b.id = bb.book_id
WHERE bb.member_id = (
SELECT id
FROM members
WHERE join_date = (
SELECT MAX(join_date)
FROM members
)
LIMIT 1
);
This example has two nested inner queries as we select all ids from the
members table based on their join_date . Note that we want to find the
members who joined the library most recently, so we can use the
MAX(join_date) function via a subquery within a subquery to find the
join_date with the highest value, indicating the member who joined most
recently. And finally, the main query retrieves the titles of the books
borrowed by this member.We need the LIMIT 1 at the end, because it’s
possible that there are multiple members who joined on the same latest date.
(Not the case in our dataset though.)
With this CTE, we created a named subquery at the top and then called that
temporary data result set in the sub(sequent) query when we needed it. The
result of this query is shown in the following screenshot:
The CTE contains all rows of records from the authors table for any author
born in the USA. Later on, when we need to find book details for all authors
from the USA, we use the CTE to match the id with the books table and
return matching records. This next example will emphasize the usefulness of
CTEs in SQL. Imagine you want to find the member details of those who
have borrowed books by the author with name Emeka Okafor. Take a
moment to ground your approach before you dive into the following query:
WITH Emeka_books AS (
SELECT id
FROM books
WHERE author_id = (
SELECT id
FROM authors
WHERE name = 'Emeka Okafor'
)
),
Emeka_borrowers AS (
SELECT member_id
FROM borrowed_books
WHERE book_id IN (SELECT id FROM Emeka_books)
)
SELECT *
FROM members
WHERE id IN (SELECT member_id FROM Emeka_borrowers);
There are two CTEs in this query. In a way, it might remind you of nested
subqueries. So, the first CTE finds out the ids of all books written by Emeka
Okafor. The resulting temporary data set is assigned to be called
Emeka_books . After that, another CTE is defined. This one is called
Emeka_borrowers and it finds the member ids of library members who have
borrowed books with the id range of Emeka Okafor’s books. Note that we
call upon our first CTE at this point instead of rewriting the whole subquery
again. Finally, we select all the columns from the members table with ids
matched by the ids of members who have borrowed books written by Emeka
Okafor. The second CTE is used here to match the ids. Phew. The following
figure shows the results:
Fig 5.6 CTE query to find the members who borrowed books written by
Emeka Okafor
Fig 5.7 CTE query to find the number of books borrowed by each member in
April 2022
With (yup, pun intended) a CTE, we’ve made the query easier to read and
understand. We’ve separated the logic for counting the books borrowed in
April 2022 from the logic for retrieving the member details, resulting in a
cleaner query.Quite some advanced new topics, right? Hold on, there's still
more to come! Next, we will learn about window functions.
Please note that this query could return multiple rows for each
member_id , because it will return a row for each entry in
borrowed_books . If you'd like to see just one row for each
member, with their first and last borrow dates, you could modify
the query to use the GROUP BY clause:
SELECT member_id, MIN(borrow_date) AS first_borrow_date, MAX(bor
row_date) AS last_borrow_date
FROM borrowed_books
GROUP BY member_id;
Okay, a new example! What if you want to find the ranks of the books based
on the number of times they have been borrowed? This is a question that the
window function specializes in, using its RANK function. We are going to
solve this using a window function and a CTE. (Let’s hope you still
remember CTEs. If not, nothing to be ashamed about. You’re learning so
much new stuff! Just glance at them once more, they’re waiting for your
return a few pages back.)
WITH book_borrow_counts AS (
SELECT book_id, COUNT(member_id) AS total_borrows
FROM borrowed_books
GROUP BY book_id
)
SELECT b.title, bbc.book_id, bbc.total_borrows,
RANK() OVER (
ORDER BY total_borrows DESC
) AS borrow_rank
FROM book_borrow_counts bbc
JOIN books b on b.id = bbc.book_id;
The preceding query has two parts: a CTE and a window function. In the
CTE (the WITH clause), we find the number of times each book has been
borrowed by a member and then use this information in the main query to
find the rank of books based on how many times they have been borrowed in
the window function. This is done by joining the book_borrow_counts CTE
with the books table on the id of the books. Your result should look
something like this:
Moving on, imagine that we want to find out the cumulative number of books
borrowed by each member in chronological order. For this, we can use the
SUM function as a window function and the OVER clause to define our
window.Here's how we can do it:
SELECT
m.first_name,
m.last_name,
bb.borrow_date,
COUNT(*) OVER (
PARTITION BY bb.member_id
ORDER BY bb.borrow_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_books_borrowed
FROM
members AS m
JOIN borrowed_books AS bb ON m.id = bb.member_id
ORDER BY
bb.member_id,
bb.borrow_date;
In this query, the COUNT(*) OVER (...) construct is our window function.
The PARTITION BY bb.member_id clause means that we're creating a
separate window for each member. The ORDER BY bb.borrow_date clause
means that the rows are ordered by the borrow date within each window.
Finally, the
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW clause means that
for each row, the window includes all preceding and current rows. The output
should look like this:
Fig 5.10: Window Function query to find the cumulative number of books
borrowed by each member in chronological order
As you can see, this query returns a list of all books borrowed, along with the
members who borrowed them. For each row, the cumulative number of books
that the member had borrowed up to and including the borrow date of that
row.Phew, that was quite the topic. We're not done yet! Next, we're going to
talk about date and time manipulation in SQL. We scratched the surface of
this in Chapter 1 already. It’s a critical skill for any data analyst, so let’s dive
a bit deeper. It is a little easier to digest, your brain might appreciate that after
all the hard work it had to do in this chapter so far!
or
SELECT CURDATE();
The result of the first one will show you the current date with a timestamp,
while the second query will only show the current date without the time.
This query will return a list of all books that have been returned, along with
the member who borrowed them, the due date, the return date, and the
number of days late. Cool, right? Let’s first look at the output we get before
we explain it in more detail.
Fig 5.11 DATEDIFF() Function to find the number of days late on book
returns.
The DATEDIFF function takes two inputs, the later and the earlier date. It
subtracts these dates to find the number of days of difference. Then, it finds
the difference between these two given dates in days. We also link the books
table with the members and borrowed_books tables using JOIN twice and
filter the results based on the condition that return_date is NOT NULL .
Let’s look at one more example. This time, first see if you can come up with
the solution before sneak peaking at the solution. Here’s the exciting
question: can you create a table in the query result to find each book's year,
month, and day from the borrow_date ? Once you've tried that out yourself,
compare it with the solution below.
SELECT book_id,
YEAR(borrow_date) AS borrow_year,
MONTH(borrow_date) AS borrow_month,
DAY(borrow_date) AS borrow_day
FROM borrowed_books
WHERE borrow_date IS NOT NULL;
Let’s do one more example. What if we want to determine how many books
are currently borrowed and overdue? We can do this by comparing the due
date with the current date and finding out which books have a NULL in
return_date (not yet returned) and the due date of returning has passed:
SELECT
COUNT(*) AS Books_Borrowed_Currently
FROM
borrowed_books
WHERE
due_date < CURDATE() AND return_date IS NULL;
So, this query will count all the books in the borrowed_books table that are
currently overdue and have not yet been returned. The
Books_Borrowed_Currently in the result will tell you the total number of
such books. You can see the result in the following screenshot:
Fig 5.13 - Find the number of books currently borrowed and overdue.
Here’s another challenging yet fun task for you. Instead of finding only the
year, month, and day of the books borrowed, let’s find the day of the week,
month and year, and the day and month names for the books borrowed. You
might have to explore the functions required for this problem.
SELECT book_id,
DAYOFWEEK(borrow_date) AS DayOfWeek,
DAYOFMONTH(borrow_date) AS DayOfMonth,
DAYNAME(borrow_date) AS DayName,
MONTHNAME(borrow_date) AS MonthName,
DAYOFYEAR(borrow_date) AS DayOfYear
FROM borrowed_books
WHERE borrow_date IS NOT NULL;
This is doing something very similar to the day, month and year functions.
You can see the results here:
Fig 5.14 Results of finding each book's borrowing days and day names.
This is probably enough for NOW(). (Yup, definitely for me.) Date and time
functions are something we typically use quite a bit in our data analysis tasks.
And we’ve made our way through! We’re ready to move to the next section,
where we will look at text manipulation functions.
Text functions
So, SQL offers numerous text functions. These include the following:
Next, let’s see how to replace a specific text portion in strings with something
else you want. Here’s the query to do that:
SELECT title,
REPLACE(publisher, 'Publishing', 'Analytics') AS Analytics_pub
FROM books;
The output of this query can be seen in Figure 5.16. It’s a table with two
columns: title (which simply contains the title of each book as listed in the
books table) and Analytics_pub (which contains the name of the publisher
for each book, but with every instance of the word 'Publishing' replaced with
'Analytics').
Let’s take another example to clarify the other text functions we listed.
Suppose we want to know how many books each member has borrowed.
Moreover, we want to return the respective member name with their
borrowed book count, but the member names in the members table are
recorded in two different columns. And let’s say we’re not sure about the
casing of these columns (although in our example, it’s actually nicely done
already). We want to make sure they’re all title case, which means that the
first letters of the words are in uppercase.We can achieve this by using the
CONCAT() , UPPER() , LOWER() , and SUBSTRING() functions:
WITH member_names AS (
SELECT id, CONCAT(UPPER(SUBSTRING(first_name, 1, 1)), LOWER(
SUBSTRING(first_name, 2))) AS FirstName,
CONCAT(UPPER(SUBSTRING(last_name, 1, 1)), LOWER(SUBSTRING(la
st_name, 2))) AS LastName
FROM members
)
SELECT
CONCAT(FirstName, ' ', LastName) AS FullName,
COALESCE(COUNT(b.member_id), 0) AS num_books
FROM member_names AS m
LEFT JOIN borrowed_books AS b ON m.id = b.member_id
GROUP BY m.id;
This query will return a list of all authors, with their names in the title case,
along with the number of books each has written:
Fig 5.17: Results of the query to show the full member names in title case and
the number of books they have borrowed.
We combined a CTE and LEFT JOIN to solve this problem. The CTE works
to create a temporary data set that stores three columns from the members
table: id, title case first name, and title case last name. Also, we used the
SUBSTRING() , UPPER() , LOWER() and CONCAT() functions to create the
columns to store title case first and last names.
CONCAT(UPPER(SUBSTRING(first_name, 1, 1)), LOWER(SUBSTRING(first_name, 2)))
is quite the expression. But don’t be intimidated, let’s break it down into
smaller chunks. This expression takes the first_name field, extracts the first
character ( SUBSTRING(first_name, 1, 1) ) and converts it to uppercase
( UPPER ). It then extracts the remainder of the string starting from the second
character ( SUBSTRING(first_name, 2) ) and converts it to lowercase
( LOWER ). These two parts are then concatenated together using CONCAT ,
resulting in a first_name that is in title case. This is assigned the alias
FirstName . We repeat the same process for the last name. After that, we call
this CTE in the query and find the count of books borrowed by each member
from the borrowed_books table with a LEFT JOIN to include the NULL s in
the member id. The COALESCE() function replaces a NULL value with 0 ,
hence the 0 after “Ingrid Pettersson”.That’s enough on text functions.
You’re almost there! In the next section, we will wrap up our discussion on
advanced SQL by sharing some best practices.
Summary
In this chapter, we tackled the more complex aspects of SQL. You should
now be equipped with the ability to complete difficult data analysis tasks.
The key lies in mastering advanced SQL techniques such as subqueries,
CTEs, window functions, and date-time and text manipulation techniques.We
started this advanced journey by exploring subqueries, a great tool allowing
you to nest queries within queries. They contribute to solving tough data
problems in manageable parts. Next, we introduced CTEs to simplify
complex queries by breaking them into smaller, more digestible sections.
This makes the code more readable which then leads to better maintainable
code. We then delved into the concept of window functions which enables
you to perform calculations over sets of rows related to the current row. This
advanced feature has opened new horizons for data analysis, allowing for
complex calculations like running totals and rankings. After this, we ventured
into the realm of date-time manipulation. We provided you with the tools and
techniques to manage and manipulate date and time data effectively, these
will help you to get valuable time-based insights. And in the final section, we
explored text manipulation. The text functions are a must-have for dealing
with string data. We covered various SQL functions to process and
manipulate textual data.To bring these concepts to life, we used practical
examples from our library database. And now we’re ready for the next step:
test all this SQL knowledge in practice with a case study! Spoiler that might
make you happy: after quite some chapters, we’re going to let go of the
library example!
6 SQL for Data Analysis Case Study
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
At this point, you’re finally prepared with the necessary tools to deal with a
more complex project from scratch. And that’s exactly what we’re going to
be doing in this chapter. In a nutshell, we’ll:
Exploring the data: Make sure that you are familiar with the data in the
database. Understand what kind of tables there are, what sort of data
there is and how the tables are related.
Know the dataset: Find out the essential words you must focus on
because this will help you know what you need to fetch or where from.
Map to their respective table(s): For instance, customer details are in
the customer , orders and shopping_information tables; suppose
you need to find the largest purchase, then you figure out that there is a
column called total_cost which is in the orders table. That’s some
great insights to get started with!
Determine and choose a primary table to fetch data: Here, you need
to ensure the table you choose has the most relevant information, and
you can join other tables to it later on for secondary information.
Determine if a JOIN might be necessary: Sometimes, you need to gain
some other insightful information for this question and other times, you
only need a single table.
Identify whether you need to use aggregate functions, subqueries,
CTEs or other complex elements of SQL queries: According to what
information is required, you might need to use more complex features if
the required data is derivative or not directly present in any table.
We are going to make the distinction between exploring the data and
analyzing the data. Let’s first explore our dataset.
Now that we know this, let’s see if we can help Umar with some of his
queries by writing some queries.
ADD
Figure 6.1: The list of categories in the store
As you can see in the preceding screenshot, there are 31 product categories
offered by the e-store. All (in this case, only two) columns are shown because
of the asterisk (*) after the SELECT keyword. It’s important to remember that
this is still a new business of his and that it might not look like a lot of data
because of that. Now that we have explored the data roughly, it’s time to start
answering Umar’s questions and take the step towards analyzing the data.
The preceding query will list the details of each product in the Clothing
category, as you can see in the following screenshot:
Figure 6.2: Product listings under the “Clothing” category
The query joins the product and product_category tables to return all the
products in the Clothing category along with their product_price and
discounted_price to present a fuller picture. The result in Figure 6.2 shows
that there are only 13 clothing products in the store. This means Umar is
short-supplied if he wants to promote these products more.
As you can see, he has 114 unique customers. Let’s see how they typically
pay.
The following screenshot reflects the query output and shows that the
payment method with id 3 is the most popular among the store’s customers.
Meanwhile, 1, 2, and 6 come in at second, followed by payment method with
id 10 being the third most popular.
Figure 6.4: The most popular payment methods
Figure 6.5: Enhanced top payment methods query with the payment method
name
This gives us sufficient insights into the payment methods. Let’s help Umar
with his questions regarding customer feedback.
Figure 6.6: Brief customer feedback query results showing average ratings
for top products.
However, the output in the preceding screenshot might not satisfy what Umar
Ali seeks when asking for customer feedback. It gives a general overview,
but it does not present the actual feedback. We can improve the query and
fetch more relevant information to help make the results more meaningful.
Here’s the updated query:
SELECT product_id_fk, AVG(rating) AS average_rating, GROUP_CONCA
T(review SEPARATOR '|| ') AS reviews
FROM product_reviews
GROUP BY product_id_fk
HAVING AVG(rating) > 4
ORDER BY average_rating DESC;
The preceding query results in the following screenshot is more granular and
shows the individual reviews. It treats each unique id and review as a
separate group, and filters out the records with an average rating lower than
4.
ADD
The preceding query uses a JOIN to connect the product and orders table to
obtain the product_id , product_name and SUM of the number of orders
from the orders table. Additionally, it filters the results using the WHERE and
IN clauses to check for specific product ids to return (only if they exist). The
product ids retrieved here were the same as our last query when we
determined the highest average ratings.
The preceding screenshot should tell Umar more about the relation between
the number of products sold and their rating. The role of the rating is not
super clear looking at this table, but let’s see how many products even have
reviews before we draw too many conclusions.
These results are not great for an e-commerce store because reviews are
essential for building trust with the customer base as a relatively new store.
Umar is a little shocked and sees a great opportunity for improvement.
Effectiveness of discounts
On a similar note, the next step in the sales analysis of this store is
determining whether the discounted products have done well with respect to
total sales per product or not. Doing so would help Umar conclude if it’s wise
to keep the current approach or try a new tactic. Let’s help him answer that
question with the following query:
SELECT p.product_id, p.product_name,
SUM(o.quantity) as total_sold
FROM product p
JOIN orders o ON p.product_id = o.product_id_fk
WHERE p.is_on_discount = TRUE
GROUP BY p.product_id, p.product_name
ORDER BY total_sold DESC;
Figure 6.11: Total sales for discounted products, grouped by product id and
name
The preceding query adds a few things: another SUM column which returns
the total revenue by multiplying the quantity and discounted prices at which
the discounted products are sold, uses the GROUP BY clause on the discounted
price, too, and adds the total revenue as one of the factors for the ORDER BY
clause. The revenue shown in the following screenshot is not symmetric
because each product is priced differently.
Figure 6.12: Query result of total sales and revenue generated from the
discounted products
So, despite not selling even half the number of items as the Razer Huntsman
Gaming Keyboard, the Nikon D850 DSLR Camera still generates many folds
more revenue because the price of one Nikon D850 DSLR Camera is more
than 20x the price of the Razer Huntsman Gaming Keyboard.It would be
interesting to compare these results with the non-discounted products and see
how that reflects on the sales revenue generated. The query for that is the
same except for one small change. The Boolean value will be revised to
FALSE for the p.is_on_discount column:
Figure 6.13: Query result of total sales and revenue generated from the non-
discounted products
Looking closer, the number of items sold and the total revenue generated
seem higher for non-discounted products. Now, this doesn’t paint a complete
picture or even implies a correlation because there are other factors to
consider, such as the total number of discounted and non-discounted
products, but it’s an indication that Umar should dive into this.
In this query, a JOIN must fetch each customer's name corresponding to the
id, which we can retrieve from the orders table. But to present it in a way that
makes more sense, we use a CONCAT function to return the full name of each
customer alongside their respective customer_id values. The SUM of the
total_cost column returns the amount spent by each customer on
transactions in the store. The results in the following screenshot are ordered
based on the total amount (in descending order) spent on purchases, grouped
by their ids, and filtered using the HAVING clause so only customers who
have spent more than 5000 on purchases are returned in the query results.
ADD
Top-selling clothing products
Do you remember when we found the product line for the Clothing
category? We are now ready to dive deeper into it and expand on that to show
more meaningful information. We’re going to combine the information
obtained from the Clothing category, with the orders table and the
product table:
The orders table is used in the preceding query because we need to find the
sum of sales for each respective product. It’s striking that none of the
products returned by this query are discounted. Perhaps Umar can continue to
research why that is. In the meantime, he asked us to inspect the top-
performing products. Let’s have a look!
This query also uses two JOINS to link three tables, as it’s important to fetch
product, category and sales information to return the columns shown in the
following screenshot:
Figure 6.16: Top 5 best-selling products and their details, including total
revenue and sales
The preceding query selects the product names, category names, and ids, the
sum of the total_cost column, and the quantity column to give revenue
and the number of orders, respectively. These results are grouped by each
unique combination of product name and id, and ordered in descending order
with respect to both total revenue and total sales. The results are limited to 5
rows because Umar only wants information for the top 5 selling
products.You could enhance the result of the last query by adding
information about the categories to understand which of the categories these
top products belong to. In terms of business context, this serves as a key point
to better highlight these categories or take the alternative and focus on low-
selling categories and products instead to improve their performance. The
updated query will then look like:
SELECT p.product_name, p.product_id, c.category_id, c.category_n
ame, SUM(o.total_cost) AS Total_Revenue, SUM(o.quantity) AS Tota
l_Sales
FROM product p
JOIN product_category c ON p.category_id_fk = c.category_id
JOIN orders o ON p.product_id = o.product_id_fk
GROUP BY p.product_name, p.product_id, c.category_id, c.category
_name
ORDER BY Total_Revenue DESC, Total_Sales DESC
LIMIT 5;
The preceding query expands on the previous one, selects the category_id
and category_name , and uses a subquery to return the sales concerning each
product category:
Figure 6.17: Top 5 best-selling products with their categories and number of
items sold
Next let’s tackle a slightly more complex problem related to the most popular
products.
Let’s break down this complex query into parts. The CTE ( WITH statement)
creates a temporary table to hold the records for the country names, product
names and their total sales using the SUM function on the total_quantity
column. These results are grouped by both country and product names. Then,
in the SELECT query, the CTE table (created above) is used to fetch the data.
Additionally, a subquery selects the maximum number of items sold for each
country and product, and these results are ordered in ascending order for the
country names. It’s a great sign for Umar to see such a diverse “heat map” of
customers for the international store:
Figure 6.18: Best-selling product for each country the products have been
sold in
This business model will not sustain revenue as it will lead to many
disgruntled customers if they order something urgent and it arrives, on
average, 2 weeks later. This way, they will not be willing to make more
purchases from the store.Naturally, it begs the question, has this affected the
sales revenue generated by the store in the last year then? Not only does it
help ascertain the effect of the late deliveries, but it also helps analyze the
bigger picture of store performance in the last year. Here is the query to find
out:
SELECT YEAR(order_date) as year, MONTH(order_date) as month, SUM
(total_cost) as monthly_sales
FROM orders
WHERE order_date >= DATE_SUB(CURRENT_DATE, INTERVAL 1 YEAR)
GROUP BY year, month
ORDER BY year DESC, month DESC;
The result of the preceding query returns more than 100 rows. It highlights
the fact that ECommerceHub has lost many of its customers, and only a few
remain. Why could this be? We have already figured out one possible reason,
long delivery delay times. Considering all this information, can you identify
future sales projections in the next 6 months? Let’s see how this can be done.
The inner query in the SELECT clause executes first, given that it’s a
subquery. The subquery finds the sum of all rows to find the total cost where
the order_date is in the last 6 months, as displayed in the following
screenshot:
Summary
In this chapter, we helped Umar with performing data analysis on his
ECommerHub store data. We used the newly acquired SQL skills we
obtained in the last few chapters to solve a hypothetical real-world example
of data analysis with MSQL. We started by importing the data into our
database, and after that, we explored the different tables and their structure.
Next, Umar took us by the hand asking for insights in his ECommerceHub
dataset.Now that you have worked your way through this chapter, you should
have an idea of how to use your SQL skills to perform data analysis on a
project that resembles real-world examples. You could come across a case
like this in your career. Right now, you will have gained some confidence in
solving real-world data analysis projects in SQL. Let’s round up this pillar of
data analysis before moving on to the Data Cleaning & Exploratory Data
Analysis pillar.
7 Fundamental Statistical Concepts
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
Descriptive statistics
Descriptive statistics provide a summary of historical data or what happened
in a business process. The ability to uncover accurate insights comes from
knowing how to properly measure and interpret your data using statistics.
Creating metrics, KPIs, or OKRs are also born out of your descriptive
analysis. In this section, we will cover the measurements of scale and the
measures of central tendency variability.
Levels of measurement
Data can be categorized as qualitative or quantitative. Qualitative data
refers to information that is descriptive in nature about qualities. They cannot
be directly measured and are normally categories or attributes. Examples
include:
The average is calculated by the sum of all values divided by the number of
observations in the dataset. While the average is a commonly used statistic, it
is highly sensitive to outliers (abnormally high or low values in the dataset).
Outliers can be from the result of errors or natural events. If your dataset has
a lot of outliers influencing the average, it is best to report the median
(discussed next). If you determine that the outliers are the result of errors, you
can remove them in your calculation and then report the average. Advantages
of the Average
Data analyst’s use case for the average: A data analyst can use the mean to
calculate the average sales revenue per month for a company. By analyzing
the average revenue, the analyst can identify trends, patterns, and seasonality
in sales, allowing them to make informed decisions regarding inventory
management, marketing strategies, and financial forecasting. Additionally,
the average revenue can be compared to industry benchmarks or historical
data to evaluate the company's performance relative to its competitors or past
performance.
Median
Does not take all the collected data into account in the calculation, and
only uses a small portion of the dataset
Data analyst’s use case for the median: A data analyst is examining the
salaries of employees in a large organization. Instead of using the average,
which can be heavily influenced by outliers or extreme values, the analyst
calculates the median salary. This measure provides a better representation of
the central tendency of the salary distribution, making it useful for
understanding the typical earning potential within the organization.
Mode
The mode is the number that occurs the most often in a dataset. This measure
is often used for summarizing categorical data.Advantages of the mode
No calculation involved.
Not influenced by outliers
Data analyst’s use case for the mode: For an e-commerce platform, the mode
can be used to identify the most frequently purchased product category. This
information can be leveraged to optimize inventory management, marketing
strategies, and product recommendations to enhance customer satisfaction
and maximize sales. By identifying the mode, the data analyst can understand
customer preferences and tailor the platform's offerings accordingly. This
insight can drive strategic decisions, such as prioritizing stock availability
and promoting popular product categories to increase customer engagement
and drive revenue.The following are less common and variations of the
previous measures:
Weighted mean
Reflects importance: All values in a business process may not carry the
same importance or priority. Weighted means allows a data analyst to
emphasize the business significance of certain observations.
Imbalanced data: Weighted means accounts for imbalances in your
data as one can add more significance to underrepresented groups to
decrease bias.
Data Analyst’s use case for the weighted mean: A data analyst in the
education sector can use the weighted mean to calculate the average test
scores of students, considering the weightage assigned to different sections of
the exam. This measure allows for a more accurate representation of student
performance by accounting for variations in the importance of different test
components. For example, if certain sections of the exam carry higher
weights, the weighted mean provides a more comprehensive understanding of
students' overall mastery of the subject. By analyzing the weighted mean,
educators can identify areas of strength and weakness, design targeted
interventions, and provide personalized feedback to improve student learning
outcomes.
Trimmed mean
Robust to outliers: Because the extreme values are cut off from the
calculation, this measure is not sensitive to the presence of outliers. If
outliers are determined to be errors or nonrepresentative of the business
process, this measure would be preferred.
Loss of data: Deleting collecting data must be done with caution and is
not always a preferred action. Important information may be lost.
Challenges in interpretability: Additional explanation of how the
calculation was performed may be needed to avoid confusion or
misrepresentation of the data.
Subjectivity: There is no hard rule on the percentage to remove in the
calculation. This can lead to bias and should be done with careful
consideration of ethics and relation to the business process.
Data analyst’s use case for the trimmed mean: In sentiment analysis of
customer reviews, the trimmed mean can be used to analyze ratings by
removing extreme outliers. This measure provides a robust estimate of the
overall sentiment expressed in the reviews while mitigating the impact of
outliers that may skew the results. By trimming the extreme values, the data
analyst can focus on the general sentiment of customers and identify trends
and patterns in their feedback. This helps businesses gain valuable insights
into customer satisfaction, product improvements, and areas that require
attention, enabling them to enhance their offerings and make data-driven
decisions based on reliable sentiment analysis.ADD
Measures of variability
Measures of variability describe the spread, dispersion, or variability of the
data points in a dataset. These measures summarize how the data points are
distributed around the previously mentioned measures of central tendency
(mean, median, mode, etc.) They allow data analysts to assess the reliability
of results, identify outliers to investigate, and manage uncertainty. The
following are common measures of variability that data analysts should
know:
Range: This measure is the difference between the largest and smallest
numbers in a dataset. Data analysts can use this number to get a quick
understanding of the spread of the data. A use case would include
quality control to assess the variability of product measurements to
ensure consistency.
Data analyst’s use case for range: A data analyst working in quality control
for a manufacturing company can use the range to assess the variability in
product dimensions. By calculating the difference between the maximum and
minimum values, the analyst can determine the acceptable range within
which products must fall to meet quality standards. This also helps identify
any deviations from specifications and assists in identifying potential issues
in the manufacturing process or material quality. By monitoring the range
over time, the analyst can track process improvements and ensure consistency
in product quality.
Interquartile Range (IQR): The range between the first and third
quartiles. Also known as the 25th and 75th quartiles. Best visually
represented in a box plot as pictured below.
Figure 3. This photo displays a box plot labeled with each quartile with
outliers displayed at the ends.
Data analyst’s use case for IQR: A data analyst in a retail company is
analyzing sales performance across different product categories and stores.
To understand variability and identify outliers, the analyst calculates the IQR
for each category. The IQR provides a measure of data spread, helping
identify categories with consistent sales and those with higher variability. It
also assists in detecting potential outliers, enabling further investigation and
targeted interventions for improved performance and decision-making in
inventory management, pricing, promotions, and product assortment.
Variance: Measures how far data points are from the mean. It is
calculated by averaging the squared differences from the mean.
Data Analyst’s use case for variance: In financial portfolio analysis, the
variance can be used to measure the volatility of different investments. By
quantifying the dispersion of returns around the mean, the analyst can
compare and assess the risk associated with various investment options. A
higher variance indicates higher volatility and potential fluctuations in
investment returns, while a lower variance implies greater stability. Portfolio
managers and investors can utilize variance to optimize their investment
strategies, diversify their portfolios, and balance risk and return based on
their risk tolerance and investment objectives.
Data analyst’s use case for standard deviation: In market research, the
standard deviation can be used to measure the variability in customer ratings
for a product or service. This helps identify the consistency of customer
perceptions, enabling businesses to focus on areas that need improvement. A
higher standard deviation indicates greater variability in ratings, suggesting
that customer opinions are more dispersed. With this analysis, a business can
pinpoint specific features or aspects of their offerings that contribute to
customer satisfaction or dissatisfaction. This information helps prioritize
product enhancements, customer service improvements, and targeted
marketing campaigns to address customer needs and preferences effectively.
Data Analyst’s use case for skewness: A data analyst is studying the
distribution of customer ages in a retail database. By calculating the skewness
of the age distribution, the analyst can identify if the data is skewed towards
younger or older customers. This measure provides insights into customer
demographics, enabling targeted marketing campaigns or the development of
age-specific products.
Outliers: These are extreme values that are very different from the
majority of the dataset. There is always a story behind an outlier. They
can arise due to various reasons including measurement errors, data
entry errors, or part of the natural business process. Because there is
variation in every process, the presence of outliers can be expected.
Data analyst’s use case for outliers: A data analyst is working for a
transportation company and is analyzing the fuel efficiency of a fleet of
vehicles. As part of the analysis, the analyst wants to identify any outliers in
the fuel efficiency data that may indicate unusual or erroneous measurements.
By examining the dataset and applying outlier detection techniques, the
analyst can pinpoint vehicles that have exceptionally high or low fuel
efficiency compared to most of the fleet. These outliers may represent
vehicles with mechanical issues, measurement errors, or other factors
influencing their fuel efficiency.And with that, we have explored the world of
descriptive statistics, where we learned how to summarize and analyze data
using various measures of central tendency and variation. Descriptive
statistics provided us with a comprehensive understanding of the
characteristics of a dataset. However, our analytics job doesn't stop there. We
will now explore the powers of inferential statistics to support more advanced
decision making.
Inferential statistics
Inferential statistics allow us to go beyond the specific dataset at hand and
generalize or make conclusions. In this section we will begin by introducing
the concepts of probability, then dive into sampling, estimation, and end with
the concept of correlation vs causation.
Probability theory
Probability refers to the likelihood of a particular event occurring. It provides
a framework for quantifying uncertainty and making informed decisions in
the face of randomness. Whether you're predicting customer behavior,
analyzing stock market trends, or evaluating the effectiveness of a marketing
campaign, understanding probability is essential for accurate and reliable data
analysis.To reiterate, probability deals with the likelihood of events
occurring. By assigning numerical values to these likelihoods, we can express
them quantitatively, enabling us to make probabilistic statements and
predictions. We will explore basic concepts and the different kinds of
distributions that allow us to support more advanced decision making.Data
analysts can conduct experiments to draw more advanced and meaningful
insights from data. It represents any controlled and repeatable process that
generates outcomes. Every outcome from an experiment is represented as an
event. The collection of all events from an experiment is called a sample
space. Experiments allow a data analyst to gather data, validate assumptions,
and make inferences. Allowing the evaluation of different features, assessing
effectiveness of different strategies, and exploring the strength of
relationships between variables. The following are examples of use cases
where a data analysts can perform experiments:
ADD
Probability distributions
Building upon the concepts of sample spaces and experiments, probability
distributions enable us to study and quantify the likelihoods associated with
various events and outcomes. By examining the probability distribution of a
random variable, we gain insights into the likelihoods of different values it
can take on. Here are the types of probability distributions:
It follows the 68-95-99 rule where 68% of the data lies between 1 standard
deviation from the mean, 95% of the data lies between 2 standard deviations
from the mean, and 99.7% of the data lies between 3 standard deviations
from the mean.
Figure 5 - A normal distribution with each section of the 68-95-99 rule
labeled
ADD
Correlation vs causation
Correlation refers to the coefficient that references the relationship between
variables. The coefficient is represented as a number between -1 and 1. A
mistake that can easily be made is to imply one events leads to / causes
another event just because of a high correlation coefficient. Let's consider the
following example:
Figure 11. Chart showing a positive correlation between the number of
McDonalds restaurants and the rise of inflation over time. Shows that two
unrelated events can have a high correlation.
Summary
In this chapter, we introduced various descriptive and inferential statistical
concepts that data analysts should know of. An in-depth understanding of
descriptive statistics is necessary to perform an effective exploratory data
analysis. While there are many more concepts and advanced terms, entry
level data analysts do not need to know a lot of statistics beyond descriptive
and inferential. In the next chapter, we will build upon the concepts here to
learn how to perform hypothesis testing. Hypothesis testing will allow a data
analyst to conduct their own experiments allowing for deeper business
insights.
8 Testing Hypotheses
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
Interpreting the ResultsThe output table will give you the sample mean,
variance, the observed t Statistic, and the P(T<=t) one-tail and two-tail
values.The two key values to look at are the P(T<=t) two-tail and the t
Statistic:
P(T<=t) Two-tail: This is the p-value. If this value is less than your
chosen alpha level (often 0.05), then you reject the null hypothesis and
conclude that there is a significant difference between the sample mean
and the hypothesized population mean.
t Statistic: This value can be positive or negative, indicating whether the
sample mean is greater or less than the hypothesized mean.
For the 'Variable 1 Range' field, select your range of the Data Analyst
High School scores (B3:B38).
For the 'Variable 2 Range' field, select your range of Data Scientist High
School scores (F3:F38).
For 'Hypothesized Mean Difference', input 75.
You can leave 'Labels', 'Alpha', and 'Output Range' fields as is unless
you have specific preferences.
1. Click 'OK'. Excel will return the t-test results in a new window.
Interpreting the ResultsThe output table will give you the means and
variances of both groups, the observed t Statistic, and the P(T<=t) one-tail
and two-tail values.The key values to look at are the P(T<=t) two-tail and the
t Statistic:
P(T<=t) Two-tail: This is the p-value. If this value is less than your
chosen alpha level (often 0.05), then you reject the null hypothesis and
conclude that there is a significant difference between the two group
means.
t Statistic: This value can be positive or negative, indicating which
sample has the larger mean.
While the t-tests we've examined up until now are adept at comparing means
and require the data to be somewhat normally distributed, we face a different
issue when our data are categorical. Specifically, when the relationship
between two variables is of interest, rather than just comparing means. What
do we do then?This leads us into our next topic: the Chi-Square test. The Chi-
Square test is a non-parametric method used in statistics to determine if
there's a significant association between two categorical variables in a
sample. It is particularly useful when analyzing data that are categorized into
groups, not numbers. In the next section, we'll delve into the Chi-Square test,
including how to implement it in Excel and interpret the results.
1. Compute the expected values for each cell to create the expected values
table. The formula for each expected value is (Row Total * Column
Total) / Grand Total.
In this case:
Null hypothesis (H0): µ1 = µ2 = µ3 (The mean scores for all schools are
equal)
Alternative hypothesis (H1): At least one school's mean score is
different
Interpreting the ResultsThe output table will include the 'Between Groups'
and 'Within Groups' variance, the F statistic, and the P-value.
F statistic: This is the test statistic. It's a ratio of the variance between
groups to the variance within groups.
P-value: This is the probability of getting an F statistic as extreme as, or
more extreme than, the observed value, assuming the null hypothesis is
true. If this value is less than your chosen alpha level (often 0.05), then
you reject the null hypothesis and conclude that there is a significant
difference between at least two of the group means.
Summary
In this chapter, we embarked on a journey through four fundamental
statistical tests: One-Sample T-Test, Two-Sample T-Test, Chi-Square Test,
and ANOVA. Each test has its unique capabilities and contexts in which it is
most appropriate.We started with the One-Sample T-Test, a technique that
allows us to compare a sample mean to a known population mean. This test
helps us to understand whether our sampled data deviates significantly from
the known population mean. Through an interactive case study, we practiced
using Excel to perform a One-Sample T-Test, strengthening our practical
understanding of statistical analysis.Moving forward, we explored the Two-
Sample T-Test. Unlike the One-Sample T-Test, this test allows us to compare
the means of two independent groups. It's particularly useful when we need to
understand differences between two groups based on sampled data.Next, we
delved into the Chi-Square test, a non-parametric method used to determine if
there is a significant association between two categorical variables. This
statistical tool helped us analyze relationships between categorical variables,
adding a new layer of complexity to our data analysis capabilities.Finally, we
explored Analysis of Variance or ANOVA, a powerful statistical technique
that extends the two-sample t-test to compare means across more than two
groups. With the help of Excel, we learned how to implement ANOVA and
how to interpret its results, empowering us to make data-driven decisions
across a variety of scenarios.Together, these tests provide a solid foundation
for statistical analysis. By understanding when and how to use these tests, we
can make informed decisions based on our data, drive meaningful insights,
and answer complex questions about the world around us.
9 Business Statistics Case Study
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
This chapter will dive into the practical application of business statistics
through a comprehensive case study. This case study gives prospective data
analysts a practical setting to use the descriptive and inferential concepts that
we’ve covered in this book so far. After the case study, we will provide
sample interview questions that will help you prepare for your interviews.
For further optional study, additional advanced topics in analytics will be
provided along with resources.By the end of this chapter, you will have a
firm grasp of using statistical methods to examine business data and make
relevant inferences. Additionally, you will develop real-world experience
processing data from the actual world and developing business-related
questions.
Learning Objectives:
1. Perform descriptive statistics to summarize and understand the data
2. Conduct t-tests to compare means between two groups
3. Perform chi-square tests to examine associations between categorical
variables
4. Conduct ANOVA tests to compare means among more than two groups
5. Calculate correlation coefficients to measure the strength and direction
of relationships between two variables
Questions:
1. Descriptive Statistics: The HR department is curious about the basic
statistics of salaries and experience across different departments. They
ask you:
What is the average, median, and range of salaries for each
department?
Also, how does the average years of experience vary for each
performance score category?
2. T-tests: During a meeting, the HR manager raises a concern about
gender pay equity. They ask you:
Is there a significant difference in the average salary between male
and female employees?
Also, we have been investing heavily in training for the Sales
department. Is their experience significantly different from the IT
department?
3. Chi-Square Test: The HR department is planning some initiatives to
promote diversity and inclusion. They ask you:
Is there a significant association between gender and department?
Also, does the department an employee works in influence their
performance score?"
4. ANOVA: The finance department wants to ensure that the budget
allocation for salaries is fair across all departments. They ask you:
Is there a significant difference in the average salary among the
different departments?
Also, does the average years of experience vary significantly
among the different performance score categories?
5. Correlation: The HR manager is interested in understanding the factors
that influence an employee's salary. They ask you:
Is there a correlation between years of experience and salary?
What about performance score and salary?
Solutions:
Text Analytics
Text analytics, often known as text mining, draws essential knowledge,
intelligence, and insights from unstructured text data. Usually, it entails
structuring the input text, finding patterns in the structured data, and
assessing and interpreting the results. Natural Language Processing (NLP),
sentiment analysis, categorization, and entity recognition are a few methods it
uses. These methods are drawn from linguistics, computer science, and
machine learning.Text analytics can be used in various situations by a data
analyst. An example is in customer service to examine customer reviews and
feedback to find recurring themes and feelings. This may assist a company in
understanding what clients are saying about their goods or services and
making adjustments as necessary. Text analytics can be used in a social
media setting to understand better the attitudes and trends surrounding a
specific subject or brand, offering helpful information for marketing and
strategic planning.Resources to learn and practice text analytics:
Learning Roadmap:
Hands on Big Data Analytics with PySpark by James Cross , Rudy Lai ,
and Bartłomiej Potaczek.
Big Data Analysis with Python by Ivan Marin , Ankit Shukla, and
Sarang VK.
Practical Big Data Analytics by Nataraj Dasgupta.
Learning Roadmap:
Hands on Big Data Analytics with PySpark by James Cross , Rudy Lai ,
and Bartłomiej Potaczek.
Big Data Analysis with Python by Ivan Marin , Ankit Shukla, and
Sarang VK.
Practical Big Data Analytics by Nataraj Dasgupta.
Learning Roadmap:
Predictive Analytics
Using data, statistical algorithms, and machine learning approaches,
predictive analytics determines the likelihood of future events based on
historical data. Predictive analytics seeks to deliver the most accurate forecast
of what will occur by going beyond simply knowing what has already
occurred.Predictive analytics is a potent tool for data analysts that may assist
businesses in making data-driven decisions. It requires a number of stages,
including designing the project, gathering data, analyzing data, developing a
predictive model, validating and putting the model into use, and tracking the
model over time.Predictive analytics can be used in various fields and
industries. Here are a few examples:
Learning Roadmap:
Learning Roadmap:
Database Management
The tasks involved with administering a database, which is a structured set of
data, are referred to as database management. This can entail creating the
database's structure, entering data, implementing security controls, preserving
the data's integrity, and retrieving data as required.In order to collect and
analyze data, a database management system (DBMS) communicates with
applications, end users, and the database itself. An analogy to a file manager
that handles data in a database as opposed to saving files in a file system is a
DBMS. There are various DBMS kinds, including the most popular,
Relational Database Management Systems (RDBMS), NoSQL DBMS, In-
Memory DBMS, Columnar DBMS, and others.Database management can be
used in various fields and industries. Here are a few examples:
SQL for Data Analytics by Jun Shan , Matt Goldwasser, and Upom
Malik.
SQL Query Design Patterns and Best Practices by Steve Hughes ,
Dennis Neer, and Dr. Ram Babu Singh.
Learning Roadmap:
Summary
In this chapter, we conducted a comprehensive business statistics case study
using a dataset of 500 employees from a multinational corporation. The
dataset included each employee's ID, gender, age, department, years of
experience, salary, and performance score. This provided a practical
application of the statistical concepts discussed in Fundamental Statistics
Concepts and Testing Hypothesis.We also provided sample interview
questions to aid your preparation for the job search and additional analytics
topics to explore. In the next chapter, Data Analysis and Programming, we
will learn the applications and importance of programming languages for data
analytics.
10 Data analysis and programming
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
It’s time to take you beyond the world of spreadsheets and dashboards, into
the dynamic world of programming. I’m super excited and honored to be
your guide.Let’s be honest: for some people this is the scariest pillar. And
that’s okay, it’s going to be my personal mission to show you that this is the
most fun part of the job (which aligns with my motto: aim high!). And even if
I can’t convince you it’s fun, you’ll absolutely end up needing it and
appreciate the role of programming in the bigger picture. You can definitely
get started as a data analyst without programming skills. But eventually,
there’s no other option than to embrace it. Programming is an important skill
for anyone involved in data analysis. Honestly, there’s no need to run away
or hide. Everybody can do it! Just like almost anything new, it’s tough in the
beginning. But programming can eventually really become a second nature.
Here’s what we’ll be doing:
Python
Python is a very obvious one to put on the list. It’s a widely used
multipurpose language that is used for many things, amongst which data
analysis. (Fun fact, just like one of your authors, it’s from the early 90s and
the Netherlands.) Python comes with a lot of libraries that are great for data
analysis, including NumPy, Pandas, MatPlotlib and a lot more. A library is
some sort of add on, written in that language, that can be easily used to
perform certain tasks. In most cases, you just need to know which statistical
thing you need to do, and then if you give the function of the library the
correct parameters, the slightly harder math is handled inside the library and
it just gives you back the correct result.Let’s have a look at hos to create a list
and iterate over it in Python (and do the same for the languages below so you
can see strong similarities between different languages). Here’s how Python
does it:
numbers = [1, 2, 3, 4, 5]
for i in numbers:
print(i)
Python heavily relies on the correct indentation level to group code blocks,
where other languages use curly brackets to create code blocks. An example
of such a language is R. Let’s see R next.
R
R is primarily used for statistical analysis and data visualization. It is a
common choice for academic research, and many people know the basics of
R after completing a program at a university. But that’s not the only reason
that it’s a very popular choice among advanced statisticians and data
scientists. It offers a rich library of statistical and graphical methods.Here is
an example of a basic R syntax for creating and iterating through a vector
(like lists in Python):
my_vector <- c(1, 2, 3, 4, 5)
for (i in my_vector) {
print(i)
}
However, R's syntax can be less intuitive for beginners, and it is less versatile
than Python for tasks outside of pure statistical computing. Let’s not forget to
mention a very important one that we’ve seen already.
SQL
We have seen SQL (Structured Query Language), and at this point you’re
probably quite adequate with SQL. But it shouldn’t miss from this small
summation of great choices of programming languages for data analysis.
SQL, is a domain-specific language used in programming for managing and
manipulating databases. SQL is great for querying and extracting data, and
it's used by most companies due to the ubiquity of SQL databases. It is often
used in combination with other languages like Python for database-related
functions, using connectors.Here is an example of a basic SQL query that
displays a table of all records in the Employees table where an employee is
more than 30 years old:
SELECT * FROM Employees WHERE Employee_Age > 30;
Julia
Julia is a high-performance programming language for technical computing
that addresses performance requirements for numerical and scientific
computing while also being effective for general-purpose programming. Its
syntax is similar to those of other technical computing languages.Here is how
you might create a and iterate over a list in Julia:
numbers = [1, 2, 3, 4, 5]
for item in numbers
println(item)
However, Julia is a newer language, and its ecosystem is less developed than
other languages like Python and R. Additionally, it might be too advanced to
use for simple data analysis.
MATLAB
Matlab is a high-level language and interactive environment for numerical
computation and programming. It is prevalent in academia and engineering,
where it is used for tasks such as signal processing, image processing, and
simulations.This is how you can iterate through a vector with a loop in
Matlab:
my_vector = [1, 2, 3, 4, 5];
for i = 1:length(my_vector)
disp(my_vector(i));
Now that we’ve opened up the CLI, let’s see what we can use it for.
It’s very common to use the CLI for and during programming. Of course, the
GUI is user friendly and chances are that you have a lot of experience with it
as opposed to the CLI, but let’s see why we’d be using the CLI instead,
especially during programming.
Depending on the outcome, let’s set up your system or skip ahead to the
testing setup section.
MacOS
If Python 3 is not installed on a MacOS, then follow these steps:
1. Install Homebrew if you didn’t ever install that. Open your Terminal
and type the following command and hit enter:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Hom
ebrew/install/HEAD/install.sh)"
The terminal should return a success message once Homebrew is installed.
1. The next step is to install Python using Homebrew. This can be done by
typing the following command on your Terminal (after installing
Homebrew):
brew install python
It will download for a minute or maybe two, and then displays a success
message. Close the terminal and open it again, and see if this time
python --version or python3 --version does show the correct version.
Linux
On Linux (depending on the exact Linux version), use the following
commands:
sudo apt-get update
This is to update the package manager itself. After that, install the version of
Python you want, for example:
sudo apt-get install python3.11
After installation, you can check the Python version again to confirm that the
installation was successful.
Windows
1. Install Python. Download the latest version of the Python installer from
the official website: (https://www.python.org/downloads/windows/). For
example 3.11.4 in Figure 10.3. During the installation process, make
sure to check the box that says "Add Python to PATH”.
2. Once installed, open the command prompt and check the Python version
to confirm the installation with the following command, followed by an
enter:
python --version
Figure 10.3 – Windows Python download page
Browser (cloud-based)
If you don’t have a computer with Linux, MacOs or Windows, not to worry.
There are great options for those with only access to cloud-based Python
platforms. Some great options are Google Colab or Jupyter Notebook via
Binder. They require no installation and allow direct access to Python from
the browser instantly with no processor or storage overhead.
Google Colab
Hello world
Next, we need to get to this folder in the command line. On Windows, and
MacOs, you can right-click on the folder that contains the “helloworld.py”,
and select an option like “open in terminal”. This opens the folder that
contains the file in the terminal. In Figure 10.4 you can see what this looks
like for Windows.
Figure 10.4 – Open in Terminal (third from the bottom)
You can check the path that shows in the command prompt on windows, and
type “pwd” and hit enter on Linux and MacOS. If that’s indeed the right
folder, go ahead and type one of the two following options followed by an
enter:
If it says something like: no such file or directory. You’re not in the folder
that contains helloworld.py . Make sure to right click on the folder that
contains the file when you open the terminal, or navigate there using the
command line (you can google the cd command to learn how to change
directory (hence, cd) this if necessary).
Data Visualization
CleanAndGreen might want to visualize the waste generation and recycling
trends over time across different cities. In Python, there are many libraries for
data visualization, some famous ones are Matplotlib and Seaborn, which can
be used to create a variety of informative plots and graphs. We actually will
not see how to use those in this book, but this is a very relevant use case that
we will definitely recommend when you’re passed your rookie stage in
Python!
Statistical Modeling
CleanAndGreen wants to determine the factors that affect recycling rates in
different neighborhoods. With Python, you can use several libraries for this
purpose. Some common choices would be statsmodels or scikit-learn. These
libraries can be used to build statistical models and perform hypothesis tests.
Summary
In this chapter, we explored the need for programming in data analysis,
illuminating its necessity in our data-driven age. You should now understand
why programming is crucial for any data analyst, and in later sections, we
will show how to work with Python a leading language in this field. With the
help of the CleanAndGreen startup example, we outlined how programming
aids in data collection, handling large volumes of data and performing
intricate data analysis to derive insights. By mastering programming, one can
achieve the potential of the data and convert it into actionable, data-driven
decisions.After that, we provided an overview of the different programming
languages utilized in data analysis. We compared and contrasted popular
languages for data analysis, such as Python, R, SQL, Julia, and MATLAB,
highlighting the pros and cons of each and offering very small code snippets
to showcase their syntax and usability differences. Our focus then shifted to
Python – one of the most popular programming languages. We explained
why Python is often the choice for data analysis due to its simplicity,
flexibility, and its vast library support. Next, we transitioned into the practical
aspect of setting up your Python programming environment. We provided
step-by-step instructions for all operating systems as well as modern cloud-
based development environments like Google Colab. After each step, we
ensured that you understood the expected results, setting you up for success
in your Python programming journey.And, of course, we also introduced you
to Command Line Interface (CLI) and drew comparisons between the CLI
and GUI, highlighting the advantages of using the CLI for programming
tasks.We ended with some practical use cases we could use Python for in the
light of supporting CleanAndGreen.As we conclude this chapter, we've
equipped you with the foundational knowledge you need to apply in the
upcoming chapters, where we will use Python to solve real-world data
problems, given our CleanAndGreen example. Get ready to code!
11 Introduction to Python
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
Welcome to the next phase of your learn to code journey. At this point,
you've already understood why programming is necessary for data analysis.
It's time to get your hands dirty with some actual coding. We’ve chosen
Python as a first language for you. It’s a relatively easy syntax and it is
extremely common in the data analysis field. We could write a whole series
of books on Python to learn you all the ins and out. The approach we chose
here is to give you a quick overview of the language and how you can use it
for data analysis. This is by far not a comprehensive guide, but rather a first
look. This is what we’re going to cover in this chapter:
Indeed, there's a lot to cover. Not to worry, we will take it step-by-step, and
by the end of this chapter, you will have a solid foundation of Python. At that
point you’re ready to look at some libraries of Python to unlock new levels of
data analysis and problem-solving capabilities.We'll continue to lean on our
greenFuture case study to bring these concepts to life. So, let's go!
Understanding the Python Syntax
The syntax refers to the set of rules that specify how to write the code for the
program. It includes how to structure your programs and how different parts
of your program should be arranged. It's the basic rules we need to write our
Python programs. A program consists of statements. And every statement
should go on its own line in the Python file. Let’s look at one of the most
basic statements first, the print statements.
Python files
Print Statements
The print statement in Python is our go-to tool for displaying output to the
console. Let's start with a basic example. Imagine if you just started your day
at greenFuture and you want to greet your team with a little Python script.
You can use a print statement for that.
print("Good morning, greenFuture team!")
And that’s how to do the most basic print statement. In your code, you can
include comments to explain what the code is doing and make it easier to
read. Let’s see how to do that next.
Comments
It’s a good practice to explain what your code is doing. Especially when the
code and logic gets more complicated and you couldn’t tell what it does at
first glance. We can use comments to do this.
Comments are ignored by the Python interpreter, and they can be used to
explain difficult snippets in the code and to add some documentation to it. So
comments don’t influence the outcome of the program.Our previous code
snippet is of course very easy, but for demonstration purposes we’ll add a
comment on top, so that you can see what that looks like.
# Printing a greating for the team
print("Good morning, greenFuture team!")
The outcome will remain exactly the same. We’ll use comments in our code
snippets to explain certain parts every now and then throughout the examples.
This Python script will always print the exact same thing. It would be great if
it would “vary” a bit more. Well, that’s were variables come in the mix! Let’s
have a look at them.
Variables
Variables are like containers for data. They have a name and they can hold a
specific value. Variables allow us to store and manipulate data in Python
programs. For instance, let's say you're tracking the amount of plastic waste
greenFuture has recycled in a week. Here's how you might use a variable to
do that.
plastic_waste_recycled = 1300 # weight in kilograms
print(plastic_waste_recycled)
Because the variables holds the value of 1300. We can also perform
operations on variables, let’s see a few basic examples next.
Operations on variables
Let’s say the team at greenFuture manages to recycle an additional 700
kilograms of plastic waste. We can update our variable like so:
plastic_waste_recycled = 1300 # initial weight in kilograms
plastic_waste_recycled += 700 # additional weight recycled
print(plastic_waste_recycled)
On the second line, we use the += symbol. This means that we increment the
variable with what is on the right side of the operation. It’s the same as:
plastic_waste_recycled = plastic_waste_recycled + 700
Now, let's make it a bit more interesting. Suppose you want to print the
progress of a waste recycling task. You can include variable values in your
print statements combined with regular text. You can do it like this:
task_percent_completed = 45
print("The waste recycling task is", task_percent_completed, "%
completed.")
Output:
The waste recycling task is 45% completed.
Now, let’s take it a step further. Suppose we want to print a formatted string
that includes variable values. We can use f-strings for that. You might be
wondering, but why do I need to use f-strings if I can use the syntax above,
which is more straightforward? It’s because it allows you to insert multiple
variables into a string seamlessly. You can achieve it with the previous
syntax as well. However, the below one is very common and you need to be
able to read it:
task_name = "Aluminium recycling"
task_percent_completed = 45
print(f"The {task_name} task is {task_percent_completed}% comple
ted.")
We start with the f for format before the quotes. That tells Python that
anything between { and } needs to be replaced with the value of a variable.
And that’s why the output will be:
The Aluminium recycling task is 45% completed.
Above, three different variables are assigned three different values, separated
by commas. Note that the result is the exact same as when you declare and
assign each variable in separate lines.We have already seen that we can add
two numbers, but there are more allowed operations. Let’s see some more of
them.
At the end we print it. This would not be necessary for the task of calculating
the total. However, it helps us to verify the result. Here’s the output:
2800
This operation added the values stored in the variables plastic , glass , and
aluminium together to give us the total weight of recycled waste.Let's try
something more complex. Suppose the weight of waste to be recycled for the
next week is expected to increase by a factor of 1.15. How could we calculate
the expected weight for each type of waste?
# weights in kilograms
plastic = 1300
glass = 700
aluminium = 800
# increase factor
increase_factor = 1.15
# calculating expected weights of each type of waste
expected_plastic = plastic * increase_factor
expected_glass = glass * increase_factor
expected_aluminium = aluminium * increase_factor
print(expected_plastic, expected_glass, expected_aluminium)
These calculations use the multiplication operator (*) to increase each weight
by a factor of 1.15. These numbers can help the greenFuture team understand
the composition of their recycling impact.There are some surprises in the
output though:
1494.9999999999998 804.9999999999999 919.9999999999999
You can see that there are a lot more decimals after the floating point than
you would expect. For example, 1.15 * 1300 = 1495. But we get
1494.9999999999998. What is going on?This is a problem that all
programming languages have to deal with. It’s related to translating a decimal
number to a binary number that the computer inherently works with. Think of
it like writing 1/3 in decimal numbers; you can never be 100% as precise as
stating 1/3, no matter how many decimal points you add, such as
0.3333333333333333333333. In Python, there's a library called decimal that
helps mitigate these issues by providing arbitrary precision arithmetic.Now,
to make our output more readable, we can format the floating-point numbers
to show only two decimal places, we can use this print statement in order to
do that:
print(f'{expected_plastic:.2f}, {expected_glass:.2f}, {expected_
aluminium:.2f}')
We also have operators to compare operands. Let’s have a look at those now.
Comparison operators
This will output False since 2800 is less than 3000. These operators are
fundamental in controlling the flow of your program and making decisions
based on certain conditions.Let’s make it more interesting and add logical
operators to our skillset next.
Logical operators
In this case, we use logical (or, and, not) and comparison operators (<=, >=)
in conjunction with the arithmetic operators we already discussed. The print
statements output the result of the logical expressions, which are boolean
values ( True or False ).As you can see, understanding Python's syntax and
using print statements, variables, operators, and expressions is powerful.
These are the foundational concepts that we'll be building upon in the
upcoming sections.You might have noticed the dynamicity of Python in
choosing data types as it chose float for some operators despite an integer
value. So, let’s dive more into the data types available in Python to help store
data more efficiently.
Strings
Strings are sequences of characters. They are used to represent the textual
data in Python. To understand better, let’s see an example:
city = "New York"
print(city)
Output:
New York
We have created a string variable city with the value New York . Now, let's
use this in a slightly more complex situation. Suppose we want to display a
message about the city where greenFuture operates:
city = "New York"
message = "GreenFuture operates in " + city
print(message)
Output:
GreenFuture operates in New York
In this example, we used the + operator to concatenate two strings. There are
many more things you can do to manipulate strings in Python which you will
explore if you continue your Python journey.
Integers
In the previous sections, we've already used integers, which represent whole
numbers. But let’s create another integer variable representing the number of
greenFuture employees to illustrate this data type:
employees = 50
print(employees)
As you can see, integers are for whole numbers. We can also have decimal
point numbers of course, that’s were the float data type comes in.
Floats
Floats represent real numbers, that means numbers with a decimal point.
Suppose we want to represent the average amount of waste generated per
person in the city as part of the data collection by greenFuture:
average_waste_per_person = 0.87 # in kilograms
print(average_waste_per_person)
Output:
0.87
As you can see, the value is a floating point number. Floats can also be the
result of an operation on two integers. For example, 2 / 3 will result in the
float 0.6666666666666666.
Booleans
In Python, there is also another data type known as boolean. It stores values
as either true or false. For example, expanding on the example above for
integers, we can use a boolean data type to check whether the employee
count is above a specific number.
employees = 50
is_greater = employees > 40
print(is_greater)
Booleans are used a lot for decision making in the code, and we’ll use them
for control flow statements that we will discuss soon, such as if else
statements and loops. Let’s talk about how to change the data type first.
Type Conversion
Python also supports explicit type conversion of data types. Before we
demonstrate that, it would be wise to print out the existing data type of your
variable first. This can be done with type() .
average_waste_per_person = 0.87
print(type(average_waste_per_person))
Output:
<class 'float'>
As you can see, 0.87 is of type float. Now, let’s improve on this example and
convert the variable's data type. Here’s how you can do that:
average_waste_per_person_int = int(average_waste_per_person)
print(average_waste_per_person_int)
print(type(average_waste_per_person_int))
Output:
0
<class 'int'>
In the examples above, the first one prints the data type of the variable that
stores the average waste per person. The second example goes a step further
and explicitly converts the data type to an integer, stores the result in a new
variable, and then prints the value and type of this new variable. This value is
now of type int.
Something else happened though, the value changed from 0.87 to 0. Let’s
talk about this. When a float value is converted to an integer in Python using
the int() function, the decimal portion of the float is truncated, not
rounded. This means that everything after the decimal point is simply
discarded, and only the whole number part is kept. In the case of 0.87, the
whole number part is 0, and the decimal part .87 is discarded, resulting in 0.
This behavior is consistent regardless of whether the decimal part is less or
more than .5. It's important to be aware of this behavior as it can lead to
unexpected results if you’re used to work with rounding instead of
truncation.A popular use case of type conversion is when you take a user
input for a numerical value. In such a case, Python defaults to storing the
results as a “string” type rather than a numerical value. So, you must convert
the type of data to use the actual numerical value in different operations.
Here’s a code snippet that demonstrates that:
# Collecting daily recycling input from the user
daily_recycling_input = input("Enter the number of plastic bottl
es recycled today: ")
# Printing the type of daily_recycling_input to show it's a stri
ng
print(type(daily_recycling_input)) # Output: <class 'str'>
# Converting the string input to an integer
daily_recycling_int = int(daily_recycling_input)
# Now we can perform numerical operations
weekly_recycling_total = daily_recycling_int * 7
# Printing the weekly recycling total
print(f"The total number of plastic bottles recycled in a week i
s: {weekly_recycling_total}")
After this, it pauses. It waits for the user input. Let’s suppose we enter 1 as
input. It then stores 1 in the daily_recycling_input . And after that it
prints:
<class 'str'>
The total number of plastic bottles recycled in a week is: 7
Let’s go over what happens in the code snippet. We first collect the daily
recycling number using the input() function. The input function pauses and
waits for user input. We store the entered value in the
daily_recycling_input variable. We then print the type of
daily_recycling_input to demonstrate that it's a string. Next, we convert
daily_recycling_input to an integer using the int() function, allowing
us to perform numerical operations on it. Finally, we calculate the weekly
recycling total by multiplying daily_recycling_int by 7, and print the
result.
There are other types of conversion as well:
This is how to convert data types from one type to another. There’s more you
likely want to do with data. Suppose you want to extract specific values from
a string or textual data type in Python. How can you do that? With indexing
and slicing, you can access, modify, and manipulate the data stored in strings
and other sequence types. And that’s exactly what we’re going to talk about
next!
As you can see, we need the square brackets to access the values in the string.
This is what it will output:
N
Output:
r
When you use negative indexing, you must remember that -1 refers to the last
character (on the right-most side of the string). Here we take the second last
character with index -2. That’s the indexing part. With the use of the index,
we can also perform slicing. Slicing is used to extract specific portions of a
string. It’s useful when you only need certain parts of a string and can dispose
of the rest. The syntax for this uses a colon in between the starting character
and ending character on either side of it. If no end index is specified, it
continues until the end of the string:
city = "New Jersey"
selected_text = city[4:]
print(selected_text)
In this example we start at index 4 (the fifth character) and continue until the
end of the string. This will be the output:
Jersey
Slicing and indexing can also be used on lists. Let’s learn about lists and
other complex data structures next.
Unpacking Data Structures
In Python, we often deal with data structures that can hold multiple items,
like lists, dictionaries, sets, and tuples. This is great whenever you want to
bundle different data records. Such as all the users, all products, and more.
Let's explore each type in detail, starting with lists.
Lists
A list is an ordered collection of items. It’s a mutable data structure, meaning
you can change its elements after declaration. It also allows duplicate
elements and elements of different data types in a single list. Like strings,
access to list elements uses zero-based indexing. Let's create a list of the
types of waste that greenFuture recycles:
waste_types = ["plastic", "glass", "aluminium"]
print(waste_types)
You can access individual elements of a list by their index, just like the
characters of strings:
# Get the first item
first_item = waste_types[0]
print(first_item) # Output: 'plastic'
# Get the last item
last_item = waste_types[-1]
print(last_item) # Output: 'aluminium'
In our journey with GreenFuture, it's often essential to know how many types
of waste materials we are dealing with. Python makes this easy with the
len() function. This handy function tells us the number of items in a list.
Let's see it in action:
list_length = len(waste_types)
print(list_length) # Output: 3
With just a simple function, we now know there are three types of waste
materials in our list. There are a lot of methods built-in on the list data type
that . Let’s explore them.
Lists in Python are equipped with a variety of methods to make our life
easier. Let's delve into some of these methods.
Appending Items
As you can see, after appending the list got extended with paper. This is how
to add a single element. There’s also a way of adding a full list of elements to
the list.
Suppose we have a list of additional materials. We can add all these items to
our existing list with the extend() method:
additional_waste_types = ["organic", "e-waste"]
waste_types.extend(additional_waste_types)
print(waste_types)
As you can see, both elements got added to our list, without having a list
inside our list. This is how to make the list bigger, we can also make it
smaller by removing elements.
Removing Items
If for some reason, we stop recycling glass, we can remove it from our list
using the remove() method:
waste_types.remove("glass")
print(waste_types)
As you can see, glass has been removed from our list. Our list is a little
unorganized right now, let’s see how to fix that with sort.
Keeping our list of materials sorted helps in quicker access and better
organization. The sort() method is here to help:
waste_types.sort()
print(waste_types)
The list is now sorted A-Z, making it easy for us to see whether a certain
waste type is present or not.
Counting Occurrences
Curious to know how many times a particular material appears in our list?
The count() method has got us covered:
count_plastic = waste_types.count("plastic")
print(count_plastic)
Since all the elements are only present once in our list, the result will either
be 1 or 0. Since plastic is present, it is 1. If we’d look for something that is
not on the list, it would be 0. We can also figure out on which position an
element on the list is on.
If we want to know the position of a material in our list, we can use the
index() method:
index_aluminium = waste_types.index("aluminium")
print(index_aluminium)
Since aluminium is the first element on the list, the index of aluminium is 0.
If the element is not on the list, it will throw a ValueError . When the error
is not handled, the program will stop and crash. You’ll learn about how to
handle errors later.
For a different perspective, we might want to look at our list in reverse. The
reverse() method makes this an easy task:
waste_types.reverse()
print(waste_types)
Dictionaries
A dictionary is an unordered collection of key-value pairs. The key for each
pair must be unique, and it is immutable after declaration. You can store
values with any data type and access is provided through the keys, not
indexing. For example, we might use a dictionary to store the amount of
waste greenFuture recycles in each category:
waste_recycled_kg = {"plastic": 1200, "glass": 800, "aluminium":
1500}
print(waste_recycled_kg)
Output:
{"plastic": 1200, "glass": 800, "aluminium": 1500}
We can also access the amount of plastic recycled by using the key "plastic":
plastic_recycled = waste_recycled_kg["plastic"]
print(plastic_recycled)
Output:
1200
Let’s have a look some of the other common things we need to do with
dictionaries in our day-to-day tasks as data analysts.
dict_size = len(waste_recycled_kg)
print(dict_size) # Output: 3
This will print 3. With a simple function call, we now know there are three
key-value pairs in our dictionary. Let’s see some built-in methods on
dictionaries.
Adding Items
The dictionary is updated to also contain a key-value pair for paper. It is also
possible to update an existing key-value pair.
Updating Items
Removing Items
If for some reason, we stop recycling glass, we can remove it from our
dictionary using the pop() method:
waste_recycled_kg.pop("glass")
print(waste_recycled_kg)
The glass: 800 key-value pair has been removed. And our dictionary now
consists of three key-value pairs. If you try to pop a key that is not in the
dictionary, you’ll get an error. You can check whether a certain key exists as
well.
Before attempting to access or remove an item, it's wise to check if the key
exists to avoid the error. The in keyword helps us here:
print("glass" in waste_recycled_kg) # Output: False
This will print False . The next step would be to use an if-else statement,
we’ll see how to create those soon!
Sometimes, we might want to take a look at all the keys or all the values in
our dictionary. The keys() and values() methods are perfect for this:
print(waste_recycled_kg.keys())
# Output: dict_keys(['plastic', 'aluminium', 'paper'])
print(waste_recycled_kg.values())
# Output: dict_values([1200, 1600, 900])
These come in handy once we’ve seen some more control flow statements.
What we’ve seen now is just a glimpse into the capabilities of dictionaries,
but it’s definitely a good place to start managing and querying data.While
dictionaries are incredibly useful due to their key-value pair structure, there
are other data structures in Python that offer different advantages. As we
continue our exploration, we'll delve into sets next.
Sets
A set is an unordered collection of unique items. It is also a mutable data
structure that supports mathematical set operators like Union but does not
allow indexing or slicing. Suppose greenFuture operates in multiple cities,
and we want to keep track of them. The cities will be unique items and we
don’t want duplicates, so we store them in a set:
cities = {"New York", "San Francisco", "Chicago", "New York"}
print(cities)
Output:
{"New York", "San Francisco", "Chicago"}
Although we added "New York" twice, it only appears once in the set
because all items in a set must be unique, and it doesn’t support duplicate
values like in lists. Hence, it’s useful for eliminating duplicates.Note the
difference in the type of parentheses used. Sets use curly braces, while lists
use square brackets. Let’s talk about what we can do with sets.
Just like with lists and dictionaries, it's often necessary to know the size of
our set. The len() function is our go-to for this again:
set_size = len(cities)
print(set_size) # Output: 3
With a simple function call, we now know there are three unique cities in our
set. Let’s see some methods that we’ll commonly use for our data analysis
tasks on sets.
Sets in Python are equipped with a variety of methods to make our data
analysis tasks smoother. Let's explore some of these methods and keywords.
As you’ll see, a part of what we often need to do is similar to what we want
for dictionaries and lists, but the way we need to do it differs a little bit.
Adding Items
When GreenFuture expands to a new city, we need to add it to our set. Here’s
how:
cities.add("Los Angeles")
print(cities)
As you can see, Los Angeles has been added to our set. In a similar way, we
can remove cities as well.
Removing Items
In the output you can see that Chicago is not longer on our set. When we try
to remove something that’s not in our set, we’ll get an error. Luckily, we can
check if an element exists in our set first.
We don’t want to get errors. That’s why it’s wise to check if the item exists,
before attempting to remove it. The in keyword helps us here:
print("Chicago" in cities)
Output:
False
In this case, Chicago is not in the set and the statement evaluates to False.
Sets allow for some special mathematical operations. Let’s see those next.
Set Operations
Sets support various mathematical operations that can be very useful in data
analysis. Let's explore a few of the available operations.
Union
The first one we’ll discuss is union. Union creates a new set combining all
unique items from two sets. In order to make sure you understand it, we’ll
repeat the current values of the sets.
cities = {"New York", "San Francisco", "Los Angeles"}
other_cities = {"Boston", "Miami", "New York"}
all_cities = cities.union(other_cities)
print(all_cities)
Intersection
We have just seen how to use union to get all the unique elements in both
sets. We can also find the common elements in two sets. This is done with the
intersection method. Here’s an example of how to do that:
cities = {"New York", "San Francisco", "Los Angeles"}
other_cities = {"Boston", "Miami", "New York"}
common_cities = cities.intersection(other_cities)
print(common_cities)
We repeated the values of the sets for clarity. This is the output:
{'New York'}
Since the other_cities and the cities only have New York in common,
that’s going to be the only element in the common_cities set. We can also
find the elements that they don’t have in common.
Difference
In order to find the elements in one set that are not in the other, we can use
the difference method. Here’s how to do that.
cities = {"New York", "San Francisco", "Los Angeles"}
other_cities = {"Boston", "Miami", "New York"}
unique_cities = cities.difference(other_cities)
print(unique_cities)
This will give the values that are present in cities , but not present in
other_cities . And this is what the output will be:
As you can see, cities has Los Angeles and San Francisco. These values
are not in other_cities . New York is in both, and that’s why it’s not in the
output.There’s a lot more possible, but these will give you a great foundation
to start with sets. The last data structure we’ll discuss are tuples.
Tuples
A tuple is an ordered collection of items that cannot be changed once created,
i.e., an immutable data type. It also allows zero-based indexing and duplicate
values to be stored. Moreover, tuples can be used as the keys in dictionaries
or elements in sets.For instance, we might use a tuple to store the latitude and
longitude of greenFuture’s office:
office_location = (40.7128, -74.0060) # coordinates for New Yor
k
print(office_location)
We can access the latitude (the first item) just like we would in a list:
latitude = office_location[0]
print(latitude)
Output:
40.7128
Tuples are similar to lists except that they are immutable and use parenthesis,
while lists use square brackets. These things are the same:
And that’s it: list, dictionary, set and tuple. These are the basic data types and
structures in Python will be the foundation for many of the tasks you’ll have
to perform. They provide a flexible way to represent real-world data and
manipulate it to derive insights. The topics we’re going to be discussing next
are going to open a new world in terms of tasks you can perform with
Python. Using your knowledge of data structures and combining them with
you near future knowledge of control flow structures is going to be a great
tool for solving your data analysis tasks. So, let’s go!
This will output nothing. Why? Well the print statement is only executed if x
> y, and since x is not bigger than y. It is not executed. Let’s look at this
example:
x = 30
y = 10
if x > y:
print("x is greater than y!")
Why? Well, because now the x is greater than y. So it does execute the code
block (in this case the print line) associated with the if. The code block could
consist of multiple lines, they would all have to be indented. Otherwise it’s
considered outside of the if statement.So far, we’ve only seen if statements
and nothing that needs to happen only if the condition is false. If the
condition is false, the program can execute an optional else block. Let’s
explore this with an example:
x = 3
y = 10
if x > y:
print("x is greater than y!")
else:
print("x is smaller than or equal to y!")
The else doesn’t specify a condition. It doesn’t need one, it’s when the if
statement is not true, it will end up in the else. Again, the code in the else
block needs to be indented in order to belong to the else block.Here’s what
this will output:
x is smaller than or equal to y!
Since the statement x > y evaluates to false, the else block is executed. We
can even specify multiple if conditions in one if statement. The second one is
an “else if” written as elif. This second statement will only be evaluated if the
first one is false.
x = 3
y = 10
z = 5
if x > y:
print("x is greater than y!")
elif x > z:
print("x is greater than z!")
else:
print("x is smaller than or equal to z!")
The flow works from top to bottom through the if-elif-else statement. The
conditions are checked; first, the if statement, when found false, the
interpreter checks if the elif statement is true; and lastly, when it finds the elif
to be false too, it prints out the statement inside the else block.Everything that
evaluates to True or False can be the expression of the if statement. Now,
let’s continue using our greenFuture example and tackle the problem of
finding data on specified waste materials:
waste_recycled_kg = {"plastic": 1200, "glass": 800, "aluminium":
1500}
if "plastic" in waste_recycled_kg:
plastic_kg = waste_recycled_kg["plastic"]
print(f"Plastic: {plastic_kg} kg")
elif "glass" in waste_recycled_kg:
glass_kg = waste_recycled_kg["glass"]
print(f"Glass: {glass_kg} kg")
elif "aluminium" in waste_recycled_kg:
aluminium_kg = waste_recycled_kg["aluminium"]
print(f"Aluminium: {aluminium_kg} kg")
else:
print("No data available for the specified materials.")
Let’s make sure you understand what is going on. What do you think this
code snippet will print?
waste_recycled_kg = {"plastic": 1200, "glass": 800, "aluminium":
1500}
if "plastic" in waste_recycled_kg:
plastic_kg = waste_recycled_kg["plastic"]
print(f"Plastic: {plastic_kg} kg")
if "glass" in waste_recycled_kg:
glass_kg = waste_recycled_kg["glass"]
print(f"Glass: {glass_kg} kg")
if "aluminium" in waste_recycled_kg:
aluminium_kg = waste_recycled_kg["aluminium"]
print(f"Aluminium: {aluminium_kg} kg")
Since these are three separate if statements, all three conditions are evaluated.
Since all conditions are true, all the code blocks will be executed. This is
what it prints:
Plastic: 1200 kg
Glass: 800 kg
Aluminium: 1500 kg
You’ll find yourself using if statements a lot. Another very commonly used
control flow structure is the loop. Let’s talk about looping next.
Looping in Python
Loops are a way for us to execute a code block repeatedly until the specified
condition(s) remain true. Python has two main types of loops: for loops and
while loops. We will look at each of them below.
For Loops
A for loop is used for iterating over a sequence (such as a list, tuple, or string)
or other iterable objects. It’s used in almost every program you will come
across. Let's say greenFuture has a list of cities they plan to expand to, and
we want to print out each city's name. Now, there’s another, more
cumbersome, way to do this apart from the for loop. You could access each
element by their indexes in the list and then print them. Here’s what that
would look like.
cities = ["Boston", "Denver", "Los Angeles", "Seattle"]
print(cities[0])
print(cities[1])
print(cities[2])
print(cities[3])
You can achieve the same thing with the below code snippet that uses a for-in
loop.
cities = ["Boston", "Denver", "Los Angeles", "Seattle"]
for city in cities:
print(city)
Do you see how the example with the loop is a lot less code? We need to
specify a temporary name for each element in the list. Here we choose city .
We could have chosen anything, this would have done the same thing:
cities = ["Boston", "Denver", "Los Angeles", "Seattle"]
for x in cities:
print(x)
But city is more descriptive than x , so that’s the better pick. Here’s what it
will output (in both cases):
Boston
Denver
Los Angeles
Seattle
As you can see, we need to define a name for they key ( category ) and the
value ( amount ). Here is the output:
greenFuture has recycled 1200 kilograms of plastic.
greenFuture has recycled 800 kilograms of glass.
greenFuture has recycled 1500 kilograms of aluminium.
That’s how we can use for loops. Let’s talk about while loops next.
While Loops
You might encounter situations where you need to exit a loop prematurely or
skip an iteration. Python provides two keywords for these scenarios: break
and continue. Let's have a look at each of them with examples from our
greenFuture data analysis tasks.
The break statement allows you to exit the loop prematurely when a certain
condition is met. This can be particularly useful when you've found what
you're looking for and there's no need to continue the loop. If you’d continue
to execute the loop that would be a waste of computing power.Suppose
greenFuture has a list of cities where it operates, and we want to check if a
particular city is in the list. Once we find the city, there's no need to continue
checking the rest of the list, because we’ve found our answer.
cities = ["New York", "Los Angeles", "Chicago", "Houston", "Phoe
nix"]
target_city = "Chicago"
for city in cities:
print(f"Currently looping over {city}.")
if city == target_city:
print(f"{target_city} is in the list of cities.")
break # Exit the loop when the target city is found
The continue statement allows you to skip the rest of the current iteration and
proceed to the next iteration. This can be useful when you want to skip
certain items in a loop.Let's say greenFuture has a list of waste materials, and
we want to print out the types of waste, skipping any entry that is glass.
waste_types = ["plastic", "glass", "aluminium", "paper", "organi
c"]
for waste in waste_types:
if waste == "glass":
continue # Skip rest of the code in this iteration
print(waste)
The iteration of glass is skipped. Because in this code, whenever the waste
is 'glass', the continue statement is executed. This means that the
print(waste) statement in that iteration will be skipped. Instead, the code
execution proceeds to the next iteration of the loop.These two control flow
tools, break and continue . provide additional control over how your loops
execute. Loops are a great building block for automating repetitive tasks and
let our program run until certain conditions are met. Another very useful
basic Python building block that will help us to automate tasks and structure
our code are functions. Let’s explore them next.
Functions in Python
A function is a block of code that can be used to perform a single action.
Functions provide modularity for your application and a high degree of code
reusability. Whenever you feel the need to copy-paste, chances are you might
need a function instead (or a class, but that’s out of scope for this book).
When you use functions, you follow the D-R-Y principle (Don't Repeat
Yourself) in programming. This principle aims to eliminate duplicate code.
Reusing code makes the application more flexible, easier to read and easier to
maintain.
That must sound pretty good! So let’s see how we can define functions to get
started with this.
Output:
After recycling, there will be 1800 kilograms of waste left.
The cool thing about functions, is that we can call them as often as we like,
with different arguments. So we can also say:
waste_left1 = calculate_waste_left(300, 250)
waste_left2 = calculate_waste_left(1000, 200)
waste_left3 = calculate_waste_left(14, 7)
This will store the values 50, 800 and 7 in waste_left1 , waste_left2 and
waste_left3 respectively.Whenever we write a snippet of code that we need
to reuse, we can create a function for it. Python has a lot of functions built-in,
let’s explore those next.
len()
print()
input()
type()
These are highly useful in many programs. For instance, you will almost
always find a print statement in any program as one of its function is also
to help the debugging. Similarly, the len() function helps find the number
of elements in an iterable object. And in contrast, the type() function, as we
discussed in type conversion before, contributes to finding out the data type
of existing objects. Of course, many more built-in functions exist, and we are
only scratching the surface here.
Let’s talk about a very often used built-in function that we haven’t seen yet:
the range() function. The range() function is used in Python to generate a
sequence of numbers. It's often used in loops to control the number of
iterations. The range() function takes up to three arguments: start, stop, and
step. The start argument is the beginning number of the sequence, the stop
argument is 1 past the end of the sequence, and the step argument is the
difference between each number in the sequence. Its syntax is as follows:
range(start, stop, step)
Let’s look at a practical example. We could use the range() function to
iterate through the years and calculate the projected amount of waste
recycled.
# Assume greenFuture is currently operating in 3 cities
current_cities = 3
# They plan to expand to 10 cities over the next 7 years
target_cities = 10
# Each city recycles 1000 kg of waste per year
recycle_per_city_per_year = 1000
# We'll project the total waste recycled over the next 7 years
for year in range(1, 8):
# Assume greenFuture expands to 1 new city each year
current_cities += 1
total_recycled_this_year = current_cities * recycle_per_city
_per_year
print(f"Year {year}:")
print(f" Cities operating: {current_cities}")
print(f" Total waste recycled this year: {total_recycled_th
is_year} kg")
print("-" * 45)
In this example, the range() function is used to simulate the passage of time
(years) as greenFuture expands its operations to more cities. Each iteration of
the loop represents a new year, with an additional city added to the
current_cities count, and the total waste recycled for that year is
calculated and printed. Here’s the output:
Year 1:
Cities operating: 4
Total waste recycled this year: 4000 kg
---------------------------------------------
Year 2:
Cities operating: 5
Total waste recycled this year: 5000 kg
---------------------------------------------
This is one example how to use the range for loops. It’s often used. Let’s see
a few more popular built-in functions next.
There are a lot of built-in functions. Discussing all of them is out of scope.
But let’s have a look at an example that combines a lot of them. You must be
very familiar with the print() function right now, but do you know
remember input() that we saw earlier in this chapter? Here’s an example
that calculates the total waste left, in whole numbers, after taking user input
for the amount of total waste and waste recycled. We’ll explain the different
functions used after:
def calculate_waste_left(total_waste, waste_recycled):
return abs(total_waste - waste_recycled)
total_waste = int(input("Enter the total amount of waste: "))
waste_recycled = int(input("Enter the amount of waste recycled:
"))
waste_left = calculate_waste_left(total_waste, waste_recycled)
# Round waste left to 2 decimal places
rounded_waste_left = round(waste_left, 2)
output_message = "After recycling, there will be {} kilograms of
waste left.".format(
rounded_waste_left
)
print(output_message)
It pauses twice. And I entered the values 100 and 80 here. Here is what it
outputs:
Enter the total amount of waste:
100
Enter the amount of waste recycled:
80
After recycling, there will be 20 kilograms of waste left.
Note that this example incorporates a few built-in functions (in program
order): abs(), int(), input(), round(), format(), and print(). Here’s a quick
summary of all these functions:We have already discussed print(), and as
evident, input() helps with engaging users by taking the input from the
console. The int() function, as seen previously in type conversion, converts
the input string into an integer to perform calculations on. The abs() function
finds the absolute value of the result of the difference in total waste and waste
recycled. Lastly, the round() function takes two arguments: the value and the
number of decimal places to which you want to round the value.As you can
see, functions, whether they are pre-built or defined by you, are very helpful
for coding. They let you encapsulate blocks of code that perform specific
tasks into one place, and then call that code whenever you want to perform
the task.And that’s it for the basics of Python! Let’s reiterate what we’ve
learnt before we move on to some more complex topics.
Summary
In this chapter, we started Python programming and got a basic
understanding of its syntax and many of its components. Right now, you
should have a comprehensive understanding of Python's syntax, its data
types, and the concept of indexing, control flow structures, data structures,
and functions.We began with the basics of Python’s syntax: print statements
and variables which are essential for output and data storage, respectively.
We then discussed operators and expressions, fundamental elements for
performing operations and constructing expressions in Python.Next, we
explored the various data types you can utilize in Python. We looked into the
usage and properties of strings, integers, floats and booleans. We have also
seen how to manipulate these data types in Python to extract information and
efficiently conduct operations.After that, we addressed the topic of indexing
in Python, an invaluable technique for accessing elements in a data structure.
Using various examples, we depicted the significance of correct indexing to
handle data effectively, as it’s vital to understand the concept behind it.When
it comes to unpacking data structures, we looked at lists, dictionaries, sets,
and tuples. For each data structure, we outlined characteristics, demonstrated
how to work with them, and discussed their uses in Python programming.
Our focus then shifted to control flow structures, emphasizing conditional
statements and loops. Lastly, we discussed functions in Python. We started
by creating our own functions, highlighting how to encapsulate reusable
pieces of code. We also reviewed some of Python's built-in functions, like
print, input, and type, that Python natively provides for various tasks.
Throughout this chapter, we utilized our greenFuture example to make each
topic more practical to help you learn. This exploration of Python's
fundamentals sets the stage for more advanced topics and future projects. In
the next chapter, you’ll see the basics of some popular libraries for data
analysis with Python.
12 Analyzing data with NumPy &
Pandas
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
After getting to know the Python basics, it’s time to dive into some data
analysis with Python. The language primarily preferred for data analysis has
been Python. This is due to its rich ecosystem of libraries. Python’s built-in
data structures, like lists, tuples, etc., do not offer satisfactory performances
and functionalities when it comes to numerical analysis of large and complex
datasets.This is where two very popular libraries that are widely used in the
data community come in: NumPy and Pandas.Numpy is a library that
provides a high-performance multidimensional array object for efficient
computation of arrays and matrices. Comparatively, lists in Python are not
good enough for numerical operations, especially when the data size is
large.Pandas provides high-level data structures like Series and DataFrame,
built on top of Numpy arrays, providing many tools for data analysis tasks,
such as handling missing data, merging/joining, reshaping, and pivoting
datasets. The standard built-in Python data structures would require
significant custom code for similar tasks. Our foundation of Python is still
very fragile, so we will not cover these libraries in great depth. Again, these
topics each deserve their own books and we’re only doing one chapter to
familiarize you with them and get you curious to learning more about them.
This is what we’ll cover:
Fundamentals of NumPy
Basic NumPy operations
Statistical and mathematical operations
Multi-dimensional Arrays
Fundamentals of Pandas
Series and DataFrame
Loading data with Pandas
Data cleaning and preparation
Data analysis and visualization
You probably can’t wait any longer, so let’s get started with NumPy right
away!
Introduction to NumPy
NumPy is a Python library that is short for Numerical Python. This library is
the cornerstone for numerical computing in Python. It provides support for
arrays, matrices and many mathematical functions to operate on these data
structures. It gives us highly optimized arrays and mathematical functions.
Unlike Python lists, NumPy arrays are homogenous and allow for vectorized
operations, making computations faster. You would not need this for small
amounts of data and simple operations, but it is necessary when dealing with
large datasets or performing complex mathematical operations. And since this
is a common scenario in data analysis, let's see how we can start using it.
You should type this on the command line, press enter and ta-da. Now we
have NumPy installed. We can go ahead and use it in our Python programs
by importing it. It's a common practice to import NumPy as np for ease of
use:
import numpy as np
The as np gives an alias for our numpy module. Whenever you need to use
the module, you’ll only have to type np and not the full word numpy . With
NumPy now at our fingertips, we're all set to delve into some basic
operations.
Creating Arrays
We first create it and after that we print it. This is the output:
[200 300 400 250 600 350 275]
We have seen that we can use indexing and slicing on Python lists. Of course
we can do this on the NumPy arrays as well. Accessing elements in a NumPy
array is similar to indexing a list in Python. Let's retrieve the amount of
plastic recycled on the first and last day of the week:
first_day = plastic_recycled[0]
last_day = plastic_recycled[-1]
print(f"First day: {first_day}, Last day: {last_day}")
Output:
First day: 200, Last day: 275
With NumPy, we can also slice and dice the data to get specific portions of
the array. Here’s how to slice the array to get the data for the weekdays only:
weekdays_recycled = plastic_recycled[:5]
print(weekdays_recycled)
Just like Python lists, when we don’t specify the start value, it starts at the
beginning. The end index is 5, meaning the last index it will print is index 4:
[200 300 400 250 600]
There are also operations that were not available in Python that we can use
easily with NumPy arrays. Let’s explore those next!
With just a two simple lines of code, we've computed the total and average
recycling for the week. Let’s see another example with some more statistical
operations. We’ll calculate some basic statistics for the amount of glass
recycled over a week:
glass_recycled = np.array([150, 200, 250, 300, 200, 100, 50])
# Calculating mean, median and variance
mean_glass = np.mean(glass_recycled)
median_glass = np.median(glass_recycled)
variance_glass = np.var(glass_recycled)
print(f"Mean: {mean_glass}, Median: {median_glass}, Variance: {v
ariance_glass}")
We are calculating the mean, median (middle value) and the variance (degree
of spread). Here is the output:
Mean: 178.57142857142858, Median: 200.0, Variance: 6326.53061224
4898
With just a few lines of code, we've managed to compute some fundamental
statistics. We would have to check with NumPy, but it’s probably statistically
significant how easy that was! There are other cool features of NumPy that
are worth mentioning. The basic mathematical operations can also be applied
to complete arrays. Let’s see how to do that next.
We’ve added the NumPy arrays on top of this code snippet again to make it a
bit easier to process what is happening. As you can see, it adds, subtracts and
multiplies the values on the same indices of each array. To be fair, we’re not
sure how we would need the multiplication in this concrete example. It’s just
there to show you what it does and how easy it is to use. This is the output:
Total recycled daily: [350 500 650 550 800 450 325]
Difference recycled daily: [ 50 100 150 -50 400 250 225]
Multiplied recycled daily: [ 30000 60000 100000 75000 120000
35000 13750]
To sum this up, we dare to say that math with NumPy is as easy as
np.array([1, 2, 3]) . Let’s look at something that’s typically more
complex to work with (mainly because your brains need to keep track of what
is happening): multidimensional arrays.
Multi-dimensional Arrays
Let’s spice it up a bit and see what we can do with multidimensional arrays.
These are simply said: an array of arrays. For a 2D array, this means that the
outer array contains arrays and the inner arrays contain non-array elements.
They allow us to represent different kinds of data in a structured form. Let’s
see how we can create one.
As you can see, it’s simply two arrays inside one array. The first inner array
represents the first city and the second inner array represents the second city.
Let’s see how to access the elements in this 2D array.
And here is the value of recycled material on the third day of the second city:
city_2_day_3_recycled = recycled_2d[1][2]
print(city_2_day_3_recycled)
As you can see, with the first [1] we access the elements of the outer array,
which is an array in itself. With the second [2] we access the elements of
this inner array. Let’s see how we can read the data from a CSV file and
analyze this with the use of NumPy.
Suppose we have the weekly recycle data stored in a CSV file. Here’s what
the file weekly_recycling_data.csv could look like:
Week,Plastic,Glass,Aluminium
1,200,150,300
2,250,180,350
3,220,160,310
4,240,170,320
5,230,155,305
6,210,145,290
7,205,140,285
8,235,165,315
9,245,175,325
10,225,160,310
We can easily load this data into a NumPy array using the np.loadtxt()
function. It could be a lot bigger and it would work just the same.
weekly_data = np.loadtxt('weekly_recycling_data.csv', delimiter=
',', skiprows=1, usecols=(1, 2, 3))
print(weekly_data)
With the data loaded, let's calculate the total and average for each type
(plastic, glass and aluminium) per week:
total_recycling_weekly = np.sum(weekly_data, axis=0)
average_recycling_weekly = np.mean(weekly_data, axis=0)
print(f"Total recycling weekly: {total_recycling_weekly}")
print(f"Average recycling weekly: {average_recycling_weekly}")
Output:
Total recycling weekly: [2260. 1600. 3110.]
Average recycling weekly: [226. 160. 311.]
And that was just a few seconds work! At this point, we've only just
scratched the surface of what NumPy can do. With a bit of practice, you'll
find it to be an ally in your data analysis work. For now this is enough. Let’s
move from the structured world of NumPy to the tabular world of Pandas!
These libraries are very commonly combined and you'll see how they
complement each other in doing your data analysis tasks. So let’s go!
FileNotFoundError
Introduction to Pandas
Now that we’re familiar with the basics of NumPy, let’s take the next step.
Let’s learn about Pandas. Pandas is great for data manipulation and analysis.
With Pandas we have access to many functions and structures to support our
data analysis tasks. It is a high-level data manipulation tool, built on the
Numpy package. Its key data structure is called DataFrame, which you can
think of as an in-memory 2D table (like a spreadsheet), with labeled axes
(rows and columns). This not only allows for the storage of data but also the
manipulation and analysis of it. We’re not really exaggerating when we state
that it's a data analyst's best friend. Let’s see how to get our system ready for
using Pandas.
Installing and Importing Pandas
The alias that is the convention to use for Pandas is pd. Now that we are all
set, let's create some data structures!
As you can see we have all the values in our list, and they’re all given a row
number. Now, let's create a DataFrame (which is pretty much a collection of
Series) to represent the amount of different materials recycled over a week:
recycled_data = {
'Plastic': [200, 300, 400, 250, 600, 350, 275],
'Glass': [150, 200, 250, 300, 200, 100, 50],
'Aluminium': [100, 150, 200, 150, 300, 200, 175]
}
recycled_df = pd.DataFrame(recycled_data)
print(recycled_df)
We have to create an object that has a key (column name) and an array
(values for each row in the column). We could have done that directly in the
line, as we did with pd.Series() . Instead, we split it up and create the
object separately for readability.Here is what the above code snippet will
output:
Plastic Glass Aluminium
0 200 150 100
1 300 200 150
2 400 250 200
3 250 300 150
4 600 200 300
5 350 100 200
6 275 50 175
Reading data from files is straightforward with Pandas. Let's assume we have
a CSV file named recycling_data.csv . Here’s the content of the CSV file:
Date,Plastic,Glass,Aluminium,Paper
2023-01-01,200,150,,300
2023-01-08,250,,350,400
2023-01-15,220,160,310,
2023-01-22,240,170,320,450
2023-01-29,230,155,305,420
2023-02-05,,145,290,410
2023-02-12,205,140,285,400
2023-02-19,235,165,315,430
2023-02-26,245,175,325,440
2023-03-05,225,160,310,425
These commands will load the data from the specified files into DataFrame
objects. Let’s see how we can view and inspect the content.
Once you have loaded your data, it's often a good practice to take a peek at it
to understand its structure, contents, and how the data is organized. Pandas
provides several methods to do just that. Here’s how you can see the first 5
rows:
print(recycling_data_csv.head())
As you can see, we have rows and columns. There are also a few NaN (not a
number). This is missing data. We can also view the last 5 rows. Here’s how
to do that:
print(recycling_data_csv.tail())
You can tell by the row numbers that these are the last rows. If we want to
get a quick summary, we can use the info() method:
print(recycling_data_csv.info())
Amazing what you can do with Pandas and a few basic methods. The ease
with which we loaded and inspected our data is just a teaser of the power of
Pandas. Let’s see a few more neat things we can do with Pandas to up our
data manipulation and analysis game.
Before being able to use data, the data needs to be cleaned up. Let’s deal with
a very common step of cleaning up data: handling the missing data.
Missing data is present in most datasets you’ll ever get to work with. Pandas
makes it easy to identify and handle such missing data. We have some
missing values in our recycling data. Let’s first find the missing data:
missing_data = recycling_data_csv.isnull()
print(missing_data)
In the snippet above , isnull() identifies missing data. When we print it,
you can see it has the value True for the data that is missing:
Date Plastic Glass Aluminium Paper
0 False False False True False
1 False False True False False
2 False False False False True
3 False False False False False
4 False False False False False
5 False True False False False
6 False False False False False
7 False False False False False
8 False False False False False
9 False False False False False
We can fill the missing data with the preceding values. Here how to do that:
filled_data = recycling_data_csv.ffill()
print(filled_data)
This solves the problem for the second row in Glass column and for the third
value in the Paper column, but not for the Aluminium column. This is
because the first entry is missing in the Aluminium column and there is no
preceding value. We could also choose to fill the empty data with the value
that follows. Here’s how to do that:
filled_data_b = recycling_data_csv.bfill()
print(filled_data_b)
We now set the missing values to the value that follows. That can be one
approach to deal with missing data. You could also opt to drop the rows with
missing that entirely, here’s how:
clean_data = recycling_data_csv.dropna()
Row 0, 1 and 2 are gone, because they all had missing data:
Date Plastic Glass Aluminium Paper
3 2023-01-22 240.0 170.0 320.0 450.0
4 2023-01-29 230.0 155.0 305.0 420.0
6 2023-02-12 205.0 140.0 285.0 400.0
7 2023-02-19 235.0 165.0 315.0 430.0
8 2023-02-26 245.0 175.0 325.0 440.0
9 2023-03-05 225.0 160.0 310.0 425.0
It is worth noting that these methods don’t alter the original object, and that’s
why we assign it to new variables ( filled_data , filled_data_b and
clean_data ). Which one you will choose, depends on what you need. We
will continue with the bfill() data. We can also add columns and
transform the data with Pandas. Let’s go there next.
Data Transformation
Transforming data is a bit like sculpting; we start with a rough raw block and
chisel away and reveal the form within. Here's how you can transform data
with Pandas. We’ll start by adding a new column. This new column should
represent the total recycling of the materials for that week. So we want to add
the values. In order to add the values, we need to exclude the Date column:
numeric_data = recycling_data_csv.drop(columns='Date')
print(numeric_data)
We then use this numeric_data to calculate the sum for each week and add
is as a column total to our recycling_data_csv object.
recycling_data_csv['Total'] = numeric_data.sum(axis=1)
print(recycling_data_csv)
Please mind that this doesn’t change the csv file, only the object in memory
that represented the CSV. This is the renewed table:
Date Plastic Glass Aluminium Paper Total
0 2023-01-01 200.0 150.0 350.0 300.0 1000.0
1 2023-01-08 250.0 160.0 350.0 400.0 1160.0
2 2023-01-15 220.0 160.0 310.0 450.0 1140.0
3 2023-01-22 240.0 170.0 320.0 450.0 1180.0
4 2023-01-29 230.0 155.0 305.0 420.0 1110.0
5 2023-02-05 205.0 145.0 290.0 410.0 1050.0
6 2023-02-12 205.0 140.0 285.0 400.0 1030.0
7 2023-02-19 235.0 165.0 315.0 430.0 1145.0
8 2023-02-26 245.0 175.0 325.0 440.0 1185.0
9 2023-03-05 225.0 160.0 310.0 425.0 1120.0
If we decide that we want to rename the column Total, we can. Here’s how to
do it:
recycling_data_csv.rename(columns={'Total': 'Total Recycled'}, i
nplace=True)
The last column will have the name “Total Recycled” now. The
inplace=True makes the operation happen directly on the
recycling_data_csv and that’s why we don’t need to assing it to another
(or the same) variable in order for the modification to have effect. This is
more efficient on the memory if you only need the modified version of the
DataFrame, because the memory then only needs to keep one.At this point,
the data happens to be sorted by date and given the row numbers 0 to 9
accordingly. We can also choose to sort it differently. Let’s say we want to
sort it based on the Total Recycled value:
sorted_data = recycling_data_csv.sort_values(by='Total Recycled'
)
print(sorted_data)
This creates a new DataFrame with an altered order of the rows in the
DataFrame, this is what it looks like:
Date Plastic Glass Aluminium Paper Total Recycled
0 2023-01-01 200.0 150.0 350.0 300.0 1000.0
6 2023-02-12 205.0 140.0 285.0 400.0 1030.0
5 2023-02-05 205.0 145.0 290.0 410.0 1050.0
4 2023-01-29 230.0 155.0 305.0 420.0 1110.0
9 2023-03-05 225.0 160.0 310.0 425.0 1120.0
2 2023-01-15 220.0 160.0 310.0 450.0 1140.0
7 2023-02-19 235.0 165.0 315.0 430.0 1145.0
1 2023-01-08 250.0 160.0 350.0 400.0 1160.0
3 2023-01-22 240.0 170.0 320.0 450.0 1180.0
8 2023-02-26 245.0 175.0 325.0 440.0 1185.0
The rows are now sorted in the new way. It is imported to note that this
doesn’t rename the number of the row. At this point we have cleaned and
transformed our data. This cleaning and transforming will help in better
understanding and analyzing the data. So far, our cleaned up data only exists
in the memory of the application. Let’s see how we can save these results.
Saving Your Results
In order to continue to work on the data in a later stage, we need to save our
results into accessible formats. Here’s how you can save a DataFrame to a
new CSV file.
sorted_data.to_csv('cleaned_recycling_data.csv', index=False)
With the to_csv() method, we inscribe our cleaned data onto a new CSV file,
ready to be shared with the greenFuture council. The index=False parameter
ensures that the index column doesn't tag along uninvited. Here’s the content
of the newly created CSV file:
Date,Plastic,Glass,Aluminium,Paper,Total Recycled
2023-01-01,200.0,150.0,350.0,300.0,1000.0
2023-02-12,205.0,140.0,285.0,400.0,1030.0
2023-02-05,205.0,145.0,290.0,410.0,1050.0
2023-01-29,230.0,155.0,305.0,420.0,1110.0
2023-03-05,225.0,160.0,310.0,425.0,1120.0
2023-01-15,220.0,160.0,310.0,450.0,1140.0
2023-02-19,235.0,165.0,315.0,430.0,1145.0
2023-01-08,250.0,160.0,350.0,400.0,1160.0
2023-01-22,240.0,170.0,320.0,450.0,1180.0
2023-02-26,245.0,175.0,325.0,440.0,1185.0
Now we know how to store our DataFrames, let’s proceed to some basic data
analysis with Pandas.
Data Analysis
It’s time to do some basic data analysis with Python. Of course, this is the
very moment we’ve been building towards. Let’s start with a more elaborate
CSV file recycle_data_city.csv for this example:
Date,Plastic,Glass,Aluminium,Paper,City
2023-01-01,120.0,200.0,,300.0,New York
2023-01-08,150.0,250.0,100.0,350.0,New York
2023-01-15,130.0,,90.0,310.0,Los Angeles
2023-01-22,140.0,240.0,80.0,320.0,Los Angeles
2023-01-29,160.0,260.0,110.0,360.0,Chicago
We will use this CSV to work with. Let’s start by adding the Total column
again:
recycling_data_csv = pd.read_csv('recycling_data.csv')
recycling_data_csv['Total'] = recycling_data_csv[['Plastic', 'Gl
ass', 'Aluminium', 'Paper']].sum(axis=1)
As you can see it has some missing values. We can fill them or drop the
rows. In this case, we choose to leave it as is, which is something to keep in
mind when looking at the results in the next few steps if you were to interpret
them. The next step, is that we’re going to group the data.
We can group the data by columns. In our example, it would make sense to
group the data by city. The code for this is quite intuitive:
grouped_data = recycling_data_csv.groupby('City')
This has grouped the data by city. In order to make sense of it, we need to
aggregate the data. Here’s how to do that:
aggregated_data = grouped_data.sum()
print(aggregated_data)
You can see that it grouped the data by city. It added the numbers to display
the totals. For the dates it did something that’s less intuitive, it concatenated
them. It makes sense to drop the date column altogether. Grouping and
aggregation allow us to slice and dice the data to uncover patterns and
insights. We can now easily find the city with the most recycling (but please
keep in mind there were missing values that we left as is):
top_city = aggregated_data['Total'].idxmax()
print(f'The city with the highest recycling is {top_city}.')
In the snippets in this section, we grouped our data by city and then
aggregated it to find the total recycling per city. This way, we can easily
compare the recycling rates of different cities. Grouping and aggregation
really help us make sense of the data!There’s one more topic to discuss at this
point. In order to represent your data, data visualization is the way to go. Of
course, this can be done with Python too. Let’s see how we can get started
with data visualization with Pandas.
Data Visualization
In order to visualize our data Pandas uses Matplotlib beneath the surface. In
order to run the above snippet, we need to make sure that this is installed.
Here’s how to do that:
pip install matplotlib
This installs matplotlib, and now the above code snippet can run. There won’t
by any output if you run this code in VS Code. If you run it in jupyter
notebook, it will automatically show it to you. In order to get some visuals,
we’ll need to be using Matplotlib in our code. Matplotlib is a Python library
specialized at plotting data. Here’s the updated version that results in a visual:
import pandas as pd
import matplotlib.pyplot as plt # Importing Matplotlib
recycling_data_csv = pd.read_csv('recycle_data_city.csv')
recycling_data_csv['Total'] = recycling_data_csv[['Plastic', 'Gl
ass', 'Aluminium', 'Paper']].sum(axis=1)
grouped_data = recycling_data_csv.groupby('City')
aggregated_data = grouped_data.sum()
ax = aggregated_data['Total'].plot(kind='bar', color='skyblue',
figsize=(10, 6))
ax.set_title('Total Recycled per City')
ax.set_xlabel('City')
ax.set_ylabel('Total Recycled (kg)')
plt.show() # Displaying the plot
Again, this is just a very quick introduction showing only a very small part of
what you can do with Pandas, NumPy and Matplotlib. And only that very
small part already gives us an amazing tool of creating descriptive statistics,
grouping, aggregating, visualizing the data. You’re really at the next level
right now. Let’s end this chapter with a summary.
Summary
In this chapter, we learnt about data analysis with NumPy and Pandas, two of
Python's most powerful libraries. We kicked things off with NumPy, diving
into its fundamentals and exploring the ease of creating arrays and
performing basic operations. We’ve also seen how to use NumPy for
statistical and mathematical operations. After that, we touched upon
multidimensional arrays. At that point, we were ready for Pandas.As we
transitioned to Pandas, started to work with Series and DataFrames. These are
the core data structures that make Pandas a joy to work with. Loading data
with Pandas is easy, and it got us ready for the process of data cleaning and
preparation. We’ve seen how to use Pandas to handle missing data and
transform our data. We ended with a very basic example of how to use
Pandas combined with Matplotlib for visualization. This chapter was not
trying to teach you all the ins and outs of these libraries, but mainly to tell
you enough to understand how excited you should be about them. You can
probably imagine that using these libraries is an invaluable skill for data
analysts.
13 Introduction to Exploratory
Data Analysis
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
1. Data cleaning typically comes first in the process. You will fill in any
missing values, eliminate duplicates, and fix any errors here. If you skip
this step, your subsequent analyses may contain noise and error. It's
important to remember that data cleaning can be an iterative process
revisited as your investigation progresses.
2. Data summarization, the next step, involves using descriptive statistics
like mean, median, variance, and standard deviation. This is a great
place to start looking for the first patterns or anomalies that require
additional investigation.
3. The next stage is data visualization, which is essential for
comprehending the underlying structure of the data. When compared to
just numerical summaries, visuals like histograms, box plots, and scatter
plots help the reader understand the data more quickly and intuitively.
They enable you to support or refute initial hypotheses and support the
development of new ones.
4. Insight Generation is the last step, bringing everything together. You
ought to be able to resolve your initial queries by fusing the numerical
summaries and the visual depictions, and you ought to be able to come
up with actionable or further-researchable insights. Additionally, you
can point out areas where more information might be required for a
more complete understanding.
It's crucial to keep track of your conclusions, presumptions, and any decisions
you make throughout this process. In addition to keeping your analysis
transparent, proper documentation leaves a trail for anyone who might later
review or expand on your work.Knowing how the EDA works is like having
a map for a challenging journey. Using the appropriate tools and a growth
mindset guarantees that you examine all the essential aspects of your data.
With this foundational knowledge in place, let's switch gears and discuss the
tools and methods that will enable you to perform EDA efficiently.
Univariate Analysis
It's time to explore the various layers of data analysis after navigating the
Exploratory Data Analysis (EDA) fundamentals and becoming familiar with
its basic tools and techniques. First up is a type of analysis called a univariate
analysis, which looks at just one variable at a time. Though it might seem
simple, don't undervalue its strength. Each variable's behavior, distribution,
and summary statistics can be thoroughly understood to provide profound
insights and direct subsequent, more intricate analyses.This section will
define Univariate Analysis and discuss why it's an essential first step in any
data analysis process. After that, we will divide our discussion into two main
groups: continuous variables and categorical variables. You'll discover
analysis and visualization methods for each category, each one completing
the picture your dataset represents.By the end of this section, you'll be skilled
at examining single variables to discover their unique traits, behaviors, and
patterns. This crucial step in the EDA process lays the foundation for more
complex analyses like bivariate and multivariate analysis, which we will
examine later.
Density plots, which are slick iterations of histograms, are used to see how
the data distribution is shaped. Density plots provide a more continuous view
of the data distribution than histograms, which are constrained by the rigid
structure of bins. This makes comparing the distribution of several variables
at once simpler. Density plots can be used in market research to compare the
distribution of customer satisfaction ratings across various product categories
and pinpoint areas that require improvement. The following picture displays
an example of a visual with multiple density plots.
Figure 13.3: Distribution of weights of different categories visualized by
density plots
Histograms, box plots, and density plots are three tools that each provide a
different perspective on your continuous variables. The quality and
interpretability of your analysis can be significantly impacted by knowing
when to use which tool.After examining continuous variables, let's now look
at categorical variables, a different species in the data zoo. Understanding
them is equally important as they are frequently key to significant
classifications and groupings in your data.
Pie charts, which show each category's share of the total, provide a
comprehensive view of your categorical data. They work best when you have
a small number of types, although they are visually appealing. A pie chart,
for instance, could show the proportion of customer feedback broken down
into three categories: positive, neutral, and negative. This would provide
stakeholders with a quick overview of the general customer mood.
Your specific requirements, the complexity of your data, and the message
you want to convey will determine which of the following to use: bar charts,
pie charts, frequency tables, etc. Your overall data analysis skills will
increase in depth and breadth when you learn how to use these tools to
analyze categorical variables.We are well-equipped to advance our
exploratory journey now that we have delved into the nuances of analyzing
categorical variables. The next step is bivariate analysis, where we'll look at
the connections between two variables and draw conclusions that one-
variable analyses could not.
Bivariate Analysis
After learning the foundations of exploratory data analysis (EDA) and diving
deep into univariate analysis, you now have the knowledge necessary to
comprehend each variable separately. It's time to broaden our focus and
investigate the relationships between various variables. Welcome to the world
of bivariate analysis, where each variable interacts with the others to reveal
insights that are frequently hidden in analyses with only one variable.We will
examine what bivariate analysis is in this section and why it is an essential
component of EDA. We'll discuss important ideas like Correlation vs.
Causation and explore several methods for observing and analyzing
relationships between two types of variables, including Continuous-
Continuous, Categorical-Categorical, and Continuous-Categorical.By the
time you finish reading this section, you will be able to comprehend each
variable independently and investigate how they interact to create patterns or
trends. You'll discover how to recognize correlations among variables and
develop theories for additional research, laying the groundwork for even
more complex multivariate analysis.
Correlation vs Causation
The distinction between correlation and causation is one of the most
important concepts to grasp in bivariate analysis. Correlation denotes a
statistical relationship between two variables, indicating that the other
variable tends to change similarly when one variable changes. Nevertheless,
correlation does not prove causation. Simply because two variables are
correlated does not imply that one is the cause of the other. The standard
error of mistaking correlation for causation can lead to erroneous conclusions
and misguided actions.Suppose, for instance, that you observe a strong
positive correlation between ice cream sales and the number of drowning
incidents over several years. While it may be tempting to believe that ice
cream sales cause an increase in drownings, it is more likely that both are
influenced by a third factor: the weather, specifically hot summer
temperatures. This is known as a "confounding variable," and when
interpreting correlations, it is essential to consider such possibilities. The
example below depicts a correlated relationship between two logically
unrelated events, divorce rates and margarine consumption. Despite the
statistical significance, one event does not cause the other.
Multivariate analysis
With a solid foundation in Univariate and Bivariate Analysis under our belts,
we are now at the crossroads of more complex data relationships: the realm
of Multivariate Analysis. As its name suggests, this method investigates
relationships between more than two variables, allowing us to explore and
decode complex interdependencies within our data sets.Multivariate Analysis
is not merely an extension of the bivariate techniques covered thus far. It is a
rich, multifaceted domain that can help uncover insights that would otherwise
remain concealed if only two variables were considered. Multivariate
Analysis provides the tools and methodologies necessary to segment
customers into distinct clusters based on their purchasing habits or to
understand how multiple factors simultaneously impact sales revenue.In this
section, we will traverse the expansive terrain of Multivariate Analysis,
revealing its techniques and applications. We will discover how to manage
multiple variables without becoming overwhelmed and how to extract
concise, actionable insights from complex data scenarios.
Heatmaps
In multivariate analysis, heatmaps are a potent tool for illustrating the
relationships between more than two variables in an easily digestible format.
In a heatmap, data values are displayed as colors, providing a quick visual
summary that can reveal patterns, trends, and correlations in complex
datasets. Essentially, it is a table-like representation in which individual cells
are colored according to the variables' magnitude, with the color's intensity
typically denoting the value's relative prominence.
Figure 13.10: Heatmap depicting the relationships between food and health
Pair plots
Pair plots, also known as scatterplot matrices, are indispensable in
multivariate analysis for examining the pairwise relationships between
variables. Each variable in your dataset is paired with another variable in a
pair plot grid. A histogram or kernel density plot displaying the distribution
of the variable in question is typically displayed on the diagonal. Off the
diagonal, you will find scatter plots between pairs of variables, which reveal
any correlation between them.
Figure 13.11: Pair plot example done in python
Pair plots are especially useful when you have a moderate number of
variables and want to quickly determine whether any potential relationships
warrant further investigation. They are an excellent starting point before
diving into more detailed and specific analyses, such as regression models or
more complex multivariate techniques.
Summary
Exploratory Data Analysis (EDA), a cornerstone of data analytics, was
explored in this chapter. Understanding how EDA turns raw data into
actionable insights was our first step. The EDA process revealed a structured
approach to complex data sets. EDA tools and techniques include statistical
and graphical methods for data exploration.Univariate Analysis—studying
continuous and categorical variables—was thoroughly examined.
Histograms, box plots, and bar charts help us understand data features. We
advanced to Bivariate Analysis to understand the relationships between
continuous and categorical variables. The chapter concluded with
Multivariate Analysis, where heatmaps and pair plots helped us understand
complex multivariable relationships.After finishing this chapter, we should
remember that EDA is a narrative process that lets us meaningfully interact
with data. The goal is to find the data's story, which can inform decisions,
policy, or even lives.We are prepared to put theory into practice now that we
have mastered the foundations of EDA. The Exploratory Data Analysis Case
Study in the following chapter will give us a hands-on introduction to data
storytelling.
14 Data Cleaning
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
By the end of this chapter, you should have a good grasp of what data
cleaning is and how important it is to the process of data analytics. With the
help of real-world examples and Python code snippets, you will be prepared
to take on the obstacles that data cleaning poses in the real world. You will be
able to convert unorganized, unclean data into a resource that is ready for
intelligent analysis. This chapter includes insightful information that will
improve your capacity to handle data precisely and confidently, regardless of
your level of experience.
Technical requirements
You can find all materials for this chapter in the following address:
https://github.com/PacktPublishing/Becoming-a-Data-Analyst-First-
Edition/blob/main/Chapter14-Data-Cleaning/Chapter-14-Data-
Cleaning.ipynb
Inconsistent formats
Dealing with mismatched formats is one of the frequent difficulties in data
cleaning. There are many different sources of data, each with its own format
for dates, monetary symbols, and other rules. Confusion can be caused by
these inconsistencies, which can also prevent data from integrating smoothly.
For data to be adequately processed and comprehended, it must be aligned to
a standard format. Early correction of format issues makes a guarantee that
the data behaves consistently, allowing for more efficient and precise
analysis.The following are examples of fixing inconsistent dates in
python:Standardizing dates
import pandas as pd
# Create a DataFrame with inconsistent date formats
df = pd.DataFrame({'date': ['2022-08-01', '01/08/2022', 'August
1, 2022']})
# Convert the date column to a uniform format
df['date'] = pd.to_datetime(df['date'])
print(df)
Instructions:
Instructions:
Instructions:
1. Create DataFrame: Create a DataFrame with inconsistent phone
number formats.
2. Standardize Phone Numbers: Use the replace() method with regex
to remove all non-digit characters, standardizing the format.
Instructions:
Instructions
Instructions
Duplicate records
An additional frequent problem in data cleaning is duplicate records. These
identical entries may result from incorrect system settings, data merging from
many sources, or repeated data entering. By overrepresenting particular data
points, duplicate records might skew analysis and produce biased results.
Careful investigation and applying specific tools or algorithms are frequently
required to identify and eliminate duplicates. Addressing duplicate records
correctly ensures that each piece of information is appropriately and
individually represented, which helps create a more accurate and objective
picture of the data environment.The following are examples of fixing
duplicate records in python:Example 1: Removing all duplicates
# Create a DataFrame with duplicate records
df = pd.DataFrame({'A': [1, 2, 2, 3], 'B': [5, 6, 6, 7]})
# Remove duplicate rows
df = df.drop_duplicates()
print(df)
Instructions
Instructions
Ex 3: Last occurrence
df = pd.DataFrame({'A': [1, 2, 2, 3], 'B': [5, 6, 6, 7]})
# Remove duplicates but keep the last occurrence
df = df.drop_duplicates(keep='last')
print(df)ample 3: Keeping last occurrence of duplicates
Instructions
A more sophisticated method that keeps the size of the original dataset is
imputing, or filling in missing values. Various methods can be used,
depending on the type of data:Mean/median/mode imputation: Simple
statistical measures like mean, median, or mode can be used to impute
missing values. Although simple to use, this method may alter the
distribution of the original data, particularly if the missingness is not
random.Here, instead of removing rows, the columns containing any missing
values are removed using dropna(axis=1). This is often applied when
specific features have many missing values, and their removal won't affect
the analysis significantly.
import pandas as pd
from sklearn.impute import SimpleImputer
# Creating data with missing values
data = {
'Age': [25, 30, None, 22],
'Income': [50000, 70000, 80000, None]
}
df = pd.DataFrame(data)
# Using mean imputation for missing values
mean_imputer = SimpleImputer(strategy="mean")
df_mean_imputed = pd.DataFrame(mean_imputer.fit_transform(df), c
olumns=df.columns)
print(df_mean_imputed)
These code examples show different ways to deal with missing data. The
method you choose should match the type and nature of the missing data and
the goals of the analysis. In some situations, simple methods like removal or
mean/median imputation may be enough, but in others, more advanced
methods like K-Nearest Neighbors or iterative imputation may be needed.
Understanding the structure of the data and why values are missing is key to
choosing the right method and making sure that the way missing values are
handled doesn't introduce bias or change the results of the analysis.
In data cleaning, figuring out what kind of data is missing and how to handle
it is not just an important step. It's a key part of the analysis process that can
have a big effect on the conclusions that can be drawn from the data. Whether
the missing data is completely random or tied to things that can be seen or
even things that can't be seen, knowing what it is helps analysts deal with it in
a way that keeps the analysis's integrity. This nuanced way of dealing with
missing data is important for turning raw, imperfect data into insights that can
be used.
Dealing with duplicate values
Duplicate data, or entries that are repeated in a dataset, may seem like
nothing important, but they can lead to wrong conclusions and bad decisions.
Duplicates can cause numbers to be inflated, distributions to be off, and
relationships to be misrepresented. Human mistakes, system bugs, or
inconsistent data integration can cause them. To keep data accurate and
reliable, it is essential to understand why duplicates happen and set up strong
ways to find and get rid of them. This section talks about where duplicate
data usually comes from and how to find and get rid of it.
import pandas as pd
# Creating a DataFrame with duplicate data
data = {
'Customer_ID': [101, 102, 103, 101, 104, 102],
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob']
,
'Address': ['123 Main St', '456 Pine St', '789 Oak St', '123
Main St', '321 Cedar St', '456 Pine St']
}
df = pd.DataFrame(data)
# Removing duplicates based on the 'Customer_ID' column
df_no_duplicates_by_id = df.drop_duplicates(subset=['Customer_ID
'])
print("\nDataFrame after Removing Duplicates Based on 'Customer_
ID':")
print(df_no_duplicates_by_id)
By setting the keep parameter to 'last' , we ensure that the last occurrence
of each duplicate is retained in the DataFrame. This approach might be
appropriate when handling time-stamped data where the most recent entry
represents the most accurate or relevant information.These scenarios show
different ways to deal with duplicate data, which is similar to how real-world
data often presents different problems. Whether they want to focus on
specific columns or choose which occurrences to keep, these methods give
analysts the power to clean data in a way that fits the needs of their analysis.
These techniques give us more ways to deal with duplicate data and help us
understand how data cleaning works in a more complete way.
Types of outliers
In the vast world of data, outliers can show up in many ways. Each presents
its own challenges and needs a different way to be found and dealt with.
Understanding the different kinds of outliers is not just a matter of putting
them into groups; it is also a necessary step in figuring out how complicated
the data is. In this section, we look at the different kinds of outliers, such as
univariate, multivariate, and contextual outliers, laying the groundwork for
finding and treating them effectively.
Univariate Outliers: These are the most extreme values in the
distribution of a single variable. For example, consider a dataset
containing the heights of adult humans. Most values will typically fall
within a range of 150 to 200 centimeters. However, a data entry error
leads to one record showing a height of 500 centimeters. This value is a
univariate outlier, as it's an extreme value in a single variable's
distribution. Univariate outliers are easy to find and usually come from
mistakes in data entry or rare but normal events. In this case, the outlier
is probably a mistake, and statistical measures like the Z-score or graphs
like a histogram can help find it quickly.
Multivariate outliers are more complicated because they involve more
than one variable and a combination of them. Imagine a dataset
containing the weights and heights of a group of individuals. Most data
points might form a consistent pattern where weight increases with
height. However, a record showing a weight of 50 kilograms for a height
of 180 centimeters might be considered a multivariate outlier. This
outlier is not extreme in either variable individually, but the combination
of weight and height is unusual. Detecting multivariate outliers often
requires more sophisticated techniques like Mahalanobis distance, which
considers the relationship between multiple variables.
Global and Contextual Outliers: Global outliers differ from the rest of
the data set in a big way, while contextual outliers are unusual within a
certain subgroup or context.
In a dataset of monthly temperatures for a city, a value of -10
degrees Celsius in summer would be considered a global outlier, as
it deviates significantly across the entire dataset. Now consider the
temperatures for different seasons. A value of -10 degrees Celsius
in winter might be normal for that context, but if it were recorded
in summer, it would be a contextual outlier.
These examples show that outliers can look different and that it's essential to
know what kind of outlier you're dealing with so you can handle it correctly.
Even though univariate outliers might be easy to spot, multivariate and
contextual outliers often require a deeper understanding of how the data is
organized and how it relates to other data.
Impact on analysis
The presence of outliers is more than just a statistical curiosity; it can have a
big effect on how an analysis turns out. Outliers have a lot of different effects
on analysis, from simple statistical measures to complex predictive models.
This section goes into detail about how outliers can change results, confuse
visualizations, and affect how well a model works. This shows how important
it is to recognize and deal with these unusual observations.
The following provides a visual of how the presence of outliers can effect the
skew of a distribution.
Handling inconsistencies
In the realm of data, inconsistencies are almost inevitable. They may stem
from human errors, technological glitches, or even well-intended variations in
data entry. These inconsistencies can disrupt the harmony of the dataset and
lead to misleading results. Consider the following scenario: A customer
database where country names are entered in different formats like "USA,"
"U.S.A.," and "United States." These inconsistencies can lead to confusion
and incorrect analysis.
import pandas as pd
# Example data
data = {'country': ['USA', 'U.S.A.', 'United States', 'UK', 'U.K
.']}
df = pd.DataFrame(data)
# Standardizing the country names
df['country'] = df['country'].replace(['USA', 'U.S.A.'], 'United
States').replace(['UK', 'U.K.'], 'United Kingdom')
The code uses the replace() method to standardize the country names to a
consistent format, so the analysis treats all the variations as the same entity.
Addressing such inconsistencies involves standardizing the data to ensure
that it adheres to a standard format or structure. This may include text
cleaning, date format standardization, or handling variations in categorical
values.
The code utilizes the map() method to convert categorical responses into a
numerical binary format, making them suitable for statistical modeling. This
data can be converted into a binary format (0 and 1) using encoding
techniques like one-hot encoding or label encoding. Such conversion allows
algorithms to process the categorical data efficiently.
Data validation
The accuracy of data analysis depends critically on data validation. Each data
point that enters our analytical pipelines must be authenticated and verified.
Without it, our findings and insights could be at risk of having errors and
inconsistencies that compromise their quality. We will explore the complex
facets of data validation in this section, highlighting its importance and going
in-depth on its various methodologies.Fundamentally, data validation is the
process of confirming and making sure that a dataset satisfies the required
quality standards and is error-free. Consider is as data quality assurance, the
process of double-checking the text and figures before approving them for
further analysis. This step's significance cannot be overstated. The 'new oil' of
decision-making, strategy-shaping, and pattern-illuminating is often referred
to as data. The machinery of decision-making may stall out or even veer off
course if this oil is tainted. By ensuring that data is valid before use, one can
prevent erroneous inferences and preserve the integrity of data processing
steps that come after.
Validation methods
The methods used to validate data are just as varied as the data landscape
itself. Let's examine some popular methods for validation and then immerse
ourselves in scenarios that illustrate them.Range checks are the protective
barriers ensuring data values stay within expected bounds. They're akin to
setting minimum and maximum temperature thresholds on a
thermostat.Scenario: Imagine a school database storing students' grades.
Logically, these grades should lie between 0 and 100. If a data entry lists a
student's grade as 150, it's a clear indication of an anomaly. Range checks
would flag this, ensuring such errors are highlighted for correction.
import pandas as pd
# Example data
data = {'grades': [95, 50, 150, 88, -10]}
df = pd.DataFrame(data)
# Applying range checks to flag invalid grades
df['invalid_grade'] = (df['grades'] < 0) | (df['grades'] > 100)
This code snippet adds a new column 'invalid_grade' that will be True if
the grade is outside the 0-100 range. This can help in identifying and
correcting anomalous entries.In order to make sure that the format and layout
of the data conform to expectations, pattern matching examines the data's
structure. It is the skill of identifying a rhythm in data and ensuring that each
data point moves in time with it.Scenario: Consider an international
organization collecting phone numbers from different countries. A U.S.
phone number has a distinct pattern, different from a U.K. number. By
employing pattern matching, the organization ensures each number adheres
to its country-specific format, ensuring accuracy and uniformity.
import re
# Example data
data = {'email': ['[email protected]', 'sarah.example', 'david@ex
ample']}
df = pd.DataFrame(data)
# Defining a pattern for valid email addresses
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
# Applying pattern matching to flag invalid emails
df['invalid_email'] = df['email'].apply(lambda x: not bool(re.ma
tch(pattern, x)))
The code defines a regular expression pattern that matches valid email
addresses. It then applies this pattern to the 'email' column and creates a new
column 'invalid_email' to flag any addresses that do not meet the pattern.The
cornerstone of data trust is consistency. Consistency checks guarantee the
harmony of the data and make sure it doesn't convey conflicting
messages.Scenario: In a retail inventory, if the total number of items sold
exceeds the initial stock without any replenishment, there's an inconsistency.
Such discrepancies can skew analysis and projections, underscoring the need
for rigorous consistency checks.
# Example data
data = {'initial_stock': [100, 50, 200], 'sales': [50, 60, 150],
'replenished': [0, 20, 50]}
df = pd.DataFrame(data)
# Checking if sales exceed stock levels
df['inconsistent_data'] = (df['sales'] > (df['initial_stock'] +
df['replenished']))
This code checks whether the sales for each product exceed the sum of the
initial stock and replenished stock. If so, it flags the row as having
inconsistent data in a new column 'inconsistent_data' .These code
examples provide practical guidance on implementing essential data
validation techniques. They serve as robust tools for safeguarding the data's
quality and reliability, ensuring that it's primed for insightful analysis.
Summary
Data cleaning is a broad term for several tasks that are essential to the success
of data analytics projects. Every step, from dealing with missing values and
outliers to validating and normalizing data, has a big effect on the quality of
the insights that can be drawn from it. Focusing on the best ways to clean
data can lead to more useful and actionable results, which can help businesses
and organizations make better decisions. Cleaning data is more important
than just getting it ready. It is at the heart of what makes data analytics such a
useful tool in today's data-driven world.
17 Exploratory Data Analysis Case
Study
Join our book community on Discord
https://packt.link/EarlyAccessCommunity
Welcome to the Exploratory Data Analysis Case Study. This chapter will
provide you with hands-on experience in conducting a comprehensive
exploratory data analysis (EDA) on a real-world dataset. EDA is a
fundamental skill for any data analyst, as it enables you to comprehend the
subtleties of your data, recognize patterns, and make informed decisions.This
chapter will guide you through a case study that simulates a typical business
scenario. You will be provided with a dataset and business questions to
answer. This exercise lets you utilize various EDA techniques we’ve
previously covered to inform business strategies. At the conclusion of this
chapter, you will have a solid understanding of how to analyze data and
effectively communicate your findings.In this chapter we’re going to cover
the following main topics:
Technical Requirements
The dataset for the case study can be found in this book’s repository at the
following link: https://github.com/PacktPublishing/Becoming-a-Data-
Analyst-First-Edition/blob/main/Chapter-17-EDA-Case-
Study/MotionEase_Sales_Data.csv
Customer Segmentation
1. MotionEase wants to roll out a VIP program but isn't sure who to invite.
Your task is to identify the top 10% of customers based on their
spending. What makes these customers special?
2. Customer retention is a big deal at MotionEase. The company wants to
reward customers who come back to make multiple purchases. Can you
identify these loyal customers and find out what keeps them coming
back?
Product Analysis
1. In the vast inventory of MotionEase, some products are stars while
others are not. Can you spotlight the top 5 best-selling products in each
category and suggest why they might be the customer favorites?
2. Returns are a headache for any retail business. MotionEase is no
exception. Your challenge is to identify products that are frequently
returned and hypothesize why this might be happening.
Summary
In this chapter, we conducted an in-depth case study to practice Exploratory
Data Analysis (EDA) skills. Following introducing a business scenario
involving MotionEase, we provided a dataset for investigation. We covered
various aspects of EDA through hands-on exercises. We answered questions
regarding, among other things, monthly sales trends, the impact of weekdays
on sales, and customer segmentation. Each question was presented in a
narrative format, and solutions included code explanations to ensure a
comprehensive understanding of the techniques employed.As we transition
into the next chapter, "Introduction to Data Visualization," we'll build upon
the foundational skills acquired in this chapter. Data visualization is a crucial
aspect of data analysis, as it provides a graphical representation of data to
make insights easier to comprehend. We will investigate various visualization
types and learn how to create them using Python libraries. Therefore, let's add
another layer of sophistication to our data analysis toolbox!