0% found this document useful (0 votes)
4 views

SQL Notes

Uploaded by

Daniel Wu
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

SQL Notes

Uploaded by

Daniel Wu
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 24

Three SQL Concepts you Must

Know to Pass the Data Science


Interview
I’ve interviewed a lot of data scientist candidates and have found there are
a a lot of SQL interview questions for data science that eventually boil down
to three generalized types of conceptual understandings.

I’ve interviewed a lot of data scientist candidates and have found there are a
lot of SQL interview questions for data science that eventually boil down to
three generalized types of conceptual understandings.

If you’re an interviewing data scientist, these problems are a must know!


They’re also great filter questions to test more than the basic one-hour
primer study of a candidate that read the differences between an INNER and
LEFT JOIN. Here’s an example question from each.

1. Getting the first or last value for each user in a


`transactions` table.

`transactions`
+---------------+---------+
| user_id | int |
| created_at | datetime|
| product | varchar |
+---------------+---------+
Question: Given the user transactions table above,
write a query to get the first purchase for each user.

Why does this matter? How would you query for the first time a person
commented on a post and read the post itself? How do we cohort users by
start date? All of these analyses need this concept of querying based on first
or last time and it definitely can be solved without using an expensive
partition function.
Explanation:

We want to take a table that looks like this:

user_id | created_at | product


--------+------------+--------
123 | 2019-01-01 | apple
456 | 2019-01-02 | banana
123 | 2019-01-05 | pear
456 | 2019-01-10 | apple
789 | 2019-01-11 | banana

and turn it into this

user_id | created_at | product


---------+------------+--------
123 | 2019-01-01 | apple
456 | 2019-01-02 | banana
789 | 2019-01-11 | banana

How do we get there?

We can solve this problem by doing a multi-column join.


First, how do we figure out the first time each user purchased? This should
be pretty simple and can be done by a simply GROUP BY aggregation and
aggregating for the minimum datetime. Notice how the table has a
created_at column. This is the column that determines which row is the first
purchase for the specific user, so we can write a query with an aggregation
to get the minimum datetime for every user.
SELECT user_id, MIN(created_at) AS min_created_at
FROM transactions
GROUP BY 1

Awesome. Now all we have to do is join this table back to the original on two
columns: user_id and created_at. The self join will effectively filter for the
first purchase. Then all we have to do is grab all of the columns on the left
side table.

SELECT t.user_id, t.created_at, t.product


FROM transactions AS t
INNER JOIN (
SELECT user_id, MIN(created_at) AS min_created_at
FROM transactions
GROUP BY 1
) AS t1
ON t.user_id = t1.user_id
t.created_at = t1.min_created_at

2. Knowing the difference between a LEFT JOIN and INNER JOIN in


practice.

`users`
+---------+---------+
| id | int |
| name | varchar |
| city_id | int |<-+
+---------+---------+ |
|
|
`cities` |
+---------+---------+ |
| id | int |<-+
| name | varchar |
+---------+---------+
Question: Given the `users` and `cities` tables above,
write a query to return the list of cities without any
users.

Why does this matter? Anyone can memorize the definitions of an inner
join and left join when asked during an interview. The Venn diagram provides
an adequate explanation. But can the candidate actually implement the
difference when in practice?
Explanation:
What is the actual difference between a LEFT JOIN and INNER JOIN?

INNER JOIN: returns rows when there is a match in both


tables.
LEFT JOIN: returns all rows from the left table, even if
there are no matches in the right table.

Okay, so we know that each user in the users table must live in a city given
the city_id field. However the cities table doesn’t have a user_id field. In
which if we run an INNER JOIN between these two tables joined by
the city_id in each table, we’ll get all of the cities that have users and all of
the cities without users will be filtered out.

SELECT cities.name, users.id


FROM cities
LEFT JOIN users
ON users.city_id = cities.id

But what if we run a LEFT JOIN between cities and users?

cities.name | users.id
_____________|__________
seattle | 123
seattle | 124
portland | null
san diego | 534
san diego | 564

Here we see that since we are keeping all of the values on the LEFT side of
the table, since there’s no match on the city of Portland to any users that
exist in the database, the city shows up as NULL. Therefore now all we have
to do is run a WHERE filter to where any value in the users table is NULL.

SELECT cities.name, users.id


FROM cities
LEFT JOIN users
ON users.city_id = cities.id
WHERE users.id IS NULL

3. Aggregations with a conditional statement

`transactions`
+---------------+---------+
| user_id | int |
| created_at | datetime|
| product | varchar |
+---------------+---------+
Question: Given the same user transactions table as before,
write a query to get the total purchases made in the morning
versus afternoon/evening (AM vs PM) by day.

Why does this matter? If you can’t use conditional statements and/or
aggregate with conditional statements, there’s no way to run any kind of
analytics. How do you look at differences in populations based on new
features or variables?
Explanation:
Notice whenever the question asks for a versus statement, we’re comparing
two groups. Every time we have to compare two groups we must use a
GROUP BY. It’s in the name. Heh.
In this case, we need to create a separate column to actually run our GROUP
BY on, which in this case, is the difference between AM or PM in
the created_at field. In that case, let’s create a condition in SQL to
differentiate them.

CASE WHEN
HOUR(created_at) > 11
THEN 'PM' ELSE 'AM' END AS time_of_day

Pretty simple. We can cast the created_at column to the hour and set the
new column value time_of_day as AM or PM based on this condition. Now
we just have to run a GROUP BY on the original created_at field truncated
to the day AND the new column we created that differentiates each row
value. The last aggregation will then be the output variable we want which is
total purchases by running the COUNT function.

SELECT
DATE_TRUNC('day', created_at) AS date
, CASE WHEN
HOUR(created_at) > 11
THEN 'PM' ELSE 'AM' END AS time_of_day
, COUNT(*)
FROM transactions
GROUP BY 1,2
RANK Function
The RANK function is used to retrieve ranked rows based on the condition of the ORDER BY clause.
For example, if you want to find the name of the car with third highest power, you can use RANK
Function.
Let’s see RANK Function in action:

SELECT name,company, power,


RANK() OVER(ORDER BY power DESC) AS PowerRank
FROM Cars

The script above finds and ranks all the records in the Cars table and orders them in order of
descending power. The output looks like this:
SELECT name,company, power,
RANK() OVER(PARTITION BY company ORDER BY power DESC) AS PowerRank
FROM Cars

In the script above, we partition the results by company column. Now for each company, the RANK
will be reset to 1 as shown below:

DENSE_RANK Function
The DENSE_RANK function is similar to RANK function however the DENSE_RANK function does
not skip any ranks if there is a tie between the ranks of the preceding records. Take a look at the
following script.

SELECT name,company, power,


RANK() OVER(PARTITION BY company ORDER BY power DESC) AS PowerRank
FROM Cars
SELECT name,company, power,
DENSE_RANK() OVER(PARTITION BY company ORDER BY power DESC) AS DensePowerRank
FROM Cars
ROW_NUMBER Function
Unlike the RANK and DENSE_RANK functions, the ROW_NUMBER function simply returns the row
number of the sorted records starting with 1. For example, if RANK and DENSE_RANK functions of
the first two records in the ORDER BY column are equal, both of them are assigned 1 as their RANK
and DENSE_RANK. However, the ROW_NUMBER function will assign values 1 and 2 to those rows
without taking the fact that they are equally into account. Execute the following script to see the
ROW_NUMBER function in action.

SELECT name,company, power,


ROW_NUMBER() OVER(ORDER BY power DESC) AS RowRank
FROM Cars

From the output, you can see that ROW_NUMBER function simply assigns a new row number to
each record irrespective of its value.
The PARTITION BY clause can also be used with ROW_NUMBER function as shown below:
SELECT name, company, power,
ROW_NUMBER() OVER(PARTITION BY company ORDER BY power DESC) AS RowRank
FROM Cars

The output looks like this:

Similarities between RANK, DENSE_RANK, and ROW_NUMBER Functions


The RANK, DENSE_RANK and ROW_NUMBER Functions have the following similarities:
1- All of them require an order by clause.
2- All of them return an increasing integer with a base value of 1.
3- When combined with a PARTITION BY clause, all of these functions reset the returned integer
value to 1 as we have seen.
4- If there are no duplicated values in the column used by the ORDER BY clause, these functions
return the same output.
To illustrate the last point, let’s create a new table Car1 in the ShowRoom database with no
duplicate values in the power column. Execute the following script:

SELECT name,company, power,


RANK() OVER(ORDER BY power DESC) AS [Rank],
DENSE_RANK() OVER(ORDER BY power DESC) AS [Dense Rank],
ROW_NUMBER() OVER(ORDER BY power DESC) AS [Row Number]
FROM Cars1
Difference between RANK, DENSE_RANK and ROW_NUMBER Functions
The only difference between RANK, DENSE_RANK and ROW_NUMBER function is when there are
duplicate values in the column being used in ORDER BY Clause.
If you go back to the Cars table in the ShowRoom database, you can see it contains lots of duplicate
values. Let’s try to find the RANK, DENSE_RANK, and ROW_NUMBER of the Cars1 table ordered
by power. Execute the following script:
SELECT name,company, power,

RANK() OVER(ORDER BY power DESC) AS [Rank],


DENSE_RANK() OVER(ORDER BY power DESC) AS [Dense Rank],
ROW_NUMBER() OVER(ORDER BY power DESC) AS [Row Number]
FROM Cars

The output looks like this:

From the output, you can see that RANK function skips the next N-1 ranks if there is a tie between N
previous ranks. On the other hand, the DENSE_RANK function does not skip ranks if there is a tie
between ranks. Finally, the ROW_NUMBER function has no concern with ranking. It simply returns
the row number of the sorted records. Even if there are duplicate records in the column used in the
ORDER BY clause, the ROW_NUMBER function will not return duplicate values. Instead, it will
continue to increment irrespective of the duplicate values.
select
sal,
RANK() over(order by sal desc) as Rank,
DENSE_RANK() over(order by sal desc) as DenseRank,
ROW_NUMBER() over(order by sal desc) as RowNumber
from employee
Output:
--------|-------|-----------|----------
sal |Rank |DenseRank |RowNumber
--------|-------|-----------|----------
5000 |1 |1 |1
3000 |2 |2 |2
3000 |2 |2 |3
2975 |4 |3 |4
2850 |5 |4 |5
--------|-------|-----------|----------

select
customer_id
, mon_purch
, row_number() over (mon_purch desc, customer_id)
, rank() over (mon_purch desc, customer_id)
, dense_rank() over (mon_purch desc, customer_id)
from customer_purchases
order by 2 desc, 3

Output:

customer_id mon_purch row_number rank dense_rank


6 400 1 1 1
4 300 2 2 2
8 300 3 2 2
1 200 4 4 3
3 200 5 4 3
7 200 6 4 3
2 150 7 7 4
5 100 8 8 5
SELECT employee_id,
full_name,
department,
salary,
salary / MAX(salary) OVER (PARTITION BY department ORDER BY
salary DESC)
AS salary_metric
FROM employee
ORDER BY 5;

Train_id Station Time

110 San Francisco 10:00:00


Train_id Station Time

110 Redwood City 10:54:00

110 Palo Alto 11:02:00

110 San Jose 12:35:00

120 San Francisco 11:00:00

120 Redwood City Non Stop

120 Palo Alto 12:49:00

120 San Jose 13:30:00

Suppose we want to add a new column called “time to next station”. To obtain
this value, we subtract the station times for pairs of contiguous stations. We
can calculate this value without using a SQL window function, but that can be
very complicated. It’s simpler to do it using the LEAD window function. This
function compares values from one row with the next row to come up with a
result. In this case, it compares the values in the “time” column for a station
with the station immediately after it.
So, here we have another SQL window function example, this time for the
train schedule:

SELECT
train_id,
station,
time as "station_time",
lead(time) OVER (PARTITION BY train_id ORDER BY time) - time
AS time_to_next_station
FROM train_schedule;

Note that we calculate the LEAD window function by using an expression


involving an individual column and a window function; this is not possible with
aggregate functions.

Here are the results of that query:

In the next example, we will add a new column that shows how much time has
elapsed from the train’s first stop to the current station. We will call it “elapsed
travel time”. The MIN window function will obtain the trip’s start time and we
will subtract the current station time. Here’s the next SQL window function
example
SELECT
train_id,
station,
time as "station_time",
time - min(time) OVER (PARTITION BY train_id ORDER BY time)
AS
elapsed_travel_time,
lead(time) OVER (PARTITION BY train_id ORDER BY time) - time
AS
time_to_next_station
FROM train_schedule;

Notice the new column in the result table:


select emp_name, dealer_id, sales, avg(sales) over() as avgsales from
q1_sales;

+-----------------+------------+--------+-----------+

| emp_name | dealer_id | sales | avgsales |

+-----------------+------------+--------+-----------+

| Beverly Lang | 2 | 16233 | 13631 |

| Kameko French | 2 | 16233 | 13631 |

| Ursa George | 3 | 15427 | 13631 |

| Ferris Brown | 1 | 19745 | 13631 |

| Noel Meyer | 1 | 19745 | 13631 |

| Abel Kim | 3 | 12369 | 13631 |

| Raphael Hull | 1 | 8227 | 13631 |

| Jack Salazar | 1 | 9710 | 13631 |

| May Stout | 3 | 9308 | 13631 |

| Haviva Montoya | 2 | 9308 | 13631 |

+-----------------+------------+--------+-----------+
select emp_name, dealer_id, sales, avg(sales) over (partition by
dealer_id) as avgsales from q1_sales;

+-----------------+------------+--------+-----------+

| emp_name | dealer_id | sales | avgsales |

+-----------------+------------+--------+-----------+

| Ferris Brown | 1 | 19745 | 14357 |

| Noel Meyer | 1 | 19745 | 14357 |

| Raphael Hull | 1 | 8227 | 14357 |

| Jack Salazar | 1 | 9710 | 14357 |

| Beverly Lang | 2 | 16233 | 13925 |

| Kameko French | 2 | 16233 | 13925 |

| Haviva Montoya | 2 | 9308 | 13925 |

| Ursa George | 3 | 15427 | 12368 |

| Abel Kim | 3 | 12369 | 12368 |

| May Stout | 3 | 9308 | 12368 |

+-----------------+------------+--------+-----------+
Window Function

You might also like