SQL Notes
SQL Notes
I’ve interviewed a lot of data scientist candidates and have found there are a
lot of SQL interview questions for data science that eventually boil down to
three generalized types of conceptual understandings.
`transactions`
+---------------+---------+
| user_id | int |
| created_at | datetime|
| product | varchar |
+---------------+---------+
Question: Given the user transactions table above,
write a query to get the first purchase for each user.
Why does this matter? How would you query for the first time a person
commented on a post and read the post itself? How do we cohort users by
start date? All of these analyses need this concept of querying based on first
or last time and it definitely can be solved without using an expensive
partition function.
Explanation:
Awesome. Now all we have to do is join this table back to the original on two
columns: user_id and created_at. The self join will effectively filter for the
first purchase. Then all we have to do is grab all of the columns on the left
side table.
`users`
+---------+---------+
| id | int |
| name | varchar |
| city_id | int |<-+
+---------+---------+ |
|
|
`cities` |
+---------+---------+ |
| id | int |<-+
| name | varchar |
+---------+---------+
Question: Given the `users` and `cities` tables above,
write a query to return the list of cities without any
users.
Why does this matter? Anyone can memorize the definitions of an inner
join and left join when asked during an interview. The Venn diagram provides
an adequate explanation. But can the candidate actually implement the
difference when in practice?
Explanation:
What is the actual difference between a LEFT JOIN and INNER JOIN?
Okay, so we know that each user in the users table must live in a city given
the city_id field. However the cities table doesn’t have a user_id field. In
which if we run an INNER JOIN between these two tables joined by
the city_id in each table, we’ll get all of the cities that have users and all of
the cities without users will be filtered out.
cities.name | users.id
_____________|__________
seattle | 123
seattle | 124
portland | null
san diego | 534
san diego | 564
Here we see that since we are keeping all of the values on the LEFT side of
the table, since there’s no match on the city of Portland to any users that
exist in the database, the city shows up as NULL. Therefore now all we have
to do is run a WHERE filter to where any value in the users table is NULL.
`transactions`
+---------------+---------+
| user_id | int |
| created_at | datetime|
| product | varchar |
+---------------+---------+
Question: Given the same user transactions table as before,
write a query to get the total purchases made in the morning
versus afternoon/evening (AM vs PM) by day.
Why does this matter? If you can’t use conditional statements and/or
aggregate with conditional statements, there’s no way to run any kind of
analytics. How do you look at differences in populations based on new
features or variables?
Explanation:
Notice whenever the question asks for a versus statement, we’re comparing
two groups. Every time we have to compare two groups we must use a
GROUP BY. It’s in the name. Heh.
In this case, we need to create a separate column to actually run our GROUP
BY on, which in this case, is the difference between AM or PM in
the created_at field. In that case, let’s create a condition in SQL to
differentiate them.
CASE WHEN
HOUR(created_at) > 11
THEN 'PM' ELSE 'AM' END AS time_of_day
Pretty simple. We can cast the created_at column to the hour and set the
new column value time_of_day as AM or PM based on this condition. Now
we just have to run a GROUP BY on the original created_at field truncated
to the day AND the new column we created that differentiates each row
value. The last aggregation will then be the output variable we want which is
total purchases by running the COUNT function.
SELECT
DATE_TRUNC('day', created_at) AS date
, CASE WHEN
HOUR(created_at) > 11
THEN 'PM' ELSE 'AM' END AS time_of_day
, COUNT(*)
FROM transactions
GROUP BY 1,2
RANK Function
The RANK function is used to retrieve ranked rows based on the condition of the ORDER BY clause.
For example, if you want to find the name of the car with third highest power, you can use RANK
Function.
Let’s see RANK Function in action:
The script above finds and ranks all the records in the Cars table and orders them in order of
descending power. The output looks like this:
SELECT name,company, power,
RANK() OVER(PARTITION BY company ORDER BY power DESC) AS PowerRank
FROM Cars
In the script above, we partition the results by company column. Now for each company, the RANK
will be reset to 1 as shown below:
DENSE_RANK Function
The DENSE_RANK function is similar to RANK function however the DENSE_RANK function does
not skip any ranks if there is a tie between the ranks of the preceding records. Take a look at the
following script.
From the output, you can see that ROW_NUMBER function simply assigns a new row number to
each record irrespective of its value.
The PARTITION BY clause can also be used with ROW_NUMBER function as shown below:
SELECT name, company, power,
ROW_NUMBER() OVER(PARTITION BY company ORDER BY power DESC) AS RowRank
FROM Cars
From the output, you can see that RANK function skips the next N-1 ranks if there is a tie between N
previous ranks. On the other hand, the DENSE_RANK function does not skip ranks if there is a tie
between ranks. Finally, the ROW_NUMBER function has no concern with ranking. It simply returns
the row number of the sorted records. Even if there are duplicate records in the column used in the
ORDER BY clause, the ROW_NUMBER function will not return duplicate values. Instead, it will
continue to increment irrespective of the duplicate values.
select
sal,
RANK() over(order by sal desc) as Rank,
DENSE_RANK() over(order by sal desc) as DenseRank,
ROW_NUMBER() over(order by sal desc) as RowNumber
from employee
Output:
--------|-------|-----------|----------
sal |Rank |DenseRank |RowNumber
--------|-------|-----------|----------
5000 |1 |1 |1
3000 |2 |2 |2
3000 |2 |2 |3
2975 |4 |3 |4
2850 |5 |4 |5
--------|-------|-----------|----------
select
customer_id
, mon_purch
, row_number() over (mon_purch desc, customer_id)
, rank() over (mon_purch desc, customer_id)
, dense_rank() over (mon_purch desc, customer_id)
from customer_purchases
order by 2 desc, 3
Output:
Suppose we want to add a new column called “time to next station”. To obtain
this value, we subtract the station times for pairs of contiguous stations. We
can calculate this value without using a SQL window function, but that can be
very complicated. It’s simpler to do it using the LEAD window function. This
function compares values from one row with the next row to come up with a
result. In this case, it compares the values in the “time” column for a station
with the station immediately after it.
So, here we have another SQL window function example, this time for the
train schedule:
SELECT
train_id,
station,
time as "station_time",
lead(time) OVER (PARTITION BY train_id ORDER BY time) - time
AS time_to_next_station
FROM train_schedule;
In the next example, we will add a new column that shows how much time has
elapsed from the train’s first stop to the current station. We will call it “elapsed
travel time”. The MIN window function will obtain the trip’s start time and we
will subtract the current station time. Here’s the next SQL window function
example
SELECT
train_id,
station,
time as "station_time",
time - min(time) OVER (PARTITION BY train_id ORDER BY time)
AS
elapsed_travel_time,
lead(time) OVER (PARTITION BY train_id ORDER BY time) - time
AS
time_to_next_station
FROM train_schedule;
+-----------------+------------+--------+-----------+
+-----------------+------------+--------+-----------+
+-----------------+------------+--------+-----------+
select emp_name, dealer_id, sales, avg(sales) over (partition by
dealer_id) as avgsales from q1_sales;
+-----------------+------------+--------+-----------+
+-----------------+------------+--------+-----------+
+-----------------+------------+--------+-----------+
Window Function