0% found this document useful (0 votes)

4 views

SQL Notes

Uploaded by

Daniel Wu

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

SQL Notes

Uploaded by

Daniel Wu

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 24

Three SQL Concepts you Must

Know to Pass the Data Science

Interview
I’ve interviewed a lot of data scientist candidates and have found there are
a a lot of SQL interview questions for data science that eventually boil down
to three generalized types of conceptual understandings.

I’ve interviewed a lot of data scientist candidates and have found there are a
lot of SQL interview questions for data science that eventually boil down to
three generalized types of conceptual understandings.

If you’re an interviewing data scientist, these problems are a must know!

They’re also great filter questions to test more than the basic one-hour
primer study of a candidate that read the differences between an INNER and
LEFT JOIN. Here’s an example question from each.

1. Getting the first or last value for each user in a

`transactions` table.

Why does this matter? How would you query for the first time a person
commented on a post and read the post itself? How do we cohort users by
start date? All of these analyses need this concept of querying based on first
or last time and it definitely can be solved without using an expensive
partition function.
Explanation:

We want to take a table that looks like this:

user_id | created_at | product

--------+------------+--------
123 | 2019-01-01 | apple
456 | 2019-01-02 | banana
123 | 2019-01-05 | pear
456 | 2019-01-10 | apple
789 | 2019-01-11 | banana

and turn it into this

user_id | created_at | product

---------+------------+--------
123 | 2019-01-01 | apple
456 | 2019-01-02 | banana
789 | 2019-01-11 | banana

How do we get there?

We can solve this problem by doing a multi-column join.

First, how do we figure out the first time each user purchased? This should
be pretty simple and can be done by a simply GROUP BY aggregation and
aggregating for the minimum datetime. Notice how the table has a
created_at column. This is the column that determines which row is the first
purchase for the specific user, so we can write a query with an aggregation
to get the minimum datetime for every user.
SELECT user_id, MIN(created_at) AS min_created_at
FROM transactions
GROUP BY 1

Awesome. Now all we have to do is join this table back to the original on two
columns: user_id and created_at. The self join will effectively filter for the
first purchase. Then all we have to do is grab all of the columns on the left
side table.

SELECT t.user_id, t.created_at, t.product

FROM transactions AS t
INNER JOIN (
SELECT user_id, MIN(created_at) AS min_created_at
FROM transactions
GROUP BY 1
) AS t1
ON t.user_id = t1.user_id
t.created_at = t1.min_created_at

2. Knowing the difference between a LEFT JOIN and INNER JOIN in

practice.

`users`
+---------+---------+
| id | int |
| name | varchar |
| city_id | int |<-+
+---------+---------+ |
|
|
`cities` |
+---------+---------+ |
| id | int |<-+
| name | varchar |
+---------+---------+
Question: Given the `users` and `cities` tables above,
write a query to return the list of cities without any
users.

Why does this matter? Anyone can memorize the definitions of an inner
join and left join when asked during an interview. The Venn diagram provides
an adequate explanation. But can the candidate actually implement the
difference when in practice?
Explanation:
What is the actual difference between a LEFT JOIN and INNER JOIN?

INNER JOIN: returns rows when there is a match in both

tables.
LEFT JOIN: returns all rows from the left table, even if
there are no matches in the right table.

Okay, so we know that each user in the users table must live in a city given
the city_id field. However the cities table doesn’t have a user_id field. In
which if we run an INNER JOIN between these two tables joined by
the city_id in each table, we’ll get all of the cities that have users and all of
the cities without users will be filtered out.

SELECT cities.name, users.id

FROM cities
LEFT JOIN users
ON users.city_id = cities.id

But what if we run a LEFT JOIN between cities and users?

Here we see that since we are keeping all of the values on the LEFT side of
the table, since there’s no match on the city of Portland to any users that
exist in the database, the city shows up as NULL. Therefore now all we have
to do is run a WHERE filter to where any value in the users table is NULL.

SELECT cities.name, users.id

FROM cities
LEFT JOIN users
ON users.city_id = cities.id
WHERE users.id IS NULL

3. Aggregations with a conditional statement

Why does this matter? If you can’t use conditional statements and/or
aggregate with conditional statements, there’s no way to run any kind of
analytics. How do you look at differences in populations based on new
features or variables?
Explanation:
Notice whenever the question asks for a versus statement, we’re comparing
two groups. Every time we have to compare two groups we must use a
GROUP BY. It’s in the name. Heh.
In this case, we need to create a separate column to actually run our GROUP
BY on, which in this case, is the difference between AM or PM in
the created_at field. In that case, let’s create a condition in SQL to
differentiate them.

CASE WHEN
HOUR(created_at) > 11
THEN 'PM' ELSE 'AM' END AS time_of_day

Pretty simple. We can cast the created_at column to the hour and set the
new column value time_of_day as AM or PM based on this condition. Now
we just have to run a GROUP BY on the original created_at field truncated
to the day AND the new column we created that differentiates each row
value. The last aggregation will then be the output variable we want which is
total purchases by running the COUNT function.

SELECT
DATE_TRUNC('day', created_at) AS date
, CASE WHEN
HOUR(created_at) > 11
THEN 'PM' ELSE 'AM' END AS time_of_day
, COUNT(*)
FROM transactions
GROUP BY 1,2
RANK Function
The RANK function is used to retrieve ranked rows based on the condition of the ORDER BY clause.
For example, if you want to find the name of the car with third highest power, you can use RANK
Function.
Let’s see RANK Function in action:

SELECT name,company, power,

RANK() OVER(ORDER BY power DESC) AS PowerRank
FROM Cars

The script above finds and ranks all the records in the Cars table and orders them in order of
descending power. The output looks like this:
SELECT name,company, power,
RANK() OVER(PARTITION BY company ORDER BY power DESC) AS PowerRank
FROM Cars

In the script above, we partition the results by company column. Now for each company, the RANK
will be reset to 1 as shown below:

DENSE_RANK Function
The DENSE_RANK function is similar to RANK function however the DENSE_RANK function does
not skip any ranks if there is a tie between the ranks of the preceding records. Take a look at the
following script.

SELECT name,company, power,

RANK() OVER(PARTITION BY company ORDER BY power DESC) AS PowerRank
FROM Cars
SELECT name,company, power,
DENSE_RANK() OVER(PARTITION BY company ORDER BY power DESC) AS DensePowerRank
FROM Cars
ROW_NUMBER Function
Unlike the RANK and DENSE_RANK functions, the ROW_NUMBER function simply returns the row
number of the sorted records starting with 1. For example, if RANK and DENSE_RANK functions of
the first two records in the ORDER BY column are equal, both of them are assigned 1 as their RANK
and DENSE_RANK. However, the ROW_NUMBER function will assign values 1 and 2 to those rows
without taking the fact that they are equally into account. Execute the following script to see the
ROW_NUMBER function in action.

SELECT name,company, power,

ROW_NUMBER() OVER(ORDER BY power DESC) AS RowRank
FROM Cars

From the output, you can see that ROW_NUMBER function simply assigns a new row number to
each record irrespective of its value.
The PARTITION BY clause can also be used with ROW_NUMBER function as shown below:
SELECT name, company, power,
ROW_NUMBER() OVER(PARTITION BY company ORDER BY power DESC) AS RowRank
FROM Cars

The output looks like this:

Similarities between RANK, DENSE_RANK, and ROW_NUMBER Functions

The RANK, DENSE_RANK and ROW_NUMBER Functions have the following similarities:
1- All of them require an order by clause.
2- All of them return an increasing integer with a base value of 1.
3- When combined with a PARTITION BY clause, all of these functions reset the returned integer
value to 1 as we have seen.
4- If there are no duplicated values in the column used by the ORDER BY clause, these functions
return the same output.
To illustrate the last point, let’s create a new table Car1 in the ShowRoom database with no
duplicate values in the power column. Execute the following script:

SELECT name,company, power,

RANK() OVER(ORDER BY power DESC) AS [Rank],
DENSE_RANK() OVER(ORDER BY power DESC) AS [Dense Rank],
ROW_NUMBER() OVER(ORDER BY power DESC) AS [Row Number]
FROM Cars1
Difference between RANK, DENSE_RANK and ROW_NUMBER Functions
The only difference between RANK, DENSE_RANK and ROW_NUMBER function is when there are
duplicate values in the column being used in ORDER BY Clause.
If you go back to the Cars table in the ShowRoom database, you can see it contains lots of duplicate
values. Let’s try to find the RANK, DENSE_RANK, and ROW_NUMBER of the Cars1 table ordered
by power. Execute the following script:
SELECT name,company, power,

RANK() OVER(ORDER BY power DESC) AS [Rank],

DENSE_RANK() OVER(ORDER BY power DESC) AS [Dense Rank],
ROW_NUMBER() OVER(ORDER BY power DESC) AS [Row Number]
FROM Cars

The output looks like this:

From the output, you can see that RANK function skips the next N-1 ranks if there is a tie between N
previous ranks. On the other hand, the DENSE_RANK function does not skip ranks if there is a tie
between ranks. Finally, the ROW_NUMBER function has no concern with ranking. It simply returns
the row number of the sorted records. Even if there are duplicate records in the column used in the
ORDER BY clause, the ROW_NUMBER function will not return duplicate values. Instead, it will
continue to increment irrespective of the duplicate values.
select
sal,
RANK() over(order by sal desc) as Rank,
DENSE_RANK() over(order by sal desc) as DenseRank,
ROW_NUMBER() over(order by sal desc) as RowNumber
from employee
Output:
--------|-------|-----------|----------
sal |Rank |DenseRank |RowNumber
--------|-------|-----------|----------
5000 |1 |1 |1
3000 |2 |2 |2
3000 |2 |2 |3
2975 |4 |3 |4
2850 |5 |4 |5
--------|-------|-----------|----------

select
customer_id
, mon_purch
, row_number() over (mon_purch desc, customer_id)
, rank() over (mon_purch desc, customer_id)
, dense_rank() over (mon_purch desc, customer_id)
from customer_purchases
order by 2 desc, 3

Output:

customer_id mon_purch row_number rank dense_rank

6 400 1 1 1
4 300 2 2 2
8 300 3 2 2
1 200 4 4 3
3 200 5 4 3
7 200 6 4 3
2 150 7 7 4
5 100 8 8 5
SELECT employee_id,
full_name,
department,
salary,
salary / MAX(salary) OVER (PARTITION BY department ORDER BY
salary DESC)
AS salary_metric
FROM employee
ORDER BY 5;

Train_id Station Time

110 San Francisco 10:00:00

Train_id Station Time

110 Redwood City 10:54:00

110 Palo Alto 11:02:00

110 San Jose 12:35:00

120 San Francisco 11:00:00

120 Redwood City Non Stop

120 Palo Alto 12:49:00

120 San Jose 13:30:00

Suppose we want to add a new column called “time to next station”. To obtain
this value, we subtract the station times for pairs of contiguous stations. We
can calculate this value without using a SQL window function, but that can be
very complicated. It’s simpler to do it using the LEAD window function. This
function compares values from one row with the next row to come up with a
result. In this case, it compares the values in the “time” column for a station
with the station immediately after it.
So, here we have another SQL window function example, this time for the
train schedule:

SELECT
train_id,
station,
time as "station_time",
lead(time) OVER (PARTITION BY train_id ORDER BY time) - time
AS time_to_next_station
FROM train_schedule;

Note that we calculate the LEAD window function by using an expression

involving an individual column and a window function; this is not possible with
aggregate functions.

Here are the results of that query:

In the next example, we will add a new column that shows how much time has
elapsed from the train’s first stop to the current station. We will call it “elapsed
travel time”. The MIN window function will obtain the trip’s start time and we
will subtract the current station time. Here’s the next SQL window function
example
SELECT
train_id,
station,
time as "station_time",
time - min(time) OVER (PARTITION BY train_id ORDER BY time)
AS
elapsed_travel_time,
lead(time) OVER (PARTITION BY train_id ORDER BY time) - time
AS
time_to_next_station
FROM train_schedule;

Notice the new column in the result table:

select emp_name, dealer_id, sales, avg(sales) over() as avgsales from
q1_sales;

+-----------------+------------+--------+-----------+

| emp_name | dealer_id | sales | avgsales |

+-----------------+------------+--------+-----------+

| Beverly Lang | 2 | 16233 | 13631 |

| Kameko French | 2 | 16233 | 13631 |

| Ursa George | 3 | 15427 | 13631 |

| Ferris Brown | 1 | 19745 | 13631 |

| Noel Meyer | 1 | 19745 | 13631 |

| Abel Kim | 3 | 12369 | 13631 |

| Raphael Hull | 1 | 8227 | 13631 |

| Jack Salazar | 1 | 9710 | 13631 |

| May Stout | 3 | 9308 | 13631 |

| Haviva Montoya | 2 | 9308 | 13631 |

+-----------------+------------+--------+-----------+
select emp_name, dealer_id, sales, avg(sales) over (partition by
dealer_id) as avgsales from q1_sales;

+-----------------+------------+--------+-----------+

| emp_name | dealer_id | sales | avgsales |

+-----------------+------------+--------+-----------+

| Ferris Brown | 1 | 19745 | 14357 |

| Noel Meyer | 1 | 19745 | 14357 |

| Raphael Hull | 1 | 8227 | 14357 |

| Jack Salazar | 1 | 9710 | 14357 |

| Beverly Lang | 2 | 16233 | 13925 |

| Kameko French | 2 | 16233 | 13925 |

| Haviva Montoya | 2 | 9308 | 13925 |

| Ursa George | 3 | 15427 | 12368 |

| Abel Kim | 3 | 12369 | 12368 |

| May Stout | 3 | 9308 | 12368 |

+-----------------+------------+--------+-----------+
Window Function

Twenty-Five SQL Practice Exercises: These Questions and Example Solutions Will Keep Your Skills Sharp
No ratings yet
Twenty-Five SQL Practice Exercises: These Questions and Example Solutions Will Keep Your Skills Sharp
39 pages
SQL - 02
No ratings yet
SQL - 02
21 pages
Test Questions
No ratings yet
Test Questions
10 pages
DA Material
No ratings yet
DA Material
14 pages
DA-Interview Reference Material
No ratings yet
DA-Interview Reference Material
8 pages
SQL Joins Tutorial: Cross Join, Full Outer Join, Inner Join, Left Join, and Right Join
No ratings yet
SQL Joins Tutorial: Cross Join, Full Outer Join, Inner Join, Left Join, and Right Join
22 pages
Data Analyst Intern
No ratings yet
Data Analyst Intern
4 pages
SQL Assingement PDF
No ratings yet
SQL Assingement PDF
4 pages
Tableau
No ratings yet
Tableau
4 pages
01.Murachs MySQL 2019 Chapter 06
No ratings yet
01.Murachs MySQL 2019 Chapter 06
32 pages
Window Functions
100% (1)
Window Functions
15 pages
Homework No:2: CAP301: Database Management System
No ratings yet
Homework No:2: CAP301: Database Management System
12 pages
Lab 1_INTRODUCTION TO SQL.
No ratings yet
Lab 1_INTRODUCTION TO SQL.
4 pages
Oracle_Functions_01
No ratings yet
Oracle_Functions_01
15 pages
SQL Interview Prep Questions Repository-1
No ratings yet
SQL Interview Prep Questions Repository-1
23 pages
Frequently asked interview questions for Data Analyst role
No ratings yet
Frequently asked interview questions for Data Analyst role
17 pages
Couchbase N1QL CheatSheet
No ratings yet
Couchbase N1QL CheatSheet
2 pages
Aggregate Functions in DBM
No ratings yet
Aggregate Functions in DBM
13 pages
Best Practices To Write SQL Queries
No ratings yet
Best Practices To Write SQL Queries
15 pages
DBMS Lab - Mca
No ratings yet
DBMS Lab - Mca
6 pages
SQL Subsquery and Temporary Tables
No ratings yet
SQL Subsquery and Temporary Tables
14 pages
Interview
No ratings yet
Interview
24 pages
Data Sementics Questions
No ratings yet
Data Sementics Questions
6 pages
SQL Window Functions 2
No ratings yet
SQL Window Functions 2
9 pages
3.Note_3
No ratings yet
3.Note_3
10 pages
Transaction Search Light
No ratings yet
Transaction Search Light
23 pages
Gudlavalleru Engineering College Gudlavalleru Department of Computer Science and Engineering DBMS Lab Manual For Students II B.Tech II Sem R-10
No ratings yet
Gudlavalleru Engineering College Gudlavalleru Department of Computer Science and Engineering DBMS Lab Manual For Students II B.Tech II Sem R-10
48 pages
Company Interview
No ratings yet
Company Interview
24 pages
CSIS 3300 W11 QueryOptimization
No ratings yet
CSIS 3300 W11 QueryOptimization
27 pages
Dumps Oracle
No ratings yet
Dumps Oracle
5 pages
Pbi 2002
No ratings yet
Pbi 2002
13 pages
Transfer Dock - Text - 20240823091413
No ratings yet
Transfer Dock - Text - 20240823091413
10 pages
Lesson 8 - More Complex Queries
No ratings yet
Lesson 8 - More Complex Queries
9 pages
Access SAP HANA Via A Secondary Database Connection
No ratings yet
Access SAP HANA Via A Secondary Database Connection
13 pages
Primers 2001
No ratings yet
Primers 2001
5 pages
Analytic_Functions_1671834736
No ratings yet
Analytic_Functions_1671834736
15 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
8 pages
Informatica: Process Control / Audit of Workflows in Informatica
No ratings yet
Informatica: Process Control / Audit of Workflows in Informatica
7 pages
SQL Server Cross Apply and Outer Apply Query Inner Join Outer Join
No ratings yet
SQL Server Cross Apply and Outer Apply Query Inner Join Outer Join
7 pages
24 Essential SQL Interview Questions
No ratings yet
24 Essential SQL Interview Questions
13 pages
Test 1
No ratings yet
Test 1
14 pages
Finals Questions CS 4
No ratings yet
Finals Questions CS 4
5 pages
R Advbeginner v5
No ratings yet
R Advbeginner v5
73 pages
Powerapps Tut
No ratings yet
Powerapps Tut
3 pages
Let's Code CRUDQ and Function Import Oper PDF
No ratings yet
Let's Code CRUDQ and Function Import Oper PDF
23 pages
4C4 22 SQL LAB Assignment
No ratings yet
4C4 22 SQL LAB Assignment
8 pages
Group and Aggregation Introduction
No ratings yet
Group and Aggregation Introduction
21 pages
Walmart Data Analyst Interview Experience
No ratings yet
Walmart Data Analyst Interview Experience
10 pages
Oracle 19C Practice Test
No ratings yet
Oracle 19C Practice Test
90 pages
SQL Functions For Data Analysis Tasks PDF
No ratings yet
SQL Functions For Data Analysis Tasks PDF
16 pages
Custom Autogen SQL
No ratings yet
Custom Autogen SQL
8 pages
Interview Questions Tableau
No ratings yet
Interview Questions Tableau
2 pages
Dbms Questions
No ratings yet
Dbms Questions
8 pages
SQL Unit 2
No ratings yet
SQL Unit 2
30 pages
data science
No ratings yet
data science
6 pages
C# .NET Assignments
No ratings yet
C# .NET Assignments
10 pages
estudo-sql
No ratings yet
estudo-sql
2 pages
Oracle Date Functions
No ratings yet
Oracle Date Functions
12 pages
MySQL Crash Course: A Hands-on Introduction to Database Development
From Everand
MySQL Crash Course: A Hands-on Introduction to Database Development
Rick Silva
No ratings yet
SQL Server Functions and tutorials 50 examples
From Everand
SQL Server Functions and tutorials 50 examples
Nino Paiotta
1/5 (1)
UNIT 4 Microprogrammed Control Unit
No ratings yet
UNIT 4 Microprogrammed Control Unit
32 pages
PSPP Lab Record Print-23-24 Odd
No ratings yet
PSPP Lab Record Print-23-24 Odd
44 pages
Python Program To Print Mirror Lower Star Triangle Pattern
No ratings yet
Python Program To Print Mirror Lower Star Triangle Pattern
8 pages
City University London: Programming Excel VBA Test II
No ratings yet
City University London: Programming Excel VBA Test II
3 pages
Convert SQL Server Results Into JSON
No ratings yet
Convert SQL Server Results Into JSON
1 page
HTML Basics (l0)
No ratings yet
HTML Basics (l0)
10 pages
Automatic Translation of First-Order Predicate Logic To SQL
No ratings yet
Automatic Translation of First-Order Predicate Logic To SQL
31 pages
Btas 16 Error
No ratings yet
Btas 16 Error
2 pages
Artificial Intelligence Lab Manual R20 (2)
No ratings yet
Artificial Intelligence Lab Manual R20 (2)
33 pages
1st Sem MCA
No ratings yet
1st Sem MCA
6 pages
GR 10 IT Revision Package Year End - Practical
No ratings yet
GR 10 IT Revision Package Year End - Practical
26 pages
Macros Com
No ratings yet
Macros Com
7 pages
Docking Container
No ratings yet
Docking Container
7 pages
CS502 Midterm Study Guide
No ratings yet
CS502 Midterm Study Guide
3 pages
Interactive Report in ABAP
No ratings yet
Interactive Report in ABAP
3 pages
DSA 14 - Test 2 - Question Paper
No ratings yet
DSA 14 - Test 2 - Question Paper
1 page
Gradle User Guide
No ratings yet
Gradle User Guide
436 pages
6 Ex SQL I
No ratings yet
6 Ex SQL I
2 pages
3rd Sem Oops 2018 Questions Bput
No ratings yet
3rd Sem Oops 2018 Questions Bput
6 pages
Web Development Series Brochure PDF
No ratings yet
Web Development Series Brochure PDF
3 pages
Using Namespace Void Bool True Bool True If: Cout Cout
No ratings yet
Using Namespace Void Bool True Bool True If: Cout Cout
4 pages
What Do You Mean by Software
No ratings yet
What Do You Mean by Software
185 pages
A Short Guide To Written Exam Previous Year Questions
No ratings yet
A Short Guide To Written Exam Previous Year Questions
6 pages
Download ebooks file C++17 Standard Library Quick Reference, 2nd Edition: A Pocket Guide to Data Structures, Algorithms, and Functions Peter Van Weert all chapters
100% (2)
Download ebooks file C++17 Standard Library Quick Reference, 2nd Edition: A Pocket Guide to Data Structures, Algorithms, and Functions Peter Van Weert all chapters
40 pages
Machine Learning Assignment 3
No ratings yet
Machine Learning Assignment 3
7 pages
xii pre board re test
No ratings yet
xii pre board re test
9 pages
BCA-V Activity List-2022
No ratings yet
BCA-V Activity List-2022
12 pages
4-Multithreaded Programming-22Aug24
No ratings yet
4-Multithreaded Programming-22Aug24
13 pages
ML Roadmap - Notes
No ratings yet
ML Roadmap - Notes
1 page
Functions PPT Parul
No ratings yet
Functions PPT Parul
30 pages