Do The Math
Do The Math
Document Header
Contents
Module 1: SQL BASICS
1. SQL OVERVIEW
2. Concepts of DBMS, RDBMS, HADOOP, SPARK & PySpark
3. DDL DML DCL & TCL
4. Tables, Fields & Records
5. Keys & Constraints
6. Normalization
7. Basic Structure, Operations, Aggregate Functions
8. Subqueries
9. Joins
Module 1 Assignment & Practice Questions
Module 2: SQL INTERMEDIATE
1. Views
2. Indexes
3. Window Functions
5. Grouping Sets
3. Stored Procedures
Mu Sigma Confidential 1
Document Header
A very similar SQL like language called Hive Querying Language is used in a data
warehouse system called Apache Hive which is built to work on a big data platform
called Hadoop. It is used to querying and managing large datasets residing in distributed
storage. Hive uses Hadoop’s Hadoop Distributed File System (HDFS) for storage and
MapReduce/YARN for parallel processing functions.
If performance is key: If you need to pull data frequently and quickly, such as to
support an application that uses online analytical processing (OLAP), MySQL
performs much better. Hive isn’t designed to be an online transactional platform,
and thus performs much more slowly than MySQL.
If your datasets are relatively small (gigabytes): Hive works very well in large
datasets, but MySQL performs much better with smaller datasets and can be
optimized in a range of ways.
If you need to update and modify many records frequently: MySQL does this kind of
activity all day long. Hive, on the other hand, doesn’t really do this well (or at all,
depending). And if you need an interactive experience, use MySQL.
Mu Sigma Confidential 2
Document Header
Mu Sigma Confidential 3
Document Header
Big Data" consists of very large volumes of heterogeneous data that is being generated,
often, at high speeds. These data sets cannot be managed and processed using
traditional data management tools and applications at hand. Big Data requires the use
of a new set of tools, applications and frameworks to process and manage the data. We
identify Big Data by a few characteristics which are specific to Big Data. These
characteristics of Big Data are popularly known as Three V's of Big Data. The three v's
of Big Data are Volume, Velocity, and Variety as shown below. Volume refers to the size
of data that we are working with. With the advancement of technology and with the
invention of social media, the amount of data is growing very rapidly. This data is
spread across different places, in different formats, in large volumes ranging from
Gigabytes to Terabytes, Petabytes, and even more. Velocity refers to the speed at
which the data is being generated. Different applications have different latency
requirements and in today's competitive world, decision makers want the necessary
data/information in the least amount of time as possible. Generally, in near real time or
real time in certain scenarios. In different fields and different areas of technology, we see
data getting generated at different speeds. A few examples include trading/stock
exchange data, tweets on Twitter, status updates/likes/shares on Facebook, and many
others. Variety refers to the different formats in which the data is being
generated/stored. Different applications generate/store the data in different formats. In
today's world, there are large volumes of unstructured data being generated apart from
the structured data getting generated in enterprises. Until the advancements in Big Data
Mu Sigma Confidential 4
Document Header
technologies, the industry didn't have any powerful and reliable tools/technologies which
can work with such voluminous unstructured data that we see today. Apart from the
traditional flat files, spreadsheets, relational databases etc., we have a lot of
unstructured data stored in the form of images, audio files, video files, web logs, sensor
data, and many others. This aspect of varied data formats is referred to as Variety in the
Big Data world.
Mu Sigma Confidential 5
Document Header
coding approach rather than using Pig or Hive scripts and vice versa. Internally, a
compiler translates Hive Querying Language statement into a directed acyclic graph of
MapReduce jobs, which are submitted to Hadoop for execution.
Apache Spark is a fast and general-purpose cluster computing system, written in Scala
programming language for big data processing, with built-in modules for streaming, SQL,
machine learning and graph processing. It’s well-known for its speed, ease of use,
generality and the ability to run virtually everywhere. And even though Spark is one of
the most asked tools for data engineers, also data scientists can benefit from Spark
when doing exploratory data analysis, feature extraction, supervised learning and model
evaluation. Spark SQL is a Spark module for structured data processing.
To support Python with Spark, Apache Spark community released a tool, PySpark.
PySpark offers PySpark Shell which links the Python API to the spark core and initializes
the Spark context. Majority of data scientists and analytics experts today use Python
because of its rich library set, integrating Python with Spark is a boon.
To learn more about the Hadoop architecture (HDFS – Hadoop Distributed File System)
and Map Reduce algorithm, refer to the links and materials in the appendix section.
Mu Sigma Confidential 6
Document Header
Data is stored in records. A record is composed of fields and contains all the data about
one person, company, or item in a database. In this database, a record contains the data
for one customer support incident report. Records appear as rows in the database table.
A record for Log ID.
A field is part of a record and contains a single piece of data for the subject of the
record. In the database table illustrated above, each record contains four fields which
are Log ID, Operator, Resolved & Duration. Fields appear as columns in a database
table.
Mu Sigma Confidential 7
Document Header
NOT NULL - Ensures that a column cannot have a NULL value, which means
that you cannot insert a new record, or update a record without adding a value to
this field.
Eg: CREATE TABLE Persons (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255) NOT NULL,
Age int
);
Eg: CREATE TABLE Persons (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int,
PRIMARY KEY (ID)
); A table can have only one primary key, which may consist of single or multiple
fields. A Candidate Key is a set of one or more fields/columns that can identify a
record uniquely in a table. There can be multiple Candidate Keys in one table. Each
Candidate Key can work as Primary Key. Only one Candidate Key can be Primary
Key
Eg: CREATE TABLE Persons (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int,
CHECK (Age>=18)
);
Mu Sigma Confidential 8
Document Header
Eg: CREATE TABLE Orders (
ID int NOT NULL,
OrderNumber int NOT NULL,
OrderDate date DEFAULT GETDATE()
);
INDEX - Used to create and retrieve data from the database very quickly.
CREATE INDEX statement is used to create indexes in tables. The users cannot
see the indexes, they are just used to speed up searches/queries. The users
cannot see the indexes, they are just used to speed up searches/queries.
Updating a table with indexes takes more time than updating a table without
(because the indexes also need an update). So, only create indexes on columns
that will be frequently searched against.
Eg: CREATE INDEX index_name
ON table_name (column1, column2, ...);
6. Normalization
Normalization is a database design technique which organizes tables in a manner that
reduces redundancy and dependency of data. It divides larger tables to smaller tables
and links them using relationships. The inventor of the relational model Edgar Codd
proposed the theory of normalization with the introduction of First Normal Form, and he
continued to extend theory with Second and Third Normal Form. Later he joined with
Raymond F. Boyce to develop the theory of Boyce-Codd Normal Form.
Eg: Assume a video library maintains a database of movies rented out. Without any
normalization, all information is stored in one table as shown below.
Mu Sigma Confidential 9
Document Header
bay
1 Janet Jones Street A, Plot 5 Thor Action
2 Robert Phil 5th Street, Zone Rush Hour Comedy
A
2 Robert Phil 5th Street, Zone Jurassic Park Action
A
3 Robert Phil 2nd Avenue, Avengers Action
Ward Street
3 Robert Phil 2nd Avenue, Fast & Furious Action
Ward Street
2NF (Second Normal Form) Rules
Rule 1- Be in 1NF
Rule 2- there should be no partial dependency
The above 1NF table is at Membership ID & Movie rented level (primary key) and the
Physical Address column is only dependent on the membership ID and is not dependent
on both Membership ID & Movies Rented (Primary Key). Hence to get rid of this Partial
Dependency, the Physical Address column can be removed from the above table and
the above table can be divided into 2 tables as shown below
Rule 1- Be in 2NF
Rule 2- Has no transitive functional dependencies
Mu Sigma Confidential 10
Document Header
Consider the above 2NF table with another Salutation column. Now the Salutation
column is not dependent on the primary key (Membership ID) but it is dependent on
another non-key column which is Full Name. This is Transitive Dependency, when a
non-prime attribute depends on other non-prime attributes rather than depending upon
the prime attributes or primary key.
To move our 2NF table into 3NFand get rid of Transitive Dependency, we again need to
again divide our table as shown below.
Salutation ID Salutation
1 Ms.
2 Mr.
3 Mrs.
4 Dr.
We have again divided our tables and created a new table which stores Salutations.
There are no transitive functional dependencies, and hence our table is in 3NF. In Table
3 Salutation ID is primary key, and in Table 1 Salutation ID is foreign to primary key in
Table 3.
Now our little example is at a level that cannot further be decomposed to attain higher
forms of normalization. In fact, it is already in higher normalization forms. Separate
efforts for moving into next levels of normalizing data are normally needed in complex
databases. To learn about higher forms of Normalization (BCNF & 4NF) refer to the link -
https://www.studytonight.com/dbms/database-normalization.php
Mu Sigma Confidential 11
Document Header
- The SELECT query is used to select rows from a table of a database which is
indicated using the FROM statement
- The asterisk (*) indicates all rows and all columns
- The columns to be picked or displayed is specified by listing out the column
names separated by a comma
- SQL is not case sensitive and hence lower-case syntax can also be used for the
keywords
- The WHERE statement is used to apply conditions on rows and multiple
conditions can be applied by using the keywords AND/OR. For better usability the
conditions can be enclosed in a parenthesis
- ORDER BY sorts the output in ascending or descending order of a column. The
default sort order is ascending and to sort in descending order the keyword
DESC is used.
Eg: Customer_Information_Table
Query to pick records of male customers with income greater than 100,000
SELECT Name,
Annual_Income,
Gender,
Occupation
FROM Customer_Information_Table
WHERE Gender = ‘Male’ and Annual_Income > 100000
ORDER BY Annual_Income desc;
- The DISTINCT keyword identifies unique rows from a table and specifying
multiple column names with DISTINCT keyword results in selecting unique
combinations of all the columns from the table
- The INSERT INTO statement is used to insert new records in a table.
Syntax: INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
- The UPDATE statement is used to modify the existing records in a table
Mu Sigma Confidential 12
Document Header
UPDATE table_name
SET column1 = value1, column2 = value2, ……….
WHERE condition;
- The DELETE statement is used to delete existing records in a table
DELETE FROM table_name WHERE condition;
Eg: Write a query to obtain the number of customers per country in descending order
from the below customer_level table
SELECT country,
count(customer_id) as no_of_customers
FROM customer_level
GROUP BY country
ORDER BY count(customer_id) DESC;
Country no_of_customers
USA 3
India 2
Australia 2
England 1
The resultant data is at Country level and it is obtained by the rolling up the customer
level table by taking the count of number of customers at country level. Similarly, any
Mu Sigma Confidential 13
Document Header
table can be rolled up to from one level to another by applying any of the aggregate
functions like count(), sum(), avg(), min(), max() and doing a Group By.
Now to, filter out on the rolled-up table the HAVING clause can be used. The HAVING
clause was added to SQL because the WHERE keyword cannot be used with aggregate
functions on the rolled-up table.
Eg: To filter out for countries having more than 2 customers, the following query can be
written:
SELECT country,
count(customer_id) as no_of_customers
FROM customer_level
GROUP BY country
HAVING count(customer_id) > 2
ORDER BY count(customer_id) DESC;
The CASE statement goes through conditions and return a value when the first condition
is met (like an IF-THEN-ELSE statement). So, once a condition is true, it will stop
reading and return the result. If no conditions are true, it returns the value in the ELSE
clause. If there is no ELSE part and no conditions are true, it returns NULL.
Syntax: CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
WHEN conditionN THEN resultN
ELSE result
END;
The UNION operator is used to combine the result-set of two or more SELECT
statements
Each SELECT statement within UNION must have the same number of columns
The columns must also have similar data types
The columns in each SELECT statement must also be in the same order
Syntax: SELECT column_1, column_2, column_3
FROM table1
UNION
SELECT column_1,column_2,column_3
FROM table2
The UNION operator selects only distinct values by default. To allow duplicate
values, use UNION ALL.
Mu Sigma Confidential 14
Document Header
The Coalesce function can be used for Null value treatment by creating another column
which replaces the Null value with a Non Null value from another column or any other
Non-Null value.
Eg: select coalesce(column_a, column_b, ‘string_value’) as columnA
from tableA
The above statement would yield the first encountered Non Null value under columnA.
The above coalesce statement can be written with case when also as shown below:
select case when column_a is not null then column_a
when column_a is null and column_b is not null then column_b
when column_a is null and column_b is null then ‘string_value’ end as columnA
from tableA
The LIMIT clause in a SELECT query sets a maximum number of rows for the result set.
ii. where
Once we have the total working set of data, the first-pass WHERE constraints are
applied to the individual rows, and rows that do not satisfy the constraint are discarded.
Each of the constraints can only access columns directly from the tables requested in
the FROM clause. Aliases in the SELECT part of the query are not accessible in most
databases since they may include expressions dependent on parts of the query that
have not yet executed.
iii. group by
Mu Sigma Confidential 15
Document Header
The remaining rows after the WHERE constraints are applied are then grouped based on
common values in the column specified in the GROUP BY clause. Because of the
grouping, there will only be as many rows as there are unique values in that column.
Implicitly, this means that you should only need to use this when you have aggregate
functions in your query.
iv. having
If the query has a GROUP BY clause, then the constraints in the HAVING clause are
then applied to the grouped rows, discard the grouped rows that don't satisfy the
constraint. Like the WHERE clause, aliases are also not accessible from this step in
most databases.
v. select
vi. distinct
Of the remaining rows, rows with duplicate values in the column marked
as DISTINCT will be discarded.
vii. order by
If an order is specified by the ORDER BY clause, the rows are then sorted by the
specified data in either ascending or descending order. Since all the expressions in
the SELECT part of the query have been computed, you can reference aliases in this
clause.
viii. limit/offset
Finally, the rows that fall outside the range specified by the LIMIT and OFFSET are
discarded, leaving the final set of rows to be returned from the query.
8. Subqueries
In SQL a Subquery can be simply defined as a query within another query. In other
words, we can say that a Subquery is a query that is embedded in WHERE clause of
another SQL query.
Important rules for Subqueries:
You can place the Subquery in a number of SQL
clauses: WHERE clause, HAVING clause, FROM clause.
Subqueries can be used with SELECT, UPDATE, INSERT, DELETE statements
along with expression operator. It could be equality operator or comparison
operator such as =, >=, <= and Like operator.
A subquery is a query within another query. The outer query is called as main
query and inner query is called as subquery.
Mu Sigma Confidential 16
Document Header
The subquery generally executes first, and its output is used to complete the
query condition for the main or outer query.
Subquery must be enclosed in parentheses.
Subqueries are on the right side of the comparison operator.
ORDER BY command cannot be used in a Subquery. GROUP BY command can be
used to perform same function as ORDER BY command.
Use single-row operators with single row Subqueries. Use multiple-row operators
with multiple-row Subqueries.
Eg: Write a query to display name, location, phone_number of students from master
table whose section is A
Master_table
Student
Name Roll_No Section
Ram 101 A
Raj 102 B
Ravi 103 A
Sumanth 104 C
SELECT Name,
Roll_No,
Location,
Phone_Number
FROM Master_table
WHERE Roll_No IN (SELECT Roll_No FROM Student WHERE Section = ‘A’;
First subquery executes “ SELECT ROLL_NO from STUDENT where SECTION=’A’ ”
returns ROLL_NO from STUDENT table whose SECTION is ‘A’. Then outer-query
executes it and return the NAME, LOCATION, PHONE_NUMBER from the DATABASE
table of the student whose ROLL_NO is returned from inner subquery .
Output:
Name Roll_No Location Phone_Number
Ram 101 Chennai 929389238923
Ravi 103 Mumbai 298342837492
9. JOINS
A SQL Join statement is used to combine data or rows from two or more tables based
on a common field between them. Different types of Joins are:
INNER JOIN
LEFT JOIN
Mu Sigma Confidential 17
Document Header
RIGHT JOIN
FULL JOIN
The INNER JOIN keyword selects all rows from both the tables if the condition satisfies.
This keyword will create the result-set by combining all rows from both the tables where
the condition satisfies i.e. value of the common field will be same.
Syntax: SELECT tableA.column1,tableA.column2,tableB.column1
FROM tableA
INNER JOIN tableB
ON tableA.matching_column = tableB.matching_column;
We can also write JOIN instead of INNER JOIN. JOIN is same as INNER JOIN.
LEFT JOIN returns all the rows of the table on the left side of the join and matching rows
for the table on the right side of join. The rows for which there is no matching row on
right side, the result-set will contain null. LEFT JOIN is also known as LEFT OUTER
JOIN
Syntax: SELECT tableA.column1,tableA.column2,tableB.column1
FROM tableA
LEFT JOIN tableB
ON tableA.matching_column = tableB.matching_column;
Mu Sigma Confidential 18
Document Header
RIGHT JOIN is like LEFT JOIN. This join returns all the rows of the table on the right
side of the join and matching rows for the table on the left side of join. The rows for
which there is no matching row on left side, the result-set will contain null. RIGHT JOIN
is also known as RIGHT OUTER JOIN.
Syntax: SELECT tableA.column1,tableA.column2,tableB.column1
FROM tableA
RIGHT JOIN tableB
ON tableA.matching_column = tableB.matching_column;
FULL JOIN creates the result-set by combining result of both LEFT JOIN and RIGHT
JOIN. The result-set will contain all the rows from both the tables. The rows for which
there is no matching, the result-set will contain NULL values.
Syntax: SELECT tableA.column1,tableA.column2,tableB.column1
FROM tableA
FULL JOIN tableB
ON tableA.matching_column = tableB.matching_column;
Mu Sigma Confidential 19
Document Header
CROSS JOIN creates the result-set which is the number of rows in the first table
multiplied by the number of rows in the second table. This kind of result is called the
Cartesian Product, where each row from 1st table joins with all the rows of another table.
If 1st table contains x rows and 2nd table contains y rows then the resultant cross joined
table contains x*y rows
Syntax: SELECT *
FROM tableA
CROSS JOIN tableB;
Eg: Table_1 & Table_2
Item_ID Item_Name Item_Unit Company_ID
1 Pot Rice Pcs 122
2 Cheese Mix Pcs 125
Mu Sigma Confidential 20
Document Header
c. Using the same table, for each month, identify the reporting_level_0 that has the highest
number of reporting_level_4’s under it (Refer to Module 2 topics – CTE & Window Functions to
solve this question) and obtain the output table at month-reporting_level_0 level having the
reporting_level_0s with the highest number of reporting_level_4s.
d. Below is the table with students and their grades in different topics. Convert this 1NF table to
3NF
e. Write a SQL statement to make a list with order no, purchase amount, customer name and their
cities for those orders whose order amount between 500 and 2000. Use the below 2 tables and
illustrate the output table
orders: tableA
Order_no Purch_amt Ord_date Customer_id Salesman_id
70001 150.5 2012-10-05 3005 5002
70009 270.65 2012-09-10 3001 5005
70002 65.26 2012-10-05 3002 5001
70004 110.5 2012-08-17 3009 5003
70007 948.5 2012-09-10 3005 5002
70005 2400.6 2012-07-27 3007 5001
70008 5760 2012-09-10 3002 5001
70010 1983.43 2012-10-10 3004 5006
70003 2480.4 2012-10-10 3009 5003
70012 250.45 2012-06-27 3008 5002
70011 75.29 2012-08-17 3003 5007
70013 3045.6 2012-04-25 3006 5001
Customers: tableB
Customer_id Cust_name City Grade Salesman_id
3002 Nick Rimando New York 100 5001
3005 Graham Zusi California 200 5002
3001 Brad Guzan London 300 5005
3004 Fabian Johns Paris 300 5006
3007 Brad Davis New York 200 5001
3009 Geoff Camero Berlin 100 5003
3008 Julian Green London 300 5002
3003 Jozy Altidor Moscow 200 5007
Mu Sigma Confidential 21
Document Header
g. Write a query in SQL to display those employees who contain a letter ‘z’ to their first name
and display their department and city using the below tables. Also illustrate the output table
Departments: tableA
Department_ID Department_Name Location_ID
10 Administration 1700
20 Marketing 1800
30 Purchasing 1700
40 Human Resources 2400
50 Shipping 1500
60 IT 1400
70 Public Relations 2700
Employees: tableB
Employee_ID First_Name Department_ID
100 Zack 10
101 Zohan 10
102 Jim 20
103 Jill 30
104 Jejo 30
105 Zaakir 40
106 Yacob 50
Locations: tableC
Location_ID City
1700 Venice
1800 Rome
1900 Tokyo
2000 London
2100 New York
2200 Paris
2300 Beijing
h. Convert the item level(ASIN level) table vn5018r.p1m_100k_final_dec18 to date item level by
using the dates of present in the table vn5018r.evaluation_dates_dec18 such that every item is
present across every date. Save the result into a table.
i. For all the date – asin combinations (date-asin level table) created in the above table obtain
the instock and publish flags and have_it flag from the date-item level flags table
vn5018r.top_100K_instock_published_dec18 and create a 1/0 flag column called
not_in_catelogue to tag all the rows that do not obtain any flag from the flags table. The join
key would be catlg_item_id & calendar_date.
j. Create the same date-asin level table with flags as created above (question i.) for another new list
using these three, new list tables vn5018r.p1m_100k_final_dec18_unadj (asin level table),
vn5018r.evaluation_dates_dec18_unadj (dates table) and
vn5018r.top_100K_instock_published_dec18_unadj (item-date level flags), using the same join key.
Once created, unify this resultant date-asin level flags table with the above (question i.) date-asin
level flags table with a flag to denote if the row is part of the original list or new list.
Mu Sigma Confidential 22
Document Header
Views can hide complexity - If you have a query that requires joining several
tables, or has complex logic or calculations, you can code all that logic into a
view, then select from the view just like you would a table.
Views can be used as a security mechanism - A view can select certain columns
and/or rows from a table, and permissions set on the view instead of the
underlying tables. This allows surfacing only the data that a user needs to see.
Views can simplify supporting legacy code - If you need to refactor a table that
would break a lot of code, you can replace the table with a view of the same
name. The view provides the exact same schema as the original table, while the
actual schema has changed. This keeps the legacy code that references the
table from breaking, allowing you to change the legacy code at your leisure.
2. Indexes
An index is a schema object. It is used by the server to speed up the retrieval of rows by
using a pointer. It can reduce disk I/O(input/output) by using a rapid path access method
to locate data quickly. An index helps to speed up select queries and where clauses, but
it slows down data input, with the update and the insert statements. Indexes can be
created or dropped with no effect on the data.
Mu Sigma Confidential 23
Document Header
Indexes should be avoided if either the table is small, or the columns are not used often
in the query or if the column is updated frequently
An Index can be dropped using DROP command. Syntax: DROP INDEX index_name.
3. Window Functions
Window functions operate on a set of rows and return a single value for each row from
the underlying query. The term window describes the set of rows on which the function
operates. When you use a window function in a query, define the window using the
OVER() clause. The OVER() clause (window definition) differentiates window functions
from other analytical and reporting functions. A query can include multiple window
functions with the same or different window definitions.
The AVG() window function operates on the rows defined in the window and returns a
value for each row.
Eg: select emp_name, dealer_id, sales, avg(sales) over() as avg_sales from q1_sales;
Mu Sigma Confidential 24
Document Header
The AVG() window function operates on the rows defined in the window and returns a
value for each row. The Partition BY and ORDER BY clauses can also be applied to the
above query. Window functions are applied to the rows within each partition and sorted
according to the order specification.
The following query uses the AVG() window function with the PARTITION BY clause to
determine the average car sales for each dealer in Q1:
To rank each of the employee within a dealer based on sales, the row_number() function
can be used and partitioned by dealer_id and ordered by sales. Eg: row_number()
over(partition by dealer_id order by sales desc) as rank.
Mu Sigma Confidential 25
Document Header
cte_name2 as
(
Select …………………..…..
)
Select * from cte_name
UNION ALL
Select * from cte_name2;
2 CTEs can be written one after the other with comma separation.
A CTE can reference itself and previously defined CTEs in the same WITH
clause. Forward referencing is not allowed.
Specifying more than one WITH clause in a CTE is not allowed. For example, if
a CTE_query_definition contains a subquery, that subquery cannot contain a
nested WITH clause that defines another CTE.
o INTO
o FOR BROWSE
When a CTE is used in a statement that is part of a batch, the statement before it
must be followed by a semicolon.
Mu Sigma Confidential 26
Document Header
When executing a CTE, any hints that reference a CTE may conflict with other
hints that are discovered when the CTE accesses its underlying tables, in the
same manner as hints that reference views in queries. When this occurs, the query
returns an error.
5. Grouping Sets
The result set returned by GROUPING SET is the union of the aggregates based on the
columns specified in each set in the Grouping set.
Whenever an aggregate function is required, GROUP BY clause is the only solution.
There can be a requirement to get these aggregates based on different set of columns in
the same result set. We can get the same result using UNION operator with different
queries. But use of multiple queries with UNION operator is not the optimum way to
achieve this and will result in longer query execution time. This can be simplified and
optimized by using Grouping Sets.
For example, the below query would yield a unified table displaying the sum of units sold
at year level and year-month level
Eg: select month,
year,
sum(units_sold)
from sales_table
group by
year,
month
Union All
Select ‘ ‘ as month,
year,
sum(units_sold)
from sales_table
group by
year
The above approach of obtaining the data at multiple levels by using Union All operator
is not the optimum way of doing it and hence the same result can be obtained through
an optimized approach, using grouping sets. The below query would produce the sum of
units sold at month – year level because of the grouping set (month, year) and at year
level because of the grouping set (year).
Mu Sigma Confidential 27
Document Header
(month, year),
(year)
)
This is much more optimized than the use of 2 queries with a union all.
d. From the below yearly employee sales table, obtain the total sales for each year
without rolling up the table to year level and by using window functions. Write down the
resultant output table.
Mu Sigma Confidential 28
Document Header
f. From the below table obtain the total units sold at month-year-company level, at year-
company level and at company level using grouping sets. On the rolled-up table, rank
the year-company combinations and the companies based on the units sold, with a flag
indicating the level which is ranked. Write the resultant output table.
Mu Sigma Confidential 29
Document Header
SQL Practice
Questions ANSWER KEY.docx
- Cursors
o https://www.c-sharpcorner.com/UploadFile/f0b2ed/cursors-in-
sql/
o https://www.tutorialspoint.com/plsql/plsql_cursors.htm
- Triggers
o https://www.geeksforgeeks.org/sql-trigger-student-database/
o https://www.tutorialspoint.com/plsql/plsql_triggers.htm
- Stored Procedures
o https://www.w3schools.com/sql/sql_stored_procedures.asp
Mu Sigma Confidential 30
Document Header
o https://www.tutorialspoint.com/t_sql/t_sql_stored_procedures.
htm
o https://www.c-sharpcorner.com/article/how-to-create-a-stored-
procedure-in-sql-server-management-studio/
Appendix
Hadoop & HDFS:
HDFC Infrastructure - https://www.youtube.com/watch?v=DLutRT6K2rM&t=1185s
04_IntroductionTo
MapReduce.pdf
Mu Sigma Confidential 31
Document Header
1. https://www.youtube.com/watch?v=3DA1grSp4mU
2. https://www.youtube.com/watch?v=oBqju4ZkD58
3. https://www.youtube.com/watch?v=WLOevQgoZo4
4. https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-
common/FileSystemShell.html
Hadoop-for-dumies
_Dirk.deRoos-Paul.C_2014.pdf
You can refer to Hadoop for Dummies attached above to learn it in further detail.
1_IntroductionToBi
gDataHadoop.pdf
Once BFD Jupyter has been set up, then the Jupyter Notebook can be started in nohup bash mode
by running the following commands
- Open cdc00 or cdc01 on putty and login with the user id on whose VM the Jupyter notebook
was setup and run the below commands
- cd bfd-jupyter
- nohup bash ./start.sh > jupyter_log.txt &
- cat jupyter_log.txt
These commands will initiate a Jupyter Notebook with the port number and token number of the
form as highlighted below, using which Jupyter can be opened on the browser. The link will be of the
Mu Sigma Confidential 32
Document Header
format - http://cdc-main-client01.bfd.walmart.com:42424/?
token=e31a4e26513ae9bbd5c679d467ee05d6cec6bb3ab1df613d
Once Jupyter has been opened, a new iPynb kernel notebook can be started and PySpark, HQL &
Pandas can be initialized through codes, post which the SQL codes can be run in spark or in hive
using the spark.sql or execute_hql functions respectively. PySpark is a Spark Python API which
exposes the Spark programming model to Python.
hql_load.py
The hql_load.py code file can be placed in the same folder where the notebooks are being run.
Spark.sql() function can be used to run sql codes in spark and this can handle data with up to 100
million rows. But when the size of the data is too large, the sql code will fail to run on spark and that
is when they must be run in hive using the execute_hql function.
Code to Initialize PySpark (should be mandatorily placed at the start of the Notebook, in the first
code cell and run)
#connect to spark
import sys, os, time
#import pandas
username = os.environ.get('USER')
username_hive = username.replace( "-", "_" )
spark = SparkSession \
.builder \
.appName("WMT_Deliver_IT_Data_Revamp_New_with_new_pre_order_logic_inc_d
el_ts")\
.master("yarn-client")\
.config("spark.driver.allowMultipleContexts", "true") \
.config("spark.dynamicAllocation.enabled", "true")\
.config("spark.dynamicAllocation.initialExecutors", "50")\
.config("spark.dynamicAllocation.minExecutors", "50")\
.config("spark.executor.memory", "32g")\
.config("spark.driver.memory", "32g")\
.config("spark.cores.max", 64)\
.config("spark.shuffle.service.enabled", "true")\
.config("spark.rdd.compress", "true")\
Mu Sigma Confidential 33
Document Header
.config("spark.serializer","org.apache.spark.serializer.KryoSerializer"
)\
.config("spark.kryoserializer.buffer","128k")\
.config("spark.kryoserializer.buffer.max","2047m")\
.config("spark.executor.userClassPathFirst", "false")\
.config("spark.streaming.unpersist",
"true").enableHiveSupport().getOrCreate()
Settings to initialize Pandas & HQL (should be run after spark has been initialized)
import sys
print "Starting program, path: %s" % (sys.path) # Make sure to
print to stderr instead if you use this in a custom mapper or reducer!
Jupyter Notebooks
According to Project Jupyter, the Jupyter Notebook, formerly known as the IPython
Notebook, is an open-source web application that allows users to create and share
Mu Sigma Confidential 34
Document Header
documents that contain live code, equations, visualizations, and narrative text. Uses
include data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more. The word, Jupyter, is a loose acronym
for Julia, Python, and R, but today, the Jupyter supports many programming languages.
Interest in Jupyter Notebooks has grown dramatically.
Apache Spark
According to Apache, Spark is a unified analytics engine for large-scale data processing,
used by well-known, modern enterprises, such as Netflix, Yahoo, and eBay. With speeds
up to 100x faster than Hadoop, Apache Spark achieves high performance for static,
batch, and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph)
scheduler, a query optimizer, and a physical execution engine.Spark’s polyglot
programming model allows users to write applications quickly in Scala, Java, Python, R,
and SQL. Spark includes libraries for Spark SQL (DataFrames and Datasets), MLlib
(Machine Learning), GraphX (Graph Processing), and DStreams (Spark Streaming). You
can run Spark using its standalone cluster mode, on Amazon EC2, Apache Hadoop
YARN, Mesos, or Kubernetes.
PySpark
The Spark Python API, PySpark, exposes the Spark programming model to Python.
PySpark is built on top of Spark’s Java API. Data is processed in Python and cached
and shuffled in the JVM. According to Apache, Py4J enables Python programs running
in a Python interpreter to dynamically access Java objects in a JVM.
REFERENCES
1. https://www.tutorialspoint.com/dbms/sql_overview.htm
2. https://www.tutorialspoint.com/sql/sql-overview.htm
3. https://www.softwaretestingmaterial.com/sql-tutorial-sql-overview/
4. https://blog.matthewrathbone.com/2015/12/08/hive-vs-mysql.html
5. https://intellipaat.com/tutorial/hadoop-tutorial/mapreduce-yarn/
6. https://www.guru99.com/difference-dbms-vs-rdbms.html
7. https://www.mssqltips.com/sqlservertip/3132/big-data-basics--part-1--introduction-to-big-
data/
8. https://www.mssqltips.com/sqlservertip/3140/big-data-basics--part-3--overview-of-hadoop/
Mu Sigma Confidential 35
Document Header
9. https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163
10. https://www.wisdomjobs.com/e-university/hadoop-tutorial-484/hdfs-concepts-14768.html
11. https://www.datacamp.com/community/tutorials/apache-spark-python
12. https://spark.apache.org/docs/2.3.0/sql-programming-guide.html
13. https://spark.apache.org/docs/2.3.0/
14. https://www.tutorialspoint.com/pyspark/pyspark_environment_setup.htm
15. https://www.geeksforgeeks.org/sql-ddl-dml-dcl-tcl-commands/
16. https://www.cengage.com/school/corpview/RegularFeatures/DatabaseTutorial/db_element
s/db_elements2.htm
17. https://www.guru99.com/database-normalization.html
18. https://www.studytonight.com/dbms/third-normal-form.php
19. https://www.studytonight.com/dbms/second-normal-form.php
20. https://launchschool.com/books/sql_first_edition/read/constraints
21. https://www.w3schools.com/sql/sql_create_index.asp
22. https://www.geeksforgeeks.org/sql-sub-queries/
23. http://www.sql-join.com/sql-join-types
24. https://www.geeksforgeeks.org/sql-join-set-1-inner-left-right-and-full-joins/
25. https://stackoverflow.com/questions/1278521/why-do-you-create-a-view-in-a-database
26. https://www.geeksforgeeks.org/sql-views/
27. https://www.geeksforgeeks.org/sql-indexes/
28. https://drill.apache.org/docs/sql-window-functions-introduction/
29. https://www.w3resource.com/sql-exercises/joins-hr/index.php
30. https://drill.apache.org/docs/sql-window-functions-introduction/
31. https://www.geeksforgeeks.org/cte-in-sql/
32. https://docs.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-
transact-sql?view=sql-server-2017
33. https://blogs.msdn.microsoft.com/sreekarm/2008/12/28/grouping-sets-in-sql-server-2008/
34. https://spark.apache.org/docs/0.9.0/python-programming-guide.html
35. https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873
36. https://medium.com/@GaryStafford/getting-started-with-pyspark-for-big-data-analytics-
using-jupyter-notebooks-and-docker-ba39d2e3d6c7
37. https://sqlbolt.com/lesson/select_queries_order_of_execution
Mu Sigma Confidential 36