0% found this document useful (0 votes)
191 views

Do The Math

This document outlines modules for teaching SQL skills at different levels. Module 1 covers SQL basics like concepts, DDL/DML statements, tables, keys, normalization and more. Module 2 covers intermediate SQL like views, indexes, window functions and common table expressions. Module 3 covers advanced topics like cursors, triggers and stored procedures. The appendix discusses Hadoop, PySpark, Jupyter notebooks and HQL settings.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views

Do The Math

This document outlines modules for teaching SQL skills at different levels. Module 1 covers SQL basics like concepts, DDL/DML statements, tables, keys, normalization and more. Module 2 covers intermediate SQL like views, indexes, window functions and common table expressions. Module 3 covers advanced topics like cursors, triggers and stored procedures. The appendix discusses Hadoop, PySpark, Jupyter notebooks and HQL settings.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

D O T H E M AT H

Document Header

Contents
Module 1: SQL BASICS
1. SQL OVERVIEW
2. Concepts of DBMS, RDBMS, HADOOP, SPARK & PySpark
3. DDL DML DCL & TCL
4. Tables, Fields & Records
5. Keys & Constraints
6. Normalization
7. Basic Structure, Operations, Aggregate Functions
8. Subqueries
9. Joins
Module 1 Assignment & Practice Questions
Module 2: SQL INTERMEDIATE
1. Views
2. Indexes
3. Window Functions

4. Common Table Expressions

5. Grouping Sets

Module 2 Assignment & Practice Questions


Module 3: SQL ADVANCED (Optional)
1. Cursors
2. Triggers

3. Stored Procedures

Module 3 Assignment & Practice Questions


APPENDIX.........................................................................................................................................2
Hadoop, PySpark, Jupyter Notebooks, HQL Settings..........................................................................

Mu Sigma Confidential 1
Document Header

Module 1: SQL BASICS


1. SQL OVERVIEW:
Structure Query Language (SQL) is a database query language used for storing and
managing data in databases. SQL was the first commercial language introduced for E.F
Codd's Relational model of database. Today almost all RDBMS (MySql, Oracle,
Infomix, Sybase, MS Access) use SQL as the standard database query language. SQL
is used to perform all types of data operations in RDBMS. It is an American National
Standard Institute (ANSI) standard. It is a standard language for accessing and
manipulating databases. Using SQL, some of the action we could do are to create
databases, tables, stored procedures (SP’s), execute queries, retrieve, insert, update,
delete data against a database.

A very similar SQL like language called Hive Querying Language is used in a data
warehouse system called Apache Hive which is built to work on a big data platform
called Hadoop. It is used to querying and managing large datasets residing in distributed
storage. Hive uses Hadoop’s Hadoop Distributed File System (HDFS) for storage and
MapReduce/YARN for parallel processing functions.

WHEN TO USE HIVE

 If you have large (terabytes/petabytes) datasets to query: Hive is designed


specifically for analytics on large datasets and works well for a range of complex
queries. Hive is the most approachable way to quickly (relatively) query and
inspect datasets already stored in Hadoop.
 If extensibility is important: Hive has a range of user function APIs that can be
used to build custom behavior in to the query engine.

WHEN TO USE MYSQL

 If performance is key: If you need to pull data frequently and quickly, such as to
support an application that uses online analytical processing (OLAP), MySQL
performs much better. Hive isn’t designed to be an online transactional platform,
and thus performs much more slowly than MySQL.
 If your datasets are relatively small (gigabytes): Hive works very well in large
datasets, but MySQL performs much better with smaller datasets and can be
optimized in a range of ways.
 If you need to update and modify many records frequently: MySQL does this kind of
activity all day long. Hive, on the other hand, doesn’t really do this well (or at all,
depending). And if you need an interactive experience, use MySQL.

Mu Sigma Confidential 2
Document Header

2. Concepts of DBMS, RDBMS, Hadoop, Spark & PySpark


A DBMS is a software used to store and manage data. The DBMS was introduced
during 1960's to store any data. It also offers manipulation of the data like insertion,
deletion, and updating of the data. DBMS system also performs the functions like
defining, creating, revising and controlling the database. It is specially designed to create
and maintain data and enable the individual business application to extract the desired
data.

Relational Database Management System (RDBMS) is an advanced version of a DBMS


system. It came into existence during 1970's. RDBMS system also allows the
organization to access data more efficiently then DBMS. RDBMS is a software system
which is used to store only data which need to be stored in the form of tables. In this kind
of system, data is managed and stored in rows and columns which is known as tuples
and attributes. RDBMS is a powerful data management system and is widely used
across the world.

Difference between DBMS & RDMS

PARAMETER DBMS RDBMS


Storage DBMS stores data as a file Data is stored in the form of
tables
Database structure DBMS system, stores data in RDBMS uses a tabular
either a navigational or structure where the headers
hierarchical form are the column names, and
the rows contain
corresponding values
Number of Users DBMS supports single user It supports multiple users
only
ACID (Atomicity Consistency In a regular database, the data Relational databases are
Isolation Durability) may not be stored following harder to construct, but they
the ACID model. This can are consistent and well
develop inconsistencies in the structured. They obey ACID
database (Atomicity, Consistency,
Isolation, Durability)
Type of program It is the program for managing It is the database systems
the databases on the which are used for maintaining
computer networks and the the relationships among the
system hard disks tables
Hardware and software needs Low software and hardware Higher hardware and software
needs need
Normalization DBMS does not support RDBMS can be Normalized
Normalization
Distributed Databases DBMS does not support RBMS offers support for

Mu Sigma Confidential 3
Document Header

distributed database distributed databases


Ideally suited for DBMS system mainly deals RDMS is designed to handle a
with small quantity of data large amount of data
Client Server DBMS does not support client- RDBMS supports client-server
server architecture architecture
Data Fetching Data fetching is slower for the Data fetching is rapid because
complex and large amount of of its relational approach
data
Data Redundancy Data redundancy is common in Keys and indexes do not allow
this model Data redundancy
Data Relationship No relationship between data Data is stored in the form of
tables which are related to
each other with the help of
foreign keys
Security There is no security Multiple levels of security. Log
files are created at OS,
Command, and object level
Data Access Data elements need to access Data can be easily accessed
individually using SQL query. Multiple data
elements can be accessed at
the same time
Examples Examples of DBMS are a file Example of RDBMS is MySQL,
system, XML, Windows Oracle, SQL Server, etc.
Registry, etc.

Big Data & Hadoop

Big Data" consists of very large volumes of heterogeneous data that is being generated,
often, at high speeds.  These data sets cannot be managed and processed using
traditional data management tools and applications at hand.  Big Data requires the use
of a new set of tools, applications and frameworks to process and manage the data. We
identify Big Data by a few characteristics which are specific to Big Data. These
characteristics of Big Data are popularly known as Three V's of Big Data. The three v's
of Big Data are Volume, Velocity, and Variety as shown below. Volume refers to the size
of data that we are working with. With the advancement of technology and with the
invention of social media, the amount of data is growing very rapidly.  This data is
spread across different places, in different formats, in large volumes ranging from
Gigabytes to Terabytes, Petabytes, and even more. Velocity refers to the speed at
which the data is being generated. Different applications have different latency
requirements and in today's competitive world, decision makers want the necessary
data/information in the least amount of time as possible.  Generally, in near real time or
real time in certain scenarios. In different fields and different areas of technology, we see
data getting generated at different speeds. A few examples include trading/stock
exchange data, tweets on Twitter, status updates/likes/shares on Facebook, and many
others. Variety refers to the different formats in which the data is being
generated/stored. Different applications generate/store the data in different formats. In
today's world, there are large volumes of unstructured data being generated apart from
the structured data getting generated in enterprises. Until the advancements in Big Data

Mu Sigma Confidential 4
Document Header

technologies, the industry didn't have any powerful and reliable tools/technologies which
can work with such voluminous unstructured data that we see today. Apart from the
traditional flat files, spreadsheets, relational databases etc., we have a lot of
unstructured data stored in the form of images, audio files, video files, web logs, sensor
data, and many others. This aspect of varied data formats is referred to as Variety in the
Big Data world.

Hadoop is an open source framework, from the Apache foundation, capable of


processing large amounts of heterogeneous data sets in a distributed fashion across
clusters of commodity computers and hardware using a simplified programming model.
Hadoop provides a reliable shared storage and analysis system. The Hadoop framework
is based closely on the following principle: In pioneer days they used oxen for heavy
pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We
shouldn't be trying for bigger computers, but for more systems of computers.

Here are the prominent characteristics of Hadoop:

 Hadoop provides a reliable shared storage (HDFS) and analysis system


(MapReduce).
 Hadoop is highly scalable and unlike the relational databases, Hadoop scales
linearly. Due to linear scale, a Hadoop Cluster can contain tens, hundreds, or
even thousands of servers.
 Hadoop is very cost effective as it can work with commodity hardware and does
not require expensive high-end hardware.
 Hadoop is highly flexible and can process both structured as well as unstructured
data.
 Hadoop has built-in fault tolerance. Data is replicated across multiple nodes
(replication factor is configurable) and if a node goes down, the required data can
be read from another node which has the copy of that data. And it also ensures
that the replication factor is maintained, even if a node goes down, by replicating
the data to other available nodes.
 Hadoop works on the principle of write once and read multiple times.
 Hadoop is optimized for large and very large data sets. For instance, a small
amount of data like 10 MB when fed to Hadoop, generally takes more time to
process than traditional systems.
Once big data is loaded into Hadoop, what is the best way to use this data? This is the
question that hinders most of the hadoop developers. To do this, there are various
coding approaches like using the Hadoop MapReduce or alternate components like
Apache Pig and Hive. Each of these coding approaches has some pros and cons. It is
up to the hadoop developers to evaluate which coding approach will work best for their
business requirements and skills. Hadoop MapReduce is a framework or a programming
model in the  Hadoop ecosystem to process large unstructured data sets in distributed
manner by using large number of nodes. Pig and Hive are components that sit on top of
Hadoop framework for processing large data sets without the users having to write Java
based MapReduce code. Pig and Hive open source alternatives to Hadoop MapReduce
were built so that hadoop developers could do the same thing in Java in a less verbose
way by writing only fewer lines of code that is easy to understand. Pig, Hive and
MapReduce coding approaches are complementary components on the Hadoop stack.
A hadoop developer who know Pig and/or Hive does not have to  learn Java, however
there are certain jobs which can be executed more effectively using Hadoop MapReduce

Mu Sigma Confidential 5
Document Header

coding approach rather than using Pig or Hive scripts and vice versa. Internally, a
compiler translates Hive Querying Language statement into a directed acyclic graph of
MapReduce jobs, which are submitted to Hadoop for execution.
Apache Spark is a fast and general-purpose cluster computing system, written in Scala
programming language for big data processing, with built-in modules for streaming, SQL,
machine learning and graph processing. It’s well-known for its speed, ease of use,
generality and the ability to run virtually everywhere. And even though Spark is one of
the most asked tools for data engineers, also data scientists can benefit from Spark
when doing exploratory data analysis, feature extraction, supervised learning and model
evaluation. Spark SQL is a Spark module for structured data processing.
To support Python with Spark, Apache Spark community released a tool, PySpark.
PySpark offers PySpark Shell which links the Python API to the spark core and initializes
the Spark context. Majority of data scientists and analytics experts today use Python
because of its rich library set, integrating Python with Spark is a boon.

To learn more about the Hadoop architecture (HDFS – Hadoop Distributed File System)
and Map Reduce algorithm, refer to the links and materials in the appendix section.

3. DDL DML DCL & TCL


DDL (Data Definition Language) consists of the SQL commands that can be used to
define the database schema. It simply deals with descriptions of the database schema
and is used to create and modify the structure of database objects in database.
Examples of DDL commands:
CREATE – is used to create the database or its objects (like table, index,
function, views, store procedure and triggers).
DROP – is used to delete objects from the database.
ALTER-is used to alter the structure of the database.
TRUNCATE–is used to remove all records from a table, including all spaces
allocated for the records are removed.
COMMENT –is used to add comments to the data dictionary.
RENAME –is used to rename an object existing in the database.
DML (Data Manipulation Language): The SQL commands that deals with the
manipulation of data present in database belong to DML or Data Manipulation Language
and this includes most of the SQL statements.
Examples of DML:
SELECT – is used to retrieve data from a database.
INSERT – is used to insert data into a table.
UPDATE – is used to update existing data within a table.
DELETE – is used to delete records from a database table.
DCL (Data Control Language): DCL includes commands such as GRANT and
REVOKE which mainly deals with the rights, permissions and other controls of the
database system.
Examples of DCL commands:
GRANT-gives user’s access privileges to database.
REVOKE-withdraw user’s access privileges given by using the GRANT
command.

Mu Sigma Confidential 6
Document Header

TCL (Transaction Control Language): TCL commands deals with the transaction within


the database.
Examples of TCL commands:
COMMIT– commits a Transaction.
ROLLBACK– rollbacks a transaction in case of any error occurs.
SAVEPOINT–sets a save point within a transaction.
SET TRANSACTION–specify characteristics for the transaction.
4. Tables, Fields & Records
A database table is composed of records and fields that hold data. Tables are also
called datasheets. Each table in a database holds data about a different, but related,
subject.

Eg: Complaints Table

Log ID Operator Resolved Duration


120137 CS1 Yes 211
120138 CS2 No 134
120139 CS1 No 89

Data is stored in records. A record is composed of fields and contains all the data about
one person, company, or item in a database. In this database, a record contains the data
for one customer support incident report. Records appear as rows in the database table.
A record for Log ID.

Eg: The row highlighted is one record of the table

Log ID Operator Resolved Duration


120137 CS1 Yes 211
120138 CS2 No 134
120139 CS1 No 89

A field is part of a record and contains a single piece of data for the subject of the
record. In the database table illustrated above, each record contains four fields which
are Log ID, Operator, Resolved & Duration. Fields appear as columns in a database
table.

5. Keys & Constraints


SQL constraints are used to specify rules for the data in a table. Constraints are used to
limit the type of data that can go into a table. This ensures the accuracy and reliability of
the data in the table. If there is any violation between the constraint and the data action,
the action is aborted. Constraints can be column level or table level. Column level
constraints apply to a column, and table level constraints apply to the whole table.

The following constraints are commonly used in SQL:

Mu Sigma Confidential 7
Document Header

 NOT NULL - Ensures that a column cannot have a NULL value, which means
that you cannot insert a new record, or update a record without adding a value to
this field.

Eg: CREATE TABLE Persons (
    ID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255) NOT NULL,
    Age int
);

 UNIQUE - Ensures that all values in a column are different


 PRIMARY KEY - A combination of a NOT NULL and UNIQUE. Uniquely
identifies each row in a table

Eg: CREATE TABLE Persons (
    ID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255),
    Age int,
    PRIMARY KEY (ID)
); A table can have only one primary key, which may consist of single or multiple
fields. A Candidate Key is a set of one or more fields/columns that can identify a
record uniquely in a table. There can be multiple Candidate Keys in one table. Each
Candidate Key can work as Primary Key. Only one Candidate Key can be Primary
Key

 FOREIGN KEY - Uniquely identifies a row/record in another table. It is a key


used to link two tables together. A FOREIGN KEY is a field (or collection of fields)
in one table that refers to the PRIMARY KEY in another table. The table
containing the foreign key is called the child table, and the table containing the
candidate key is called the referenced or parent table.

 CHECK - Ensures that all values in a column satisfies a specific condition. It is


used to limit the value range that can be placed in a column.

Eg: CREATE TABLE Persons (
    ID int NOT NULL,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255),
    Age int,
    CHECK (Age>=18)
);

 DEFAULT - Sets a default value for a column when no value is specified

Mu Sigma Confidential 8
Document Header

Eg: CREATE TABLE Orders (
    ID int NOT NULL,
    OrderNumber int NOT NULL,
    OrderDate date DEFAULT GETDATE()
);

 INDEX - Used to create and retrieve data from the database very quickly.
CREATE INDEX statement is used to create indexes in tables. The users cannot
see the indexes, they are just used to speed up searches/queries. The users
cannot see the indexes, they are just used to speed up searches/queries.
Updating a table with indexes takes more time than updating a table without
(because the indexes also need an update). So, only create indexes on columns
that will be frequently searched against.

Eg: CREATE INDEX index_name
ON table_name (column1, column2, ...);

6. Normalization
Normalization is a database design technique which organizes tables in a manner that
reduces redundancy and dependency of data. It divides larger tables to smaller tables
and links them using relationships. The inventor of the relational model Edgar Codd
proposed the theory of normalization with the introduction of First Normal Form, and he
continued to extend theory with Second and Third Normal Form. Later he joined with
Raymond F. Boyce to develop the theory of Boyce-Codd Normal Form. 

1NF (First Normal Form) Rules

 Each table cell should contain a single value.


 Each record needs to be unique.

Eg: Assume a video library maintains a database of movies rented out. Without any
normalization, all information is stored in one table as shown below.

Membership Full Name Physical Movies rented Category


ID Address
1 Janet Jones Street A, Plot 5 Pirates of the Action, Action
bay, Thor
2 Robert Phil 5th Street, Zone Rush Hour, Comedy,
A Jurassic Park Action
3 Robert Phil 2nd Avenue, Avengers, Fast Action, Action
Ward Street & Furious

The above table in 1NF is as shown below:

Membership Full Name Physical Movies rented Category


ID Address
1 Janet Jones Street A, Plot 5 Pirates of the Action

Mu Sigma Confidential 9
Document Header

bay
1 Janet Jones Street A, Plot 5 Thor Action
2 Robert Phil 5th Street, Zone Rush Hour Comedy
A
2 Robert Phil 5th Street, Zone Jurassic Park Action
A
3 Robert Phil 2nd Avenue, Avengers Action
Ward Street
3 Robert Phil 2nd Avenue, Fast & Furious Action
Ward Street
2NF (Second Normal Form) Rules

 Rule 1- Be in 1NF
 Rule 2- there should be no partial dependency

The above 1NF table is at Membership ID & Movie rented level (primary key) and the
Physical Address column is only dependent on the membership ID and is not dependent
on both Membership ID & Movies Rented (Primary Key). Hence to get rid of this Partial
Dependency, the Physical Address column can be removed from the above table and
the above table can be divided into 2 tables as shown below

Membership ID Full Name Physical Address


1 Janet Jones Street A, Plot 5
2 Robert Phil 5th Street, Zone A
3 Robert Phil 2nd Avenue, Ward Street

Membership ID Movies Rented Category


1 Pirates of the bay Action
1 Thor Action
2 Rush Hour Comedy
2 Jurassic Park Action
3 Avengers Action
3 Fast & Furious Action
The Table 1 contains membership information and Table 2 contains information on
movies rented. The Membership ID the Primary Key of Table 1 and is the Foreign Key of
Table 2. Foreign Key references the primary key of another table and helps connect your
tables.
Foreign Key helps in maintaining referential integrity. For example, If a person tries to
insert a record into table 2 with a membership ID that is not present in table 1(parent
table), then the database will throw an error since membership id is declared as foreign
key of membership id of table 1.

3NF (Third Normal Form) Rules

 Rule 1- Be in 2NF
 Rule 2- Has no transitive functional dependencies

Membership ID Full Name Physical Address Salutation

Mu Sigma Confidential 10
Document Header

1 Janet Jones Street A, Plot 5 Ms.


2 Robert Phil 5th Street, Zone A Mr.
3 Robert Phil 2nd Avenue, Ward Mr.
Street

Consider the above 2NF table with another Salutation column. Now the Salutation
column is not dependent on the primary key (Membership ID) but it is dependent on
another non-key column which is Full Name. This is Transitive Dependency, when a
non-prime attribute depends on other non-prime attributes rather than depending upon
the prime attributes or primary key.

To move our 2NF table into 3NFand get rid of Transitive Dependency, we again need to
again divide our table as shown below.

Membership ID Full Name Physical Address Salutation ID


1 Janet Jones Street A, Plot 5 1
2 Robert Phil 5th Street, Zone A 2
3 Robert Phil 2nd Avenue, Ward 2
Street

Membership ID Movies Rented Category


1 Pirates of the bay Action
1 Thor Action
2 Rush Hour Comedy
2 Jurassic Park Action
3 Avengers Action
3 Fast & Furious Action

Salutation ID Salutation
1 Ms.
2 Mr.
3 Mrs.
4 Dr.

We have again divided our tables and created a new table which stores Salutations.
There are no transitive functional dependencies, and hence our table is in 3NF. In Table
3 Salutation ID is primary key, and in Table 1 Salutation ID is foreign to primary key in
Table 3.

Now our little example is at a level that cannot further be decomposed to attain higher
forms of normalization. In fact, it is already in higher normalization forms. Separate
efforts for moving into next levels of normalizing data are normally needed in complex
databases. To learn about higher forms of Normalization (BCNF & 4NF) refer to the link -
https://www.studytonight.com/dbms/database-normalization.php

7. Basic Structure, Operations & Aggregate Functions

Mu Sigma Confidential 11
Document Header

The basic syntax of SQL ‘select’ statement is


SELECT colmn1,
colmn2
FROM table_name

- The SELECT query is used to select rows from a table of a database which is
indicated using the FROM statement
- The asterisk (*) indicates all rows and all columns
- The columns to be picked or displayed is specified by listing out the column
names separated by a comma
- SQL is not case sensitive and hence lower-case syntax can also be used for the
keywords
- The WHERE statement is used to apply conditions on rows and multiple
conditions can be applied by using the keywords AND/OR. For better usability the
conditions can be enclosed in a parenthesis
- ORDER BY sorts the output in ascending or descending order of a column. The
default sort order is ascending and to sort in descending order the keyword
DESC is used.

Eg: Customer_Information_Table

Name Annual_Income ($) Gender Occupation


John Joe 100,000 Male Professor
Sarah Thomas 120,000 Female Dentist
Devin Martin 75,000 Male Junior Analyst
Chandler Martin 95,000 Male Database Admin
Jose Flores 140,000 Male Software Engineer

Query to pick records of male customers with income greater than 100,000
SELECT Name,
Annual_Income,
Gender,
Occupation
FROM Customer_Information_Table
WHERE Gender = ‘Male’ and Annual_Income > 100000
ORDER BY Annual_Income desc;

- The DISTINCT keyword identifies unique rows from a table and specifying
multiple column names with DISTINCT keyword results in selecting unique
combinations of all the columns from the table
- The INSERT INTO statement is used to insert new records in a table.
Syntax: INSERT  INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
- The UPDATE statement is used to modify the existing records in a table
Mu Sigma Confidential 12
Document Header

UPDATE table_name
SET column1 = value1, column2 = value2, ……….
WHERE condition;
- The DELETE statement is used to delete existing records in a table
DELETE FROM table_name WHERE condition;

An Aggregate Function allows you to perform a calculation on a set of values to return


a single scalar value. We often use aggregate functions with the GROUP BY and
HAVING clauses of the SELECT statement. The following are the most commonly
used SQL aggregate functions:

 AVG – calculates the average of a set of values


 COUNT – counts rows in a specified table or view
 MIN – gets the minimum value in a set of values
 MAX – gets the maximum value in a set of values
 SUM – calculates the sum of values

Eg: Write a query to obtain the number of customers per country in descending order
from the below customer_level table

Customer_ID Customer City Country


Name
1 Thomas Hardy London England
2 John Joe Sydney Australia
3 Jean Justin Melbourne Australia
4 James Sabre Austin USA
5 Jovin Jam Mumbai India
6 Jamie Lannis Seattle USA
7 Paul Hardy Chicago USA
8 Rahul David New Delhi India

SELECT country,
count(customer_id) as no_of_customers
FROM customer_level
GROUP BY country
ORDER BY count(customer_id) DESC;

The above query would yield the following result:

Country no_of_customers
USA 3
India 2
Australia 2
England 1

The resultant data is at Country level and it is obtained by the rolling up the customer
level table by taking the count of number of customers at country level. Similarly, any

Mu Sigma Confidential 13
Document Header

table can be rolled up to from one level to another by applying any of the aggregate
functions like count(), sum(), avg(), min(), max() and doing a Group By.

Now to, filter out on the rolled-up table the HAVING clause can be used. The HAVING
clause was added to SQL because the WHERE keyword cannot be used with aggregate
functions on the rolled-up table.
Eg: To filter out for countries having more than 2 customers, the following query can be
written:
SELECT country,
count(customer_id) as no_of_customers
FROM customer_level
GROUP BY country
HAVING count(customer_id) > 2
ORDER BY count(customer_id) DESC;

The CASE statement goes through conditions and return a value when the first condition
is met (like an IF-THEN-ELSE statement). So, once a condition is true, it will stop
reading and return the result. If no conditions are true, it returns the value in the ELSE
clause. If there is no ELSE part and no conditions are true, it returns NULL.
Syntax: CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    WHEN conditionN THEN resultN
    ELSE result
END;

Eg: SELECT OrderID, Quantity,


CASE
    WHEN Quantity > 30 THEN "The quantity is greater than 30"
    WHEN Quantity = 30 THEN "The quantity is 30"
    ELSE "The quantity is under 30"
END AS QuantityText
FROM OrderDetails;
The above Case When statement is used to display an additional text column called
QuantityText, whose value depends on the value of the column Quantity.

The UNION operator is used to combine the result-set of two or more SELECT
statements

 Each SELECT statement within UNION must have the same number of columns
 The columns must also have similar data types
 The columns in each SELECT statement must also be in the same order
Syntax: SELECT column_1, column_2, column_3
FROM table1
UNION
SELECT column_1,column_2,column_3
FROM table2
The UNION operator selects only distinct values by default. To allow duplicate
values, use UNION ALL.

Mu Sigma Confidential 14
Document Header

The Coalesce function can be used for Null value treatment by creating another column
which replaces the Null value with a Non Null value from another column or any other
Non-Null value.
Eg: select coalesce(column_a, column_b, ‘string_value’) as columnA
from tableA
The above statement would yield the first encountered Non Null value under columnA.
The above coalesce statement can be written with case when also as shown below:
select case when column_a is not null then column_a
when column_a is null and column_b is not null then column_b
when column_a is null and column_b is null then ‘string_value’ end as columnA
from tableA

The LIMIT clause in a SELECT query sets a maximum number of rows for the result set.

Eg: Select ………. From table_name LIMIT 5


The above query would yield 5 rows from the table.
To obtain the Top rows, the LIMIT clause can be combined with the ORDER BY clause.
Eg: To select the student with the highest marks the following query can be used –
Select student_name from exam_table order by marks desc LIMIT 1
Wildcard characters are used with the SQL LIKE operator. The LIKE operator is used
in a WHERE clause to search for a specified pattern in column. The most widely used
wildcard character is the ‘%’ operator.
Eg: 1. where customer_name LIKE ‘a%’ finds any value under the customer_name
column that starts with “a”
2. where customer_name LIKE ‘%a’ finds any value under the column that ends with “a”
3. where customer_name LIKE ‘%a%’ finds any value under the column customer_name
that has an “a” in any position

QUERY ORDER of EXECUTION

i. from & joins

The FROM clause, and subsequent JOINs are first executed to determine the total


working set of data that is being queried. This includes subqueries in this clause, and
can cause temporary tables to be created under the hood containing all the columns and
rows of the tables being joined.

ii. where

Once we have the total working set of data, the first-pass WHERE constraints are
applied to the individual rows, and rows that do not satisfy the constraint are discarded.
Each of the constraints can only access columns directly from the tables requested in
the FROM clause. Aliases in the SELECT part of the query are not accessible in most
databases since they may include expressions dependent on parts of the query that
have not yet executed.

iii. group by

Mu Sigma Confidential 15
Document Header

The remaining rows after the WHERE constraints are applied are then grouped based on
common values in the column specified in the GROUP BY clause. Because of the
grouping, there will only be as many rows as there are unique values in that column.
Implicitly, this means that you should only need to use this when you have aggregate
functions in your query.

iv. having

If the query has a GROUP BY clause, then the constraints in the HAVING clause are
then applied to the grouped rows, discard the grouped rows that don't satisfy the
constraint. Like the WHERE clause, aliases are also not accessible from this step in
most databases.

v. select

Any expressions in the SELECT part of the query are finally computed.

vi. distinct

Of the remaining rows, rows with duplicate values in the column marked
as DISTINCT will be discarded.

vii. order by

If an order is specified by the ORDER BY clause, the rows are then sorted by the
specified data in either ascending or descending order. Since all the expressions in
the SELECT part of the query have been computed, you can reference aliases in this
clause.

viii. limit/offset

Finally, the rows that fall outside the range specified by the LIMIT and OFFSET are
discarded, leaving the final set of rows to be returned from the query.

8. Subqueries
In SQL a Subquery can be simply defined as a query within another query. In other
words, we can say that a Subquery is a query that is embedded in WHERE clause of
another SQL query.
Important rules for Subqueries:
 You can place the Subquery in a number of SQL
clauses: WHERE clause, HAVING clause, FROM clause.
Subqueries can be used with SELECT, UPDATE, INSERT, DELETE statements
along with expression operator. It could be equality operator or comparison
operator such as =, >=, <= and Like operator.
 A subquery is a query within another query. The outer query is called as main
query and inner query is called as subquery.

Mu Sigma Confidential 16
Document Header

 The subquery generally executes first, and its output is used to complete the
query condition for the main or outer query.
 Subquery must be enclosed in parentheses.
 Subqueries are on the right side of the comparison operator.
 ORDER BY command cannot be used in a Subquery. GROUP BY command can be
used to perform same function as ORDER BY command.
 Use single-row operators with single row Subqueries. Use multiple-row operators
with multiple-row Subqueries.

Eg: Write a query to display name, location, phone_number of students from master
table whose section is A
Master_table

Name Roll_No Location Phone_Number


Ram 101 Chennai 929389238923
Raj 102 Delhi 012931320938
Ravi 103 Mumbai 298342837492
Sumanth 104 Bangalore 938747237298

Student
Name Roll_No Section
Ram 101 A
Raj 102 B
Ravi 103 A
Sumanth 104 C

SELECT Name,
Roll_No,
Location,
Phone_Number
FROM Master_table
WHERE Roll_No IN (SELECT Roll_No FROM Student WHERE Section = ‘A’;
First subquery executes “ SELECT ROLL_NO from STUDENT where SECTION=’A’ ”
returns ROLL_NO from STUDENT table whose SECTION is ‘A’. Then outer-query
executes it and return the NAME, LOCATION, PHONE_NUMBER from the DATABASE
table of the student whose ROLL_NO is returned from inner subquery .
Output:
Name Roll_No Location Phone_Number
Ram 101 Chennai 929389238923
Ravi 103 Mumbai 298342837492

9. JOINS
A SQL Join statement is used to combine data or rows from two or more tables based
on a common field between them. Different types of Joins are:
 INNER JOIN
 LEFT JOIN

Mu Sigma Confidential 17
Document Header

 RIGHT JOIN
 FULL JOIN
The INNER JOIN keyword selects all rows from both the tables if the condition satisfies.
This keyword will create the result-set by combining all rows from both the tables where
the condition satisfies i.e. value of the common field will be same.
Syntax: SELECT tableA.column1,tableA.column2,tableB.column1
FROM tableA
INNER JOIN tableB
ON tableA.matching_column = tableB.matching_column;

tableA: First table.


tableB: Second table
matching_column: Column common to both the tables.

We can also write JOIN instead of INNER JOIN. JOIN is same as INNER JOIN.

LEFT JOIN returns all the rows of the table on the left side of the join and matching rows
for the table on the right side of join. The rows for which there is no matching row on
right side, the result-set will contain null. LEFT JOIN is also known as LEFT OUTER
JOIN
Syntax: SELECT tableA.column1,tableA.column2,tableB.column1
FROM tableA
LEFT JOIN tableB
ON tableA.matching_column = tableB.matching_column;

tableA: First table.


tableB: Second table
matching_column: Column common to both the tables

Mu Sigma Confidential 18
Document Header

RIGHT JOIN is like LEFT JOIN. This join returns all the rows of the table on the right
side of the join and matching rows for the table on the left side of join. The rows for
which there is no matching row on left side, the result-set will contain null. RIGHT JOIN
is also known as RIGHT OUTER JOIN.
Syntax: SELECT tableA.column1,tableA.column2,tableB.column1
FROM tableA
RIGHT JOIN tableB
ON tableA.matching_column = tableB.matching_column;

tableA: First table.


tableB: Second table
matching_column: Column common to both the tables

FULL JOIN creates the result-set by combining result of both LEFT JOIN and RIGHT
JOIN. The result-set will contain all the rows from both the tables. The rows for which
there is no matching, the result-set will contain NULL values.
Syntax: SELECT tableA.column1,tableA.column2,tableB.column1
FROM tableA
FULL JOIN tableB
ON tableA.matching_column = tableB.matching_column;

tableA: First table.


tableB: Second table
matching_column: Column common to both the tables

Mu Sigma Confidential 19
Document Header

For examples on the working of Joins refer to the link: http://www.sql-join.com/sql-join-


types

CROSS JOIN creates the result-set which is the number of rows in the first table
multiplied by the number of rows in the second table. This kind of result is called the
Cartesian Product, where each row from 1st table joins with all the rows of another table.
If 1st table contains x rows and 2nd table contains y rows then the resultant cross joined
table contains x*y rows
Syntax: SELECT *
FROM tableA
CROSS JOIN tableB;
Eg: Table_1 & Table_2
Item_ID Item_Name Item_Unit Company_ID
1 Pot Rice Pcs 122
2 Cheese Mix Pcs 125

Company_ID Company_Name Company_City


123 Jack Hills Boston
124 J&J London
SELECT Table_1.*,
Table_2.Company_Name, Table_2.Company_City
FROM Table_1
CROSS JOIN Table_2
Output of the above query will be:
Item_ID Item_Name Item_Unit Company_ID Company_Name Company_City
1 Pot Rice Pcs 122 Jack Hills Boston
2 Cheese Pcs 125 Jack Hills Boston
Mix
1 Pot Rice Pcs 122 J&J London
2 Cheese Pcs 125 J&J London
Mix

Module 1 Practice Questions:


a. Using the cdc hive month-ASIN level table
vn5018r.P1M_Amz_Top_100K_All_Exclusions_consolidated identify the reporting_level_4 with
the highest number of ASINs over all the months combined (ASIN is an Amazon Item and the
above table is at ASIN level and reporting_level_4 is the subcategory that contains the ASINs.
reporting_level_4 is subcategory, reporting_level_3 is category, reporting_level_2 is
department, reporting_level_1 is super department and reporting_level_0 is division)
b. Identify the month for which the reporting_level_4 identified above has the highest number of
ASINs

Mu Sigma Confidential 20
Document Header

c. Using the same table, for each month, identify the reporting_level_0 that has the highest
number of reporting_level_4’s under it (Refer to Module 2 topics – CTE & Window Functions to
solve this question) and obtain the output table at month-reporting_level_0 level having the
reporting_level_0s with the highest number of reporting_level_4s.
d. Below is the table with students and their grades in different topics. Convert this 1NF table to
3NF

UnitID StudentID Date TutorID Topic Room Grade Book TutEmail


U1 St1 23.02.03 Tut1 GMT 629 4.7 Deumlich [email protected]
U2 St1 18.11.02 Tut3 Gin 631 5.1 Zehnder [email protected]
U1 St4 23.02.03 Tut1 GMT 629 4.3 Deumlich [email protected]
U5 St2 05.05.03 Tut3 PhF 632 4.9 Dummlers [email protected]
U4 St2 04.07.03 Tut5 AVQ 621 5.0 SwissTopo [email protected]

e. Write a SQL statement to make a list with order no, purchase amount, customer name and their
cities for those orders whose order amount between 500 and 2000. Use the below 2 tables and
illustrate the output table
orders: tableA
Order_no Purch_amt Ord_date Customer_id Salesman_id
70001 150.5 2012-10-05 3005 5002
70009 270.65 2012-09-10 3001 5005
70002 65.26 2012-10-05 3002 5001
70004 110.5 2012-08-17 3009 5003
70007 948.5 2012-09-10 3005 5002
70005 2400.6 2012-07-27 3007 5001
70008 5760 2012-09-10 3002 5001
70010 1983.43 2012-10-10 3004 5006
70003 2480.4 2012-10-10 3009 5003
70012 250.45 2012-06-27 3008 5002
70011 75.29 2012-08-17 3003 5007
70013 3045.6 2012-04-25 3006 5001

Customers: tableB
Customer_id Cust_name City Grade Salesman_id
3002 Nick Rimando New York 100 5001
3005 Graham Zusi California 200 5002
3001 Brad Guzan London 300 5005
3004 Fabian Johns Paris 300 5006
3007 Brad Davis New York 200 5001
3009 Geoff Camero Berlin 100 5003
3008 Julian Green London 300 5002
3003 Jozy Altidor Moscow 200 5007

f. Using the table vn5018r.deliver_date_item_output_format_dec18 calculate the deliver it


percentage as sum(total_deliver_it_units)/sum(total_units). Calculate the percentage for
each of the reporting_level_1 category at month- reporting_level_0- reporting_level_1 level.

Mu Sigma Confidential 21
Document Header

g. Write a query in SQL to display those employees who contain a letter ‘z’ to their first name
and display their department and city using the below tables. Also illustrate the output table
Departments: tableA
Department_ID Department_Name Location_ID
10 Administration 1700
20 Marketing 1800
30 Purchasing 1700
40 Human Resources 2400
50 Shipping 1500
60 IT 1400
70 Public Relations 2700
Employees: tableB
Employee_ID First_Name Department_ID
100 Zack 10
101 Zohan 10
102 Jim 20
103 Jill 30
104 Jejo 30
105 Zaakir 40
106 Yacob 50
Locations: tableC
Location_ID City
1700 Venice
1800 Rome
1900 Tokyo
2000 London
2100 New York
2200 Paris
2300 Beijing

h. Convert the item level(ASIN level) table vn5018r.p1m_100k_final_dec18 to date item level by
using the dates of present in the table vn5018r.evaluation_dates_dec18 such that every item is
present across every date. Save the result into a table.

i. For all the date – asin combinations (date-asin level table) created in the above table obtain
the instock and publish flags and have_it flag from the date-item level flags table
vn5018r.top_100K_instock_published_dec18 and create a 1/0 flag column called
not_in_catelogue to tag all the rows that do not obtain any flag from the flags table. The join
key would be catlg_item_id & calendar_date.

j. Create the same date-asin level table with flags as created above (question i.) for another new list
using these three, new list tables vn5018r.p1m_100k_final_dec18_unadj (asin level table),
vn5018r.evaluation_dates_dec18_unadj (dates table) and
vn5018r.top_100K_instock_published_dec18_unadj (item-date level flags), using the same join key.
Once created, unify this resultant date-asin level flags table with the above (question i.) date-asin
level flags table with a flag to denote if the row is part of the original list or new list.

Mu Sigma Confidential 22
Document Header

Module 2: SQL INTERMEDIATE


1. Views
Views in SQL are kind of virtual tables. A view also has rows and columns as they are in
a real table in the database. We can create a view by selecting fields from one or more
tables present in the database. A View can either have all the rows of a table or specific
rows based on certain condition. In this article we will learn about creating, deleting and
updating Views.
A View provides the following benefits:

 Views can hide complexity - If you have a query that requires joining several
tables, or has complex logic or calculations, you can code all that logic into a
view, then select from the view just like you would a table.
 Views can be used as a security mechanism - A view can select certain columns
and/or rows from a table, and permissions set on the view instead of the
underlying tables. This allows surfacing only the data that a user needs to see.
 Views can simplify supporting legacy code - If you need to refactor a table that
would break a lot of code, you can replace the table with a view of the same
name. The view provides the exact same schema as the original table, while the
actual schema has changed. This keeps the legacy code that references the
table from breaking, allowing you to change the legacy code at your leisure.

Syntax: CREATE VIEW view_name AS


SELECT column1, column2
FROM table_name
WHERE condition;

2. Indexes
An index is a schema object. It is used by the server to speed up the retrieval of rows by
using a pointer. It can reduce disk I/O(input/output) by using a rapid path access method
to locate data quickly. An index helps to speed up select queries and where clauses, but
it slows down data input, with the update and the insert statements. Indexes can be
created or dropped with no effect on the data.

Syntax: CREATE INDEX index_name


ON table_name column_name
Indexes are created in the following cases:

Mu Sigma Confidential 23
Document Header

- A column contains a wide range of values


- A column does not contain many null values
- One or more columns are frequently used together in a where clause or a join
condition

Indexes should be avoided if either the table is small, or the columns are not used often
in the query or if the column is updated frequently

An Index can be dropped using DROP command. Syntax: DROP INDEX index_name.

To learn more about Clustered and Non Clustered Indexes refer to


https://docs.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-
nonclustered-indexes-described?view=sql-server-2017

3. Window Functions
Window functions operate on a set of rows and return a single value for each row from
the underlying query. The term window describes the set of rows on which the function
operates. When you use a window function in a query, define the window using the
OVER() clause. The OVER() clause (window definition) differentiates window functions
from other analytical and reporting functions. A query can include multiple window
functions with the same or different window definitions.

The OVER() clause has the following capabilities:

 Defines window partitions to form groups of rows. (PARTITION BY clause), i.e.


the PARTITION BY clause subdivides the window into partitions.
 Orders rows within a partition. (ORDER BY clause) and the ORDER BY clause
defines the logical order of the rows within each partition of the result set.

The AVG() window function operates on the rows defined in the window and returns a
value for each row.

Eg: select emp_name, dealer_id, sales, avg(sales) over() as avg_sales from q1_sales;

Emp_Name Dealer_ID Sales Avg_Sales


Beverly Lang 1 1234 4589.42
George 3 3651 4589.42
John 2 4233 4589.42
Alex 4 6231 4589.42
Alexander 1 5672 4589.42
Shivam 3 7891 4589.42
Smith 2 3214 4589.42

Mu Sigma Confidential 24
Document Header

The AVG() window function operates on the rows defined in the window and returns a
value for each row. The Partition BY and ORDER BY clauses can also be applied to the
above query. Window functions are applied to the rows within each partition and sorted
according to the order specification.

The following query uses the AVG() window function with the PARTITION BY clause to
determine the average car sales for each dealer in Q1:

select emp_name, dealer_id, sales, avg(sales) over(partition by dealer_id) as avg_sales


from q1_sales;

Emp_Name Dealer_ID Sales Avg_Sales


Beverly Lang 1 1234 3453
Alexander 1 5672 3453
John 2 4233 3723.5
Smith 2 3214 3723.5
George 3 3651 5771
Shivam 3 7891 5771
Alex 4 3214 6231

To rank each of the employee within a dealer based on sales, the row_number() function
can be used and partitioned by dealer_id and ordered by sales. Eg: row_number()
over(partition by dealer_id order by sales desc) as rank.

The different types of window functions are:


- Value based: first_value(), lag(), last_value(), lead()
- Aggregate: avg(), count(), max(), min(), sum()
- Ranking: cume_dist(), dense_rank(), ntile(), percent_rank(), rank(), row_number()

4. Common Table Expressions


The Common Table Expressions (CTE) were introduced into standard SQL to simplify
various classes of SQL Queries for which a derived table was just unsuitable. CTE was
introduced in SQL Server 2005, the common table expression (CTE) is a temporary
named result set that you can reference within a SELECT, INSERT, UPDATE, or
DELETE statement. You can also use a CTE in a CREATE a view, as part of the view’s
SELECT query. In addition, as of SQL Server 2008, you can add a CTE to the new
MERGE statement. CTEs are written using the WITH clause,
Syntax: WITH cte_name as
(
Select ……….,…
),

Mu Sigma Confidential 25
Document Header

cte_name2 as
(
Select …………………..…..
)
Select * from cte_name
UNION ALL
Select * from cte_name2;

2 CTEs can be written one after the other with comma separation.

The following guidelines apply to common table expressions:

 A CTE must be followed by a single SELECT, INSERT, UPDATE, or DELETE


statement that references some or all the CTE columns. A CTE can also be
specified in a CREATE VIEW statement as part of the defining SELECT statement
of the view.

 Multiple CTE query definitions can be defined in a nonrecursive CTE. The


definitions must be combined by one of these set operators: UNION ALL, UNION,
INTERSECT, or EXCEPT.

 A CTE can reference itself and previously defined CTEs in the same WITH
clause. Forward referencing is not allowed.

 Specifying more than one WITH clause in a CTE is not allowed. For example, if
a CTE_query_definition contains a subquery, that subquery cannot contain a
nested WITH clause that defines another CTE.

 The following clauses cannot be used in the CTE_query_definition:

o ORDER BY (except when a TOP clause is specified)

o INTO

o OPTION clause with query hints

o FOR BROWSE

 When a CTE is used in a statement that is part of a batch, the statement before it
must be followed by a semicolon.

 A query referencing a CTE can be used to define a cursor.

 Tables on remote servers can be referenced in the CTE.

Mu Sigma Confidential 26
Document Header

 When executing a CTE, any hints that reference a CTE may conflict with other
hints that are discovered when the CTE accesses its underlying tables, in the
same manner as hints that reference views in queries. When this occurs, the query
returns an error.

5. Grouping Sets
The result set returned by GROUPING SET is the union of the aggregates based on the
columns specified in each set in the Grouping set.
Whenever an aggregate function is required, GROUP BY clause is the only solution.
There can be a requirement to get these aggregates based on different set of columns in
the same result set. We can get the same result using UNION operator with different
queries. But use of multiple queries with UNION operator is not the optimum way to
achieve this and will result in longer query execution time. This can be simplified and
optimized by using Grouping Sets.

For example, the below query would yield a unified table displaying the sum of units sold
at year level and year-month level
Eg: select month,
year,
sum(units_sold)
from sales_table
group by
year,
month

Union All

Select ‘ ‘ as month,
year,
sum(units_sold)
from sales_table
group by
year

The above approach of obtaining the data at multiple levels by using Union All operator
is not the optimum way of doing it and hence the same result can be obtained through
an optimized approach, using grouping sets. The below query would produce the sum of
units sold at month – year level because of the grouping set (month, year) and at year
level because of the grouping set (year).

Eg: Select year,


month,
sum(units_sold)
from sales_table
group by
month,
year
grouping sets
(

Mu Sigma Confidential 27
Document Header

(month, year),
(year)
)

This is much more optimized than the use of 2 queries with a union all.

Module 2 Practice Questions:


a. Rank the salaries of all the employees within each department using the table below
and then pick the top ranked (highest salaried) employee within each department. Write
a single query for the same and show the output table

employee_id full_name department salary


100 Mary Johns Sales 1000
101 Sean Moldy IT 1500
102 Peter Dugan Sales 2000
103 Lilian Penn Sales 1700
104 Milton Kowarsky IT 1800
105 Mareen Bisset Accounts 1200
106 Airton Graue Accounts 1100
107 John Joe Sales 1100
108 Cherry Quir IT 1600
109 Jijo James Sales 2100
110 Jean Justin Sales 1800
111 Paul Chris IT 1900
112 Samuel Jackson Accounts 1300
113 Jovin Jolly Accounts 1200

b. Calculate the Deliver IT % as ratio of sum of total_deliver_it_units to total_units using


the cdc table vn5018r.deliver_date_item_output_format_dec18. Calculate the
percentage at all levels of product hierarchy (From overall level till reporting_level_4
level) and produce a single resultant table. Obtain the resultant table using the Union All
method as well as grouping sets method and identify the difference between the 2
methods.

c. Convert the cdc table vn5018r.have_date_item_level_sku_corr_output_format_dec18


which is at asin-date level to asin level by picking only the latest instance of each asin
based on the date. Implement in a single query through CTE and window function.
Window Function can be used to rank all the date instances of each ASIN. Check the
level of the original source table and resultant ASIN level table and verify.

d. From the below yearly employee sales table, obtain the total sales for each year
without rolling up the table to year level and by using window functions. Write down the
resultant output table.

Mu Sigma Confidential 28
Document Header

year sales_employee sale_amount


2016 John 350
2016 David 425
2017 Melwin 225
2017 George 570
2017 Jack 325
2018 James 260
2018 Jill 780
e. From the Month – ASIN level cdc table
vn5018r.P1M_Amz_Top_100K_All_Exclusions_consolidated obtain the monthly
reporting_level_1 (L1) hierarchy level % distribution of ASINs by writing a single query
using CTEs. The % distribution of an L1 is the number of ASINs mapped under an L1 for
a given month divided by the total number of ASINs present in that month.

f. From the below table obtain the total units sold at month-year-company level, at year-
company level and at company level using grouping sets. On the rolled-up table, rank
the year-company combinations and the companies based on the units sold, with a flag
indicating the level which is ranked. Write the resultant output table.

company_nam year month day units_sold


e
X 2016 January Sunday 4200
X 2016 January Friday 3250
X 2016 January Saturday 2425
X 2016 February Tuesday 1450
X 2016 February Saturday 6300
X 2017 January Thursday 4300
X 2017 January Monday 1350
X 2017 February Wednesday 1000
X 2017 February Sunday 4700
X 2017 February Tuesday 1800
Y 2016 January Sunday 4230
Y 2016 January Friday 3251
Y 2016 January Saturday 2426
Y 2016 February Tuesday 1451
Y 2016 February Saturday 6301
Y 2017 January Thursday 4301
Y 2017 January Monday 1351
Y 2017 February Wednesday 1001
Y 2017 February Sunday 4701
Y 2017 February Tuesday 1801

Practice Questions Answer Key:

Mu Sigma Confidential 29
Document Header

SQL Practice
Questions ANSWER KEY.docx

Module 3: SQL ADVANCED (Optional)

- Cursors
o https://www.c-sharpcorner.com/UploadFile/f0b2ed/cursors-in-
sql/
o https://www.tutorialspoint.com/plsql/plsql_cursors.htm

- Triggers
o https://www.geeksforgeeks.org/sql-trigger-student-database/
o https://www.tutorialspoint.com/plsql/plsql_triggers.htm

- Stored Procedures
o https://www.w3schools.com/sql/sql_stored_procedures.asp

Mu Sigma Confidential 30
Document Header

o https://www.tutorialspoint.com/t_sql/t_sql_stored_procedures.
htm
o https://www.c-sharpcorner.com/article/how-to-create-a-stored-
procedure-in-sql-server-management-studio/

Appendix
Hadoop & HDFS:
HDFC Infrastructure - https://www.youtube.com/watch?v=DLutRT6K2rM&t=1185s

Map Reduce - https://www.youtube.com/watch?v=6OemZEJdMp8

Introduction to MapReduce class 2 (Attached)

04_IntroductionTo
MapReduce.pdf

Linux and HDFS commands

Mu Sigma Confidential 31
Document Header

Unix and HDFS


Commands.sql

HDFS commands (in the order):

1. https://www.youtube.com/watch?v=3DA1grSp4mU

2. https://www.youtube.com/watch?v=oBqju4ZkD58

3. https://www.youtube.com/watch?v=WLOevQgoZo4

4. https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-
common/FileSystemShell.html

Hadoop-for-dumies
_Dirk.deRoos-Paul.C_2014.pdf

You can refer to Hadoop for Dummies attached above to learn it in further detail.

1_IntroductionToBi
gDataHadoop.pdf

Map Reduce & YARN - https://www.youtube.com/watch?v=-Z_PayGTrE0


Yarn Failure Cases
https://stackoverflow.com/questions/30694747/how-container-failure-is-handled-for-a-yarn-
mapreduce-job
http://timepasstechies.com/handling-failures-hadoopmapreduce-yarn/

Steps on setting up BFD Jupyter Notebook with PySpark on a new VM


https://confluence.walmart.com/display/ADTECH/Setup+BFD+Jupyter+with+PySpark#SetupBFDJupy
terwithPySpark-Step-by-stepguide(IfyouhaveroleaccountwithBFDwriteaccess)
Refer to the confluence page for steps on how to set up Jupyter Notebook on a VM.

Once BFD Jupyter has been set up, then the Jupyter Notebook can be started in nohup bash mode
by running the following commands
- Open cdc00 or cdc01 on putty and login with the user id on whose VM the Jupyter notebook
was setup and run the below commands
- cd bfd-jupyter
- nohup bash ./start.sh > jupyter_log.txt &
- cat jupyter_log.txt
These commands will initiate a Jupyter Notebook with the port number and token number of the
form as highlighted below, using which Jupyter can be opened on the browser. The link will be of the

Mu Sigma Confidential 32
Document Header

format - http://cdc-main-client01.bfd.walmart.com:42424/?
token=e31a4e26513ae9bbd5c679d467ee05d6cec6bb3ab1df613d
Once Jupyter has been opened, a new iPynb kernel notebook can be started and PySpark, HQL &
Pandas can be initialized through codes, post which the SQL codes can be run in spark or in hive
using the spark.sql or execute_hql functions respectively. PySpark is a Spark Python API which
exposes the Spark programming model to Python.

Attached the hql_load.py code to initialize the hql settings

hql_load.py

The hql_load.py code file can be placed in the same folder where the notebooks are being run.
Spark.sql() function can be used to run sql codes in spark and this can handle data with up to 100
million rows. But when the size of the data is too large, the sql code will fail to run on spark and that
is when they must be run in hive using the execute_hql function.

Code to Initialize PySpark (should be mandatorily placed at the start of the Notebook, in the first
code cell and run)
#connect to spark
import sys, os, time
#import pandas

username = os.environ.get('USER')
username_hive = username.replace( "-", "_" )

from os.path import expanduser, join

from pyspark.sql import SparkSession


from pyspark.sql import Row

# warehouse_location points to the default location for managed


databases and tables
warehouse_location = 'spark-warehouse'

spark = SparkSession \
.builder \

.appName("WMT_Deliver_IT_Data_Revamp_New_with_new_pre_order_logic_inc_d
el_ts")\
.master("yarn-client")\
.config("spark.driver.allowMultipleContexts", "true") \
.config("spark.dynamicAllocation.enabled", "true")\
.config("spark.dynamicAllocation.initialExecutors", "50")\
.config("spark.dynamicAllocation.minExecutors", "50")\
.config("spark.executor.memory", "32g")\
.config("spark.driver.memory", "32g")\
.config("spark.cores.max", 64)\
.config("spark.shuffle.service.enabled", "true")\
.config("spark.rdd.compress", "true")\

Mu Sigma Confidential 33
Document Header

.config("spark.serializer","org.apache.spark.serializer.KryoSerializer"
)\
.config("spark.kryoserializer.buffer","128k")\
.config("spark.kryoserializer.buffer.max","2047m")\
.config("spark.executor.userClassPathFirst", "false")\
.config("spark.streaming.unpersist",
"true").enableHiveSupport().getOrCreate()

Settings to initialize Pandas & HQL (should be run after spark has been initialized)
import sys
print "Starting program, path: %s" % (sys.path) # Make sure to
print to stderr instead if you use this in a custom mapper or reducer!

# Activate the virtual environment


activate_this = '/usr/local/bfd-
virtualenvs/bfd_virtualenv_20171004_27/bin/activate_this.py'
execfile(activate_this, dict(__file__=activate_this))

print "Activation complete, path: %s" % (sys.path) # Make sure to


print to stderr instead if you use this in a custom mapper or reducer!
%run hql_load.py
print "HQL loaded"
Syntax to run queries in HQL:
query = “””
CREATE TABLE table_name STORED as ORC as
select ………..
from …….
………..
“””
execute_hql (hql_statement = query)
print(“DONE”)
Syntax to run queries in Spark:
query = “””
select ……………
from ………….
……….
“””
spark.sql(query).write.saveAsTable(table_name, format = ‘ORC’, mode = ‘overwrite’)
print(“DONE”)

Syntax to save the result of the query into a CSV


.toPandas().to_csv("CSV_Name.csv",encoding='utf-7',index = False)

Jupyter Notebooks
According to Project Jupyter, the Jupyter Notebook, formerly known as the IPython
Notebook, is an open-source web application that allows users to create and share

Mu Sigma Confidential 34
Document Header

documents that contain live code, equations, visualizations, and narrative text. Uses
include data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more. The word, Jupyter, is a loose acronym
for Julia, Python, and R, but today, the Jupyter supports many programming languages.
Interest in Jupyter Notebooks has grown dramatically.

Apache Spark
According to Apache, Spark is a unified analytics engine for large-scale data processing,
used by well-known, modern enterprises, such as Netflix, Yahoo, and eBay. With speeds
up to 100x faster than Hadoop, Apache Spark achieves high performance for static,
batch, and streaming data, using a state-of-the-art DAG (Directed Acyclic Graph)
scheduler, a query optimizer, and a physical execution engine.Spark’s polyglot
programming model allows users to write applications quickly in Scala, Java, Python, R,
and SQL. Spark includes libraries for Spark SQL (DataFrames and Datasets), MLlib
(Machine Learning), GraphX (Graph Processing), and DStreams (Spark Streaming). You
can run Spark using its standalone cluster mode, on Amazon EC2, Apache Hadoop
YARN, Mesos, or Kubernetes.

PySpark
The Spark Python API, PySpark, exposes the Spark programming model to Python.
PySpark is built on top of Spark’s Java API. Data is processed in Python and cached
and shuffled in the JVM. According to Apache, Py4J enables Python programs running
in a Python interpreter to dynamically access Java objects in a JVM.

REFERENCES
1. https://www.tutorialspoint.com/dbms/sql_overview.htm
2. https://www.tutorialspoint.com/sql/sql-overview.htm
3. https://www.softwaretestingmaterial.com/sql-tutorial-sql-overview/
4. https://blog.matthewrathbone.com/2015/12/08/hive-vs-mysql.html
5. https://intellipaat.com/tutorial/hadoop-tutorial/mapreduce-yarn/
6. https://www.guru99.com/difference-dbms-vs-rdbms.html
7. https://www.mssqltips.com/sqlservertip/3132/big-data-basics--part-1--introduction-to-big-
data/
8. https://www.mssqltips.com/sqlservertip/3140/big-data-basics--part-3--overview-of-hadoop/

Mu Sigma Confidential 35
Document Header

9. https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163
10. https://www.wisdomjobs.com/e-university/hadoop-tutorial-484/hdfs-concepts-14768.html
11. https://www.datacamp.com/community/tutorials/apache-spark-python
12. https://spark.apache.org/docs/2.3.0/sql-programming-guide.html
13. https://spark.apache.org/docs/2.3.0/
14. https://www.tutorialspoint.com/pyspark/pyspark_environment_setup.htm
15. https://www.geeksforgeeks.org/sql-ddl-dml-dcl-tcl-commands/
16. https://www.cengage.com/school/corpview/RegularFeatures/DatabaseTutorial/db_element
s/db_elements2.htm
17. https://www.guru99.com/database-normalization.html
18. https://www.studytonight.com/dbms/third-normal-form.php
19. https://www.studytonight.com/dbms/second-normal-form.php
20. https://launchschool.com/books/sql_first_edition/read/constraints
21. https://www.w3schools.com/sql/sql_create_index.asp
22. https://www.geeksforgeeks.org/sql-sub-queries/
23. http://www.sql-join.com/sql-join-types
24. https://www.geeksforgeeks.org/sql-join-set-1-inner-left-right-and-full-joins/
25. https://stackoverflow.com/questions/1278521/why-do-you-create-a-view-in-a-database
26. https://www.geeksforgeeks.org/sql-views/
27. https://www.geeksforgeeks.org/sql-indexes/
28. https://drill.apache.org/docs/sql-window-functions-introduction/
29. https://www.w3resource.com/sql-exercises/joins-hr/index.php
30. https://drill.apache.org/docs/sql-window-functions-introduction/
31. https://www.geeksforgeeks.org/cte-in-sql/
32. https://docs.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-
transact-sql?view=sql-server-2017
33. https://blogs.msdn.microsoft.com/sreekarm/2008/12/28/grouping-sets-in-sql-server-2008/
34. https://spark.apache.org/docs/0.9.0/python-programming-guide.html
35. https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873
36. https://medium.com/@GaryStafford/getting-started-with-pyspark-for-big-data-analytics-
using-jupyter-notebooks-and-docker-ba39d2e3d6c7
37. https://sqlbolt.com/lesson/select_queries_order_of_execution

Mu Sigma Confidential 36

You might also like