1.1 Lecture Slides Python and Tableau - The Compete Data Analytics Bootcamp
1.1 Lecture Slides Python and Tableau - The Compete Data Analytics Bootcamp
Tableau: The
Compete Data
Analytics
Bootcamp!
A Data Analysts Toolkit
When you become a Data Analyst, there are two things that you should be
skilled at
Python has become one of the most popular programming languages in the world in recent
years.
It's used in everything from machine learning to building websites and software testing. It can
be used by developers and non-developers alike. Python is commonly used for developing
websites and software, task automation, data analysis, and data visualization.
What is Tableau?
Tableau is a Business Intelligence tool for visually analyzing the data.
Users can create and distribute an interactive and shareable dashboard, which depict the
trends, variations, and density of the data in the form of graphs and charts.
Tableau can connect to files, relational and Big Data sources to acquire and process data.
The software allows data blending and real-time collaboration,
Tableau Features
Tableau supports powerful data discovery and exploration that enables users to answer
important questions in seconds
No prior programming knowledge is needed; users without relevant experience can start
immediately with creating visualizations using Tableau
It can connect to several data sources that other BI tools do not support. Tableau enables
users to create reports by joining and blending different datasets
In this course we are using Tableau Public, the free version of Tableau Desktop. It has most
features of Tableau except that in Tableau Public we can only connect to flat files and we can
only publish online (We can't save our work locally on our computer- have to save it online)
What Next
We've set up Python
Download Anaconda on Windows/Mac
Use Spyder
Project Brief.pdf
Lecture Slides.pdf
Once you install pandas once, you don’t need to install it again. Now that this
is done, we can go back to Spyder and let’s import pandas
Investigating Variables
String: These are characters or a mix of characters and numbers
Int: These are whole numbers
Float: These are decimals
List: A collection of items. You can change a list, for example, if I want to change
pear to banana I can with python.
Tuple: A collection of items but you can’t change the items in a tuple. So if I want to
change pear to banana for a tuple, I’m not able to. I’ll have to create a new tuple.
Range: This is a range of numbers ex range(10) represents a start point of 0 and an
end point of 10. A range like this: range(2,9) then the start point is 2 and the end
point is 9.
Dictionary: Dictionaries consist of pairs of keys and their corresponding values.
Set: Sets store unordered values. And unlike Tuples and Lists, Sets can have no
duplicate data
Bool: Represents true or false
Sales Analysis for Value Inc
Sales Analysis for Value Inc: Value Inc is a retail
store that sells household items all over the world
by bulk.
The Sales Manager has:
No sales reporting but he has a brief idea
Has no idea of the monthly cost, profit and top
selling products.
He wants a dashboard on this and says the
data is currently stored in an excel sheet.
Files to Download
Data Files: transaction.csv
Logo: Value Inc. Logo.png
Looking at the Columns
What is a Series?
A Pandas Series is like a column in a table. It is a one-dimensional array
holding data of any type.
Profit and Markup
One of the important metrics in something like sales data is Profit and
Markup. The formula is below
Round() Function
The round() function returns a floating point number that is a rounded
version of the specified number, with the specified number of decimals.
The default number of decimals is 0, meaning that the function will return
the nearest integer.
Syntax: ROUND(variable, digits)
You can also view other lists of functions here:
https://www.w3schools.com/python/python_ref_functions.asp
Loc
Pandas DataFrame.loc attribute accesses a group of rows and columns
by label(s) or a boolean array in the given DataFrame.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
Split()
The split() method splits a string into a list.
You can specify the separator, default separator is any whitespace.
string.split(separator, maxsplit)
https://www.w3schools.com/python/ref_string_split.asp
Replace()
The replace() method replaces a specified phrase with another specified
phrase.
string.replace(oldvalue, newvalue, count)
https://www.w3schools.com/python/ref_string_replace.asp
Lower()
The lower() method returns a string where all characters are lower case.
string.lower()
https://www.w3schools.com/python/ref_string_lower.asp
drop()
The drop() function is used to drop specified labels from rows or
columns.
Remove rows or columns by specifying label names and corresponding
axis, or by specifying directly index or column names. When using a multi-
index, labels on different levels can be removed by specifying the level.
https://www.w3resource.com/pandas/dataframe/dataframe-drop.php
pandas.DataFrame.to_csv()
By using pandas.DataFrame.to_csv() method you can write/save/export
a pandas DataFrame to CSV File.
By default to_csv() method export DataFrame to a CSV file with comma
delimiter and row index as the first column
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
Blue Bank Loan Analysis
Files to Download
Data Files: loan_data.csv
Logo: Blue Bank Logo.png
JSON Files
JSON is a lightweight data-interchange format and is plain text written in
JavaScript object notation
with statement
with statement in Python is used in exception handling to make the code
cleaner and much more readable. It simplifies the management of
common resources like file streams.
https://www.w3schools.com/python/ref_string_lower.asp
Lists
Lists are used to store multiple items in a single variable.
Lists are one of 4 built-in data types in Python used to store collections of
data, the other 3 are Tuple, Set, and Dictionary, all with different qualities
and usage.
Lists are created using square brackets
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11).
Borrowers judged by Blue Bank to be more risky are assigned higher interest rates.
installment: The monthly installments owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
dti>1 then borrower has more debt than income.
dti<1 then borrower has more income than debt
Columns
fico: The FICO credit score of the borrower.
- 300 - 400: Very Poor
- 401 - 600: Poor
- 601 - 660: Fair
- 661 - 780: Good
- 781 - 850: Excellent
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card
billing cycle).
revol.util: The borrower's revolving line utilization rate (the amount of the credit line used
relative to total credit available).
inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months. (If there
are a lot of inquiries, that’s an issue)
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in
the past 2 years.
pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens,
or judgments).
Unique()
unique() method is used to know all type of unique values in a column.
describe()
Pandas describe() is used to view some basic statistical details like
percentile, mean, std etc. of a data frame or a series of numeric values.
Syntax: DataFrame.describe(percentiles=None, include=None,
exclude=None)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
Numpy()
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier
transform, and matrices.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
IF Statement
Python if Statement is used for decision-making operations.
It contains a body of code which runs only when the condition given in
the if statement is true. If the condition is false, then the optional else
statement runs which contains some code for the else condition.
https://www.w3schools.com/python/python_conditions.asp
- Equals: a == b
- Not Equals: a != b
- Less than: a < b
- Less than or equal to: a <= b
- Greater than: a > b
- Greater than or equal to: a >= b
FICO RANGE
fico >= 300 and < 400: 'Very Poor'
fico >= 400 and ficoscore < 600: 'Poor'
fico >= 601 and ficoscore < 660: 'Fair'
fico >= 660 and ficoscore < 780: 'Good'
fico >=780: 'Excellent'
For Loops
A for loop is used for iterating over a sequence (that is either a list, a
tuple, a dictionary, a set, or a string).
This is less like the for keyword in other programming languages, and
works more like an iterator method as found in other object-orientated
programming languages.
With the for loop we can execute a set of statements, once for each item
in a list, tuple, set etc.
https://www.w3schools.com/python/python_for_loops.asp
Python Try and Except
When an error occurs, or exception as we call it, Python will normally stop
and generate an error message.
These exceptions can be handled using the try statement
The try block lets you test a block of code for errors.
The except block lets you handle the error.
https://www.w3schools.com/python/python_try_except.asp
Matplotlib
Matplotlib is a multi-platform data visualization library built on NumPy
arrays, and designed to work with the broader SciPy stack
Most of the Matplotlib utilities lies under the pyplot submodule, and are
usually imported under the plt alias:
https://www.w3schools.com/python/matplotlib_pyplot.asp
Groupby
Pandas groupby is used for grouping the data according to the
categories and apply a function to the categories. It also helps to
aggregate data efficiently.
Pandas dataframe.groupby() function is used to split the data into
groups based on some criteria. pandas objects can be split on any of
their axes.
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True,
sort=True, group_keys=True, squeeze=False, **kwargs)
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
Size()
The size of an array is the total number of elements in the array
BlogMe: Sentiment and Keyword
Analysis
BlogMe, a famous blogging business has a dataset
of news articles that they need further analysis on.
Firstly, they’d like keywords to be extracted from
headlines of the article and secondly, they would
need to determine the sentiment of the news
articles.
Files to Download
Data Files: articles.xlsx
BlogMe_sources.xlsx
Logo: BlogMe Logo.png
Functions
A function is a block of code which only runs when it is called.
You can pass data, known as parameters, into a function.
A function can return data as a result.
https://www.w3schools.com/python/python_functions.asp
Classes
Functions generally represent general calculations/formula in your script.
Classes are similar to functions however Classes (or rather their instances) are
for representing things.
https://www.w3schools.com/python/python_classes.asp
Classes
A class is a blueprint for how something should be defined. It doesn’t actually
contain any data. So something like a Car class will specify that a car name
and car make are necessary for defining a car, but it doesn’t contain the name
or make of any specific car.
While the class is the blueprint, an instance is an object that is built from a class
and contains real data. An instance of the Car class is not a blueprint anymore.
It’s an actual car with a car name, like Ford, that is a F150.
https://www.w3schools.com/python/python_classes.asp
VADER
In our project, we will be using VADER sentiment analysis.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon
and rule-based sentiment analysis tool that is specifically attuned to
sentiments expressed in social media. VADER uses a combination of A
sentiment lexicon is a list of lexical features (e.g., words) which are
generally labelled according to their semantic orientation as either
positive or negative.
VADER has been found to be quite successful when dealing with social
media texts, NY Times editorials, movie reviews, and product reviews. This
is because VADER not only tells about the Positivity and Negativity score
but also tells us about how positive or negative a sentiment is.
It is fully open-sourced under the MIT License.
https://pypi.org/project/vaderSentiment/
VADER
Advantages of using VADER
https://pypi.org/project/vaderSentiment/
TABLEAU
Tableau Workbook, Worksheets
and Dashboards
Tableau uses a workbook and sheet file structure, much like Microsoft Excel.
A workbook contains sheets. A sheet can be a worksheet, a dashboard, or a
story.
A worksheet contains a single view along with shelves, cards, legends,
and the Data and Analytics panes in its side bar.
JOIN – the data sources have one or more columns in common that you can
combine together, creating a wider table
Groups
You can create a group to combine related members in a field.
For example, if you are working with a view that shows average test scores
by major, you might want to group certain majors together to create major
categories.
English and History might be combined into a group called Liberal Arts
Majors, while Biology and Physics might be grouped as Science Majors.
Groups are useful for both correcting data errors as well as answering "what
if" type questions
Sets
You can use sets to compare and ask questions about a subset of data.
Sets are custom fields that define a subset of data based on some
conditions.
Filters
Filtering is an essential part of analyzing data. This article describes the
many ways you can filter data from your view.
It also describes how you can display interactive filters in the view, and
format filters in the view.
Calculated Fields
If your underlying data doesn't include all of the fields you need to answer
your questions, you can create new fields in Tableau using calculations and
then save them as part of your data source.
For example, you may create a calculated field that returns True if Sales is
greater than $500,000 and otherwise returns False.
You can replace the constant value of “500000” in the formula with a
parameter. Then, using the parameter control, you can dynamically change
the threshold in your calculation.