Python in Power BI - Unleash The - Hayden Van Der Post
Python in Power BI - Unleash The - Hayden Van Der Post
Reactive Publishing
CONTENTS
Title Page
Chapter 1: Introduction to Python in Power BI
Chapter 2: Data Import and Manipulation with Python
Chapter 3: Data Visualization with Python in Power BI
Chapter 4: Statistical Analysis and Machine Learning with Python
Chapter 5: Advanced Python Scripting in Power BI
Chapter 6: Real-world Use Cases and Projects
Chapter 7: Trends and Resources
Data Visualization Guide
Time Series Plot
Correlation Matrix
Histogram
Scatter Plot
Bar Chart
Pie Chart
Box and Whisker Plot
Risk Heatmaps
Additional Resources
How to install python
Python Libraries
Key Python Programming Concepts
How to write a Python Program
CHAPTER 1:
INTRODUCTION TO
PYTHON IN POWER BI
Power BI has revolutionized the way we handle business intelligence
and data analytics. Introduced in 2015, Power BI has rapidly evolved
into one of the most powerful tools for data visualization and
reporting, catering to a broad spectrum of users from data analysts
to business executives. Its user-friendly interface, extensive data
connectivity, and robust analytical capabilities have positioned it as a
cornerstone in the data analytics landscape.
Power BI's ability to integrate with these diverse data sources allows
users to bring together data from disparate systems into a single,
cohesive framework. This integration is enabled through Power
Query, a data connection technology that enables users to discover,
connect, combine, and refine data across a wide variety of sources.
DAX is similar to Excel formulas but is far more robust and optimized
for large datasets and complex calculations. It allows users to
perform sophisticated data analysis and create dynamic, interactive
reports. Some common DAX functions include:
Installing Python
1. Download Python:
Visit the official Python website at [python.org]
(https://www.python.org/) and download the latest version of Python.
It is recommended to download Python 3.x as Python 2.x is no
longer supported.
2. Install Python:
Run the installer and make sure to check the option "Add Python to
PATH". This is crucial as it allows you to run Python from the
command line. Follow the installation instructions provided by the
installer.
3. Verify Installation:
To ensure Python is installed correctly, open Command Prompt
(Windows) or Terminal (Mac/Linux) and type:
```bash
python --version
```
This command should return the version of Python installed on your
machine.
3. Verify Installation:
Once installed, open Power BI Desktop. You should be greeted with
the Power BI interface, ready to create your first report.
After installing both Python and Power BI Desktop, the next step is to
integrate Python within Power BI.
3. Verify Configuration:
To ensure that Power BI can use Python, click on "Detect Python
home directories". If everything is set up correctly, Power BI will
detect the Python installation and display the path.
1. Install Pandas:
Pandas is a powerful data manipulation library. To install it, open
Command Prompt or Terminal and type:
```bash
pip install pandas
```
2. Install Matplotlib:
Matplotlib is a popular library for creating static, interactive, and
animated visualizations. Install it using:
```bash
pip install matplotlib
```
3. Install Seaborn:
Seaborn is based on Matplotlib and provides a high-level interface
for drawing attractive and informative statistical graphics. Install it
using:
```bash
pip install seaborn
```
4. Install Scikit-learn:
Scikit-learn is essential for machine learning tasks. Install it using:
```bash
pip install scikit-learn
```
5. Install other libraries as needed:
Depending on your specific use case, you might need other libraries
such as `numpy`, `statsmodels`, `nltk`, `plotly`, etc. These can be
installed similarly using the `pip install` command.
To use Python visuals in Power BI, you need to enable the Python
scripting option.
Prerequisites
Before we dive into the setup, let's ensure you have the essential
prerequisites:
1. Regular Updates:
Keep your Python installation and libraries up to date. Regular
updates ensure compatibility with Power BI and access to the latest
features and security patches.
2. Environment Management:
Use virtual environments to manage different project dependencies.
Tools like `virtualenv` and `conda` allow you to create isolated
environments, preventing conflicts between library versions.
3. Documentation:
Document your Python scripts and configurations. Clear
documentation helps you and your team understand the setup and
troubleshoot any issues that arise.
4. Testing:
Test your Python scripts outside of Power BI before integrating them.
This ensures that the scripts run correctly and produces the
expected results.
Python Fundamentals
```python
data_list = [10, 20, 30, 40, 50]
```
```python
data_tuple = (10, 20, 30, 40, 50)
```
```python
data_dict = {"A": 10, "B": 20, "C": 30}
```
```python
data_set = {10, 20, 30, 40, 50}
```
- If Statement:
```python
value = 20
if value > 10:
print("Value is greater than 10")
```
- For Loop:
```python
for item in data_list:
print(item)
```
- While Loop:
```python
count = 0
while count < 5:
print(count)
count += 1
```
1. Defining Functions:
```python
def greet(name):
return f"Hello, {name}!"
print(greet("Power BI User"))
```
2. Using Modules:
Python has a rich ecosystem of modules that you can import and
use in your code. For example, the `math` module provides
mathematical functions.
```python
import math
print(math.sqrt(16)) # Output: 4.0
```
1. Pandas:
Pandas is a powerful data manipulation and analysis library. It
provides data structures like Series and DataFrame.
```python
import pandas as pd
# Creating a DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
2. NumPy:
NumPy is fundamental for numerical computations, providing
support for arrays.
```python
import numpy as np
# Creating an array
arr = np.array([1, 2, 3, 4, 5])
3. Matplotlib:
Matplotlib is used for creating static, interactive, and animated
visualizations.
```python
import matplotlib.pyplot as plt
Practical Examples
```python
import pandas as pd
# Load data from a CSV file
df = pd.read_csv('data.csv')
```python
# Summarize the data
summary = df.describe()
print(summary)
```python
import matplotlib.pyplot as plt
# Create a histogram
plt.hist(df['Column'], bins=10, edgecolor='black')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()
```
Best Practices
1. Code Readability:
Write clear and readable code by following consistent naming
conventions and adhering to the PEP 8 style guide.
2. Documentation:
Document your code with comments and docstrings to explain the
purpose and functionality of different sections.
```python
def calculate_sum(a, b):
"""
Calculate the sum of two numbers.
Args:
a (int): First number
b (int): Second number
Returns:
int: Sum of the two numbers
"""
return a + b
```
3. Error Handling:
Use try-except blocks to handle potential errors gracefully.
```python
try:
result = 10 / 0
except ZeroDivisionError:
print("Error: Division by zero is not allowed")
```
```python
# Example: Cleaning data using pandas
import pandas as pd
This script imports the pandas library, loads the data into a
DataFrame, cleans it by filling missing values, and returns the
cleaned DataFrame.
```python
# Example: Creating a scatter plot using matplotlib
import matplotlib.pyplot as plt
# Data preparation
x = dataset['column_x']
y = dataset['column_y']
# Scatter plot
plt.scatter(x, y)
plt.xlabel('Column X')
plt.ylabel('Column Y')
plt.title('Scatter Plot of Column X vs. Column Y')
plt.show()
```
This script generates a scatter plot from the data fields dragged into
the visual, providing a customized visual representation directly
within Power BI.
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
# Load data
df = pd.DataFrame(dataset)
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load data
df = pd.DataFrame(dataset)
# Make predictions
predictions = model.predict(X_test)
# Display predictions
result = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
print(result)
```
This script trains a linear regression model on historical data and
makes predictions, which can then be visualized within Power BI.
1. Code Optimization:
Ensure your Python scripts are optimized for performance. Avoid
using unnecessary loops and prefer vectorized operations with
libraries like pandas and NumPy.
2. Error Handling:
Implement robust error handling in your scripts to manage potential
issues gracefully.
```python
try:
# Your Python code
pass
except Exception as e:
print(f"Error occurred: {e}")
```
4. Security Considerations:
Be mindful of data security and privacy, especially when dealing with
sensitive information. Ensure that your scripts do not expose any
confidential data.
Python offers several built-in data types that are pivotal for data
handling and manipulation. These include integers, floating-point
numbers, strings, and booleans.
```python
# Example usage
int_num = 42
float_num = 3.14
```
2. Strings:
Strings (str) are sequences of characters enclosed within single or
double quotes. They are used for text manipulation.
```python
# Example usage
text = "Hello, Power BI!"
```
3. Booleans:
Booleans (bool) represent two values: True or False. They are often
used in conditional statements and logical operations.
```python
# Example usage
is_active = True
is_complete = False
```
Data Structures
1. Lists:
Lists are ordered collections of items that are mutable (i.e., they can
be changed). They can store heterogeneous data types.
```python
# Example usage
data_list = [10, 20, 30, 'Power BI', True]
```
2. Tuples:
Tuples are similar to lists but are immutable (i.e., they cannot be
changed). They are defined using parentheses.
```python
# Example usage
data_tuple = (10, 20, 30, 'Power BI', True)
```
3. Dictionaries:
Dictionaries (dict) are unordered collections of key-value pairs. They
are highly efficient for lookups and data retrieval.
```python
# Example usage
data_dict = {'name': 'Power BI', 'version': 2023, 'active': True}
```
4. Sets:
Sets are unordered collections of unique items. They are useful for
operations involving membership tests and deduplication.
```python
# Example usage
data_set = {10, 20, 30, 'Power BI', True}
```
```python
import pandas as pd
df = pd.DataFrame(data)
This script cleans the dataset by filling missing values in the 'Name'
and 'Age' columns, preparing it for further analysis and visualization
within Power BI.
2. Data Aggregation:
```python
# Example data
sales_data = [{'Product': 'A', 'Sales': 100},
{'Product': 'B', 'Sales': 150},
{'Product': 'A', 'Sales': 200},
{'Product': 'B', 'Sales': 250}]
aggregated_sales
```
```python
import pandas as pd
import matplotlib.pyplot as plt
# Example dataset
data = {'Month': ['January', 'February', 'March', 'April'],
'Revenue': [1000, 1500, 1200, 1300]}
df = pd.DataFrame(data)
1. namedtuple:
Named tuples are like regular tuples but provide named fields,
making the code more readable and self-documenting.
```python
from collections import namedtuple
# Define a namedtuple
Employee = namedtuple('Employee', ['name', 'age', 'role'])
# Create instances
emp1 = Employee('Alice', 30, 'Engineer')
emp2 = Employee('Bob', 35, 'Manager')
emp1.name # Accessing field by name
```
2. defaultdict:
defaultdict is a dictionary subclass that returns a default value if the
key is not found. It simplifies handling missing keys.
```python
from collections import defaultdict
sales_data
```
3. OrderedDict:
OrderedDict maintains the order of items as they are added, which is
useful for tasks where the order of elements matters.
```python
from collections import OrderedDict
# Define an OrderedDict
ordered_data = OrderedDict()
ordered_data['first'] = 1
ordered_data['second'] = 2
ordered_data['third'] = 3
ordered_data
```
```python
import pandas as pd
# Load dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Occupation': ['Engineer', 'Doctor', 'Artist', 'Engineer']}
df = pd.DataFrame(data)
2. Custom Visualizations:
Python's extensive visualization libraries, such as Matplotlib,
Seaborn, and Plotly, allow for the creation of sophisticated and
customized visualizations that go beyond what is natively possible
within Power BI.
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Example dataset
data = {'Month': ['January', 'February', 'March', 'April'],
'Revenue': [1000, 1500, 1200, 1300]}
df = pd.DataFrame(data)
```python
from sklearn.linear_model import LinearRegression
import numpy as np
# Example dataset
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1.2, 1.9, 3.0, 4.1, 4.8])
# Make predictions
predictions = model.predict(X)
predictions
```
This script fits a linear regression model to a simple dataset and
makes predictions, illustrating Python’s capability in statistical
modeling.
4. Automation:
Python can automate repetitive and complex tasks within Power BI,
such as data extraction, transformation, and loading (ETL)
processes, thus saving valuable time and reducing the potential for
human error.
```python
import pandas as pd
```python
# Example of a complex and time-consuming data operation
import pandas as pd
import numpy as np
2. Limited Interactivity:
Python visuals within Power BI are static and do not support
interactivity in the same way as native Power BI visuals. This can be
a limitation when creating dynamic reports that require interactive
elements.
3. Restricted Environment:
Power BI's Python scripting environment has certain limitations
regarding the libraries and packages available. While many popular
libraries are supported, some specific or custom libraries may not be
available or require special handling.
```python
# Attempt to use an unsupported library
import some_custom_library
4. Security Concerns:
Python scripts can pose security risks, especially when handling
sensitive data. Organizations need to ensure that proper security
measures are in place to prevent unauthorized access or execution
of malicious code.
```python
# Example of a security measure
import pandas as pd
```python
import pandas as pd
df = pd.DataFrame(data)
Practical Considerations
Given these capabilities and limitations, it is crucial to assess when
and how to use Python within Power BI effectively. Consider the
following practical tips:
2. Optimize Performance:
Optimize your Python scripts to ensure they run efficiently. This
includes minimizing data size, using efficient algorithms, and
leveraging vectorized operations in libraries like pandas and NumPy.
3. Enhance Security:
Implement security best practices to safeguard sensitive data and
prevent unauthorized script execution. Masking sensitive information
and restricting script permissions are essential steps.
5. Regular Maintenance:
Regularly review and update Python scripts to ensure they remain
functional and efficient as the underlying data and report
requirements evolve.
To truly unleash the potential of Power BI, one must harness the
scripting power of Python. This section will guide you through
creating your first Python script within Power BI, enabling a deeper
level of data manipulation and visualization.
```sh
pip install pandas numpy matplotlib seaborn
```
Importing Data
```python
import pandas as pd
import matplotlib.pyplot as plt
Practical Considerations
Conclusion
```python
import pandas as pd
# Reading data in chunks
chunksize = 10 6 # 1 million rows at a time
for chunk in pd.read_csv('large_data.csv', chunksize=chunksize):
process(chunk) # Replace with actual data processing function
```
```python
df = pd.read_csv('data.csv', dtype={'column1': 'int32', 'column2':
'float32'})
```
```python
df.drop(columns=['unnecessary_column'], inplace=True)
```
```python
# Vectorized operation
df['new_column'] = df['existing_column'] * 2
```
1. Plot Size: Adjust the plot size to ensure clarity without rendering
excessively large plots that slow down performance.
```python
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
```
```python
import seaborn as sns
sns.set(style="whitegrid")
sns.boxplot(x="category", y="value", data=df)
```
```python
# Example script within Power BI
import pandas as pd
df = pd.DataFrame(dataset)
filtered_df = df[['necessary_column1', 'necessary_column2']]
```
```python
try:
# Your script logic
except Exception as e:
print(f"An error occurred: {e}")
```
df = clean_data(df)
```
```python
# Function to clean data
def clean_data(df):
"""
Remove missing values from DataFrame.
:param df: pandas DataFrame
:return: cleaned pandas DataFrame
"""
df.dropna(inplace=True)
return df
```
Performance Considerations
Performance optimization is critical when integrating Python scripts
within Power BI:
```python
# Optimized aggregation with pandas
summary = df.groupby('category').agg({'value': 'sum'}).reset_index()
```
```python
import cProfile
cProfile.run('main()')
```
```python
# Mask sensitive data
df['sensitive_column'] = "MASKED"
```
```python
import re
def validate_input(user_input):
if not re.match("^[a-zA-Z0-9_]*$", user_input):
raise ValueError("Invalid input")
return user_input
```
CSV files remain one of the most commonly used data formats due
to their simplicity and compatibility. Python's `pandas` library
provides a straightforward method to read CSV files:
```python
import pandas as pd
```python
# Reading a CSV file with specified data types
df = pd.read_csv('path_to_your_file.csv', dtype={'column1': 'int32',
'column2': 'float32'})
```
```python
import pandas as pd
```python
# Reading multiple sheets from an Excel file
sheets = pd.read_excel('path_to_your_file.xlsx', sheet_name=
['Sheet1', 'Sheet2'])
```python
from sqlalchemy import create_engine
import pandas as pd
# Querying data
df = pd.read_sql('SELECT * FROM table_name', engine)
```
For PostgreSQL:
```python
from sqlalchemy import create_engine
import pandas as pd
APIs provide a means to access and import data from web services.
Python's `requests` library simplifies interacting with APIs:
```python
import requests
import pandas as pd
```python
import requests
from requests.auth import HTTPBasicAuth
import pandas as pd
```python
import pandas as pd
```python
import pandas as pd
```python
import pandas as pd
import xml.etree.ElementTree as ET
df = pd.DataFrame(data)
```
Text files, especially those with delimited data, are easy to handle
with Python:
```python
import pandas as pd
```python
import pandas as pd
Cloud storage services like AWS S3 are increasingly used for data
storage. Python’s `boto3` library provides an interface to interact with
AWS services:
```python
import boto3
import pandas as pd
```python
import gspread
import pandas as pd
from oauth2client.service_account import ServiceAccountCredentials
# Creating a Series
data = pd.Series([1, 2, 3, 4, 5])
print(data)
```
# Creating a DataFrame
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 35]}
df = pd.DataFrame(data)
print(df)
```
One of the most common tasks is loading data into pandas for
analysis. Whether your data is stored in CSV files, Excel
spreadsheets, databases, or other formats, `pandas` provides
seamless methods to import it.
- CSV Files:
```python
df = pd.read_csv('path_to_your_file.csv')
```
- Excel Files:
```python
df = pd.read_excel('path_to_your_file.xlsx', sheet_name='Sheet1')
```
- SQL Databases:
```python
from sqlalchemy import create_engine
engine = create_engine('sqlite:///my_database.db')
df = pd.read_sql('SELECT * FROM my_table', engine)
```
- Basic Inspection:
```python
# Display the first few rows of the DataFrame
print(df.head())
- Handling Duplicates:
```python
# Drop duplicate rows
df_unique = df.drop_duplicates()
```
- Sorting Data:
```python
# Sort DataFrame by a column
df_sorted = df.sort_values(by='Age', ascending=False)
```
- Grouping Data:
```python
# Group by a column and calculate the mean
df_grouped = df.groupby('Age').mean()
```
- Merging DataFrames:
```python
# Merge two DataFrames on a common column
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Value1': ['A', 'B', 'C']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Value2': ['D', 'E', 'F']})
df_merged = pd.merge(df1, df2, on='ID', how='inner')
```
- Joining DataFrames:
```python
# Join DataFrames using their indexes
df1.set_index('ID', inplace=True)
df2.set_index('ID', inplace=True)
df_joined = df1.join(df2, how='inner')
```
Removing Duplicates
- Removing Duplicates:
```python
# Drop duplicate rows
df_unique = df.drop_duplicates()
```
# Convert to lowercase
df['Name'] = df['Name'].str.lower()
Standardizing Data
- Renaming Columns:
```python
# Rename columns for consistency
df.rename(columns={'Name': 'Customer_Name', 'Age':
'Customer_Age'}, inplace=True)
```
- Reordering Columns:
```python
# Reorder columns
df = df[['Customer_Name', 'Customer_Age', 'Date']]
```
Data Transformation
To integrate these data cleaning steps within Power BI, you can use
Python scripts in the Power Query editor. Here’s how you can do it:
```python
import pandas as pd
Before diving into the practical aspects, it's crucial to understand the
difference between merging and joining datasets. While both terms
are often used interchangeably, they have nuanced differences in
the `pandas` library:
Imagine you are working with two datasets: one containing customer
information and another containing transaction details. By merging
these datasets, you can analyze customer behavior more
comprehensively. Let's look at various scenarios:
- Inner Join: Only includes rows with matching keys in both datasets.
- Outer Join: Includes all rows from both datasets. Missing values
are handled appropriately.
- Left Join: Includes all rows from the left dataset and matching rows
from the right dataset.
- Right Join: Includes all rows from the right dataset and matching
rows from the left dataset.
```python
import pandas as pd
# Customer dataset
customers = pd.DataFrame({
'CustomerID': [1, 2, 3, 4],
'Name': ['John Doe', 'Jane Smith', 'Jim Brown', 'Jake Blues']
})
# Transaction dataset
transactions = pd.DataFrame({
'TransactionID': [101, 102, 103, 104],
'CustomerID': [1, 2, 2, 3],
'Amount': [200, 150, 300, 400]
})
# Inner join
inner_joined = pd.merge(customers, transactions, on='CustomerID',
how='inner')
# Outer join
outer_joined = pd.merge(customers, transactions, on='CustomerID',
how='outer')
# Left join
left_joined = pd.merge(customers, transactions, on='CustomerID',
how='left')
# Right join
right_joined = pd.merge(customers, transactions, on='CustomerID',
how='right')
```
```python
# Customer dataset with index
customers.set_index('CustomerID', inplace=True)
# Join on index
joined = customers.join(transactions, how='inner')
```
Advanced Merging Techniques
```python
# Vertical concatenation
vertical_concat = pd.concat([customers, transactions], axis=0)
# Horizontal concatenation
horizontal_concat = pd.concat([customers, transactions], axis=1)
```
```python
# Example datasets
df1 = pd.DataFrame({
'Key1': ['A', 'B', 'C'],
'Key2': ['K1', 'K2', 'K3'],
'Value': [1, 2, 3]
})
df2 = pd.DataFrame({
'Key1': ['A', 'B', 'C'],
'Key2': ['K1', 'K2', 'K4'],
'Value': [4, 5, 6]
})
```python
# Example datasets with overlapping column names
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'key': [1, 2, 3]
})
df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5'],
'key': [1, 2, 4]
})
To integrate merging and joining operations within Power BI, you can
leverage Python scripts within the Power Query editor. This section
illustrates how to execute these operations seamlessly.
```python
import pandas as pd
```python
import pandas as pd
# Sample dataset
data = pd.DataFrame({
'Name': ['John', 'Jane', 'Jim', 'Jake', None],
'Age': [28, 34, None, 45, 23],
'City': ['New York', 'Los Angeles', 'Chicago', None, 'Houston']
})
Once missing values are identified, the next step is to analyze their
pattern. Understanding how missing data is distributed across your
dataset can inform your handling strategy.
```python
# Visualizing missing data
import seaborn as sns
import matplotlib.pyplot as plt
Using a heatmap, you can visualize the pattern of missing data. This
can reveal whether missing data is scattered randomly or if certain
sections of your dataset are more affected.
There are several strategies to handle missing values, each with its
pros and cons. The appropriate method depends on the nature and
extent of the missing data, as well as the specific use case.
```python
# Removing rows with missing values
data_dropped_rows = data.dropna()
Keep in mind that this method can lead to data loss and should be
used cautiously.
```python
# Imputing missing values with mean
data['Age'].fillna(data['Age'].mean(), inplace=True)
```python
# Forward fill
data.fillna(method='ffill', inplace=True)
# Backward fill
data.fillna(method='bfill', inplace=True)
```
```python
# Interpolation
data['Age'] = data['Age'].interpolate()
```
```python
from sklearn.ensemble import RandomForestRegressor
To integrate missing value handling within Power BI, you can use
Python scripts in the Power Query editor. This allows for seamless
data preparation and transformation.
```python
import pandas as pd
# 1. Renaming Columns
```python
import pandas as pd
# Sample dataset
data = pd.DataFrame({
'col1': [1, 2, 3],
'col2': [4, 5, 6]
})
# Renaming columns
data.rename(columns={'col1': 'Column1', 'col2': 'Column2'},
inplace=True)
print(data)
```
```python
# Changing data type
data['Column1'] = data['Column1'].astype(float)
print(data.dtypes)
```
# 3. Filtering Data
```python
# Filtering data based on condition
filtered_data = data[data['Column1'] > 1]
print(filtered_data)
```
# 4. Sorting Data
```python
# Sorting data by a column
sorted_data = data.sort_values(by='Column2', ascending=False)
print(sorted_data)
```
```python
# Applying a custom function to a column
data['Column1'] = data['Column1'].apply(lambda x: x * 2)
print(data)
```
# 2. Aggregating Data
```python
# Aggregating data
grouped_data = data.groupby('Column1').sum()
print(grouped_data)
```
# 3. Pivoting Data
```python
# Sample dataset for pivoting
data = pd.DataFrame({
'A': ['foo', 'foo', 'bar', 'bar'],
'B': ['one', 'two', 'one', 'two'],
'C': [1, 2, 3, 4]
})
```python
# Sample datasets
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'A': ['A0', 'A1', 'A2']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K3'], 'B': ['B0', 'B1', 'B3']})
# Merging datasets
merged_data = pd.merge(left, right, on='key', how='inner')
print(merged_data)
```
```python
# Sample time series data
timeseries_data = pd.DataFrame({
'date': pd.date_range(start='1/1/2022', periods=5, freq='D'),
'value': [1, 2, 3, 4, 5]
})
timeseries_data.set_index('date', inplace=True)
# Resampling data
resampled_data = timeseries_data.resample('2D').sum()
print(resampled_data)
```
```python
from sklearn.preprocessing import StandardScaler
# Sample dataset
data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Normalizing data
scaler = StandardScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
```
```python
import pandas as pd
```python
def calculate_percentage(part, whole):
return (part / whole) * 100
```
This function takes two parameters, `part` and `whole`, and returns
the percentage. Applying such a function to your dataset can be
straightforward, especially with libraries like pandas.
Suppose you have a dataset containing sales data, and you need to
calculate the percentage contribution of each product to the total
sales. Here's how you can do it:
```python
import pandas as pd
# Sample dataset
data = {'Product': ['A', 'B', 'C'],
'Sales': [150, 300, 500]}
df = pd.DataFrame(data)
print(df)
```
Output:
```
Product Sales Percentage
0 A 150 15.000000
1 B 300 30.000000
2 C 500 50.000000
```
```python
# Sample dataset
data = {'Product': ['A', 'B', 'C'],
'Sales': [150, 300, 500],
'Discount': [0.1, 0.2, 0.15]}
df = pd.DataFrame(data)
print(df)
```
Output:
```
Product Sales Discount Net Sales
0 A 150 0.10 135.0
1 B 300 0.20 240.0
2 C 500 0.15 425.0
```
```python
# Sample dataset
data = {'Region': ['North', 'South', 'North', 'East', 'South'],
'Sales': [200, 150, 300, 400, 100]}
df = pd.DataFrame(data)
Output:
```
Region Sales
0 East 400.0
1 North 250.0
2 South 150.0
```
# Sample dataset
data = {'Product': ['A', 'B', 'C'],
'Sales': [150, 300, 500],
'Discount': [0.1, 0.2, 0.15]}
df = pd.DataFrame(data)
By running this script in Power BI, you can dynamically calculate the
`Net Sales` and integrate this calculation seamlessly into your Power
BI reports.
Conclusion
Before you can start using Python scripts within Power Query, you
need to ensure that your environment is properly set up. Here are
the steps to get started:
Once your environment is set up, you can start incorporating Python
scripts into your Power Query transformations. Here’s a step-by-step
guide to writing and executing Python scripts in Power Query:
3. Write Your Python Script: Enter your Python script that performs
the desired data transformation. For example, here’s a script that
loads a dataset, performs a simple transformation, and returns the
result:
```python
import pandas as pd
# Sample dataset
data = {'Product': ['A', 'B', 'C'],
'Sales': [150, 300, 500],
'Discount': [0.1, 0.2, 0.15]}
df = pd.DataFrame(data)
# Step-by-Step Guide
1. Load Data into Power BI: Load your sales data into Power BI. For
simplicity, let’s assume the data is stored in an Excel file:
```plaintext
Product, Sales, Discount
A, 150, 0.1
B, 300, 0.2
C, 500, 0.15
```
2. Transform Data with Power Query: Open Power Query Editor and
load the dataset.
3. Add Python Script: In the Power Query Editor, go to Home > New
Source > Other > Python script. Enter the following script to
categorize products based on their net sales:
```python
import pandas as pd
4. Execute the Script: Click OK to run the script. Power Query will
display the transformed dataset with the new `Category` column.
# Step-by-Step Guide
1. Load Data into Power BI: Load the following datasets into Power
BI:
2. Transform Data with Power Query: Open Power Query Editor and
load both datasets.
3. Add Python Script: In the Power Query Editor, go to Home > New
Source > Other > Python script. Enter the following script to merge
the datasets and calculate aggregate metrics:
```python
import pandas as pd
4. Execute the Script: Click OK to run the script. Power Query will
display the aggregated dataset with total sales and average discount
by category.
Performance Considerations
# Load data
data = pd.read_csv('large_dataset.csv')
# Parallelized apply
data['Net Sales'] = data.swifter.apply(lambda x: x['Sales'] - (x['Sales']
* x['Discount']), axis=1)
```
Monitoring and profiling your code can help identify bottlenecks and
optimize performance. Here are some tools and techniques:
@profile
def data_transformation():
data['Net Sales'] = data['Sales'] - (data['Sales'] * data['Discount'])
# Load dataset
data = pd.read_csv('sales_data.csv')
By loading the data into a DataFrame and inspecting its first few
rows, you gain an initial overview of the dataset, including column
names and data types.
Filtering Data
print(filtered_data.head())
print(category_data.head())
```
Sorting Data
print(sorted_data.head())
```
Aggregating Data
print(category_sales)
```
print(missing_values)
print(cleaned_data.head())
print(filled_data.head())
```
print(merged_data)
```
Data Transformation
print(data.head())
```
Transforming data, you can derive new insights and make your data
more suitable for analysis and visualization.
print(pivot_data)
```
Pivoting data allows you to create summary tables that are easier to
analyze and visualize.
print(data.head())
```
Understanding Matplotlib
```bash
pip install matplotlib
```
With Matplotlib installed, let’s delve into its core components and
how they can be utilized to enhance your Power BI reports.
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]
# Plot data
ax.plot(x, y)
# Show plot
plt.show()
```
Customizing Visualizations
# Add grid
ax.grid(True)
# Show plot
plt.show()
```
Creating Subplots
Matplotlib supports the creation of multiple plots within a single
figure, known as subplots. This is particularly useful for comparing
multiple datasets side by side.
# Sample data
x1, y1 = [1, 2, 3, 4, 5], [1, 4, 9, 16, 25]
x2, y2 = [1, 2, 3, 4, 5], [25, 16, 9, 4, 1]
# Plot data
ax1.plot(x1, y1)
ax2.plot(x2, y2)
# Add titles
ax1.set_title('Plot 1')
ax2.set_title('Plot 2')
# Show plot
plt.show()
```
```python
import matplotlib.pyplot as plt
import pandas as pd
# Show plot
plt.show()
```
In the next section, we will delve deeper into creating various types
of charts and graphs using Matplotlib, further expanding your
visualization toolkit within Power BI.
# Line Charts
Line charts are used to show trends over time or continuous data
points. They are one of the simplest yet most effective ways to
visualize data changes. Let’s create a basic line chart using
Matplotlib within Power BI.
```python
import matplotlib.pyplot as plt
# Sample data
months = ['January', 'February', 'March', 'April', 'May']
sales = [150, 200, 250, 300, 350]
# Show plot
plt.show()
```
In this example, we first define our data: monthly sales figures. Using
`ax.plot()`, we plot the data points with a line connecting them. The
`marker='o'` parameter adds markers to each data point, while
`linestyle='-'` and `color='b'` customize the line style and color. Titles
and labels are added to ensure the chart is informative.
# Bar Charts
```python
import matplotlib.pyplot as plt
# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [5, 7, 3, 8, 6]
# Show plot
plt.show()
```
# Scatter Plots
```python
import matplotlib.pyplot as plt
# Sample data
height = [150, 160, 170, 180, 190]
weight = [50, 60, 65, 70, 75]
# Show plot
plt.show()
```
In this example, `ax.scatter()` is used to generate a scatter plot.
Each point on the plot represents a data pair from the `height` and
`weight` lists. The `color='r'` parameter sets the point color to red.
The plot is titled and labeled to ensure the relationship between
height and weight is easily understood.
# Pie Charts
```python
import matplotlib.pyplot as plt
# Sample data
segments = ['Frogs', 'Hogs', 'Dogs', 'Logs']
sizes = [15, 30, 45, 10]
# Add title
ax.set_title('Animal Proportions')
# Show plot
plt.show()
```
Here, `ax.pie()` creates a pie chart. The `sizes` list represents the
size of each pie segment, while `labels=segments` attaches labels to
each segment. The `autopct='%1.1f%%'` parameter formats the
percentage display on each segment, and `startangle=90` rotates
the pie chart to start from the top.
# Histograms
```python
import matplotlib.pyplot as plt
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
# Create a histogram
fig, ax = plt.subplots()
ax.hist(data, bins=5, color='c', edgecolor='black')
# Show plot
plt.show()
```
In this example, `ax.hist()` creates a histogram. The `data` list
provides the values to be binned, and the `bins=5` parameter
specifies the number of bins. The `color='c'` and `edgecolor='black'`
parameters customize the bar colors and edges. Titles and labels
are added to aid interpretation.
# Box Plots
```python
import matplotlib.pyplot as plt
# Sample data
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Show plot
plt.show()
```
In this example, `ax.boxplot()` generates a box plot for the provided
`data` list. Titles and labels are added to ensure the plot is clear and
informative.
```python
import matplotlib.pyplot as plt
import pandas as pd
# Show plot
plt.show()
```
Having created basic charts and graphs using Matplotlib, the next
step is to customize these visualizations to meet specific
requirements and enhance their visual appeal. Customization is key
to making your data stand out, providing more precise insights, and
delivering a more engaging user experience. In this section, we'll
delve into the various customization techniques available in
Matplotlib and how to apply them effectively within Power BI.
```python
import matplotlib.pyplot as plt
# Apply a style
plt.style.use('ggplot')
# Sample data
months = ['January', 'February', 'March', 'April', 'May']
sales = [150, 200, 250, 300, 350]
# Show plot
plt.show()
```
Colors and markers play a pivotal role in making plots more readable
and visually appealing. Matplotlib provides extensive options for
customizing these elements.
```python
import matplotlib.pyplot as plt
# Sample data
months = ['January', 'February', 'March', 'April', 'May']
sales = [150, 200, 250, 300, 350]
sales2 = [180, 220, 260, 310, 380]
In this example, we plot two sets of sales data with different colors
and markers. The `marker='s'` parameter changes the marker to a
square, and `linestyle='--'` changes the line style to dashed. The
`label` parameter adds a legend entry for each line.
```python
import matplotlib.pyplot as plt
# Sample data
months = ['January', 'February', 'March', 'April', 'May']
sales = [150, 200, 250, 300, 350]
# Show plot
plt.show()
```
# Adding Annotations
```python
import matplotlib.pyplot as plt
# Sample data
months = ['January', 'February', 'March', 'April', 'May']
sales = [150, 200, 250, 300, 350]
# Add annotation
ax.annotate('Peak Sales', xy=('May', 350), xytext=('March', 320),
arrowprops=dict(facecolor='black', shrink=0.05))
# Show plot
plt.show()
```
# Customizing Legends
```python
import matplotlib.pyplot as plt
# Sample data
months = ['January', 'February', 'March', 'April', 'May']
sales = [150, 200, 250, 300, 350]
sales2 = [180, 220, 260, 310, 380]
# Customize legend
ax.legend(loc='upper left', fontsize='large', title='Products',
title_fontsize='medium')
# Show plot
plt.show()
```
```python
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [1, 3, 6, 10, 15]
# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
# First subplot
ax1.plot(x, y1, marker='o', linestyle='-', color='b')
ax1.set_title('Quadratic Function')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
# Second subplot
ax2.plot(x, y2, marker='s', linestyle='--', color='r')
ax2.set_title('Triangular Numbers')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
# Adjust layout
plt.tight_layout()
# Show plot
plt.show()
```
```python
import matplotlib.pyplot as plt
# Sample data
months = ['January', 'February', 'March', 'April', 'May']
sales = [150, 200, 250, 300, 350]
# Show plot
plt.show()
```
In this example, we customize the title, x-label, and y-label fonts. The
`fontsize`, `fontweight`, `fontstyle`, and `fontfamily` parameters
provide control over the text appearance, enhancing the plot's
overall readability and aesthetics.
In the following section, we will explore advanced visualization
techniques, allowing you to elevate your Matplotlib charts to a new
level of complexity and sophistication.
# Sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
y3 = np.tan(x)
y4 = np.exp(x)
# Create subplots
fig, axs = plt.subplots(2, 2, figsize=(12, 8))
# First subplot
axs[0, 0].plot(x, y1, 'b-')
axs[0, 0].set_title('Sine Function')
# Second subplot
axs[0, 1].plot(x, y2, 'r--')
axs[0, 1].set_title('Cosine Function')
# Third subplot
axs[1, 0].plot(x, y3, 'g-.')
axs[1, 0].set_title('Tangent Function')
# Fourth subplot
axs[1, 1].plot(x, y4, 'k:')
axs[1, 1].set_title('Exponential Function')
# Adjust spacing
plt.tight_layout()
# Show plot
plt.show()
```
```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
The pair plot in this example uses the Iris dataset to visualize
pairwise relationships between different features, categorized by
species. The `hue='species'` parameter adds color coding to
differentiate between species, and `height=2.5` adjusts the size of
the plots.
```python
import plotly.express as px
# Show plot
fig.show()
```
In this example, `px.scatter()` creates an interactive scatter plot that
includes hover information and color coding by species. Users can
interact with the plot, zoom, and hover over points to see additional
details.
```python
import seaborn as sns
import matplotlib.pyplot as plt
# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(flights_pivot, annot=True, fmt="d", cmap="YlGnBu")
# Add title
plt.title('Flight Passengers Over Time')
# Show plot
plt.show()
```
```python
import plotly.express as px
# Show plot
fig.show()
```
Here, `px.choropleth()` creates an interactive map visualizing life
expectancy across different countries. The `color_continuous_scale`
parameter applies a color gradient, making the differences in life
expectancy visually distinct.
```python
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Show plot
plt.show()
```
In this example, a scatter plot and a box plot are combined within a
single figure. This combination allows for a detailed exploration of
different aspects of the Iris dataset, facilitating more nuanced
insights.
```python
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]
y2 = [1, 3, 6, 10, 15]
# Plot data
ax.plot(x, y1, marker='o', linestyle='-', color='b', label='Quadratic')
ax.plot(x, y2, marker='s', linestyle='--', color='r', label='Triangular')
# Customize legend
ax.legend(loc='upper left', fontsize='large', title='Functions',
title_fontsize='medium')
# Show plot
plt.show()
```
Introduction to Seaborn
Installing Seaborn
```bash
pip install seaborn
```
This command will install Seaborn along with its dependencies,
including Matplotlib and NumPy.
Let's explore some of the key functionalities of Seaborn and how you
can leverage them within Power BI.
```python
import seaborn as sns
import matplotlib.pyplot as plt
# Example Dataset
data = sns.load_dataset('tips')
In Power BI, you can utilize Python scripts to generate similar plots,
providing a rich visual context for data distributions.
2. Box Plots
Box plots are invaluable for visualizing the spread and skewness of
data, highlighting outliers effectively.
```python
# Box Plot
sns.boxplot(x='day', y='total_bill', data=data)
plt.title('Total Bill Distribution by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill')
plt.show()
```
3. Heatmaps
```python
# Correlation Heatmap
correlation = data.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
```
Embedding heatmaps in Power BI, you can enhance the analytical
depth of your dashboards, providing users with intuitive visual cues
on data relationships.
4. Pair Plots
```python
# Pair Plot
sns.pairplot(data)
plt.show()
```
```python
# Customizing Plots
sns.set_style('whitegrid')
sns.set_palette('pastel')
To integrate Seaborn visuals within Power BI, you can utilize Python
scripts in the Power BI Desktop. Here’s a step-by-step guide:
- In the Python script editor, input your Seaborn code. Ensure the
dataset passed from Power BI is referenced correctly.
```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
- Click the play button to run the script and render the Seaborn
visualization within Power BI.
Introduction to Plotly
Installing Plotly
Before you can harness the power of Plotly within Power BI, you
need to ensure it's installed in your Python environment. Open your
command prompt or terminal and execute:
```bash
pip install plotly
```
Plotly’s strength lies in its simplicity and flexibility. Let's explore some
fundamental plot types and how you can create them using Plotly,
then move on to integrating these plots within Power BI.
1. Scatter Plots
```python
import plotly.express as px
# Example Dataset
df = px.data.iris()
This code snippet creates a scatter plot of sepal width versus sepal
length, colored by species, providing an interactive way to explore
the Iris dataset.
2. Line Charts
Line charts are perfect for visualizing trends over time. Plotly’s line
charts come with interactive features that enhance data exploration.
```python
# Interactive Line Chart
fig = px.line(df, x='date', y='value', title='Time Series Analysis')
fig.show()
```
This creates an interactive line chart where users can hover over
data points to see detailed annotations, making it ideal for time
series analysis.
3. 3D Plots
```python
# 3D Scatter Plot
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width',
z='petal_length', color='species')
fig.update_layout(title='3D Scatter Plot of Iris Dataset')
fig.show()
```
4. Heatmaps
```python
# Interactive Heatmap
import plotly.graph_objects as go
```python
# Customizing Interactive Plots
fig.update_traces(marker=dict(size=12, line=dict(width=2,
color='DarkSlateGrey')),
selector=dict(mode='markers'))
fig.update_layout(title='Customized Plot', xaxis_title='X Axis',
yaxis_title='Y Axis')
fig.show()
```
To bring these interactive Plotly visuals into Power BI, you can use
Python scripts within the Power BI Desktop environment. Follow
these steps to integrate Plotly plots:
```python
import plotly.express as px
import pandas as pd
- Click the play button to run the script and render the Plotly
visualization within Power BI.
```bash
pip install plotly
pip install geopandas
pip install folium
```
Plotly provides robust tools for creating interactive maps. Below are
examples of how to leverage Plotly for geospatial data:
Scatter geo maps are ideal for plotting points on a world map or
within specific regions. They can be used to visualize data such as
customer locations, sales points, or event occurrences.
```python
import plotly.express as px
# Sample Data
data = px.data.gapminder().query("year == 2007")
2. Choropleth Maps
```python
# Choropleth Map
fig = px.choropleth(data, locations="iso_alpha", color="lifeExp",
hover_name="country", projection="natural earth",
title="Life Expectancy in 2007")
fig.show()
```
This code generates a choropleth map where countries are colored
based on life expectancy data, providing an intuitive way to compare
regions.
```python
import geopandas as gpd
import matplotlib.pyplot as plt
# Plotting
world.plot()
plt.title('World Map')
plt.show()
```
This script uses Geopandas to load and plot a world map, providing
a basis for more complex geospatial visualizations.
# Plotting
ax = world.plot(color='white', edgecolor='black')
cities.plot(ax=ax, color='red')
plt.title('World Map with Cities')
plt.show()
```
```python
import folium
# Create a Map
m = folium.Map(location=[45.5236, -122.6750], zoom_start=13)
```python
# Adding Markers
folium.Marker([45.5236, -122.6750], popup='Portland').add_to(m)
# Adding a Layer
folium.CircleMarker([45.5236, -122.6750], radius=50, color='blue',
fill=True).add_to(m)
This example adds a marker and a circle marker to the map, making
it more informative and interactive.
```python
import plotly.express as px
import pandas as pd
- Click the play button to run the script and render the geospatial
visualization within Power BI.
Practical Applications of Geospatial Data Visualization
```python
import plotly.express as px
# Sample Data
data = px.data.gapminder().query("year == 2007")
This code snippet creates a scatter geo map of the global population
in 2007 and exports it as an HTML file.
PNG files are widely used for their balance between quality and file
size, making them suitable for embedding in Power BI reports.
```python
import matplotlib.pyplot as plt
# Sample Data
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]
# Create Plot
plt.plot(x, y)
plt.title('Sample Line Plot')
# Export to PNG
plt.savefig('sample_line_plot.png')
```
This script generates a simple line plot and exports it as a PNG file.
```python
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Create Figure
fig = make_subplots(rows=1, cols=2)
```python
import matplotlib.pyplot as plt
import pandas as pd
# Plotting
plt.plot(df['Date'], df['Sales'])
plt.title('Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()
```
In the Python script editor, input the Python code for your
visualization. Below is an example using Plotly to create an
interactive scatter plot.
```python
import plotly.express as px
import pandas as pd
```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Creating a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.show()
```
```python
import plotly.express as px
import pandas as pd
```python
import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
1. Data Aggregation
Aggregating data helps reduce the volume of data points while
preserving the overall trends and patterns. Use aggregation
techniques to summarize data at a higher level:
```python
import pandas as pd
2. Data Sampling
```python
import pandas as pd
3. Incremental Loading
Loading data incrementally, rather than all at once, can help manage
performance and memory usage. This is particularly useful when
data is updated frequently or in real-time:
```python
import pandas as pd
df = load_data_incrementally()
```
```python
import pandas as pd
These formats provide faster read and write operations, reducing the
time needed to process and visualize large datasets.
```python
import pandas as pd
```python
import plotly.express as px
import pandas as pd
# Aggregating data
aggregated_data = df.groupby(['Year', 'Product
Category']).agg({'Sales': 'sum'}).reset_index()
Conclusion
```python
import numpy as np
# Creating an array
data = np.array([1, 2, 3, 4, 5])
# Calculating mean
mean = np.mean(data)
print(f'Mean: {mean}')
```
```python
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'Scores': [90, 85, 78, 92, 88]})
# Calculating summary statistics
summary = df['Scores'].describe()
print(summary)
```
```python
from scipy import stats
```python
import statsmodels.api as sm
```python
import seaborn as sns
import matplotlib.pyplot as plt
Descriptive Statistics
```python
mean = np.mean(df['Scores'])
median = np.median(df['Scores'])
mode = stats.mode(df['Scores'])
print(f'Mean: {mean}, Median: {median}, Mode: {mode}')
```
```python
variance = np.var(df['Scores'])
std_dev = np.std(df['Scores'])
print(f'Variance: {variance}, Standard Deviation: {std_dev}')
```
```python
skewness = stats.skew(df['Scores'])
kurtosis = stats.kurtosis(df['Scores'])
print(f'Skewness: {skewness}, Kurtosis: {kurtosis}')
```
Inferential Statistics
```python
# Performing a two-sample t-test
sample1 = [85, 90, 78, 92, 88]
sample2 = [80, 85, 75, 89, 84]
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
print(f'T-statistic: {t_statistic}, P-value: {p_value}')
```
```python
# One-way ANOVA
f_statistic, p_value = stats.f_oneway(sample1, sample2, [82, 87, 80,
90, 85])
print(f'F-statistic: {f_statistic}, P-value: {p_value}')
```
Imagine you're working for a retail company, and you're tasked with
analyzing customer satisfaction scores across different regions. You
have collected data on customer satisfaction scores from five
regions, and you want to determine if there are significant
differences between them.
```python
import pandas as pd
# Creating a DataFrame with sample data
data = {
'Region': ['North', 'South', 'East', 'West', 'Central'] * 20,
'Satisfaction Score': [80, 85, 78, 92, 88, 76, 84, 82, 90, 86, 81, 87,
79, 93, 89, 77, 83, 81, 91, 87]
}
df = pd.DataFrame(data)
```
2. Descriptive Statistics
```python
summary = df.groupby('Region')['Satisfaction Score'].describe()
print(summary)
```
```python
import seaborn as sns
import matplotlib.pyplot as plt
```python
from scipy import stats
```python
import numpy as np
# Sample data
data = np.array([23, 20, 28, 24, 30, 22, 26])
# Calculating mean
mean = np.mean(data)
print(f'Mean: {mean}')
```
```python
# Calculating median
median = np.median(data)
print(f'Median: {median}')
```
```python
from scipy import stats
# Calculating mode
mode = stats.mode(data)
print(f'Mode: {mode.mode[0]}, Count: {mode.count[0]}')
```
Measures of Variability
```python
# Calculating range
range_value = np.ptp(data)
print(f'Range: {range_value}')
```
```python
# Calculating variance
variance = np.var(data)
print(f'Variance: {variance}')
```
Distribution Shape
```python
# Calculating skewness
skewness = stats.skew(data)
print(f'Skewness: {skewness}')
```
```python
# Calculating kurtosis
kurtosis = stats.kurtosis(data)
print(f'Kurtosis: {kurtosis}')
```
1. Data Preparation
```python
import pandas as pd
df = pd.DataFrame(sales_data)
```
```python
# Mean sales
mean_sales = np.mean(df['Sales'])
print(f'Mean Sales: {mean_sales}')
# Median sales
median_sales = np.median(df['Sales'])
print(f'Median Sales: {median_sales}')
# Mode sales
mode_sales = stats.mode(df['Sales'])
print(f'Mode Sales: {mode_sales.mode[0]}, Count:
{mode_sales.count[0]}')
```python
import seaborn as sns
import matplotlib.pyplot as plt
Application in Power BI
```python
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
Hypothesis Testing
5. Make a Decision:
- Compare the p-value to \(\alpha\). If \(p \leq \alpha\), reject \(H_0\);
otherwise, fail to reject \(H_0\).
Implementing Hypothesis Testing in Python
```python
import pandas as pd
from scipy import stats
import powerbipy as pbi
```
```python
sales_data = pbi.load_dataset('sales_data')
sales_before = sales_data['sales_before']
sales_after = sales_data['sales_after']
```
```python
t_stat, p_value = stats.ttest_rel(sales_before, sales_after)
```
```python
alpha = 0.05
if p_value <= alpha:
result = "Reject the null hypothesis. The marketing campaign
significantly impacted sales."
else:
result = "Fail to reject the null hypothesis. No significant impact from
the marketing campaign."
```
```python
pbi.display_text(f"T-statistic: {t_stat:.2f}, P-value: {p_value:.4f}")
pbi.display_text(result)
```
- Sales Before Campaign: [320, 340, 300, 310, 305, 290, 280]
- Sales After Campaign: [350, 360, 330, 345, 325, 315, 310]
We will apply the hypothesis testing steps to evaluate the
campaign's effectiveness.
```python
sales_before = [320, 340, 300, 310, 305, 290, 280]
sales_after = [350, 360, 330, 345, 325, 315, 310]
```
```python
t_stat, p_value = stats.ttest_rel(sales_before, sales_after)
print(f"T-statistic: {t_stat:.2f}, P-value: {p_value:.4f}")
```
```python
alpha = 0.05
if p_value <= alpha:
result = "Reject the null hypothesis. The marketing campaign
significantly impacted sales."
else:
result = "Fail to reject the null hypothesis. No significant impact from
the marketing campaign."
print(result)
```
Output:
```
T-statistic: -7.89, P-value: 0.0002
Reject the null hypothesis. The marketing campaign significantly
impacted sales.
```
Introduction to scikit-learn
What is scikit-learn?
2. Extensive Documentation:
The library includes comprehensive documentation and numerous
examples, making it accessible for both beginners and experienced
data scientists.
Installing scikit-learn
```bash
pip install scikit-learn
```
If you are working within the Power BI environment, make sure to set
up the Python scripting environment accordingly.
Core Concepts and Workflow
1. Data Preparation:
Load and preprocess your data using pandas or NumPy.
2. Model Selection:
Choose an appropriate machine learning model from scikit-learn's
extensive library.
3. Model Training:
Train the model using your dataset.
4. Model Evaluation:
Evaluate the model's performance using various metrics.
5. Prediction:
Make predictions on new data.
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
import powerbipy as pbi
```
```python
churn_data = pbi.load_dataset('churn_data')
X = churn_data[['customer_age', 'account_balance']]
y = churn_data['churn']
```
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
```
```python
model = LogisticRegression()
model.fit(X_train, y_train)
```
Step 5: Making Predictions
```python
y_pred = model.predict(X_test)
```
```python
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
```
```python
pbi.display_text(f"Model Accuracy: {accuracy:.2f}")
pbi.display_text("Confusion Matrix:")
pbi.display_dataframe(pd.DataFrame(conf_matrix, columns=
['Predicted Negative', 'Predicted Positive'], index=['Actual Negative',
'Actual Positive']))
pbi.display_text("Classification Report:")
pbi.display_text(class_report)
```
2. Streamlined Workflow:
Integrating scikit-learn with Power BI allows for a streamlined
workflow where data manipulation, model training, and result
visualization happen within a single environment.
3. Scalability:
Scikit-learn’s efficient implementation ensures that your machine
learning models can scale with your data, making it suitable for both
small datasets and large-scale applications.
4. Customization:
The flexibility of scikit-learn allows for customization of models and
pipelines to fit specific business needs.
1. Linear Regression:
Ideal for models where the relationship between the independent
and dependent variables is linear.
4. Logistic Regression:
Used for binary classification problems; despite its name, it is
fundamentally a regression model.
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import powerbipy as pbi
```
```python
housing_data = pbi.load_dataset('housing_data')
X = housing_data[['square_footage', 'num_bedrooms']]
y = housing_data['price']
```
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
```
```python
model = LinearRegression()
model.fit(X_train, y_train)
```
```python
y_pred = model.predict(X_test)
```
```python
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
```python
pbi.display_text(f"Mean Squared Error: {mse:.2f}")
pbi.display_text(f"R-squared: {r2:.2f}")
```
```python
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
y_ridge_pred = ridge_model.predict(X_test)
Polynomial Regression
```python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train_poly)
y_poly_pred = poly_model.predict(X_test_poly)
2. R-squared (R²):
Represents the proportion of variance in the dependent variable
explained by the model. Higher values (closer to 1) indicate better fit.
3. Adjusted R-squared:
Similar to R² but adjusts for the number of predictors in the model.
Useful when comparing models with different numbers of predictors.
```python
from sklearn.metrics import mean_absolute_error
Practical Considerations
1. Data Preprocessing:
Ensure data is clean and preprocessed adequately. Handle missing
values, outliers, and scaling appropriately.
2. Feature Engineering:
Create meaningful features that capture the underlying patterns in
the data. This includes polynomial features, interaction terms, and
other domain-specific features.
3. Model Selection:
Choose the appropriate model based on the data characteristics and
the problem at hand. Use cross-validation to assess model
performance robustly.
4. Regularization:
Apply regularization techniques like Ridge and Lasso to prevent
overfitting, especially when dealing with high-dimensional data.
5. Model Interpretation:
Interpret the model coefficients to understand the relationship
between predictors and the target variable. This helps in deriving
actionable insights.
1. Logistic Regression:
Despite its name, logistic regression is fundamentally a classification
model used for binary outcomes. It models the probability of a
categorical dependent variable based on one or more predictor
variables.
2. K-Nearest Neighbors (KNN):
A non-parametric method used for classification and regression. In
classification, it assigns a class based on the majority class among a
fixed number of nearest neighbors.
3. Decision Trees:
These models use a tree-like graph of decisions and their possible
consequences, making them easy to visualize and interpret.
4. Random Forest:
An ensemble method that creates multiple decision trees and
merges them together to get a more accurate and stable prediction.
6. Naive Bayes:
Based on Bayes' theorem, this classifier assumes independence
between predictors. It's particularly effective for large datasets and
real-time prediction.
7. Neural Networks:
These models are inspired by the human brain and are capable of
capturing complex patterns in data. They are highly effective in tasks
involving image and speech recognition.
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
import powerbipy as pbi
```
```python
customer_data = pbi.load_dataset('customer_data')
X = customer_data[['age', 'income', 'previous_purchase']]
y = customer_data['purchase']
```
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
```
```python
y_pred = model.predict(X_test)
```
```python
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)
```
```python
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_knn_pred = knn.predict(X_test)
```python
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_rf_pred = rf.predict(X_test)
```python
from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)
y_svm_pred = svm.predict(X_test)
1. Accuracy:
The ratio of correctly predicted instances to the total instances.
2. Confusion Matrix:
A table that describes the performance of a classification model by
showing the true positives, false positives, true negatives, and false
negatives.
5. Cross-Validation:
A technique to assess how the results of a statistical analysis will
generalize to an independent dataset. Common methods include k-
fold cross-validation and leave-one-out cross-validation.
```python
from sklearn.metrics import roc_auc_score, roc_curve
precision = precision_score(y_test, y_svm_pred)
recall = recall_score(y_test, y_svm_pred)
f1 = f1_score(y_test, y_svm_pred)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
Practical Considerations
1. Data Preprocessing:
Clean and preprocess data adequately. Handle missing values,
outliers, and ensure proper scaling of features.
2. Feature Selection:
Select the most relevant features to improve model performance and
reduce overfitting.
3. Model Tuning:
Use techniques such as GridSearchCV or RandomizedSearchCV to
tune hyperparameters for optimal model performance.
5. Model Interpretation:
Tools like SHAP (SHapley Additive exPlanations) and LIME (Local
Interpretable Model-agnostic Explanations) can help interpret
complex models.
Understanding Clustering
1. K-Means Clustering:
One of the simplest and most popular clustering algorithms. It
partitions data into K clusters, minimizing the variance within each
cluster.
2. Hierarchical Clustering:
Builds a hierarchy of clusters either by merging smaller clusters into
bigger ones (agglomerative) or splitting bigger clusters into smaller
ones (divisive).
```python
import pandas as pd
from sklearn.cluster import KMeans
import powerbipy as pbi
import matplotlib.pyplot as plt
import seaborn as sns
```
```python
customer_data = pbi.load_dataset('customer_data')
X = customer_data[['age', 'annual_income', 'spending_score']]
```
From the Elbow plot, assume the optimal number of clusters (k) is 4.
```python
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300,
n_init=10, random_state=42)
y_kmeans = kmeans.fit_predict(X)
```
```python
plt.figure(figsize=(10, 7))
sns.scatterplot(x='annual_income', y='spending_score',
hue=y_kmeans, palette='viridis', data=customer_data)
plt.title('Customer Segments')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()
```
```python
customer_data['Cluster'] = y_kmeans
pbi.display_table(customer_data)
```
Hierarchical Clustering
```python
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()
```
DBSCAN
Example: DBSCAN
```python
from sklearn.cluster import DBSCAN
sns.scatterplot(x='annual_income', y='spending_score',
hue=y_dbscan, palette='viridis', data=customer_data)
plt.title('DBSCAN Clustering')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()
```
Segmentation Techniques
Customer Segmentation:
```python
from sklearn.mixture import GaussianMixture
customer_data['Segment'] = y_gmm
1. Data Scaling:
Ensure that data is properly scaled to avoid bias in the clustering
algorithm. Techniques like Min-Max scaling or Standardization can
be used.
2. Feature Selection:
Select relevant features that contribute significantly to the clustering
or segmentation process.
3. Interpreting Results:
Use visualization tools to interpret and present the clustering results
effectively.
4. Validation:
Validate the stability of clusters using methods like Silhouette Score
and Cross-Validation.
```python
from sklearn.metrics import silhouette_score
```python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y_kmeans_pca,
palette='viridis')
plt.title('Clustering with PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
```
Conclusion
Several models are pivotal for time series analysis, each suited for
different types of data and objectives.
4. Prophet:
Developed by Facebook, Prophet is an open-source forecasting tool
designed to handle complex time series data with daily observations
and strong seasonal effects.
```python
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from fbprophet import Prophet
import matplotlib.pyplot as plt
import powerbipy as pbi
```
```python
sales_data = pbi.load_dataset('sales_data')
sales_data['date'] = pd.to_datetime(sales_data['date'])
sales_data.set_index('date', inplace=True)
```
```python
plt.figure(figsize=(10, 5))
plt.plot(sales_data.index, sales_data['sales'], label='Sales Data')
plt.title('Sales Data Time Series')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
```
```python
from statsmodels.tsa.seasonal import seasonal_decompose
```python
from pmdarima import auto_arima
Prophet requires a specific format with columns `ds` (date) and `y`
(value).
```python
sales_data_reset = sales_data.reset_index()
sales_data_prophet = sales_data_reset.rename(columns={'date':
'ds', 'sales': 'y'})
1. Stationarity:
Ensure that the time series data is stationary. Non-stationary data
can be transformed through differencing.
2. Seasonal Patterns:
Recognize and adjust for seasonality in your data to avoid
misleading results.
3. Model Validation:
Validate your models using out-of-sample tests and cross-validation
techniques to ensure robustness.
```python
sales_data['sales'].fillna(method='ffill', inplace=True)
```
```python
sales_data['sales_diff'] = sales_data['sales'].diff()
sales_data.dropna(inplace=True)
```
6. Model Selection:
Choose the appropriate model based on the data's characteristics
and the analysis objective. Use metrics like AIC, BIC, and RMSE for
model comparison.
```python
aic_values = []
for p in range(0, 3):
for q in range(0, 3):
try:
model = ARIMA(sales_data['sales'], order=(p, 1, q))
model_fit = model.fit()
aic_values.append((p, q, model_fit.aic))
except:
continue
best_params = min(aic_values, key=lambda x: x[2])
print(f'Best ARIMA parameters: p={best_params[0]}, q=
{best_params[1]}, AIC={best_params[2]}')
```
```python
!pip install pandas numpy scikit-learn matplotlib powerbipy
```
```python
import pandas as pd
customer_data = pbi.load_dataset('customer_data')
```
```python
# Handle missing values
customer_data.fillna(customer_data.mean(), inplace=True)
scaler = StandardScaler()
customer_data[['age', 'balance', 'tenure']] =
scaler.fit_transform(customer_data[['age', 'balance', 'tenure']])
```
Step 5: Splitting the Data
```python
from sklearn.model_selection import train_test_split
X = customer_data.drop('churn', axis=1)
y = customer_data['churn']
```python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100,
random_state=42)
model.fit(X_train, y_train)
```
```python
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1-Score: {f1:.2f}')
```
```python
import joblib
joblib.dump(model, 'customer_churn_model.pkl')
```
In Power BI, use Python scripts to load the saved model and make
predictions on new data.
```python
import joblib
loaded_model = joblib.load('customer_churn_model.pkl')
# Make predictions
new_data['churn_prediction'] =
loaded_model.predict(new_data.drop('customer_id', axis=1))
output = new_data[['customer_id', 'churn_prediction']]
```
Once you have the predictions, you can visualize them directly in
Power BI. Create a new visual to display the churn predictions
alongside other customer information.
```python
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30]
}
grid_search = GridSearchCV(estimator=model,
param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
```
```python
import shap
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
```
```python
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('customer_churn_model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True)
prediction = model.predict(pd.DataFrame(data))
return jsonify(prediction.tolist())
if __name__ == '__main__':
app.run(port=5000, debug=True)
```
Embedding machine learning models into Power BI, you are not just
enhancing your analytics but transforming your entire approach to
data-driven decision-making. This integration empowers you to
predict trends, automate processes, and optimize business
outcomes with unprecedented accuracy and efficiency.
Introduction
Background
Implementation
# Load data
sales_data = pbi.load_dataset('sales_data')
```python
from statsmodels.tsa.arima_model import ARIMA
# Forecast
forecast = model_fit.forecast(steps=12)
```
Integration in Power BI: The forecasted sales data were loaded back
into Power BI and visualized to highlight trends and potential stock
issues.
```python
# Load forecast into Power BI
forecast_data = pd.DataFrame(forecast[0], columns=
['forecasted_sales'])
pbi.save_dataset('forecasted_sales', forecast_data)
```
Outcome
Background
Implementation
Data Collection: Patient records, including demographics, medical
history, and previous hospital admissions, were gathered.
```python
# Load data
patient_data = pbi.load_dataset('patient_data')
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = patient_data.drop('readmission', axis=1)
y = patient_data['readmission']
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
```
```python
# Load model and new data
loaded_model = joblib.load('readmission_model.pkl')
new_patient_data = pbi.load_dataset('new_patient_data')
# Predict readmissions
new_patient_data['readmission_prediction'] =
loaded_model.predict(new_patient_data.drop('patient_id', axis=1))
output = new_patient_data[['patient_id', 'readmission_prediction']]
```
Outcome
Background
Implementation
```python
# Load data
transaction_data = pbi.load_dataset('transaction_data')
# Feature engineering
transaction_data['transaction_frequency'] =
transaction_data.groupby('customer_id')
['transaction_amount'].transform('count')
transaction_data['average_transaction'] =
transaction_data.groupby('customer_id')
['transaction_amount'].transform('mean')
```
```python
from sklearn.ensemble import RandomForestClassifier
X = transaction_data.drop('fraud', axis=1)
y = transaction_data['fraud']
model = RandomForestClassifier(n_estimators=100,
random_state=42)
model.fit(X_train, y_train)
```
```python
# Load model and new data
loaded_model = joblib.load('fraud_detection_model.pkl')
new_transaction_data = pbi.load_dataset('new_transaction_data')
# Feature engineering
new_transaction_data['transaction_frequency'] =
new_transaction_data.groupby('customer_id')
['transaction_amount'].transform('count')
new_transaction_data['average_transaction'] =
new_transaction_data.groupby('customer_id')
['transaction_amount'].transform('mean')
# Predict fraud
new_transaction_data['fraud_prediction'] =
loaded_model.predict(new_transaction_data.drop('transaction_id',
axis=1))
output = new_transaction_data[['transaction_id', 'fraud_prediction']]
```
Outcome
Conclusion
```python
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
total = sum(numbers)
print(total)
```
Using the `sum()` function here is not only concise but also more
efficient than writing a loop to accumulate the sum.
Example:
```python
# Traditional loop
squares = []
for x in range(10):
squares.append(x2)
# List comprehension
squares = [x2 for x in range(10)]
```
The latter is not only more readable but also faster, as the list
comprehension is typically optimized in Python's interpreter.
Example:
```python
# Using map to square all numbers in a list
numbers = [1, 2, 3, 4, 5]
squared_numbers = list(map(lambda x: x2, numbers))
Example:
```python
def generate_squares(n):
for i in range(n):
yield i2
squares = generate_squares(10)
for square in squares:
print(square)
```
```python
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50]
})
# Vectorized addition
df['C'] = df['A'] + df['B']
```
```python
df['category'] = df['category'].astype('category')
```
Example:
```python
import cProfile
def calculate_squares():
result = [x2 for x in range(1000000)]
return result
cProfile.run('calculate_squares()')
```
```python
import pandas as pd
# Load data
data = pd.read_csv('sales_data.csv')
print(total_sales)
```
In this script, the data is loaded, missing values are filled using
forward fill, and the total sales per product are calculated using
`groupby`, followed by sorting for better visualization—all done
efficiently using Pandas.
Writing efficient Python code is not just about making your scripts run
faster; it's about adopting practices that lead to cleaner, more
readable, and maintainable code. Leveraging Python’s built-in
functions, list comprehensions, generators, and optimized libraries
like Pandas, along with profiling and benchmarking, can transform
your Power BI projects, making them more robust and performant. In
the realm of data analytics, where every second counts, such
efficiency can be the difference between a good analyst and a great
one.
```python
def calculate_discount(price, discount):
"""
Function to calculate the discount on a price.
"""
discounted_price = price - (price * discount / 100)
return discounted_price
```
```python
import pandas as pd
# Sample data
data = pd.DataFrame({
'Product': ['A', 'B', 'C'],
'Price': [100, 200, 300],
'Discount': [10, 15, 20]
})
print(data)
```
```python
import pandas as pd
# Load data
data = pd.read_csv('sales_data.csv')
# Filter data
filtered_data = data[data['Sales'] > 1000]
print(summary)
```
In this script, the data is filtered to include only rows where sales
exceed 1000 units, and then it is grouped by product to calculate the
total sales per product.
```python
import numpy as np
# Generate random data
data = np.random.randn(1000)
Example:
```python
import matplotlib.pyplot as plt
# Sample data
products = ['A', 'B', 'C']
sales = [1500, 2000, 1700]
This script generates a bar chart displaying sales data, which can
then be embedded into a Power BI dashboard for enhanced visual
storytelling.
```python
# utils.py
def calculate_discount(price, discount):
return price - (price * discount / 100)
# Sample data
data = pd.DataFrame({
'Product': ['A', 'B', 'C'],
'Price': [100, 200, 300],
'Discount': [10, 15, 20],
'Tax_Rate': [5, 10, 15]
})
print(data)
```
# Load data
data = pd.read_csv('sales_data.csv')
# Prepare data
X = data[['Marketing_Spend', 'Seasonality']]
y = data['Sales']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
print(predictions)
```
By default, Python uses the `int` type for integers, which can handle
large values but also consumes more memory. For datasets where
the range of values is known, using more efficient types such as
`numpy.int8`, `numpy.int16`, or `numpy.int32` can save memory.
Example:
```python
import numpy as np
In this example, changing the data type from `int64` to `int8` reduces
memory usage significantly, making it an effective optimization
strategy.
Example:
```python
import pandas as pd
Reading large datasets into memory all at once can quickly exhaust
your system's resources. Instead, consider techniques such as
chunking, which involves processing data in smaller, more
manageable pieces.
Example:
```python
import pandas as pd
Memory Profiling
Using memory_profiler
The `memory_profiler` library provides a line-by-line analysis of
memory usage in your scripts.
Example:
```python
from memory_profiler import profile
@profile
def process_data():
data = pd.read_csv('large_dataset.csv')
data['Processed_Column'] = data['Original_Column'].apply(lambda x:
x + 1)
return data
if __name__ == "__main__":
process_data()
```
Optimizing DataFrames
Example:
```python
import pandas as pd
data = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [1.0, 2.0, 3.0, 4.0]
})
print(data.dtypes)
print(data.memory_usage(deep=True))
```
Example:
```python
data.drop(columns=['Unnecessary_Column'], inplace=True)
```
3. Setting Appropriate Indexes:
Using appropriate indexes can speed up operations and reduce
memory usage.
Example:
```python
data.set_index('ID_Column', inplace=True)
```
Example:
```python
def data_generator(file_path):
with open(file_path) as file:
for line in file:
yield process_line(line)
By using generators, you can process each line of a large file without
loading the entire file into memory, ensuring efficient memory usage.
Caching Results
Using lru_cache
Example:
```python
from functools import lru_cache
@lru_cache(maxsize=128)
def expensive_computation(param):
# Simulate an expensive computation
result = param 2
return result
# Example usage
print(expensive_computation(4))
print(expensive_computation(4)) # This call will use the cached
result
```
1. Syntax Errors:
These occur when the Python parser encounters code that doesn't
conform to the syntax rules of the language. Syntax errors are often
straightforward to identify and fix.
Example:
```python
print("Hello, World!
```
In this example, the missing closing quotation mark results in a
syntax error.
2. Runtime Errors:
These occur during the execution of the script. Runtime errors can
range from simple issues like division by zero to more complex
problems like accessing a non-existent file.
Example:
```python
result = 10 / 0
```
Here, attempting to divide by zero will raise a `ZeroDivisionError`.
3. Logical Errors:
These are the most challenging to detect because they don't cause
the script to crash but lead to incorrect results. Logical errors arise
from mistakes in the program logic.
Example:
```python
total = sum([1, 2, 3])
average = total / 4 # Logical error: should divide by 3
```
This error leads to an incorrect average calculation.
Example:
```python
try:
result = 10 / 0
except ZeroDivisionError:
print("Error: Division by zero is not allowed.")
```
Example:
```python
try:
value = int(input("Enter a number: "))
result = 10 / value
except ValueError:
print("Error: Invalid input. Please enter a valid number.")
except ZeroDivisionError:
print("Error: Division by zero is not allowed.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
```
The `else` block can be used to execute code that should run only if
the `try` block doesn’t raise an exception. The `finally` block can be
used to execute code regardless of whether an exception occurred
or not.
Example:
```python
try:
value = int(input("Enter a number: "))
result = 10 / value
except (ValueError, ZeroDivisionError) as e:
print(f"Error occurred: {e}")
else:
print(f"The result is {result}")
finally:
print("Execution completed.")
```
Inserting print statements at various points in your code can help you
trace the flow of execution and inspect variable values.
Example:
```python
def calculate_average(numbers):
total = sum(numbers)
print(f"Total: {total}")
average = total / len(numbers)
print(f"Average: {average}")
return average
calculate_average([1, 2, 3, 4])
```
Example:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
def calculate_average(numbers):
total = sum(numbers)
logging.debug(f"Total: {total}")
average = total / len(numbers)
logging.debug(f"Average: {average}")
return average
calculate_average([1, 2, 3, 4])
```
Using Debuggers
Example:
```python
import pdb
def calculate_average(numbers):
pdb.set_trace()
total = sum(numbers)
average = total / len(numbers)
return average
calculate_average([1, 2, 3, 4])
```
Running this script will enter the interactive debugging mode at the
`pdb.set_trace()` line, allowing you to step through your code.
Visual Studio Code Debugger
Example:
```python
import pandas as pd
Reading and writing files can lead to errors if the file doesn’t exist or
if there are permission issues. Use `try` and `except` blocks to
handle such errors.
Example:
```python
try:
with open('non_existent_file.txt', 'r') as file:
data = file.read()
except FileNotFoundError:
print("Error: The file does not exist.")
except IOError:
print("Error: An I/O error occurred.")
```
url = "https://api.example.com/data"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
df = pd.DataFrame(data)
else:
print(f"Failed to retrieve data. Status code: {response.status_code}")
```
2. Transform Data:
```python
# Example transformation: convert date column to datetime format
df['date'] = pd.to_datetime(df['date'])
```
You can schedule this script to run at regular intervals using Task
Scheduler on Windows or Cron jobs on Unix-based systems,
ensuring your Power BI dataset is always up to date.
# Load dataset
df = pd.read_csv('data.csv')
# Remove outliers
df = df[df['column_name'] < df['column_name'].quantile(0.99)]
Using such scripts, you can ensure that your data cleaning process
is consistent and error-free, saving you from manually performing
these tasks each time you refresh your dataset.
# Load dataset
df = pd.read_csv('data.csv')
# Generate a plot
plt.figure(figsize=(10, 6))
plt.plot(df['date'], df['value'])
plt.title('Daily Values')
plt.xlabel('Date')
plt.ylabel('Value')
plt.savefig('report_plot.png')
msg = MIMEMultipart()
msg['From'] = sender_email
msg['To'] = receiver_email
msg['Subject'] = subject
msg.attach(MIMEText(body, 'plain'))
if response.status_code == 202:
print("Dataset refresh initiated successfully.")
else:
print(f"Failed to initiate dataset refresh. Status code:
{response.status_code}")
```
Scheduling this script to run at regular intervals, you can ensure that
your Power BI dashboards are always up to date with the latest data.
Understanding APIs
APIs are a set of protocols and tools that allow different software
applications to communicate with each other. They enable you to
access data from various services, such as financial markets, social
media platforms, and weather services, in a structured and
standardized manner.
api_url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(api_url)
if response.status_code == 200:
data = response.json()
print(data)
else:
print(f"Error: {response.status_code}")
```
api_url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(api_url)
if response.status_code == 200:
data = response.json()
df = pd.DataFrame(data["rates"].items(), columns=["Currency",
"Rate"])
else:
print(f"Error: {response.status_code}")
This script retrieves the latest exchange rates and transforms the
data into a Pandas DataFrame, which is then loaded into Power BI
for further analysis.
Web Scraping
While APIs provide structured access to data, not all data sources
offer APIs. This is where web scraping comes into play. Web
scraping involves extracting data from websites by parsing the HTML
content. Python's libraries, such as `BeautifulSoup` and `Selenium`,
make web scraping a manageable task.
Example: Web Scraping with BeautifulSoup
Let's say you need to scrape stock prices from a financial news
website.
url = "https://www.example.com/stock-prices"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id': 'stock-prices'})
rows = table.find_all('tr')
data = []
for row in rows[1:]:
cols = row.find_all('td')
data.append({
'Symbol': cols[0].text.strip(),
'Price': float(cols[1].text.strip())
})
df = pd.DataFrame(data)
else:
print(f"Error: {response.status_code}")
When working with APIs and web scraping, you must often handle
authentication and rate limiting. Authentication ensures that only
authorized users can access the data, while rate limiting prevents
overloading the server with too many requests in a short period.
api_url = "https://api.example.com/data"
headers = {
"Authorization": "Bearer YOUR_API_KEY"
}
response = requests.get(api_url, headers=headers)
if response.status_code == 200:
data = response.json()
df = pd.DataFrame(data)
else:
print(f"Error: {response.status_code}")
api_url = "https://api.example.com/data"
headers = {
"Authorization": "Bearer YOUR_API_KEY"
}
for i in range(10):
response = requests.get(api_url, headers=headers)
if response.status_code == 200:
data = response.json()
# Process data...
else:
print(f"Error: {response.status_code}")
1. Sample DataFrames:
```python
import pandas as pd
sales_data = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'product': ['A', 'B', 'C', 'D'],
'amount': [100, 150, 200, 250]
})
customer_data = pd.DataFrame({
'customer_id': [1, 2, 3, 5],
'name': ['John', 'Jane', 'Doe', 'Smith'],
'region': ['North', 'East', 'South', 'West']
})
```
2. Merging DataFrames:
```python
merged_data = pd.merge(sales_data, customer_data,
on='customer_id', how='inner')
print(merged_data)
```
Suppose you want to calculate the total sales amount for each
product.
This script groups the sales data by 'product' and calculates the sum
of the 'amount' for each group, providing a summary of total sales
per product.
Reshaping Data
1. Sample DataFrame:
```python
sales_data = pd.DataFrame({
'date': ['2023-01-01', '2023-01-02', '2023-01-01', '2023-01-02'],
'product': ['A', 'A', 'B', 'B'],
'amount': [100, 150, 200, 250]
})
```
2. Pivoting Data:
```python
pivoted_data = sales_data.pivot(index='date', columns='product',
values='amount')
print(pivoted_data)
```
1. Sample DataFrame:
```python
data = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'amount': [100, None, 200, None]
})
```
2. Filling Missing Values:
```python
filled_data = data.fillna(data['amount'].mean())
print(filled_data)
```
This script fills the missing values in the 'amount' column with the
mean of the existing values, ensuring that the DataFrame is
complete for analysis.
1. Sample DataFrame:
```python
data = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'product': ['A', 'B', 'A', 'B'],
'amount': [100, 150, 200, 250]
}).set_index(['region', 'product'])
```
Suppose you have a time series data and you need to resample it to
a different frequency.
1. Sample DataFrame:
```python
date_range = pd.date_range(start='2023-01-01', periods=6, freq='D')
time_series_data = pd.DataFrame({
'date': date_range,
'value': [10, 20, 30, 40, 50, 60]
}).set_index('date')
```
2. Resampling Data:
```python
resampled_data = time_series_data.resample('2D').sum()
print(resampled_data)
```
import pandas as pd
def remove_duplicates(df):
return df.drop_duplicates()
def standardize_column_names(df):
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]
return df
```
print(data)
```
# Sample DataFrame
data = pd.DataFrame({
'Product': ['A', 'B', 'C', 'D'],
'Sales': [100, 150, 200, 250]
})
# Normalizing the 'Sales' column using z-score
data = normalize_column(data, 'Sales', method='z-score')
print(data)
```
# Sample DataFrame
data = pd.DataFrame({
'Product ID': [1, 2, 2, 4],
'Sales': [100, 150, None, 250],
'Category': ['A', 'B', 'A', 'B']
})
print(data)
```
This script demonstrates how to use the `my_data_tools` package,
illustrating the convenience and organization provided by packaging
your modules.
1. Adding Docstrings:
```python
# file: data_cleaning.py
import pandas as pd
def remove_duplicates(df):
"""
Remove duplicate rows from a DataFrame.
Parameters:
df (pandas.DataFrame): The DataFrame to clean.
Returns:
pandas.DataFrame: The cleaned DataFrame with duplicates
removed.
"""
return df.drop_duplicates()
Parameters:
df (pandas.DataFrame): The DataFrame with missing values.
method (str): The method to fill missing values ('mean' or 'median').
Returns:
pandas.DataFrame: The DataFrame with missing values filled.
"""
if method == 'mean':
return df.fillna(df.mean())
elif method == 'median':
return df.fillna(df.median())
else:
raise ValueError("Method must be 'mean' or 'median'")
def standardize_column_names(df):
"""
Standardize column names in a DataFrame by converting to lower
case and replacing spaces with underscores.
Parameters:
df (pandas.DataFrame): The DataFrame with column names to
standardize.
Returns:
pandas.DataFrame: The DataFrame with standardized column
names.
"""
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]
return df
```
import unittest
import pandas as pd
from data_cleaning import remove_duplicates, fill_missing_values,
standardize_column_names
class TestDataCleaning(unittest.TestCase):
def test_remove_duplicates(self):
df = pd.DataFrame({'A': [1, 2, 2, 4]})
result = remove_duplicates(df)
self.assertEqual(len(result), 3)
def test_fill_missing_values(self):
df = pd.DataFrame({'A': [1, None, 3]})
result = fill_missing_values(df, method='mean')
self.assertEqual(result['A'].iloc[1], 2)
def test_standardize_column_names(self):
df = pd.DataFrame({' A ': [1], 'B Column': [2]})
result = standardize_column_names(df)
self.assertIn('a', result.columns)
self.assertIn('b_column', result.columns)
if __name__ == '__main__':
unittest.main()
```
Unit tests like these ensure that your functions work correctly and
handle various scenarios, enhancing the reliability of your reusable
scripts.
```python
import pandas as pd
# Inefficient
df = df.drop(columns=['unnecessary_column'])
# Efficient
df.drop(columns=['unnecessary_column'], inplace=True)
```
```python
# Inefficient
for index, row in df.iterrows():
df.at[index, 'new_column'] = row['column1'] + row['column2']
# Efficient
df['new_column'] = df['column1'] + df['column2']
```
- Use Appropriate File Formats: CSV files are easy to use but not
always the most efficient. Formats like Parquet and Feather are
optimized for performance and can handle larger datasets more
efficiently.
```python
# Reading a CSV file
df = pd.read_csv('data.csv')
```python
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_data.csv', chunksize=chunk_size):
chunks.append(chunk)
df = pd.concat(chunks)
```
```python
df['integer_column'] = df['integer_column'].astype('int8')
```
```python
import gc
```python
# Inefficient
result = expensive_operation(data)
result_again = expensive_operation(data)
# Efficient
result = expensive_operation(data)
result_again = result
```
```python
import cProfile
def my_function():
# Your code here
cProfile.run('my_function()')
```
```python
from multiprocessing import Pool
def process_data(data_chunk):
# Your processing code here
return processed_data
```python
import numba
@numba.jit
def fast_function(data):
# Your optimized code here
return result
```
Adopting these best practices, you can ensure that your Python
scripts within Power BI not only perform efficiently but also remain
maintainable and scalable. As you incorporate these techniques into
your workflow, you'll find that even the most complex analyses
become manageable and performant, paving the way for more
insightful and impactful data-driven decisions.
In the realm of data analytics, the ability to not only analyze historical
data but also predict future trends is invaluable. Python's integration
with Power BI opens up a wealth of possibilities for advanced
analytics and predictive modeling, transforming raw data into
actionable insights that can drive strategic decisions. This section
delves into the advanced techniques and methodologies necessary
to harness the full potential of predictive modeling within Power BI
using Python.
```python
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Fill missing values with the mean of the column
df.fillna(df.mean(), inplace=True)
```
```python
# Normalize numeric features
df['normalized_feature'] = (df['feature'] - df['feature'].mean()) /
df['feature'].std()
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Split data into training and testing sets
X = df[['feature1', 'feature2']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
# Predict
predictions = model.predict(X_test)
```
```python
from sklearn.linear_model import LogisticRegression
# Predict
log_predictions = log_model.predict(X_test)
```
# Predict
rf_predictions = rf_model.predict(X_test)
```
```python
from sklearn.metrics import mean_squared_error, r2_score,
accuracy_score
# Regression evaluation
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
# Classification evaluation
accuracy = accuracy_score(y_test, log_predictions)
```
- Cross-Validation: Implement cross-validation to assess how the
model generalizes to an independent dataset. This involves splitting
the data into multiple folds and training the model on each fold.
```python
from sklearn.model_selection import cross_val_score
# Perform cross-validation
cv_scores = cross_val_score(log_model, X, y, cv=5)
```
To leverage predictive models within Power BI, you can use Python
scripts to run the models and display the results directly in Power BI
reports.
```python
# Power BI Python script example
import pandas as pd
from sklearn.linear_model import LinearRegression
# Load data
dataset = pandas.read_csv('path_to_data.csv')
# Prepare data
X = dataset[['feature1', 'feature2']]
y = dataset['target']
# Train the model
model = LinearRegression()
model.fit(X, y)
# Predict
dataset['predicted'] = model.predict(X)
```python
from statsmodels.tsa.holtwinters import ExponentialSmoothing
```python
# Load customer data
customer_data = pd.read_csv('customer_data.csv')
# Feature engineering
customer_data['tenure_scaled'] = customer_data['tenure'] /
customer_data['tenure'].max()
# Predict churn
customer_data['churn_prediction'] =
churn_model.predict(customer_data[['tenure_scaled',
'monthly_charges']])
```
Embracing these advanced analytics and predictive modeling
techniques, you can uncover deeper insights and forecast future
trends with greater accuracy. The integration of Python within Power
BI not only enhances your analytical capabilities but also streamlines
the process of transforming data into actionable intelligence. As you
continue to explore and apply these methodologies, you'll find
yourself at the forefront of data-driven decision-making, equipped
with the tools to anticipate and navigate the complexities of an ever-
evolving business landscape.
CHAPTER 6: REAL-
WORLD USE CASES AND
PROJECTS
In the competitive world of sales and marketing, leveraging data to
drive decisions is no longer optional—it’s imperative. Python's
integration with Power BI provides a powerful toolkit for transforming
raw data into actionable insights, enabling businesses to refine their
strategies, optimize campaigns, and ultimately drive growth. This
section delves into how you can harness the power of Python within
Power BI to elevate your sales and marketing analytics to new
heights.
- Data Import: Use Python to import data from multiple sources into
Power BI. For instance, you can extract data from a CRM system
using an API and load it into a pandas DataFrame.
```python
import pandas as pd
import requests
```python
# Handle missing values
df_sales.fillna(method='ffill', inplace=True)
```python
from statsmodels.tsa.holtwinters import ExponentialSmoothing
```python
from sklearn.cluster import KMeans
```python
# Calculate conversion rate
df_sales['conversion_rate'] = df_sales['conversions'] /
df_sales['impressions']
# Calculate ROI
df_sales['roi'] = (df_sales['revenue'] - df_sales['cost']) /
df_sales['cost']
```
```python
import matplotlib.pyplot as plt
import seaborn as sns
```python
# Example Python script for Power BI dashboard
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('sales_data.csv')
```python
from statsmodels.tsa.statespace.sarimax import SARIMAX
```python
from sklearn.cluster import KMeans
```python
from mlxtend.frequent_patterns import apriori, association_rules
Integrating Python with Power BI, you can unlock advanced analytics
capabilities that drive more informed and strategic sales and
marketing decisions. Whether it's through sophisticated forecasting
models, customer segmentation, or campaign performance analysis,
the combination of these tools empowers you to derive deeper
insights and achieve better business outcomes. As you continue to
explore and apply these techniques, you'll find yourself better
equipped to navigate the complexities of the modern sales and
marketing landscape, ultimately driving growth and success for your
organization.
- Data Import: Use Python to fetch data from multiple sources and
consolidate it within Power BI. For example, you might extract
financial statements from an ERP system and market data from a
financial API, then merge them into a cohesive dataset.
```python
import pandas as pd
import requests
# Merge datasets
df_financial = pd.merge(erp_data, df_market, on='date')
```
```python
# Handle missing values
df_financial.fillna(method='bfill', inplace=True)
With the data prepared, you can apply various advanced analytics
techniques to extract valuable insights.
- Revenue Forecasting: Use time series analysis and machine
learning models to predict future revenue. Python’s `statsmodels`
and `scikit-learn` libraries are particularly useful for this purpose.
```python
from statsmodels.tsa.holtwinters import ExponentialSmoothing
```python
from sklearn.linear_model import LinearRegression
```python
# Calculate net cash flow
df_financial['net_cash_flow'] = df_financial['cash_inflow'] -
df_financial['cash_outflow']
```python
import matplotlib.pyplot as plt
import seaborn as sns
```python
# Example Python script for Power BI dashboard
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('financial_data.csv')
```python
from sklearn.linear_model import Ridge
```python
from statsmodels.tsa.arima_model import ARIMA
```python
from sklearn.ensemble import RandomForestClassifier
Conclusion
- Data Import: Use Python to gather data from various sources and
consolidate it within Power BI. For instance, you might pull customer
interaction data from a CRM system and website behavior data from
Google Analytics, then merge these datasets for a holistic view.
```python
import pandas as pd
import requests
# Merge datasets
df_customer = pd.merge(crm_data, df_web, on='customer_id')
```
```python
# Handle missing values
df_customer.fillna(method='bfill', inplace=True)
# Standardize formats
df_customer['purchase_date'] =
pd.to_datetime(df_customer['purchase_date'])
With your data prepared, you can apply various advanced analytics
techniques to segment and profile your customers effectively.
```python
from sklearn.cluster import KMeans
```python
# Calculate average metrics for each segment
segment_profiles = df_customer.groupby('segment').mean()
```
```python
# Example of segment profiling output
segment_profiles[['age', 'income', 'total_spend',
'purchase_frequency']]
```
```python
import matplotlib.pyplot as plt
import seaborn as sns
```python
# Example Python script for Power BI dashboard
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('customer_segments.csv')
```python
from sklearn.cluster import DBSCAN
```python
from sklearn.mixture import GaussianMixture
```python
from sklearn.ensemble import RandomForestClassifier
- Data Import: Use Python to gather data from various sources and
consolidate it within Power BI. For instance, you might pull inventory
data from an ERP system and transportation data from a TMS, then
merge these datasets for a comprehensive view.
```python
import pandas as pd
import requests
# Merge datasets
df_supply_chain = pd.merge(erp_data, df_transport,
on='shipment_id')
```
```python
# Handle missing values
df_supply_chain.fillna(method='bfill', inplace=True)
# Standardize formats
df_supply_chain['delivery_date'] =
pd.to_datetime(df_supply_chain['delivery_date'])
With your data prepared, you can apply various advanced analytics
techniques to optimize your supply chain processes effectively.
```python
import statsmodels.api as sm
# Prepare data for forecasting
df_demand = df_supply_chain[['date', 'demand']]
df_demand.set_index('date', inplace=True)
```python
from scipy.optimize import minimize
```python
import pulp
# Add constraints
for i in sources:
problem += pulp.lpSum([route_vars[i, j] for j in destinations]) <=
supply[i]
for j in destinations:
problem += pulp.lpSum([route_vars[i, j] for i in sources]) ==
demand[j]
# Solve problem
problem.solve()
```
```python
import matplotlib.pyplot as plt
import seaborn as sns
```python
# Example Python script for Power BI dashboard
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('supply_chain_data.csv')
```python
from statsmodels.tsa.holtwinters import ExponentialSmoothing
```python
import numpy as np
from scipy.optimize import linprog
```python
from ortools.graph import pywrapgraph
# Solve problem
max_flow.Solve(0, 4)
```
In the digital age, the vast expanse of the internet and the dynamic
world of social media have become vital arenas for businesses to
understand consumer behavior, track engagement, and measure the
impact of their digital strategies. By integrating Python with Power BI,
organizations can unlock the potential of web analytics and social
media data, transforming raw metrics into actionable insights that
drive business growth and enhance online presence.
```python
import requests
from bs4 import BeautifulSoup
```python
import tweepy
- Data Cleaning: Parse and clean the raw data to ensure consistency
and accuracy.
```python
import pandas as pd
```python
# Convert created_at to datetime
df_tweets['created_at'] = pd.to_datetime(df_tweets['created_at'])
```python
from textblob import TextBlob
```python
# Calculate engagement metrics
df_tweets['engagement'] = df_tweets['likes'] + df_tweets['retweets']
```
```python
# Calculate conversion rate
df_web_analytics['conversion_rate'] =
df_web_analytics['conversions'] / df_web_analytics['visits']
```
```python
import matplotlib.pyplot as plt
import seaborn as sns
```python
# Example Python script for Power BI dashboard
import pandas as pd
import matplotlib.pyplot as plt
```python
import statsmodels.api as sm
```python
import matplotlib.pyplot as plt
```python
from wordcloud import WordCloud
# Generate a word cloud of common keywords
text = ' '.join(df_tweets['text'])
wordcloud = WordCloud(width=800, height=400).generate(text)
Integrating Python with Power BI for web analytics and social media
analysis empowers businesses to derive deep insights from their
digital presence. By leveraging advanced analytics techniques and
creating compelling visualizations, you can transform raw web and
social media data into actionable insights that drive strategic
decision-making and business growth.
Introduction to HR Analytics
```python
import pandas as pd
```python
import requests
```python
# Fill missing values in the dataset
df_hr['salary'].fillna(df_hr['salary'].mean(), inplace=True)
df_hr.dropna(subset=['employee_id', 'hire_date'], inplace=True)
```
- Data Transformation: Create derived metrics such as tenure,
performance scores, and engagement indices.
```python
# Calculate employee tenure
df_hr['hire_date'] = pd.to_datetime(df_hr['hire_date'])
df_hr['tenure'] = (pd.to_datetime('today') - df_hr['hire_date']).dt.days /
365
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
```python
from textblob import TextBlob
- Skill Gap Analysis: Identify skill gaps within the workforce and plan
targeted training programs.
```python
# Calculate skill gap metric (example)
df_hr['skill_gap'] = df_hr['required_skills'] - df_hr['current_skills']
```
Visualization and Reporting
```python
import matplotlib.pyplot as plt
import seaborn as sns
```python
# Example Python script for Power BI dashboard
import pandas as pd
import matplotlib.pyplot as plt
# Load HR data
df_hr = pd.read_csv('hr_data.csv')
```python
import statsmodels.api as sm
# Analyze results
print(results.summary())
```
```python
import matplotlib.pyplot as plt
Healthcare data analysis has never been more critical. The datasets
generated in this sector, whether from patient records, clinical trials,
or operational metrics, impact lives. The fusion of Python and Power
BI offers unparalleled opportunities to harness this data,
transforming it into actionable insights.
Understanding Healthcare Data
```python
import pandas as pd
# Remove duplicates
df.drop_duplicates(inplace=True)
```
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
```
Predictive Modeling
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
Visualization in Power BI
Example:
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
data = dataset
# Visualization
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.boxplot(x='age', y='bmi', data=data)
plt.title('BMI vs Age')
plt.show()
```
1. Data Collection: Gather EHR data for patients discharged over the
past year.
2. Data Cleaning: Address missing values and inconsistencies.
3. Feature Engineering: Create new features such as average length
of stay or number of chronic conditions.
4. Model Building: Train a classification model to predict
readmissions.
5. Deployment: Integrate the model into Power BI for real-time
predictions.
```python
# Anonymize patient data
df['patient_id'] = df['patient_id'].apply(lambda x: 'ID_' + str(x))
```
```python
import pandas as pd
```python
import matplotlib.pyplot as plt
import seaborn as sns
Geospatial Analysis
```python
import geopandas as gpd
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
Example:
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
data = dataset
# Visualization
plt.figure(figsize=(12, 6))
sns.scatterplot(x='longitude', y='latitude', hue='AQI', data=data)
plt.title('Spatial Distribution of Air Quality Index')
plt.show()
```
```python
# Anonymize sensitive locations
df['location'] = df['location'].apply(lambda x: 'Location_' + str(x))
```
```python
import matplotlib.pyplot as plt
import seaborn as sns
Geospatial Analysis
```python
import geopandas as gpd
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
Example:
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
data = dataset
# Visualization
plt.figure(figsize=(12, 6))
sns.lineplot(x='report_date', y='health_metric', hue='region',
data=data)
plt.title('Health Metrics Over Time')
plt.xlabel('Date')
plt.ylabel('Health Metric')
plt.legend(title='Region')
plt.show()
```
```python
# Anonymize sensitive data
df['individual_id'] = df['individual_id'].apply(lambda x: 'ID_' + str(x))
```
Government and public sector analytics with Python in Power BI
provide powerful tools for transforming complex datasets into
actionable insights. By following the outlined steps and leveraging
advanced tools, analysts can drive transparency, efficiency, and
data-driven decision-making in the public sphere. This integration not
only enhances the accessibility of data insights but also fosters a
culture of evidence-based policy-making and resource optimization.
Data Collection:
The dataset comprises historical sales data extracted from a retail
ERP system, including transaction details, product information,
promotional data, and customer demographics.
```python
import pandas as pd
```python
# Fill missing values
sales_data.fillna(method='ffill', inplace=True)
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Sales trend over time
plt.figure(figsize=(12, 6))
sns.lineplot(x='transaction_date', y='sales_amount',
data=sales_data)
plt.title('Sales Trends Over Time')
plt.xlabel('Date')
plt.ylabel('Sales Amount')
plt.show()
Predictive Modeling:
Build a time series forecasting model to predict future sales.
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Feature engineering
sales_data['day_of_week'] =
sales_data['transaction_date'].dt.dayofweek
sales_data['month'] = sales_data['transaction_date'].dt.month
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
data = dataset
# Visualization
plt.figure(figsize=(12, 6))
sns.lineplot(x='transaction_date', y='sales_amount', data=data)
plt.title('Predicted Sales Trends')
plt.xlabel('Date')
plt.ylabel('Sales Amount')
plt.show()
```
Outcome:
The forecasted sales data helps the retail company optimize its
inventory levels, minimize stockouts, and tailor marketing campaigns
to anticipated demand, ultimately improving profitability and
customer satisfaction.
Data Collection:
Obtain patient records, including demographic information, medical
history, treatment plans, and outcome data.
```python
# Load the patient dataset
patient_data = pd.read_csv('patient_data.csv')
```python
# Fill missing values
patient_data.fillna(method='bfill', inplace=True)
```python
# Visualize age distribution
plt.figure(figsize=(12, 6))
sns.histplot(patient_data['age'], bins=30)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Predictive Modeling:
Develop models to predict patient outcomes based on various
factors.
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
# Feature engineering
X = patient_data[['age', 'gender', 'treatment_type']]
y = patient_data['outcome']
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
data = dataset
# Visualization
plt.figure(figsize=(12, 6))
sns.barplot(x='treatment_type', y='outcome', hue='gender',
data=data)
plt.title('Predicted Outcomes by Treatment Type and Gender')
plt.xlabel('Treatment Type')
plt.ylabel('Outcome')
plt.show()
```
Outcome:
The hospital can identify high-risk patients, tailor treatment plans,
and allocate resources more efficiently, leading to improved patient
outcomes and optimized healthcare delivery.
Case Study 3: Financial Risk Analysis and Mitigation
Data Collection:
Gather financial records, including transaction data, market trends,
and economic indicators.
```python
# Load the financial dataset
financial_data = pd.read_csv('financial_data.csv')
```python
# Fill missing values
financial_data.fillna(method='ffill', inplace=True)
```python
# Financial trends over time
plt.figure(figsize=(12, 6))
sns.lineplot(x='transaction_date', y='transaction_amount',
data=financial_data)
plt.title('Financial Trends Over Time')
plt.xlabel('Date')
plt.ylabel('Transaction Amount')
plt.show()
Predictive Modeling:
Build models to predict financial risks and recommend mitigation
strategies.
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Feature engineering
X = financial_data[['market_segment', 'transaction_amount',
'economic_indicator']]
y = financial_data['risk_level']
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
data = dataset
# Visualization
plt.figure(figsize=(12, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Financial Variables')
plt.show()
```
Outcome:
The financial institution can proactively identify potential risks,
implement mitigation strategies, and enhance risk management
practices, leading to reduced financial losses and improved stability.
Enhanced AI Capabilities
Key Features:
Example:
1. Access AutoML:
- Navigate to the Dataflows section in Power BI Service.
- Select the dataset for which you want to create a predictive model.
- Click on the "Create New" option and choose "AutoML Model."
Key Features:
Example:
Key Features:
Example:
Example:
```python
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
data = dataset
# Create visualization
plt.figure(figsize=(10, 6))
sns.lineplot(x='date', y='value', data=data)
plt.title('Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
```
Power BI's mobile app has also seen significant updates, ensuring
that users can access and interact with their reports on the go.
These enhancements focus on improving the user experience and
providing more functionality on mobile devices.
Key Features:
Example:
Key Features:
Example:
Key Features:
- Real-Time Analytics: Edge devices can process data locally,
enabling real-time analytics and reducing the dependency on cloud-
based processing.
- Scalability: Edge computing supports scalable data processing
architectures, distributing workloads across numerous edge devices.
- Enhanced Security: By keeping sensitive data localized, edge
computing can enhance data security and privacy.
Example:
Key Features:
Example:
Key Features:
Example:
Key Features:
Example:
Key Features:
Example:
Python Code
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# For the purpose of this example, let's create a random time series
data
# Assuming these are daily stock prices for a year
np.random.seed(0)
dates = pd.date_range('20230101', periods=365)
prices = np.random.randn(365).cumsum() + 100 # Random walk +
starting price of 100
# Create a DataFrame
df = pd.DataFrame({'Date': dates, 'Price': prices})
Python Code
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# For the purpose of this example, let's create some synthetic
stock return data
np.random.seed(0)
# Generating synthetic daily returns data for 5 stocks
stock_returns = np.random.randn(100, 5)
Python Code
import matplotlib.pyplot as plt
import numpy as np
Python Code
import matplotlib.pyplot as plt
import numpy as np
Python Code
import matplotlib.pyplot as plt
import numpy as np
Python Code
import matplotlib.pyplot as plt
# Adding a title
plt.title('Portfolio Composition')
Python Code
import matplotlib.pyplot as plt
import numpy as np
Python Code
import seaborn as sns
import numpy as np
import pandas as pd
# Create a DataFrame
df_risk = pd.DataFrame(risk_scores, index=assets,
columns=sectors)
1. Download Python:
Visit the official Python website at python.org.
Navigate to the Downloads section and choose
the latest version for Windows.
Click on the download link for the Windows
installer.
2. Run the Installer:
Once the installer is downloaded, double-click the
file to run it.
Make sure to check the box that says "Add
Python 3.x to PATH" before clicking "Install Now."
Follow the on-screen instructions to complete the
installation.
3. Verify Installation:
Open the Command Prompt by typing cmd in the
Start menu.
Type python --version and press Enter. If Python
is installed correctly, you should see the version
number.
macOS
1. Download Python:
Visit python.org.
Go to the Downloads section and select the
macOS version.
Download the macOS installer.
2. Run the Installer:
Open the downloaded package and follow the on-
screen instructions to install Python.
macOS might already have Python 2.x installed.
Installing from python.org will provide the latest
version.
3. Verify Installation:
Open the Terminal application.
Type python3 --version and press Enter. You
should see the version number of Python.
Linux
Python is usually pre-installed on Linux distributions. To check if
Python is installed and to install or upgrade Python, follow these
steps:
1. Download Anaconda:
Visit the Anaconda website at anaconda.com.
Download the Anaconda Installer for your
operating system.
2. Install Anaconda:
Run the downloaded installer and follow the on-
screen instructions.
3. Verify Installation:
Open the Anaconda Prompt (Windows) or your
terminal (macOS and Linux).
Type python --version or conda list to see the
installed packages and Python version.
PYTHON LIBRARIES
Installing Python libraries is a crucial step in setting up your Python
environment for development, especially in specialized fields like
finance, data science, and web development. Here's a
comprehensive guide on how to install Python libraries using pip,
conda, and directly from source.
Using pip
pip is the Python Package Installer and is included by default with
Python versions 3.4 and above. It allows you to install packages
from the Python Package Index (PyPI) and other indexes.
financial data.
interactive charts.
an additive model where non-linear trends are fit with yearly, weekly,
2. Operators
Operators are used to perform operations on variables and values.
Python divides operators into several types:
3. Control Flow
Control flow refers to the order in which individual statements,
instructions, or function calls are executed or evaluated. The primary
control flow statements in Python are if, elif, and else for conditional
operations, along with loops (for, while) for iteration.
4. Functions
Functions are blocks of organized, reusable code that perform a
single, related action. Python provides a vast library of built-in
functions but also allows you to define your own using the def
keyword. Functions can take arguments and return one or more
values.
5. Data Structures
Python includes several built-in data structures that are essential for
storing and managing data:
7. Error Handling
Error handling in Python is managed through the use of try-except
blocks, allowing the program to continue execution even if an error
occurs. This is crucial for building robust applications.
8. File Handling
Python makes reading and writing files easy with built-in functions
like open(), read(), write(), and close(). It supports various modes,
such as text mode (t) and binary mode (b).
9. Libraries and Frameworks
Python's power is significantly amplified by its vast ecosystem of
libraries and frameworks, such as Flask and Django for web
development, NumPy and Pandas for data analysis, and TensorFlow
and PyTorch for machine learning.
7. Defining Functions
Functions are blocks of code that run when called. They can take
parameters and return results. Defining reusable functions makes
your code modular and easier to debug:
python
def greet(name):
return f"Hello, {name}!"
print(greet("Alice"))
greeter_instance = Greeter("Alice")
print(greeter_instance.greet())