Sony Ai Content
Sony Ai Content
0-1
1 IMPORT LIBRARIES
3-17
2 NEURAL NETWORKS AND
RELAVENT FIGURE
PYTHON LIBRARIES 18
3
19
4 INTRODUCTION TO ANACONDA
NAVIGATOR
20
5 INTRODUCTION TO GOOGLE
COLAB
21
6 INTRODUCTION TO KAGGLE
22
7 CUSTOMER DATASET
24-25
8 SOURCE CODE AND OUTPUT
1
IMPORT LIBRARIES
2
NEURAL NETWORKS
Hidden layers: Hidden layers are intermediate layers between the input and
output layers. They perform complex computations and transformations on the
input data. A neural network can have numerous hidden layers, each consisting of
numerous neurons or nodes.
Weights and biases: Weights and biases are adjustable parameters associated
with the connections between neurons. Each connection has a weight, which
determines the strength or importance of the input signal. Biases, on the other
hand, provide an
Activation functions: Activation functions are threshold values that introduce
non-linearities into the neural network, enabling it to comprehend complex
relationships between inputs and outputs. Common activation functions include
sigmoid, tanh, ReLU (Rectified Linear Unit), and softmax.
Output layer: The output layer generates the final predictions or outputs of the
neural network. The number of artificial neurons in the output layer depends on
the specific problem being solved. For example, in a binary classification task,
there may be one neuron representing each class (e.g., “yes” or “no”), while in
multi-class classification, there will be multiple neurons representing each
class.
3
Loss function: The loss function measures the discrepancy between the predicted
outputs of the neural network and the true values. It quantifies the network’s
performance and guides the learning process by providing feedback on its
performance.
4
Pytorch(by facebook)
Open sourced on GitHub.com in 2017, PyTorch is one of the newer deep learning
frameworks.
5
KERAS
Keras is a high-level, user-friendly API used for building and training neural networks.
It is designed to be user-friendly, modular, and easy to extend. Keras allows you to
build, train, and deploy deep learning models with minimal code. It provides a high-
level API that is intuitive and easy to use, making it ideal for beginners and experts
alike.
Keras is relatively easy to learn and work with because it provides a python frontend
with a high level of abstraction while having the option of multiple back-ends for
computation purposes. This makes Keras slower than other deep learning frameworks,
but extremely beginner-friendly.
6
PREPARE THE DATA
Format your dataset appropriately for input into the RNN.
This typically involves splitting the data into input sequences and corresponding
output sequences.
7
Data Normalization techniques
Min-Max Scaling:
This normalization method is valuable when managing with information that has
exceptions, so that the exceptions do not skew the information much as well.
Z-Score Normalization:
8
Decimal Scaling:
Log Transformation:
This technique is additionally valuable when managing with information that has
outliers, because it makes a difference to decrease their impact on the information.
9
ADVANTAGES
10
1. Calculate the range of the data set
To find the range of a data set, find the maximum and minimum values in the
data set, then subtract the minimum from the maximum. Arranging your data set
in order from smallest to largest can help you find these values easily. Here's the
formula:
Range of x values = 32 - 12 = 20
2. Subtract the minimum x value from the value of this data point
Next, take the x value of the data point you're analyzing and subtract the
minimum x value from it. You can start with any data point in your set.
Example: The scientist's first data point is 25, so the scientist subtracts the
minimum x value from that:
x - xminimum = 25 - 12 = 13
11
3. Insert these values into the formula and divide
The final step of applying this formula to an individual data point is to divide the
difference between the specific data point and the minimum by the range. In this
process, that would mean taking the result of step two and dividing it by the result
from step one.
Example: For this data point, the scientist fills in the complete equation:
xnormalized = (x - xminimum) / range of x = 13 / 20 = 0.65
This result falls between zero and one, so they applied the normalization formula
correctly.
Since the normalization formula is useful for analyzing and comparing complete
sets of data, it's important to apply it to each data point so that you can compare
your whole set. You might automate this with a spreadsheet program to save time.
12
Normalization formula for custom ranges
While this normalization formula brings all results into a range between zero and
one, there is a variation on the normalization formula to use if you're trying to put
all data within a custom range where the lowest value is a and the highest value is
b:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
13
X = data.iloc[:,:-1]:
data is typically a DataFrame containing your dataset.
[:,:-1] selects all rows and all columns except the last one. This is achieved
by using : to select all rows and :-1 to select all columns up to the last one.
So, X contains all the input features (independent variables) of the dataset.
y = data.iloc[:,-1]:
Similarly, data.iloc[:,-1] selects all rows and only the last column.
This extracts the target variable (dependent variable) from the dataset.
So, y contains the target variable.
In summary, after executing these lines of code, you will have:
14
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It
involves analyzing and visualizing data to understand its key characteristics, uncover
patterns, and identify relationships between variables refers to the method of studying
and exploring record sets to apprehend their predominant traits, discover patterns,
locate outliers, and identify relationships between variables. EDA is normally carried
out as a preliminary step before undertaking extra formal statistical analyses or
modeling.
Outlier Detection: Identifying unusual values that deviate from other data points.
Outliers can influence statistical analyses and might indicate data entry errors or
unique cases.
Handling Missing Values: Detecting and deciding how to address missing data
points, whether by imputation or removal, depending on their impact and the
amount of missing data.
15
Summary Statistics: Calculating key statistics that provide insight into data
trends and nuances.
Testing Assumptions: Many statistical tests and models assume the data meet
certain conditions (like normality or homoscedasticity). EDA helps verify these
assumptions.
Exploratory Data Analysis (EDA) is important for several reasons, especially in the
context of data science and statistical modeling. Here are some of the key reasons
why EDA is a critical step in the data analysis process:
Understanding Data Structures: EDA helps in getting familiar with the dataset,
understanding the number of features, the type of data in each feature, and the
distribution of data points. This understanding is crucial for selecting appropriate
analysis or prediction techniques.
Testing Assumptions: Many statistical models assume that data follow a certain
distribution or that variables are independent. EDA involves checking these
assumptions. If the assumptions do not hold, the conclusions drawn from the
model could be invalid.
Informing Feature Selection and Engineering: Insights gained from EDA can
inform which features are most relevant to include in a model and how to
transform them (scaling, encoding) to improve model performance.
16
Optimizing Model Design: By understanding the data’s characteristics, analysts
can choose appropriate modeling techniques, decide on the complexity of the
model, and better tune model parameters.
Facilitating Data Cleaning: EDA helps in spotting missing values and errors in
the data, which are critical to address before further analysis to improve data
quality and integrity.
17
Python Libraries
Plotly: An interactive graphing library for making interactive plots and offers
more sophisticated visualization capabilities.
18
PYTHON PLATFORM(ANACONDA)
Anaconda is the installation program used by Fedora, Red Hat Enterprise Linux and some other
distributions.
During installation, a target computer’s hardware is identified and configured and the appropriate file
systems for the system’s architecture are created. Finally, anaconda allows the user to install the
operating system software on the target computer. Anaconda can also upgrade existing installations of
earlier versions of the same distribution. After the installation is complete, you can reboot into your
installed system and continue doing customization using the initial setup program.
Anaconda is a fairly sophisticated installer. It supports installation from local and remote sources such
as CDs and DVDs, images stored on a hard drive, NFS, HTTP, and FTP. Installation can be scripted
with kickstart to provide a fully unattended installation that can be duplicated on scores of machines. It
can also be run over VNC on headless machines. A variety of advanced storage devices including
LVM, RAID, iSCSI, and multipath are supported from the partitioning program. Anaconda provides
advanced debugging features such as remote logging, access to the python interactive debugger, and
remote saving of exception dumps.
Anaconda is a Python distribution that is popular for data analysis and scientific computing.
• Available for Windows, Mac OS X and Linux.
• Included many popular packages: Numpy, SciPy, Matplotlib, Pandas, IPython, Cython.
• Includes Spyder, a Python development environment.
• Includes conda, a platform-independent package manager
19
INTRODUCTION TO GOOGLE COLAB
Another attractive feature that Google offers to the developers is the use of GPU.
Colab supports GPU and it is totally free. The reasons for making it free for public
could be to make its software a standard in the academics for teaching machine
learning and data science. It may also have a long term perspective of building a
customer base for Google Cloud APIs which are sold per-use basis.
Irrespective of the reasons, the introduction of Colab has eased the learning and
development of machine learning applications.
20
INTRODUCTION TO KAGGLE
Kaggle is an online community platform for data scientists and machine learning enthusiasts. Kaggle
allows users to collaborate with other users, find and publish datasets, use GPU integrated notebooks,
and compete with other data scientists to solve data science challenges. The aim of this online platform
(founded in 2010 by Anthony Goldbloom and Jeremy Howard and acquired by Google in 2017) is to
help professionals and learners reach their goals in their data science journey with the powerful tools
and resources it provides. As of today (2021), there are over 8 million registered users on Kaggle.
One of the sub-platforms that made Kaggle such a popular resource is their competitions. In a similar
way that HackerRank plays that role for software developers and computer engineers, “ Kaggle
Competitions” has significant importance for data scientists; you can learn more about them in
our Kaggle Competiton Guide and learn how to analyze a dataset step-by-step in our Kaggle
Competition Tutorial. In data science competitions like Kaggle’s or DataCamp’s, companies and
organizations share a big amount of challenging data science tasks with generous rewards in which data
scientists, from beginners to experienced, compete on their completion. Kaggle also provides the
Kaggle Notebook, which, just like DataLab, allows you to edit and run your code for data science
tasks on your browser, so your local computer doesn't have to do all the heavy lifting and you don't
need to set up a new development environment on your own.
Kaggle provides powerful resources on cloud and allows you to use a maximum of 30 hours of GPU
and 20 hours of TPU per week. You can upload your datasets to Kaggle and download others' datasets
as well. Additionally, you can check other people's datasets and notebooks and start discussion topics
on them. All your activity is scored on the platform and your score increases as you help others and
share useful information. Once you start earning points, you will be placed on a live leaderboard of 8
million Kaggle users.
21
CUSTOMER PURCHASE DATASET
22
NORMALIZATION OF DATASET
23
24
25
26