0% found this document useful (0 votes)
21 views26 pages

Sony Ai Content

Uploaded by

Ramya Sree A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views26 pages

Sony Ai Content

Uploaded by

Ramya Sree A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

INDEX

S.NO CONTENT PAGE NO

0-1
1 IMPORT LIBRARIES

3-17
2 NEURAL NETWORKS AND
RELAVENT FIGURE

PYTHON LIBRARIES 18
3
19
4 INTRODUCTION TO ANACONDA
NAVIGATOR

20
5 INTRODUCTION TO GOOGLE
COLAB

21
6 INTRODUCTION TO KAGGLE

22
7 CUSTOMER DATASET

24-25
8 SOURCE CODE AND OUTPUT

1
IMPORT LIBRARIES

Neural Networks are computational models that mimic the


NEURAL NETWORKS complex functions of the human brain. The neural networks
consist of interconnected nodes or neurons that process and learn
from data, enabling tasks such as pattern recognition and
decision making in machine learning.

PyTorch is an open source machine learning machine learning


PYTORCH framework based on the Python programming language and the
Torch library. Torch is an open source ML library used for
creating deep neural networks and is written in the Lua scripting
language. It's one of the preferred platforms for deep
learning research.

KERAS Keras isan open-source library thatprovides


a Python interface for artificial neural networks. Keras was first
independent software, then integrated into the TensorFlow library.

Apache MXNet is an open-source deep learning software


MXnet framework that trains and deploys deep neural networks. It aims to
be scalable, allows fast model training, and supports a flexible
programming model and multiple programming languages. The
MXNet library is portable and can scale to multiple GPUs and
machines.

Chainer is an open source deep learning framework written purely


CHAINER in Python on top of NumPy and CuPy Python libraries. The
development is led by Japanese venture company Preferred
Networks in partnership with IBM, Intel, Microsoft, and Nvidia.

Theano is a Python library and optimizing compiler for


THEANO manipulating and evaluating mathematical expressions,
especially matrix-valued ones. In Theano, computations are
expressed using a NumPy-esque syntax and compiled to run
efficiently on either CPU or GPU architectures.

2
NEURAL NETWORKS

The structure of a neural network comprises layers of interconnected nodes,


commonly referred to as neurons or units. These neurons are organized into three
main types of layers: an input layer, one or more hidden layers, and an output layer.
Let us understand each key element of the neural network in detail:
 Input layer: The input layer is responsible for receiving the initial data or
features that are fed into the neural network. Each neuron in the input layer
depicts a specific feature or attribute of the input data.

 Hidden layers: Hidden layers are intermediate layers between the input and
output layers. They perform complex computations and transformations on the
input data. A neural network can have numerous hidden layers, each consisting of
numerous neurons or nodes.

 Neurons (Nodes): Neurons or artificial neurons are fundamental units of a


neural network. They receive input signals and perform computations to produce
an output. Neurons in the hidden and output layers utilize activation functions to
introduce non-linearities into the network, allowing it to learn complex patterns.

 Weights and biases: Weights and biases are adjustable parameters associated
with the connections between neurons. Each connection has a weight, which
determines the strength or importance of the input signal. Biases, on the other
hand, provide an
 Activation functions: Activation functions are threshold values that introduce
non-linearities into the neural network, enabling it to comprehend complex
relationships between inputs and outputs. Common activation functions include
sigmoid, tanh, ReLU (Rectified Linear Unit), and softmax.

 Output layer: The output layer generates the final predictions or outputs of the
neural network. The number of artificial neurons in the output layer depends on
the specific problem being solved. For example, in a binary classification task,
there may be one neuron representing each class (e.g., “yes” or “no”), while in
multi-class classification, there will be multiple neurons representing each
class.

3
 Loss function: The loss function measures the discrepancy between the predicted
outputs of the neural network and the true values. It quantifies the network’s
performance and guides the learning process by providing feedback on its
performance.

 Backpropagation: Backpropagation is a learning algorithm used to train a


neural network. It involves propagating the error (difference between predicted
and actual outputs) backward through the network and adjusting the weights and
biases iteratively to minimize the loss function.

 Optimization algorithm: Optimization algorithms, such as gradient descent, are


employed to update the weights and biases during training. These algorithms
determine the direction and magnitude of weight adjustments based on the
gradients of the loss function concerning the network parameters.

4
Pytorch(by facebook)

 Open sourced on GitHub.com in 2017, PyTorch is one of the newer deep


learning frameworks.

 PyTorch also optimizes performance by taking advantage of native support for


asynchronous execution from Python. Benefits include built-in dynamic
graphs and a stronger community than that of TensorFlow.

 However, PyTorch doesn't provide a framework to deploy trained models directly


online, and an API server is needed for production. It also requires a third party
— Visdom — for visualization, and the features of this are rather limited

PyTorch has the following features:


Production Ready: Transition seamlessly between eager and graph modes with
TorchScript, and accelerate the path to production with TorchServe.
Distributed Training: Scalable distributed training and performance optimization in
research and production is enabled by the torch.distributed backend.
Robust Ecosystem: A rich ecosystem of tools and libraries extends PyTorch and
supports development in computer vision, NLP and more.er on the server using Node
and power mobile apps using React Native.
Cloud Support: PyTorch is well supported on major cloud platforms, providing
frictionless development and easy scaling.

 Open sourced on GitHub.com in 2017, PyTorch is one of the newer deep learning
frameworks.

 PyTorch also optimizes performance by taking advantage of native support for


asynchronous execution from Python. Benefits include built-in dynamic graphs
and a stronger community than that of TensorFlow.

 However, PyTorch doesn't provide a framework to deploy trained models directly


online, and an API server is needed for production. It also requires a third party
— Visdom — for visualization, and the features of this are rather limited.

5
KERAS
Keras is a high-level, user-friendly API used for building and training neural networks.
It is designed to be user-friendly, modular, and easy to extend. Keras allows you to
build, train, and deploy deep learning models with minimal code. It provides a high-
level API that is intuitive and easy to use, making it ideal for beginners and experts
alike.

Keras is relatively easy to learn and work with because it provides a python frontend
with a high level of abstraction while having the option of multiple back-ends for
computation purposes. This makes Keras slower than other deep learning frameworks,
but extremely beginner-friendly.

6
PREPARE THE DATA
 Format your dataset appropriately for input into the RNN.

 This typically involves splitting the data into input sequences and corresponding
output sequences.

 Ensure that your data is properly scaled and normalized if necessary.

7
Data Normalization techniques

Min-Max Scaling:

 This normalization method is utilized to convert information into a range


between 0 and 1, by subtracting the minimum value from each data point and
after that partitioning by the difference between the greatest and least values.

 This normalization method is valuable when managing with information that has
exceptions, so that the exceptions do not skew the information much as well.

Z-Score Normalization:

 This normalization strategy is utilized to convert information into a standard


normal distribution, by subtracting the mean from each data point and, after that
dividing by the standard deviation.

 This procedure is valuable when the information contains a normal distribution,


because it makes a difference to create the information more interpretable .

8
Decimal Scaling:

 This normalization strategy is utilized to convert information into a range from


0 to 1, by subtracting the minimum value from each data point and after that
dividing by the difference between the greatest and least values.

 This normalization procedure is valuable when managing with exceptionally


expansive datasets, because it makes a difference in diminishing the information
to a manageable range.

Log Transformation:

 This normalization method is utilized to convert information into a logarithmic


scale, by taking the log of each data point.

 This procedure is useful when managing with information that incorporates a


wide extend of values, because it makes a difference to decrease the variety in the
information.

 This technique is additionally valuable when managing with information that has
outliers, because it makes a difference to decrease their impact on the information.

9
ADVANTAGES

 Normalization in machine learning helps to reduce data redundancy and


improve data integrity.

 Normalization moreover improves information consistency.

 Normalization also helps to reduce the complexity of queries.

10
1. Calculate the range of the data set

 To find the range of a data set, find the maximum and minimum values in the
data set, then subtract the minimum from the maximum. Arranging your data set
in order from smallest to largest can help you find these values easily. Here's the
formula:

 Range of x values = xmaximum - xminimum

 Example: A scientist is using the normalization formula to analyze a set of data.


They did their experiment four times, and their results were 12, 26, 28 and 32.
The largest data point in the set is 32, and the smallest is 12.

 Range of x values = 32 - 12 = 20

2. Subtract the minimum x value from the value of this data point

 Next, take the x value of the data point you're analyzing and subtract the
minimum x value from it. You can start with any data point in your set.

 Example: The scientist's first data point is 25, so the scientist subtracts the
minimum x value from that:

 x - xminimum = 25 - 12 = 13

11
3. Insert these values into the formula and divide

 The final step of applying this formula to an individual data point is to divide the
difference between the specific data point and the minimum by the range. In this
process, that would mean taking the result of step two and dividing it by the result
from step one.

 Example: For this data point, the scientist fills in the complete equation:
xnormalized = (x - xminimum) / range of x = 13 / 20 = 0.65

 This result falls between zero and one, so they applied the normalization formula
correctly.

4. Repeat with additional data points

 Since the normalization formula is useful for analyzing and comparing complete
sets of data, it's important to apply it to each data point so that you can compare
your whole set. You might automate this with a spreadsheet program to save time.

 Example: The scientist completes their analysis by using the normalization


formula on the remaining three data points, 12, 28 and 32. Their results are 0, 0.8
and 1.

12
Normalization formula for custom ranges

 While this normalization formula brings all results into a range between zero and
one, there is a variation on the normalization formula to use if you're trying to put
all data within a custom range where the lowest value is a and the highest value is
b:

 xnormalized = a + ( ((x - xminimum) * (b - a)) / range of x)

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# read the dataset


data = pd.read_csv('day.csv')
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

# convert the normalized features to a DataFrame


X_normalized_df = pd.DataFrame(X_normalized, columns=X.columns)
print(X_normalized_df)

13
X = data.iloc[:,:-1]:
 data is typically a DataFrame containing your dataset.

 .iloc is a method in pandas used for integer-location based indexing. It allows


you to select rows and columns by their integer indices.

 [:,:-1] selects all rows and all columns except the last one. This is achieved
by using : to select all rows and :-1 to select all columns up to the last one.

 So, X contains all the input features (independent variables) of the dataset.

y = data.iloc[:,-1]:

 Similarly, data.iloc[:,-1] selects all rows and only the last column.
 This extracts the target variable (dependent variable) from the dataset.
 So, y contains the target variable.
 In summary, after executing these lines of code, you will have:

 X: A DataFrame containing all the input features (independent variables) of the


dataset.

 y: A Series containing the target variable (dependent variable) of the dataset

14
Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It
involves analyzing and visualizing data to understand its key characteristics, uncover
patterns, and identify relationships between variables refers to the method of studying
and exploring record sets to apprehend their predominant traits, discover patterns,
locate outliers, and identify relationships between variables. EDA is normally carried
out as a preliminary step before undertaking extra formal statistical analyses or
modeling.

Key aspects of EDA include:

 Distribution of Data: Examining the distribution of data points to understand


their range, central tendencies (mean, median), and dispersion (variance, standard
deviation).

 Graphical Representations: Utilizing charts such as histograms, box plots,


scatter plots, and bar charts to visualize relationships within the data and
distributions of variables.

 Outlier Detection: Identifying unusual values that deviate from other data points.
Outliers can influence statistical analyses and might indicate data entry errors or
unique cases.

 Correlation Analysis: Checking the relationships between variables to


understand how they might affect each other. This includes computing correlation
coefficients and creating correlation matrices.

 Handling Missing Values: Detecting and deciding how to address missing data
points, whether by imputation or removal, depending on their impact and the
amount of missing data.

15
 Summary Statistics: Calculating key statistics that provide insight into data
trends and nuances.
 Testing Assumptions: Many statistical tests and models assume the data meet
certain conditions (like normality or homoscedasticity). EDA helps verify these
assumptions.

Exploratory Data Analysis (EDA) is important for several reasons, especially in the
context of data science and statistical modeling. Here are some of the key reasons
why EDA is a critical step in the data analysis process:

 Understanding Data Structures: EDA helps in getting familiar with the dataset,
understanding the number of features, the type of data in each feature, and the
distribution of data points. This understanding is crucial for selecting appropriate
analysis or prediction techniques.

 Identifying Patterns and Relationships: Through visualizations and statistical


summaries, EDA can reveal hidden patterns and intrinsic relationships between
variables. These insights can guide further analysis and enable more effective
feature engineering and model building.

 Detecting Anomalies and Outliers: EDA is essential for identifying errors or


unusual data points that may adversely affect the results of your analysis.
Detecting these early can prevent costly mistakes in predictive modeling and
analysis.

 Testing Assumptions: Many statistical models assume that data follow a certain
distribution or that variables are independent. EDA involves checking these
assumptions. If the assumptions do not hold, the conclusions drawn from the
model could be invalid.

 Informing Feature Selection and Engineering: Insights gained from EDA can
inform which features are most relevant to include in a model and how to
transform them (scaling, encoding) to improve model performance.

16
 Optimizing Model Design: By understanding the data’s characteristics, analysts
can choose appropriate modeling techniques, decide on the complexity of the
model, and better tune model parameters.

 Facilitating Data Cleaning: EDA helps in spotting missing values and errors in
the data, which are critical to address before further analysis to improve data
quality and integrity.

 Enhancing Communication: Visual and statistical summaries from EDA can


make it easier to communicate findings and convince others of the validity of
your conclusions, particularly when explaining data-driven insights to
stakeholders without technical backgrounds.

17
Python Libraries

Normally, a library is a collection of books or is a room or place where many books


are stored to be used later. Similarly, in the programming world, a library is a
collection of precompiled codes that can be used later on in a program for some
specific well-defined operations. Other than pre-compiled codes, a library may
contain documentation, configuration data, message templates, classes, and values, etc.

A Python library is a collection of related modules. It contains bundles of code that


can be used repeatedly in different programs. It makes Python Programming simpler
and convenient for the programmer. As we don’t need to write the same code again
and again for different programs. Python libraries play a very vital role in fields of
Machine Learning, Data Science, Data Visualization, etc. Let’s have a look at some of
the commonly used libraries:

 pandas: Provides extensive functions for data manipulation and analysis,


including data structure handling and time series functionality.Pandas are an
important library for data scientists. It is an open-source machine learning library
that provides flexible high-level data structures and a variety of analysis tools. It
eases data analysis, data manipulation, and cleaning of data. Pandas support
operations like Sorting, Re-indexing, Iteration, Concatenation, Conversion of data,
Visualizations, Aggregations, etc.

 Matplotlib: A plotting library for creating static, interactive, and animated


visualizations in Python.This library is responsible for plotting numerical data.
And that’s why it is used in data analysis. It is also an open-source library and
plots high-defined figures like pie charts, histograms, scatterplots, graphs, etc.

 Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing


attractive and informative statistical graphics.

 Plotly: An interactive graphing library for making interactive plots and offers
more sophisticated visualization capabilities.

18
PYTHON PLATFORM(ANACONDA)

Anaconda is the installation program used by Fedora, Red Hat Enterprise Linux and some other
distributions.

During installation, a target computer’s hardware is identified and configured and the appropriate file
systems for the system’s architecture are created. Finally, anaconda allows the user to install the
operating system software on the target computer. Anaconda can also upgrade existing installations of
earlier versions of the same distribution. After the installation is complete, you can reboot into your
installed system and continue doing customization using the initial setup program.

Anaconda is a fairly sophisticated installer. It supports installation from local and remote sources such
as CDs and DVDs, images stored on a hard drive, NFS, HTTP, and FTP. Installation can be scripted
with kickstart to provide a fully unattended installation that can be duplicated on scores of machines. It
can also be run over VNC on headless machines. A variety of advanced storage devices including
LVM, RAID, iSCSI, and multipath are supported from the partitioning program. Anaconda provides
advanced debugging features such as remote logging, access to the python interactive debugger, and
remote saving of exception dumps.

Anaconda is a Python distribution that is popular for data analysis and scientific computing.
• Available for Windows, Mac OS X and Linux.
• Included many popular packages: Numpy, SciPy, Matplotlib, Pandas, IPython, Cython.
• Includes Spyder, a Python development environment.
• Includes conda, a platform-independent package manager

19
INTRODUCTION TO GOOGLE COLAB

Google is quite aggressive in AI research. Over many years, Google developed AI


framework called TensorFlow and a development tool called Colaboratory. Today
TensorFlow is open-sourced and since 2017, Google made Colaboratory free for
public use. Colaboratory is now known as Google Colab or simply Colab.

Another attractive feature that Google offers to the developers is the use of GPU.
Colab supports GPU and it is totally free. The reasons for making it free for public
could be to make its software a standard in the academics for teaching machine
learning and data science. It may also have a long term perspective of building a
customer base for Google Cloud APIs which are sold per-use basis.

Irrespective of the reasons, the introduction of Colab has eased the learning and
development of machine learning applications.

20
INTRODUCTION TO KAGGLE

Kaggle is an online community platform for data scientists and machine learning enthusiasts. Kaggle
allows users to collaborate with other users, find and publish datasets, use GPU integrated notebooks,
and compete with other data scientists to solve data science challenges. The aim of this online platform
(founded in 2010 by Anthony Goldbloom and Jeremy Howard and acquired by Google in 2017) is to
help professionals and learners reach their goals in their data science journey with the powerful tools
and resources it provides. As of today (2021), there are over 8 million registered users on Kaggle.

One of the sub-platforms that made Kaggle such a popular resource is their competitions. In a similar
way that HackerRank plays that role for software developers and computer engineers, “ Kaggle
Competitions” has significant importance for data scientists; you can learn more about them in
our Kaggle Competiton Guide and learn how to analyze a dataset step-by-step in our Kaggle
Competition Tutorial. In data science competitions like Kaggle’s or DataCamp’s, companies and
organizations share a big amount of challenging data science tasks with generous rewards in which data
scientists, from beginners to experienced, compete on their completion. Kaggle also provides the
Kaggle Notebook, which, just like DataLab, allows you to edit and run your code for data science
tasks on your browser, so your local computer doesn't have to do all the heavy lifting and you don't
need to set up a new development environment on your own.

Kaggle provides powerful resources on cloud and allows you to use a maximum of 30 hours of GPU
and 20 hours of TPU per week. You can upload your datasets to Kaggle and download others' datasets
as well. Additionally, you can check other people's datasets and notebooks and start discussion topics
on them. All your activity is scored on the platform and your score increases as you help others and
share useful information. Once you start earning points, you will be placed on a live leaderboard of 8
million Kaggle users.

21
CUSTOMER PURCHASE DATASET

22
NORMALIZATION OF DATASET

23
24
25
26

You might also like