0% found this document useful (0 votes)
49 views42 pages

ML Manual

The document is a lab manual for the Machine Learning Lab (BCSL606) at Proudadevaraya Institute of Technology, outlining the course objectives, experiments, and assessment details for the VI semester. It emphasizes the importance of technical education in Artificial Intelligence and Machine Learning, aiming to produce competent and socially responsible engineering graduates. The manual includes an introduction to machine learning concepts, types, and algorithms, alongside practical experiments to enhance students' learning and skills.

Uploaded by

Mayduru Lekhana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views42 pages

ML Manual

The document is a lab manual for the Machine Learning Lab (BCSL606) at Proudadevaraya Institute of Technology, outlining the course objectives, experiments, and assessment details for the VI semester. It emphasizes the importance of technical education in Artificial Intelligence and Machine Learning, aiming to produce competent and socially responsible engineering graduates. The manual includes an introduction to machine learning concepts, types, and algorithms, alongside practical experiments to enhance students' learning and skills.

Uploaded by

Mayduru Lekhana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Ballari V.V.

Sangha’s

PROUDHADEVARAYA INSTITUTE OF TECHNOLOGY, HOSPETE

(Affiliated to VTU, Belagavi, Karnataka & Recognized by AICTE, New Delhi)

Department of Artificial Intelligence & Machine

Learning MACHINE LEARNING LAB


(BCSL606)

VI Semester

VISVESVARAYA TECHNOLOGICAL UNIVERSITY,


Jnana Sangama, Belagavi

KARNATAKA-583225.
Ballari V.V.Sangha’s
PROUDHADEVARAYA INSTITUTE OF TECHNOLOGY, HOSPETE
(Affiliated to VTU, Belagavi, Karnataka & Recognized by AICTE, New Delhi)

Department of Artificial Intelligence & Machine Learning

LAB MANUAL
(2024-2025)

MACHINE LEARNING LAB (BCSL606)

VI Semester

Prepared by: Dr. Venumadhava. M


B.E, M. Tech, P.hd
INSTITUTE VISSION

To become premier institute in imparting technical education by creating competent engineers


having dynamic adaptability with high morals and concern towards environment and society.

INSTITUTE MISSION

M1 Promote active learning strategies to facilitate student centric learning.

M2 Provide self-learning capabilities to enhance employability and entrepreneurial skills

M3 Inculcate human values and ethics to make learners sensitive towards societal issues.

Vision of the Department:

To become center of excellence in providing technical education in the field of Artificial


Intelligence & Machine Learning Engineering to produce globally competent and socially
responsible engineering Graduates.
Mission of the Department:

M1 To provide theoretical foundation and practical training to the students in the areas related to
computing with exposure of latest tools and technologies.
M2 To establish a continuous industry-institute interaction, participation and collaboration to
contribute skilled competent engineers.
M3 To build human values, social values, entrepreneurship skills and professional ethics among the
IT technocrats.

Programme Educational Objectives (PEOs):

PEO1 Analyze and solve real time computer science and engineering problems through fundamental
knowledge of mathematics, science, engineering and Technology.
PEO2 Pursue a successful career in the field of Computer Science & Engineering by contributing to
the profession as an excellent employee, or as an entrepreneur
PEO3 Exhibit the ethical values, communication, teamwork skills, ability for lifelong learning to
address computer engineering and societal issues.
Programme Specific Objectives (PSOs):

The graduates will be able to;


PSO1 Understand, Degree, implement and test algorithms/programmes for the department of different
computer-based systems.
PSO2 Acquire and apply Hardware/Software tools to build real world applications.
PSO3 Adopt and incorporate the changing technologies for creating innovative career paths to be an
entrepreneur and desire for higher studies.

PROGRAMME OUTCOMES (POs):

Undergraduate engineering programmes are designed to prepare graduates to attain the following
program outcomes
PO1 Engineering Apply the knowledge of mathematics, science, engineering
knowledge fundamentals, and an engineering specialization to the solution of
complex engineering problems.
PO2 Problem analysis Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first
principles of mathematics, natural sciences, and engineering
sciences
PO3 design/development of Design solutions for complex engineering problems and design
solutions system components or processes that meet the pecified needs with
appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations

PO4 conduct investigations Use research-based knowledge and research methods including
of complex problems design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.

PO5 Modern tool usage Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and
modelling to complex engineering activities with an
understanding of the
limitations.
PO6 The engineer and societ Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.

PO7 Environment and Understand the impact of the professional engineering solutions in
sustainability societal and environmental contexts, and demonstrate the knowledge
of, and need for sustainable development.
PO8 Ethics Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO9 Individual and Function effectively as an individual, and as a member or leader in
teamwork diverse teams, and in multidisciplinary settings.
PO10 Communication Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able
to comprehend and write effective reports and design documentation,
make effective presentations, and give and receive clear instructions.

PO11 Project management and Demonstrate knowledge and understanding of the engineering and
finance management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in
multidisciplinary environments
PO12 Life-long learning Recognize the need for and have the preparation and ability to engage
in independent and life-long learning in the broadest
Context of technological change.
Machine Learning lab Semester 6
Course Code BCSL606 CIE Marks 50
Teaching Hours/Week (L:T:P: S) 0:0:2:0 SEE Marks 50
Credits 01 Exam Hours 100
Examination type (SEE) Practical
Course objectives:
 To become familiar with data and visualize univariate, bivariate, and multivariate data using statistical
techniques and dimensionality reduction.
 To understand various machine learning algorithms such as similarity-based learning, regression, decision trees,
and clustering.
 To familiarize with learning theories, probability-based models and developing the skills required for
decision-making in dynamic environments.
Sl.NO Experiments
1 Develop a program to create histograms for all numerical features and analyze the distribution of each feature.
Generate box plots for all numerical features and identify any outliers. Use California Housing dataset.

Book 1: Chapter 2

2 Develop a program to Compute the correlation matrix to understand the relationships between pairs of features.
Visualize the correlation matrix using a heatmap to know which variables have strong positive/negative
correlations. Create a pair plot to visualize pairwise relationships between features. Use California Housing
dataset.
Book 1: Chapter 2
3 Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality of the
Iris dataset from 4 features to 2.

Book 1: Chapter 2
4 For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-S algorithm
to output a description of the set of all hypotheses consistent with the training examples.

Book 1: Chapter 3
5 Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated 100 values
of x in the range of [0,1]. Perform the following based on dataset generated.
a.
Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊ Class1
b.
Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
Book 2: Chapter – 2
6 Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs

Book 1: Chapter – 4
7 Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use Boston
Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency prediction) for
Polynomial Regression.

Book 1: Chapter – 5
8 Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer Data set for
building the decision tree and apply this knowledge to classify a new sample.

Book 2: Chapter – 3
9 Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for training.
Compute the accuracy of the classifier, considering a few test data sets.

Book 2: Chapter – 4
10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and visualize the
clustering result.

Book 2: Chapter – 4
Course outcomes (Course Skill Set):
At the end of the course the student will be able to:
 Illustrate the principles of multivariate data and apply dimensionality reduction techniques.
 Demonstrate similarity-based learning methods and perform regression analysis.
 Develop decision trees for classification and regression problems, and Bayesian models for probabilistic
learning.
 Implement the clustering algorithms to share computing resources.

Assessment Details (both CIE and SEE)


The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%. The
minimum passing mark for the CIE is 40% of the maximum marks (20 marks out of 50) and for the SEE
minimum passing mark is 35% of the maximum marks (18 out of 50 marks). A student shall be deemed to
have satisfied the academic requirements and earned the credits allotted to each subject/ course if the student
secures a minimum of 40% (40 marks out of 100) in the sum total of the CIE (Continuous Internal Evaluation)
and SEE (Semester End Examination) taken together
Continuous Internal Evaluation (CIE):
CIE marks for the practical course are 50 Marks.
 The split-up of CIE marks for record/ journal and test are in the ratio 60:40.
 Each experiment is to be evaluated for conduction with an observation sheet and record write- up.
Rubrics for the evaluation of the journal/write-up for hardware/software experiments are
designed by the faculty who is handling the laboratory session and are made known to students at
the beginning of the practical session.
 Record should contain all the specified experiments in the syllabus and each experiment write- up will
be evaluated for 10 marks.
 Total marks scored by the students are scaled down to 30 marks (60% of maximum marks).
 Weightage to be given for neatness and submission of record/write-up on time.
 Department shall conduct a test of 100 marks after the completion of all the experiments listed in the
syllabus.
 In a test, test write-up, conduction of experiment, acceptable result, and
procedural knowledge will carry a weightage of 60% and the rest 40% for
viva-voce.
 The suitable rubrics can be designed to evaluate each student’s performance and learning ability.
 The marks scored shall be scaled down to 20 marks (40% of the maximum marks).
 The Sum of scaled-down marks scored in the report write-up/journal and marks of a test is the total CIE
marks scored by the student.
INTRODUCTION
Machine Learning
Machine Learning is used anywhere from automating mundane tasks to offering intelligent
insights, industries in every sector try to benefit from it. You may already be using a device that
utilizes it. For example, a wearable fitness tracker likes Fit bit, or an intelligent home assistant
like Google Home. But there are much more examples of ML in use.
•Prediction: Machine learning can also be used in the prediction systems. Considering the loan
example, to compute the probability of a fault, the system will need to classify the available data
in groups.
• Image recognition: Machine learning can be used for face detection in an image as well. There
is a separate category for each person in a database of several people.
•Speech Recognition: It is the translation of spoken words into the text. It is used in voice
searches and more. Voice user interfaces include voice dialing, call routing, and appliance
control. It can also be used a simple data entry and the preparation of structured documents.
•Medical diagnoses: ML is trained to recognize cancerous tissues.
•Financial industry: and trading: companies use ML in fraud investigations and credit checks.
Types of Machine Learning?
Machine learning can be classified into 3 types of algorithms
1.Supervised Learning
2.Unsupervised Learning
3.Reinforcement Learning
Overview of Supervised Learning Algorithm
In Supervised learning, an AI system is presented with data which is labeled, which means that
each data tagged with the correct label. The goal is to approximate the mapping function so well
that when you have new input data (x) that you can predict the output variables (Y) for that data
Types of Supervised learning
•Classification: A classification problem is when the output variable is a category, such as “red”
or “blue” or “disease” and “no disease”.
•Regression: A regression problem is when the output variable is a real value, such as “dollars”
or “weight”
Overview of Unsupervised Learning Algorithm
In unsupervised learning, an AI system is presented with unlabeled, uncategorized data and the
system’s algorithms act on the data without prior training. The output is dependent upon the
coded algorithms. Subjecting a system to unsupervised learning is one way of testing AI.
Types of Unsupervised learning:
•Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
•Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Overview of Reinforcement Learning
A reinforcement learning algorithm, or agent, learns by interacting with its environment. The
agent receives rewards by performing correctly and penalties for performing incorrectly. The agent
learns without intervention from a human by maximizing its reward and minimizing its penalty.
It is a type of dynamic programming that trains algorithms using a system of reward and
punishment. Some more examples of tasks that are best solved by using a learning algorithm in
the above example, we can see that the agent is given 2 options i.e. a path with water or a path
with fire. A reinforcement algorithm works on reward a system i.e. if the agent uses the fire path
then the rewards are subtracted and agent tries to learn that it should avoid the fire path. If it had
chosen the water path or the safe path then some points would have been added to the reward
points, the agent then would try to learn what path is safe and what path isn’t. It is basically
leveraging the rewards obtained; the agent improves its environment knowledge to select the next
action.
Machine learning Approaches
Decision tree learning: Decision tree learning uses a decision tree as a predictive model, which
maps observations about an item to conclusions about the item's target value. Association rule
learning Association rule learning is a method for discovering interesting relations between
variables in large databases.
Artificial neural networks
An artificial neural network (ANN) learning algorithm, usually called "neural network" (NN), is
a learning algorithm that is vaguely inspired by biological neural networks. Computations are
structured in terms of an interconnected group of artificial neurons, processing information using
a connectionist approach to computation. Modern neural networks are non-linear statistical data
modeling tools. They are usually used to model complex relationships between inputs and
outputs, to find patterns in data, or to capture the statistical structure in an unknown joint
probability distribution between observed variables
Deep learning
Falling hardware prices and the development of GPUs for personal use in the last few years have
contributed to the development of the concept of deep learning which consists of multiple hidden
layers in an artificial neural network. This approach tries to model the way the human brain
processes light and sound into vision and hearing. Some successful applications of deep learning
are computer vision and speech recognition.
Inductive logic programming
Inductive logic programming (ILP) is an approach to rule learning using logic programming as a
uniform representation for input examples, background knowledge, and hypotheses. Given an
encoding of the known background knowledge and a set of examples represented as a logical
database of facts, an ILP system will derive a hypothesized logic program that entails all positive
and no negative examples. Inductive programming is a related field that considers any kind of
programming languages for representing hypotheses (and not only logic programming), such as
functional programs.
Clustering Cluster
Analysis is the assignment of a set of observations into subsets (called clusters) so that
observations within the same cluster are similar according to some pre designated criterion or
criteria, while observations drawn from different clusters are dissimilar. Different clustering
techniques make different assumptions on the structure of the data, often defined by some
similarity metric and evaluated for example by internal compactness (similarity between
members of the same cluster) and separation between different clusters. Other methods are based
on estimated density and graph connectivity. Clustering is a method of unsupervised learning,
and a common technique for statistical data analysis.
Bayesian networks
A Bayesian network, belief network or directed acyclic graphical model is a probabilistic
graphical model that represents a set of random variables and their conditional independencies
via a directed acyclic graph (DAG). For example, a Bayesian network could represent the
probabilistic relationships between diseases and symptoms. Given symptoms, the network can be
used to compute the probabilities of the presence of various diseases. Efficient algorithms exist
that perform inference and learning.
What is Python?
A programming language with strong similarities to PERL, but with powerful typing and object
oriented features.
-Commonly used for producing HTML content on websites. Great for text files.
-Useful built-in types (lists, dictionaries).
-Clean syntax, powerful extensions.
Why Python?
Natural Language ToolKit
•Ease of use; interpreter
•AI Processing: Symbolic - Python’s built-in datatypes for strings, lists, and more. - Java or C++
requires the use of special classes for this.
•AI Processing: Statistical - Python has strong numeric processing capabilities: matrix operations,
etc. - Suitable for probability and machine learning code.
Look at a sample of code… x = 34 - 23 # A comment.
y = “Hello” # Another one.
z = 3.45
if z == 3.45 or y == “Hello”:
x=x+1
y = y + “World” # String concat. print x
print y
Enough to Understand the Code
Assignment uses = and comparison uses
==.
•For numbers +-*/% are as expected. - Special use of + for string concatenation. - Special use of
% for string formatting.
• Logical operators are words (and, or, not) not symbols (&&, ||!).

•The basic printing command is “print.”


•First assignment to a variable will create it. - Variable types don’t need to be declared. - Python
figures out the variable types on its own.
Basic Datatypes
• Integers (default for numbers)

z = 5 / 2 # Answer is 2, integer division.


•Floats x = 3.456
•Strings Can use “” or ‘’ to specify. “abc” ‘abc’ (Same thing.) Unmatched ones can occur within
the string. “matt’s” Use triple double-quotes for multi-line strings or strings than contain both ‘
and “ inside of them: “““a‘b“c”””
Whitespace
Whitespace is meaningful in Python: especially indentation and placement of newlines.
-Use a newline to end a line of code. (Not a semicolon like in C++ or Java.) (Use \ when must go
to next line prematurely.)
-No braces { } to mark blocks of code in Python… Use consistent indentation instead. The first
line with a new indentation is considered outside of the block.
-Often a colon appears at the start of a new block. (We’ll see this later for function and class
definitions.)
Comments
•Start comments with # – the rest of line is ignored.
•Can include a “documentation string” as the first line of any new function or class that you define.
•The development environment, debugger, and other tools use it: it’s good style to include one.
def my_function(x, y): “““This is the docstring. This function does blah blah blah.””” # The code
would go here...
Look at a sample of code…
x = 34 - 23 # A comment.
y = “Hello” # Another one.
z = 3.45 if z == 3.45 or y == “Hello”:
x = x + 1 y = y + “ World” # String concat. print x
print y
Python and Types
Python determines the data types in a program automatically. “Dynamic Typing”
But Python’s not casual about types, it enforces them after it figures them out. “Strong Typing”
So, for example, you can’t just append an integer to a string. You must first convert the integer to
a string itself.
x = “the answer is ” # Decides x is string. y = 23 # Decides y is integer.
print x + y # Python will complain about this.
Naming Rules
•Names are case sensitive and cannot start with a number. They can contain letters, numbers, and
underscores. bob Bob _bob _2_bob_ bob_2 BoB
•There are some reserved words: and, assert, break, class, continue, def, del, elif, else, except,
exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try,
while Accessing Non-existent Name
• If you try to access a name before it’s been properly created (by placing it on the left side of an
assignment), you’ll get an
error. >>> y Traceback (most recent call last): File "", line 1, in -toplevel-y NameError: name ‘y'
is not defined >>> y = 3 >>> y Multiple Assignment
•You can also assign to multiple names at the same time. >>> x, y = 2, 3 >>> x 2 >>> y 3 String
Operations
•We can use some methods built-in to the string data type to perform some formatting operations
on strings: >>> “hello”.upper() ‘HELLO’
•There are many other handy string operations available. Check the Python documentation for
more.
Printing with Python
•You can print a string to the screen using “print.”
Using the % string operator in combination with the print command, we can format our output text.
>>> print “%s xyz %d” % (“abc”, 34)
abc xyz 34
“Print” automatically adds a newline to the end of the string. If you include a list of strings, it will
concatenate them with a space between them.
>>> print “abc”
>>> print “abc”, “def” abc
abc def
Numpy
Let’s start with NumPy. Among other things, NumPy contains:
A powerful N-dimensional array object. Sophisticated (broadcasting/universal) functions. Tools
for integrating C/C++ and Fortran code.
Useful linear algebra, Fourier transforms, and random number capabilities.
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data.
The key to NumPy is the ndarray object, an n-dimensional array of homogeneous data types,
with many operations being performed in compiled code for performance. There are several
important differences between NumPy arrays and the standard Python sequences:
NumPy arrays have a fixed size. Modifying the size means creating a new array. NumPy arrays
must be of the same data type, but this can include Python objects. More efficient mathematical
operations than built-in sequence types.
Numpy datatypes
To begin, NumPy supports a wider variety of data types than are built-in to the Python language
by default. They are defined by the numpy.dtype class and include:
intc (same as a C integer) and intp (used for indexing) int8, int16, int32, int64
uint8, uint16, uint32, uint64 float16, float32, float64 complex64, complex128
bool_, int_, float_, complex_ are shorthand for defaults.
These can be used as functions to cast literals or sequence types, as well as arguments to numpy
functions that accept the dtype keyword argument.
To begin, NumPy supports a wider variety of data types than are built-in to the Python language
by default. They are defined by the numpy.dtype class and include:
intc (same as a C integer) and intp (used for indexing) int8, int16, int32, int64
uint8, uint16, uint32, uint64 float16, float32, float64 complex64, complex128
Some examples:
>>> import numpy as np
>>> x = np.float32(1.0)
>>> x 1.0 >>> y = np.int_([1,2,4])
>>> y array([1, 2, 4])
>>> z = np.arange(3, dtype=np.uint8)
>>> z array([0, 1, 2], dtype=uint8)
>>> z.dtype dtype('uint8')

Numpy arrays
There are a couple of mechanisms for creating arrays in NumPy:
•Conversion from other Python structures (e.g., lists, tuples).
•Built-in NumPy array creation (e.g., arange, ones, zeros, etc.).
•Reading arrays from disk, either from standard or custom formats (e.g. reading in from a CSV
file).
•And others … In general, any numerical data that is stored in an array-like container can be
converted to an ndarray through use of the array() function. The most obvious examples are
sequence types like lists and tuples.
>>> x = np.array([2,3,1,0])
>>> x = np.array([2, 3, 1, 0])
>>> x = np.array([[1,2.0],[0,0],(1+1j,3.)])
>>> x = np.array([[ 1.+0.j, 2.+0.j], [ 0.+0.j, 0.+0.j], [ 1.+1.j, 3.+0.j]])
There are a couple of built-in NumPy functions which will create arrays from scratch.
•zeros(shape) -- creates an array filled with 0 values with the specified shape. The default dtype is
float64.
>>>np.zeros((2,3)) array([[ 0., 0., 0.], [ 0., 0., 0.]])
•ones(shape) -- creates an array filled with 1 values.
•arange() -- creates arrays with regularly incrementing values.
>>>np.arange(10) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>>np.arange(2,10,dtype=np.float) array([2.,3.,4.,5.,6.,7.,8.,9.])
>>>np.arange(2,3,0.1) array([ 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9])
•linspace() -- creates arrays with a specified number of elements, and spaced equally between the
specified beginning and end values.
>>>np.linspace(1.,4.,6) array([ 1. , 1.6, 2.2, 2.8, 3.4, 4. ])
•random.random(shape) – creates arrays with random floats over the interval [0,1).
>>>np.random.random((2,3)) array([[0.75688597,0.41759916,0.35007419],[ 0.77164187,
0.05869089, 0.98792864]]) Printing
an array can be done with the print statement.
>>> import numpy as np >>> a = np.arange(3)
>>> print a [0 1 2]
>>> a array([0, 1, 2])
>>> b = np.arange(9).reshape(3,3)
>>> print b [[0 1 2] [3 4 5] [6 7 8]]
>>> c = np.arange(8).reshape(2,2,2)
>>> print c [[[0 1]
[2 3]]
[[4 5]
[6 7]]]
Indexing
Single-dimension indexing is accomplished as usual.
>>> x = np.arange(10)
>>> x[2] 2
>>> x[-2]
80123456789
Multi-dimensional arrays support multi-dimensional indexing.
>>> x.shape = (2,5) # now x is 2-dimensional
>>> x[1,3] 8
>>> x[1,-1] 9 0 1 2 3 4 5 6 7 8 9 Using fewer dimensions to index will result in a subarray.
>>> x[0] array([0, 1, 2, 3, 4]) This means that x[i, j] == x[i][j] but the second method is less
efficient. Slicing is possible just as it is for typical Python sequences.
>>> x = np.arange(10)
>>> x[2:5] array([2, 3, 4])
>>> x[:-7] array([0, 1, 2])
>>> x[1:7:2] array([1, 3, 5])
>>> y = np.arange(35).reshape(5,7)
>>> y[1:5:2,::3] array([[ 7, 10, 13], [21, 24, 27]])
Array operations
>>> a = np.arange(5)
>>> b = np.arange(5)
>>> a+b array([0, 2, 4, 6, 8])
>>> a-b array([0, 0, 0, 0, 0])
>>> a**2 array([ 0, 1, 4, 9, 16])
>>> a>3 array([False, False, False, False, True], dtype=bool)
>>> 10*np.sin(a) array([ 0., 8.41470985, 9.09297427, 1.41120008, -7.56802495])
>>> a*b array([ 0, 1, 4, 9, 16]) Basic operations apply element-wise. The result is a new array
with the resultant elements. Operations like *= and += will modify the existing array. Since
multiplication is done element-wise, you need to specifically perform a dot product to perform
matrix multiplication.
>>> a = np.zeros(4).reshape(2,2)
>>> a array([[ 0., 0.], [ 0., 0.]])
>>> a[0,0] = 1
>>> a[1,1] = 1
>>> b = np.arange(4).reshape(2,2)
>>> b array([[0, 1], [2, 3]])
>>> a*b array([[ 0., 0.], [ 0., 3.]])
>>> np.dot(a,b) array([[ 0., 1.], [ 2., 3.]]) There are also some built-in methods of ndarray
objects. Universal functions which may also be applied include exp, sqrt, add, sin, cos, etc… >>>
a = np.random.random((2,3))
>>> a array([[ 0.68166391, 0.98943098, 0.69361582], [ 0.78888081, 0.62197125, 0.40517936]])
>>> a.sum()
4.1807421388722164
>>> a.min() 0.4051793610379143
>>> a.max(axis=0) array([ 0.78888081, 0.98943098, 0.69361582])
>>> a.min(axis=1) array([ 0.68166391, 0.40517936])
Pandas
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use
data structures and data analysis tools for the Python programming language. Python with Pandas
is used in a wide range of fields including academic and commercial domains including finance,
economics, Statistics, analytics, etc.In this tutorial, we will learn the various features of Python
Pandas and how to use them in practice.
Pandas is an open-source Python Library providing high-performance data manipulation and
analysis tool using its powerful data structures. The name Pandas is derived from the word Panel
Data – an Econometrics from Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high performance,
flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data mugging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of
data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas
•Fast and efficient Data Frame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats. • Data alignment
and integrated handling of missing data.
•Reshaping and pivoting of date sets.

• Label-based slicing, indexing and sub setting of large data sets.

•Columns from a data structure can be deleted or inserted.


•Group by data for aggregation and transformations.
•High performance merging and joining of data.
•Time Series functionality. Standard Python distribution doesn't come bundled with Pandas
module. A lightweight alternative is to install NumPy using popular Python package installer,
pip. pip install pandas If you install Anaconda Python package, Pandas will be installed by
default with the following.
Spyder (software)
Spyder is an open-source cross-platform integrated development environment (IDE) for scientific
programming in the Python language. Spyder integrates with a number of prominent packages in
the scientific Python stack, including NumPy, SciPy, Matplotlib, pandas, IPython, SymPy and
Cython, as well as other open-source software
1. Develop a program to create histograms for all numerical features and analyze the
distribution of each feature. Generate box plots for all numerical features and identify
any outliers. Use California Housing dataset.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the dataset


data = fetch_california_housing(as_frame=True)
df = data.frame

# Plot histograms
df.hist(figsize=(12, 8), bins=30, color='blue',
edgecolor='black') plt.suptitle("Feature Distributions")
plt.show()

# Plot boxplots
plt.figure(figsize=(12, 8))
sns.boxplot(data=df, palette="Set2")
plt.xticks(rotation=45)
plt.title("Box Plots of Features")
plt.show()

# Detect outliers using IQR method


outliers_count = {}
for col in df.select_dtypes(include=[np.number]).columns:
Q1, Q3 = df[col].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
outliers_count[col] = ((df[col] < lower) | (df[col] > upper)).sum()

# Print outlier summary print("\


nOutliers Count:")
print(outliers_count)

# Print dataset summary print("\


nDataset Summary:")
print(df.describe())
OUTPUT:

Outliers Count:
{'MedInc': 681, 'HouseAge': 0, 'AveRooms': 511, 'AveBedrms': 1424, 'Population': 1196, 'AveOccup': 711, 'Latitude': 0, 'Longitude': 0,

MedInc HouseAge ... Longitude MedHouseVal


count 20640.000000 20640.000000 ... 20640.000000 20640.000000
mean 3.870671 28.639486 ... -119.569704 2.068558
std 1.899822 12.585558 ... 2.003532 1.153956
min 0.499900 1.000000 ... -124.350000 0.149990
25% 2.563400 18.000000 ... -121.800000 1.196000
50% 3.534800 29.000000 ... -118.490000 1.797000
75% 4.743250 37.000000 ... -118.010000 2.647250
max 15.000100 52.000000 ... -114.310000 5.000010
2. Develop a program to Compute the correlation matrix to understand the relationships between
pairs of features. Visualize the correlation matrix using a heatmap to know which variables have
strong positive/negative correlations. Create a pair plot to visualize pairwise relationships
between features. Use California Housing dataset.

import pandas
as pd import
seaborn as
sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Step 1: Load the California Housing Dataset


california_data =
fetch_california_housing(as_frame=True) data =
california_data.frame

# Step 2: Compute the correlation matrix


correlation_matrix = data.corr()

# Step 3: Visualize the correlation matrix using a heatmap


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f',
linewidths=0.5) plt.title('Correlation Matrix of California Housing Features')
plt.show()

# Step 4: Create a pair plot to visualize pairwise relationships


sns.pairplot(data, diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of California Housing Features', y=1.02)
plt.show()
OUTPUT:
3. Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.

import numpy as
np import pandas
as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import
PCA

# Load Iris dataset


iris = load_iris()
data, labels = iris.data, iris.target

# Apply PCA (2 components)


pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)

# Plot PCA results


plt.figure(figsize=(8, 6))
for i, color in enumerate(['r', 'g', 'b']):
plt.scatter(*reduced_data[labels == i].T, color=color, label=iris.target_names[i])

plt.title("PCA on Iris Dataset")


plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend()
plt.grid()
plt.show()
OUTPUT:
4. A given set of training data examples stored in a .CSV file, implement and demonstrate the
Find- S algorithm to output a description of the set of all hypotheses consistent with the
training examples.

import pandas as pd

def find_s_algorithm(file_path):
data = pd.read_csv(file_path)

print("Training data:")
print(data)

attributes = data.columns[:-1]
class_label = data.columns[-1]

hypothesis = ['?' for _ in attributes]

for index, row in data.iterrows():


if row[class_label] == 'Yes':
for i, value in enumerate(row[attributes]):
if hypothesis[i] == '?' or hypothesis[i] == value:
hypothesis[i] = value
else:
hypothesis[i] = '?'

return hypothesis

file_path = 'training_data.csv'
hypothesis = find_s_algorithm(file_path)
print("\nThe final hypothesis is:", hypothesis)

CSV File:

Outlook Temperature Humidity Windy PlayTennis


Sunny Hot High FALSE No
Sunny Hot High TRUE No
Overcast Hot High FALSE Yes
Rain Cold High FALSE Yes
Rain Cold High TRUE No
Overcast Hot High TRUE Yes
Sunny Hot High FALSE No
OUTPUT:
5. Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated

first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊ Class1 2. Classify
100 values of x in the range of [0,1]. Perform the following based on dataset generated. 1. Label the

the
remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30

import numpy as np
import matplotlib.pyplot as plt
from collections import
Counter

# Generate random data


data = np.random.rand(100)
labels = ["Class1" if x <= 0.5 else "Class2" for x in data[:50]]

# k-NN classifier function


def knn(train_data, train_labels, test_point, k):
distances = sorted(((abs(test_point - x), label) for x, label in zip(train_data, train_labels)), key=lambda d: d[0])
return Counter(label for _, label in distances[:k]).most_common(1)[0][0]

# Train and test data


train_data, train_labels, test_data = data[:50], labels, data[50:]

# k values
k_values = [1, 3, 5, 10]

# Perform classification and plot results


for k in k_values:
predictions = [knn(train_data, train_labels, x, k) for x in test_data]

plt.figure(figsize=(8, 5))
plt.scatter(train_data, [0] * len(train_data), c=["blue" if l == "Class1" else "red" for l in train_labels],
label="Train Data", marker="o")
plt.scatter(test_data, [1] * len(test_data), c=["blue" if p == "Class1" else "red" for p in predictions], label="Test
Data", marker="x")

plt.title(f"k-NN (k={k}) Classification")


plt.xlabel("Data Points")
plt.ylabel("Classification Level")
plt.legend()
plt.grid(True)
plt.show()
OUTPUT:
6. Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points.
Select appropriate data set for your experiment and draw graphs

import numpy as np
import matplotlib.pyplot as plt

def gaussian_kernel(x, xi, tau):


return np.exp(-np.sum((x - xi) ** 2) / (2 * tau ** 2))

def locally_weighted_regression(x, X, y, tau):


m = X.shape[0]
weights = np.array([gaussian_kernel(x, X[i], tau) for i in range(m)])
W = np.diag(weights)
X_transpose_W = X.T @ W
theta = np.linalg.inv(X_transpose_W @ X) @ X_transpose_W @ y
return x @ theta

np.random.seed(42)
X = np.linspace(0, 2 * np.pi, 100)
y = np.sin(X) + 0.1 * np.random.randn(100)
X_bias = np.c_[np.ones(X.shape), X]

x_test = np.linspace(0, 2 * np.pi, 200)


x_test_bias = np.c_[np.ones(x_test.shape), x_test]
tau = 0.5
y_pred = np.array([locally_weighted_regression(xi, X_bias, y, tau) for xi in x_test_bias])

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='red', label='Training Data', alpha=0.7)
plt.plot(x_test, y_pred, color='blue', label=f'LWR Fit (tau={tau})',
linewidth=2) plt.xlabel('X', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Locally Weighted Regression',
fontsize=14) plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.show()
OUTPUT:
7. Develop a program to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for
vehicle fuel efficiency prediction) for Polynomial Regression.

import numpy as
np import pandas
as pd
import matplotlib.pyplot as plt
from sklearn.datasets import
fetch_california_housing from
sklearn.model_selection import train_test_split from
sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

def train_plot(X, y, model, title, xlabel, ylabel):


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

plt.scatter(X_test, y_test, color="blue", label="Actual")


plt.scatter(X_test, y_pred, color="red", label="Predicted",
alpha=0.5) plt.xlabel(xlabel), plt.ylabel(ylabel), plt.title(title)
plt.legend(), plt.show()
print(f"{title} | MSE: {mean_squared_error(y_test, y_pred):.4f} | R²: {r2_score(y_test,
y_pred):.4f}\n")

# Load datasets
housing = fetch_california_housing(as_frame=True)
auto_mpg = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-
mpg/auto-mpg.data", sep='\s+',
names=["mpg", "cylinders", "displacement", "horsepower",
"weight", "acceleration", "model_year", "origin"], na_values="?").dropna()

# Train & plot


train_plot(housing.data[["AveRooms"]], housing.target, LinearRegression(), "Linear Regression -
California Housing", "Avg Rooms", "Median Home Value")
train_plot(auto_mpg[["displacement"]], auto_mpg["mpg"],
make_pipeline(PolynomialFeatures(2), StandardScaler(), LinearRegression()),
"Polynomial Regression - Auto MPG", "Displacement", "MPG")
OUTPUT:
8. Develop a program to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data set for building the decision tree and apply this knowledge to classify a new
sample

# Importing necessary libraries


import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import
train_test_split from sklearn.tree import
DecisionTreeClassifier from sklearn.metrics import
accuracy_score
from sklearn import tree

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42) clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print(f"Model Accuracy: {accuracy * 100:.2f}%")
new_sample = np.array([X_test[0]])
prediction = clf.predict(new_sample)

prediction_class = "Benign" if prediction == 1 else "Malignant"


print(f"Predicted Class for the new sample: {prediction_class}")

plt.figure(figsize=(12,8))
tree.plot_tree(clf, filled=True, feature_names=data.feature_names,
class_names=data.target_names)
plt.title("Decision Tree - Breast Cancer Dataset")
plt.show()
OUTPUT:
9. Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data
set for training. Compute the accuracy of the classifier, considering a few test data sets

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split,
cross_val_score from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset & split


X, y = fetch_olivetti_faces(shuffle=True, random_state=42, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Train & predict


gnb = GaussianNB().fit(X_train, y_train)
y_pred = gnb.predict(X_test)

# Print metrics
print(f'Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%')
print("\nClassification Report:\n", classification_report(y_test, y_pred, zero_division=1))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print(f'\nCross-validation accuracy:{cross_val_score(gnb,X,y,cv=5, scoring="accuracy").mean() * 100:.2f}
%')

# Display sample predictions


fig, axes = plt.subplots(3, 5, figsize=(12, 8))
for ax, img, label, pred in zip(axes.ravel(), X_test, y_test, y_pred):
ax.imshow(img.reshape(64, 64), cmap="gray")
ax.set_title(f"True: {label}, Pred: {pred}")
ax.axis('off')

plt.show()
OUTPUT:
10. Develop a program to implement k-means clustering using Wisconsin Breast Cancer data
set and visualize the clustering result.

import numpy as
np import pandas
as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report

# Load & preprocess data


X, y = load_breast_cancer(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)
X_pca = PCA(n_components=2).fit_transform(X_scaled)

# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10).fit(X_scaled)
df = pd.DataFrame(X_pca, columns=['PC1',
'PC2']).assign(Cluster=kmeans.labels_, True_Label=y)

# Print evaluation metrics


print(f"Confusion Matrix:\n{confusion_matrix(y, kmeans.labels_)}\n")
print(f"Classification Report:\n{classification_report(y, kmeans.labels_)}")

# Plot results
for label, title in zip(['Cluster', 'True_Label'], ['K-Means Clustering', 'True Labels']):
sns.scatterplot(data=df, x='PC1', y='PC2', hue=label, palette='coolwarm', s=80,
edgecolor='black', alpha=0.7)
plt.title(title), plt.legend(title=label), plt.show()

# Plot centroids
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1', s=80, edgecolor='black',
alpha=0.7)
plt.scatter(*PCA(n_components=2).fit(X_scaled).transform(kmeans.cluster_centers_).T, s=200,
c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering with Centroids'), plt.legend(), plt.show()
OUTPUT:

You might also like