0% found this document useful (0 votes)
5 views

PDF

The project titled 'Computational Mathematics in the Era of Data Science' explores the integration of mathematical principles with data science methodologies, focusing on key areas such as linear algebra, calculus, probability, and optimization. A significant emphasis is placed on Principal Component Analysis (PCA) for dimensionality reduction, showcasing its applications across various fields like healthcare and finance. The project also addresses contemporary challenges in computational mathematics and outlines future directions, reinforcing the critical role of mathematics in advancing data science.

Uploaded by

mdhinagaran321
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

PDF

The project titled 'Computational Mathematics in the Era of Data Science' explores the integration of mathematical principles with data science methodologies, focusing on key areas such as linear algebra, calculus, probability, and optimization. A significant emphasis is placed on Principal Component Analysis (PCA) for dimensionality reduction, showcasing its applications across various fields like healthcare and finance. The project also addresses contemporary challenges in computational mathematics and outlines future directions, reinforcing the critical role of mathematics in advancing data science.

Uploaded by

mdhinagaran321
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

ABSTRACT

In an era defined by data, the convergence of mathematics and data science is reshaping
the landscape of research and industry a like. This project, titled "Computational
Mathematics in the Era of Data Science," investigates how fundamental mathematical
principles underpin modern datadriven methodologies. By exploring the theoretical
underpinnings of linear algebra, calculus, probability, and optimization, the project
demonstrates how these concepts are harnessed to solve complex problems across diverse
domains.

A central focus of the study is Principal Component Analysis (PCA), a powerful


technique used for dimensionality reduction. Through detailed mathematical derivations and
computational examples, the project illustrates how PCA can transform highdimensional
datasets into more manageable forms while preserving essential variance. Applications in fields
such as healthcare, finance, artificial intelligence, and climate science are examined to highlight
both the versatility and impact of these techniques.

Furthermore, the project addresses contemporary challenges in computational


mathematics—such as scalability, interpretability, and data security—and outlines future
directions including quantum computing, explainable AI, and advanced optimization strategies.
These insights not only underscore the current capabilities of computational methods but also
pave the way for future innovations in data science.

Overall, this work emphasizes the critical role of mathematics in driving data science
advancements and provides a framework for future exploration at the intersection of these
disciplines.

1
INDEX

S.No Title Pg.No

1 Introduction

2 (Basics Of Data Science)

2.1 Data Science

2.2 Key Component of data science

2.3 Tools and Technologies in Data Science

2.4 Types of Data Science Problems

2.5 Application of Data Science

2.6 Challenges in Data Science

3 Mathematics in Data Science

3.1 Introduction

3.2 Key Mathematical Areas in Data Science

3.3 Linear Algebra

3.4 Probability and Statistics

3.5 Calculus

3.6 Optimization in Data Science

3.7 Discrete Mathematics

3.8 Role of Mathematics in Machine learning

3.9 Conclusion

2
4 Principal Component Analysis

4.1 Introduction to PCA

4.2 The Mathematics Behind PCA

4.3 Steps to perform PCA

4.4 Visualization PCA

4.5 Application of PCA

4.6 Advantages and Disadvantages of PCA

4.7 Conclusion

5
Mathematical Explanation And Python Coding
For PCA
5.1 Explanation for 2Dimension

5.2 Explanation for 3Dimension

5.3 Explanation for nDimension

6 Application And Future Directions

6.1 Application of Computational Mathematics in Data Science

6.2 Future Direction

6.3 Computational Complexity

6.4 Conclusion

7 Conclusion

3
CHAPTER 1
INTRODUCTION:
In today’s datadriven world, the fusion of mathematics and data science has
revolutionized how we analyze, interpret, and leverage vast amounts of information. The project
titled "Computational Mathematics in the Era of Data Science" delves into the critical role that
mathematical techniques play in transforming raw data into actionable insights. With the rapid
advancement of computational power and the explosion of data availability, traditional
mathematical concepts have found new applications in solving complex problems across diverse
fields such as healthcare, finance, artificial intelligence, and climate science.

At its core, this project explores how computational mathematics underpins key data
science methodologies, enabling more efficient and accurate decisionmaking processes. From
linear algebra and calculus to probability, statistics, and optimization, the mathematical
foundations discussed herein are pivotal for developing algorithms that drive machine learning,
pattern recognition, and predictive modeling.

A focal point of the project is Principal Component Analysis (PCA), a powerful


technique used for dimensionality reduction. PCA exemplifies how mathematical computations
can simplify large datasets while preserving the most significant features, thereby enhancing
both computational efficiency and interpretability. Through a detailed examination of PCA,
including its derivation, computational steps, and realworld applications, this project highlights
the indispensable synergy between mathematics and data science.
Furthermore, the project discusses the broader implications of integrating advanced
mathematical techniques with modern computational frameworks. It addresses current
challenges, such as scalability and interpretability, and explores emerging trends and future
directions that promise to further transform the landscape of data science.
In summary, "Computational Mathematics in the Era of Data Science" presents a
comprehensive study that not only reinforces the importance of mathematical principles in data
analysis but also paves the way for future innovations at the intersection of computation and
science.

4
CHAPTER 2

BASICS OF DATA SCIENCE

2. 1.Data Science:
Data science is an interdisciplinary field that combines statistical techniques,
mathematical modes and computational methods to analyze and interpret complex data. It aims
to extract meaningful insights pattern and knowledge from structure and unstructured data.

Structured Data:
Structured data refers to data that is organized in a defined manner, typically in rows and
columns making it easy to store query and analyze. It follows a specific schema or format which
allows for easy access and manipulation. This type of data is often found in relational data bases
or spread sheets.
Example:

Employee Name Department Salary


ID

101 Alice HR 55000


102 Bob IT 70000

Employee data [ID, Name, Salary

5
Unstructured data:
Unstructured data refers to data that does not have a predefined format or structure. It is
often text heavy and can come in various formats such as text files, images, audio and vedio.
Unstructured data lacks the organization of structured data which makes it more challenging to
process and analyze directly

Example:
Text document (articles, blogs, social media post Images and
vedio ( Photos, movies, You Tube) Audio ( podcosts, music)
Web page ( HTML, Javascript, content)

Evolution:
The field has evolved from statistics and data mining to a full fledged discipline involving
big data machine learning, artificial Intelligence (AI) and data visulalization the rise of digital
technology and the explosion of data availability have driven the growth of data science.

Importance:
Data Science helps business, governments and scientists make informed decusions by
analyzing large data sets identifying trends, making predictions and optimizing processes.
2.2 .Key Components of Data Science :
Data science is a combination of several core components which include.

Data collection:
 Data can come from various sources such as sensors, data bases, social
media surveys and transactional records.
 In the modern world large scale data generation has led to the concept of big data

Data cleaning and preprocessing:


Raw data often contain errors inconsistencies and missing values data preprocessing ensures
the dataset is clean and suitable for analysis

Common techniques include handlings missing data, outlier detection normalization and
encoding categorical variables.

Data Analysis:
Data analysis involves statistical methods, hypothesis testing and Enploratory data
analysis (EDA) to uncover patterns trends and correlations

6
Descriptive Statistics:
Measures of central tendency ( mean, median), dispersion( varience, standard deviation)
and distribution ( skewness, kurtosis)

Inferential Statistics:
Estimating population parameters from sample data testing hypothesis and making
predictions about future trends.

Data Visualization:
Visualization tools and techniques help to communicate findings from the data analysis
effectively.
2.3.Tools and Technologies in Data Science:
Data science requires a set of powerful tools for collecting, processing and analyzing
data some of the commonly used tools include.

Programming language:
Python, R and SQL are the most widely used languages in data science. python is
particularly favored for its rich ecosystem of libraries such as pandas Numpy and SciPy.

Statistical Analysis:
Scipy, Stats models (for Python) and R provide extensive libraries for statistical analysis.

2.4. Types of Data Science Problems:


Data science can be applied to solve different kinds of problems such as

Classification:
Assigning data to predefined categories (e.g., spam vs. not spam email)

Regression:
Predicting continuous values based on historical data (e.g., predicting housing prices)

Clustering:
Grouping similar data points together (e.g., customer segmentation).

7
Recommendation:
Suggesting items to users based on their preferences (e.g., movie recommendations on
Netflix).

Anomaly Detection:

Identifying outliers or unusual patterns in data (e.g., fraud detection).

2.5. Applications of Data Science


Data science is used across various industries and domains to solve realworld
problems:

Healthcare:
Predictive modeling for disease outbreaks, patient diagnosis,personalised treatment
plans and medical image analysis.

Finance:
Fraud detection, stock market prediction, credit scoring, risk management and
algorithmic trading.

E-Commerce:
Recommendation systems, inventory management, sales forecasting and personalised
marketing.

Social Media:
Sentiment analysis, trend prediction and social network analysis.

Manufacturing:
Predictive maintenance, process optimization and quality control.

8
2.6. Challenges in Data Science

Data Quality:
Raw data may be noisy, inconsistent or incomplete. Data scientists must ensure the data is
clean and reliable before analysis.

Data Privacy and Ethics:


Ensuring that data collection and analysis respect user privacy and follow ethical
guidelines.

Scalability:
Managing and processing large datasets efficiently.

Interpretability:
Making machine learning models understandable to nonexperts and ensuring their decisions
are transparent and explainable.

Summary of Chapter 1
This chapter introduced the fundamental concepts of data science, emphasizing the
importance of mathematical, computational, and statistical techniques in extracting value from
data. By understanding the key components, tools and techniques used in data science, students
and practitioners can appreciate how data science applies to realworld problems and industries.

9
Chapter 3
3. 1. Introduction

Overview:
Mathematics plays a central role in data science, forming the foundation for algorithms
and models used in data analysis, machine learning, and statistical inference. In this chapter, we
explore the mathematical concepts and techniques essential for data science.

Importance of Mathematics in Data Science:


Mathematical methods enable data scientists to extract meaningful insights from data,
make predictions and optimize solutions. These methods are essential for creating accurate
models and making datadriven decisions.

3.2. Key Mathematical Areas in Data Science


Data science draws upon various fields of mathematics, including:

Linear Algebra:
Provides the tools for dealing with data in matrix and vector form, which is central to
machine learning and data manipulation.

Statistics and Probability:


The foundation of data analysis and inference, helping to understand data distributions,
model uncertainty and test hypotheses.

Calculus:
Used for optimizing models, particularly in machine learning algorithms, where we need
to minimize or maximize objective functions (e.g., gradient descent).
Optimization:
Mathematical optimization techniques are crucial in finding the best solution for various
problems like classification, regression or clustering.

Discrete Mathematics:
Relevant for graph theory, combinatorics and algorithms that deal with noncontinuous
data (such as network analysis and pathfinding)

10
3.3 Linear Algebra in Data Science
Linear algebra is one of the most fundamental branches of mathematics used in data
science, especially in machine learning, deep learning and data processing.

Some key concepts include:

Vectors and Matrices:


Represent data points, features, and relationships between data. Vectors are used for
representing individual data points and matrices represent datasets where each row corresponds
to a data point and each column corresponds to a feature.

Eigenvalues and Eigenvectors:


Used in Principal Component Analysis (PCA) for dimensionality reduction. Eigenvectors
represent the direction of data variation and eigenvalues represent the magnitude of that
variation.

Singular Value Decomposition (SVD):

Used in matrix factorization and data compression techniques.

Applications:
Linear regression, machine learning algorithms like support vector machines (SVMs) and
neural networks.

Example:
Representing data points as vectors in a high dimensional space (e.g., a dataset with 10
features becomes a 10dimensional vector).

3.4. Probability and Statistics in Data Science


Probability and statistics form the backbone of data analysis and predictive modeling.
These concepts help data scientists quantify uncertainty, test hypotheses and make predictions
based on available data.

Descriptive Statistics:
Includes measures such as mean, median, mode, variance and standard deviation that
summarize the properties of a dataset.

Probability Distributions:
Probability distributions like Normal, Poisson and Binomial are used to model the
randomness and uncertainty inherent in data.

11
Bayesian Inference:
A method of statistical inference in which Bayes' theorem is used to update the probability
estimate for a hypothesis as more evidence or data becomes available.

Hypothesis Testing:
A procedure used to make inferences or decisions about a population based on sample data,
using tests like ttests, chisquare tests and ANOVA.

Regression Analysis:

A statistical method for modeling the relationship between a dependent variable and one or
more independent variables (e.g., linear regression).

Applications:
Predictive modeling, hypothesis testing and risk assessment.

Example:
Using a normal distribution to model the heights of people in a population and
calculating the probability of selecting someone taller than 6 feet.

3.5. Calculus in Data Science


Calculus is essential for understanding how algorithms optimize themselves to fit data.
The primary role of calculus in data science is in optimization, which is the process of finding
the best parameters for a model to minimize or maximize a certain function.

Differentiation:
In machine learning algorithms like gradient descent, differentiation helps in adjusting
model parameters to minimize the loss function. The gradient represents the direction of the
steepest increase and we move against it to minimize the function.

Integration:
Helps in calculating areas under curves, which is essential in probability theory and in
determining certain aspects of data distributions.

Optimization:
Finding minima or maxima of functions (e.g., the least squares method in regression,
minimizing error in classification).

12
Example:
Using the derivative of the loss function to adjust weights in a neural network during
training.

3.6. Optimization in Data Science


Optimization techniques are at the heart of most machine learning algorithms, where we
aim to minimize or maximize a function to find the best model or prediction.

Linear and Nonlinear Optimization:


Linear optimization focuses on linear relationships, while nonlinear optimization deals
with more complex, real world problems.

Convex Optimization:
A subset of optimization where the function being optimized is convex (i.e., it has a
single global minimum).

Gradient Descent:
An iterative optimization technique used to minimize a loss function in many machine
learning algorithms. It adjusts parameters (such as weights in a neural network) based on the
gradient of the function.

Stochastic Gradient Descent (SGD):


A variation of gradient descent where updates are made using random subsets of data,
making it more computationally efficient.

Example:
Using gradient descent to minimize the loss function in a logistic regression model to
predict binary outcomes.

3.7. Discrete Mathematics in Data Science


Discrete mathematics plays a role in several areas of data science, especially in graph theory
and combinatorics.
Graph Theory:
Helps in analyzing networks, social media graphs, recommendation systems and
understanding relationships between entities (e.g., nodes and edges in a graph).

13
Combinatorics:
The study of counting, arrangement and combination of objects. It is used in algorithm
analysis, data structure design, and optimization problems.

Algorithms and Data Structures:


Understanding the mathematical properties of algorithms and choosing the appropriate
data structures for efficient data storage and retrieval.

Example:
Analyzing social media networks using graph theory to identify influencers (nodes) and
their connections (edges).

3.8. Role of Mathematics in Machine Learning


Mathematics is the foundation of machine learning algorithms, helping in tasks like model
selection, training and validation. Key mathematical concepts such as matrices, vectors,
probability and optimization are used to train models and make predictions.

Supervised Learning:
Involves learning from labeled data. Mathematics is used to compute the best fit line (in
linear regression) or decision boundary (in SVMs).

Unsupervised Learning:
Involves finding hidden patterns or structures in data. Concepts like clustering and
dimensionality reduction (e.g., PCA) rely on mathematical foundations.

Deep Learning:
Involves complex mathematical computations, especially in neural networks, where
calculus and linear algebra are used to adjust the weights and biases.

3.9. Conclusion
In this chapter, we have explored the essential mathematical tools and concepts used in
data science. Linear algebra, probability, statistics, calculus and optimization are foundational to
the development and understanding of data science algorithms and models. A strong grasp of
these mathematical principles is essential for anyone wishing to excel in data science and its
applications.

14
Chapter 4
Principal Component Analysis (PCA)

4. 1. Introduction to PCA

Overview:
Principal Component Analysis (PCA) is a statistical technique used for
dimensionality reduction while preserving as much variance (information) as possible.
It is widely used in data science for reducing the complexity of high dimensional data.

Purpose of PCA:
PCA helps to transform a large set of variables into a smaller one, called
principal components, without losing significant information. This is especially useful
in machine learning for improving model performance, reducing overfitting and
speeding up computation.

4.2 .The Mathematics Behind PCA


PCA works by finding the directions (principal components) along which the
variance of the data is maximized. Here’s the basic process:

Step 1: Standardize the Data


The first step is to standardize the dataset so that each feature has a mean of 0
and a standard deviation of 1. This ensures that all features contribute equally to the
analysis.

Formula for standardizapion:


𝑋−𝜇
𝑋′ = 𝜎

where X is the original data, 𝜇 is the mean of the feature and 𝜎 is the standard
deviation.

Step 2: Covariance Matrix


PCA finds the covariance matrix of the data, which represents how the features relate
to each other.

The covariance matrix is calculated as:

C= 1 ∑𝑛 (𝑥 𝑋 −)(𝑋 − 𝑋)𝑇
𝑛−1 𝑖=1 𝑖−𝑖

15
where 𝑥𝑖 is a data point and X bar is the mean of the data.

Step 3: Eigenvectors and Eigenvalue


PCA involves calculating the eigenvectors and eigenvalues of the covariance matrix.

Eigenvectors represent the directions of maximum variance (principal components) and


eigenvalues indicate the magnitude of variance along each eigenvector.

Step 4: Form the Principal Components


The principal components are the eigenvectors corresponding to the largest eigenvalues.
These components form a new basis for the data.

The data is projected onto these principal components to reduce the dimensionality while
retaining as much variance as possible

4.3 . Steps to Perform PCA

Step 1: Standardize the Data:


Normalize the data if the features have different scales.

Step 2:Compute the Covariance Matrix:


Calculate the covariance matrix to understand the relationships between the variables.

Step 3: Calculate Eigenvectors and Eigenvalues:


Find the eigenvectors and eigenvalues of the covariance matrix.

Step 4: Sort Eigenvectors:


Sort the eigenvectors by their corresponding eigenvalues in descending order.

Step 5: Project Data onto Principal Components:


Select the top k eigenvectors (principal components) and project the data onto these
components to reduce dimensionality.
4.4. Visualizing PCA
To better understand PCA, let's visualize its application using a simple 2D dataset.

Example: Consider the following dataset with two features: X and Y.

16
X Y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2.0 1.6
1.0 1.1

Step by step visual representation:

1.Original Data Points (Before PCA):


The scatter plot below shows the data points in their original 2D

(This image is just a placeholder; you'll need to use actual plotting software like Python’s matplotlib to
generate the plot in your project).

2.Standardized Data:
After standardization, each feature is rescaled to have a mean of 0 and a standard
deviation of 1.

Standardized Data Points (Illustrative example):

3.Covariance Matrix:
The covariance matrix indicates the relationship between the two features, showing if
there’s a linear relationship between them. PCA seeks to find directions that explain the
maximum variance, which in this case might be a line or vector that best captures the data's
spread.

Covariance Matrix (Illustrative table):

X Y
X 1.0 0.9
Y 0.9 1.0

4.Eigenvectors and Eigenvalues:


The eigenvector of the covariance matrix is the line that best fits the data and the
eigenvalue indicates how much variance the data has along that direction.

17
5.Projecting Data onto Principal Components:
By projecting the data onto the principal component (the eigenvector with the largest
eigenvalue), we can reduce the dimensionality of the data from 2D to 1D while retaining as
much of the variance as possible.

Projected Data:
The data points are now projected onto the principal component (a single line in the case
of 2D data)

4.5. Applications of PCA


PCA is widely used across many fields due to its ability to reduce the dimensionality of
data while retaining essential information. Some common applications include:

Image Compression:
Reducing the size of images while retaining key features.

Face Recognition:
Using PCA to extract key facial features and reduce the number of dimensions.

Speech Recognition:
Reducing dimensionality in audio data for more efficient processing.

Data Visualization:
Visualizing high dimensional data in lower dimensions (e.g., 2D or 3D) for easier
interpretation.

Genetics:
Analyzing gene expression data and reducing noise for clearer insights.

4.5. Advantages and Disadvantages of PCA

Advantages:

Dimensionality Reduction:
PCA reduces the number of variables while preserving essential
information.

Noise Reduction:
By focusing on the principal components with the highest variance, PCA helps to

18
reduce noise in the data.

Improved Computation:
Reduced dimensions often lead to faster processing times in machine learning
models.

Disadvantages:

Interpretability:
The transformed data (principal components) may be harder to interpret because
they are linear combinations of the original features.

Assumption of Linearity:
PCA assumes that the data is linearly related, which may not be the case in all
scenarios.

Loss of Information:
Although PCA aims to preserve variance, some information is inevitably lost when
reducing dimensions.

4.7. Conclusion
Principal Component Analysis (PCA) is a powerful technique for dimensionality
reduction, enabling the extraction of key features from complex datasets. Its ability to reduce
data while retaining essential information makes it an invaluable tool in data science, particularly
in machine learning, pattern recognition and data visualization. However, it is important to
understand its limitations and the assumptions behind the method.

19
Chapter 5
Mathematical Explanation and python coding for PCA

5. 1. Mathematical Explanation for 2D


Suppose we have marks for four students:

Student Math English

Student A 85 78
Student B 90 88
Student C 75 70
Student D 95 92

Each student is represented by a 2D point (x,y) where x is the Math mark and y is the
English mark.

Step 1. Data Centering


85+90+75+95
𝜇𝑀𝑎𝑡ℎ = =86.25
4

𝜇 78+88+70+92 = 82
𝐸𝑛𝑔𝑙𝑖𝑠ℎ =
4

Subtract the mean from each observation

 StudentA:
(85−86.25, 78−82)=(−1.25, −4)
 StudentB:
(90−86.25, 88−82)=(3.75, 6)
 StudentC:
(75−86.25, 70−82)=(−11.25, −12)
 StudentD:
(95−86.25, 92−82)=(8.75, 10)

Step 2. Form the Covariance Matrix


The sample covariance matrix S for the centered data (with n=4 samples) is given by

S= 1
∑𝑛𝑖=1 𝑥𝑖 𝑥𝑇.
𝑛−1

Compute Variances and Covariance:


20
1. Variance in Math (x):

(−1.25)2+(3.75)2+(−11.25)2+(8.75)2=1.5625+14.0625+126.5625+76.562

=218.75

Dividing by 3 (since n−1=3):


𝜎2 =218.75 ≈ 72.92.
𝑀𝑎𝑡ℎ 3

2. Variance in Math (y):

(−4)2+(6)2+(−12)2+(10)2=16+36+144+100

=296

Thus,

𝜎2 =296 ≈ 98.67.
𝐸𝑛𝑔𝑙𝑖𝑠ℎ 3

3. Covariance between Math and English:


(−1.25)(−4)+(3.75)(6)+(−11.25)(−12)+(8.75)(10)=5+22.5+135+87.5 =250

Dividing by 3:

𝜎𝑀𝑎𝑡ℎ,𝐸𝑛𝑔𝑙𝑖𝑠ℎ = 250 ≈ 83.33.


3

So the covariance matrix is


72.92 83.33
S=( ).
83.33 98.67

Solve using the quadratic formula:

171.58±√(171.58)2−4×250
λ= .
2

The discriminant is

(171.58)2− 1000 ≈ 29439.5 − 1000 = 28439.5,

So

√28439.5 ≈168.66
Thus, the eigenvalues are approximately:
171.58+168.66 171.58−168.66
𝜆1≈ ≈170.12, 𝜆 2≈ ≈1.46.
2 2

21
Eigenvectors:

For λ1 ≈ 170.12, solve


72.92 − 𝜆 83.33 ) 𝑣 = 0.
(
83.33 98.67 − 𝜆
This becomes
−97.20 83.33 𝑣1
( )( )
83.33 −71.45 𝑣 2
Step 3. Eigen Decomposition
We solve for eigenvalues λ from

det(S−λI)=0.

That is,

72.92 − 𝜆 83.33 ) = (72.92 − 𝜆)( 98.67 − 𝜆) (83.33)2=0


Det
83.33 98.67 − 𝜆
Multiplying out (approximate values):

(72.92 − λ)(98.67 − λ) ≈ 7194.44 − 171.58 λ + λ2

and since 83.332 ≈ 6944.44, the equation becomes

λ2 − 171.58 λ + 7194.44 − 6944.44 = λ2 − 171.58 λ + 250 = 0.


From the first row:
97.20
−97.20 v + 83.33 v = 0 ⟹ = 𝑣 ≈ 1.1665 𝑣
1 2 2 83.33 1 1

Choosing v1 = 1, the (unnormalized) eigenvector is (1, 1.1665). Its norm is

‖𝑣‖=√(1)2+(1.1665)2≈ √1 + 1.3606≈1.5365.

Thus, the normalized eigenvector (first principal component) is


1
v ≈( , 1.1665)≈(0.6506, 0.7588)
1
1.5365 1.5365

The second eigenvector (for λ2) will be orthogonal; for instance, it may be approximately

v2 ≈ (0.7588, −0.6506).

Here, PC1 (the direction of maximum variance) is along v1.

Step 4. Projection onto Principal Components

22
For each centered data point x, the coordinate along PC1 is given by the dot product

t = 𝑣𝑇1x.

For example, for Student A’s centered data (−1.25, −4):

tA = (0.6506, 0.7588) ⋅ (−1.25, −4) ≈ 0.6506 × (−1.25) + 0.7588 ×(−4).

This projection gives the new coordinate along the principal axis that explains
most of the variance in marks.

Part 2. Python code Implementation

import numpy as np

import matplotlib.pyplot as plt

# Data: rows represent students; columns represent marks in Math and English.

# Students: A, B, C, D
X = np.array([ [85, 78],
[90, 88],

[75, 70],

[95, 92]

])

# Step 1: Center the data by subtracting the mean of each subject.

mean_marks = np.mean(X, axis=0) # Compute column means X_centered =

X mean_marks

print("Mean Marks:", mean_marks)

print("Centered Data:\n", X_centered)

# Step 2: Compute the covariance matrix.

# Note: np.cov expects variables in rows by default, so we transpose. cov_matrix =

np.cov(X_centered.T, bias=False)

print("Covariance Matrix:\n", cov_matrix)

# Step 3: Perform eigen decomposition of the covariance matrix. eigenvalues,

eigenvectors = np.linalg.eig(cov_matrix) print("Eigenvalues:", eigenvalues)

print("Eigenvectors:\n", eigenvectors)

23
# Sort the eigenvalues (and corresponding eigenvectors) in descending order. idx =

eigenvalues.argsort()[::1]

eigenvalues = eigenvalues[idx] eigenvectors =

eigenvectors[:, idx] print("Sorted Eigenvalues:",

eigenvalues)

print("Sorted Eigenvectors:\n", eigenvectors)

# Step 4: Project the centered data onto the principal components.

# Here we compute the projection onto PC1.

PC1 = eigenvectors[:, 0]

projections = X_centered @ PC1 # dot product along PC1

print("Projection onto PC1:", projections)

# For visualization, we can also project onto both PCs.

X_projected = X_centered @ eigenvectors # Each column is a principal


component score

# Plot original data and the principal component axes.plt.figure(figsize=(6,6))

plt.scatter(X_centered[:, 0], X_centered[:, 1], color='blue', label='Centered Data')

# Plot the mean (origin after centering)

plt.scatter(0, 0, color='black', marker='x', s=100, label='Origin') # Plot

principal axes (scaled for visualization)

origin = np.array([[0, 0]]) # origin

# PC1: using eigenvector corresponding to largest eigenvalue pc1_line =

np.array([PC1, PC1]) 20 # scale factor for display plt.plot(pc1_line[:,0],

pc1_line[:,1], color='red', label='PC1')

# PC2: second principal component

pc2_line = np.array([eigenvectors[:,1], eigenvectors[:,1]]) 20 plt.plot(pc2_line[:,0],

pc2_line[:,1], color='green', label='PC2') plt.xlabel('Math (centered)')

plt.ylabel('English (centered)') plt.title('PCA of

Students’ Marks') plt.legend()

plt.grid(True)
24
plt.axis('equal') plt.show()

OUTPUT:

Explanation of the Code

1.Data Preparation:
The marks are stored in a NumPy array where each row is a student’s marks in Math and
English.

2.Centering:
The mean of each column is subtracted so that the data is centered at the origin.

3.Covariance Matrix:
We compute the covariance matrix using the centered data.

4.Eigen Decomposition:
We use NumPy’s np.linalg.eig to obtain the eigenvalues and eigenvectors of the
covariance matrix. We then sort them in descending order of eigenvalue magnitude

25
5.Projection:
The centered data is projected onto the principal components by computing the dot product
with the eigenvectors.

6.Visualization:
The original centered data is plotted along with the PC1 and PC2 axes to show the
direction of maximum variance.
5.2 Mathematical Explanation for 3D .

Consider the marks of four students in three subjects:

Student Tamil Science Social


A 80 85 75
B 90 95 85
C 70 75 65
D 85 80 80

Each student is represented as a 3dimensional point


Tamil
X=(Science)
Social
Step 1. Data Centering

compute the mean for each subject:


80+90+70+85 85+95+75+80
𝜇 𝑀𝑎𝑡ℎ = = 81.25 , 𝜇 𝑆𝑐𝑖𝑒𝑛𝑐𝑒 = = 83.75
4 4

75+85+65+80
𝜇𝑆𝑜𝑐𝑖𝑎𝑙 = = 76.25
4

26
For Tamil and Social:
(−1.25)(−1.25)+(8.75)(8.75)+(−11.25)(−11.25)+(3.75)(3.75)
Cov(Tamil,Social)=
3

≈ 72.92

For science and social:

Cov(Science, Social) ≈ 60.42.

Thus, the covariance matrix is approximately

72.92 60.42 72.92


S ≈(60.42 72.92 60.42)
72.92 60.42 72.92

Note: Because the first and third subjects (Tamil and Social) have identical centered deviations,
the matrix is singular (one eigenvalue will be nearly zero), indicating redundancy between those
subjects.

Step 3. Eigen Decomposition

Solve the equation det(S −

λI) = 0,

To obtain eigenvalues λ1, λ2, λ3 and corresponding eigenvectors v1, v2, v3.

Without going through all algebraic details (which involve solving a cubic equation), one
typically finds:

A dominant eigenvalue (say, λ1 ≈ 202.3) whose eigenvector v1 lies approximately along


the direction of maximum variance.

A much smaller eigenvalue (e.g. λ2 ≈ 16.3) and a zero (or nearly zero) eigenvalue λ3 ≈ 0
reflecting the redundancy between Tamil and Social.

For example, one may obtain (after normalization):

v1 ≈ (0.65, 0.75, 0.65),v2 ≈ (0.70, −0.05, −0.70),v3 ≈ (0.28, −0.95, 0.28).

The first principal component (PC1) captures the maximum variance (here, nearly all the
variation), while the remaining components capture little additional information.
Step 4. Projection onto Principal Components

Each centered data point xc is projected onto the new axes by computing

t = v𝖳 xc.

For instance, the projection onto PC1 is computed as:

27
t(1) = 𝑣𝑇𝑥𝑐
1
Repeating this for each data point gives the transformed (or score) coordinates in the PCA
space.

Part 2. Python Code Implementation


Below is an annotated Python script that implements 3dimensional PCA on the student
marks dataset:

import numpy as np

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D # For 3D plotting

# Example Data: Each row corresponds to a student; columns: Tamil, Science,


Social

X = np.array([

[80, 85, 75], # Student A

[90, 95, 85], # Student B

[70, 75, 65], # Student C

[85, 80, 80] # Student D

])

# Step 1: Center the Data

mean_marks = np.mean(X, axis=0) # Compute mean of each column


(subject)

X_centered = X mean_marks

28
print("Mean Marks:", mean_marks) print("Centered

Data:\n", X_centered)

# Step 2: Compute the Covariance Matrix

# We use np.cov with rowvar=False (or transpose) because each column is a variable.

cov_matrix = np.cov(X_centered.T, bias=False) print("Covariance

Matrix:\n", cov_matrix)

# Step 3: Perfo… import

numpy as np

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

# Example: Create a random dataset with n observations and p variables. # For

illustration, let n = 100 observations and p = 5 features. np.random.seed(42)

X = np.random.rand(100, 5)

# Step 1: Standardize the Data

# This scales each feature to have zero mean and unit variance. scaler =

StandardScaler()

X_std = scaler.fit_transform(X)

# Step 2 & 3: Compute PCA using scikitlearn

# You can set n_components to any number <= p, or a fraction of explained variance.

n_components = 3 # For example, reduce to 3 dimensions. pca =

PCA(n_components=n_components)

T = pca.fit_transform(X_std)

29
print("Explained Variance Ratio:", pca.explained_variance_ratio_) print("Principal

Components (each row is one PC):\n", pca.components_) print("Transformed Data Shape:",

T.shape)

# (Optional) Visualize the explained variance import

matplotlib.pyplot as plt plt.figure(figsize=(6,4))

plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o') plt.xlabel('Number of

Components')

plt.ylabel('Cumulative Explained Variance') plt.title('Explained

Variance by PCA Components') plt.grid(True)

plt.show()

OUTPUT:

30
Mathematical Explanation:

1.Standardize the Data:


For a dataset with n observations and p variables, construct a data matrix X of size n × p.

Standardize each variable to have a mean of zero and a standard deviation of one. This
ensures that each variable contributes equally to the analysis.

2.Compute the Covariance Matrix:


1
Calculate the covariance matrix Σ = XT X, which is a p × p matrix representing the
n−1
covariances between

each pair of variables.

3.Eigen Decomposition:
Perform eigen decomposition on the covariance matrix Σ to obtain eigenvalues and
eigenvectors.

The eigenvalues indicate the amount of variance captured by each principal component,
while the eigenvectors represent the directions of these components in the feature space.

4.Select Principal Components:


Sort the eigenvalues in descending order and select the top k eigenvalues and their
corresponding eigenvectors.

The choice of k is based on the desired level of explained variance.

5.Transform the Data:


Project the original standardized data onto the selected eigenvectors to obtain the
principal component scores.

This results in a new dataset with reduced dimensionality.

31
Explanation of the Python Code

1. Data Preparation & Centering:


The marks are stored in an array and then centered by subtracting the mean for each subject.

2. Covariance Matrix Calculation:


Using NumPy’s np.cov , we compute the 3×3 covariance matrix from the centered data.

3. Eigen Decomposition:
We extract eigenvalues and eigenvectors with np.linalg.eig and sort them so that the first
principal component corresponds to the largest eigenvalue.

4. Projection:
The centered data is projected onto the principal components by taking the dot product
with the eigenvector matrix.

5. Visualization:
A 3D scatter plot displays the centered data points. The principal axes (PC1, PC2, PC3)
are overlaid as arrows (scaled for visibility) originating at the origin.
Summary

 Mathematically:
The process starts by centering the 3dimensional student marks data. The covariance
matrix is computed from the centered data, and its eigen decomposition yields the principal
components. The eigenvector with the highest eigenvalue (PC1) points in the direction of
greatest variance, while the other eigenvectors capture the remaining (often redundant)
variance

 In Python:
The provided code shows how to compute the mean, covariance, eigenvalues/eigenvectors,
and finally project the data into the new PCA space. A 3D plot helps visualize both the data
and the principal directions.

5.3. Mathematicals Explanation for N-Dimensional

Principal Component Analysis (PCA) is a statistical technique used to reduce the


dimensionality of a dataset while preserving as much variability as possible. This is achieved by
transforming the original variables into a new set of uncorrelated variables called principal
components, which are ordered by the amount of variance they capture from the data.

32
Mathematical Explanation:

1. Standardize the Data:


For a dataset with n observations and p variables, construct a data matrix X of size n × p.

Standardize each variable to have a mean of zero and a standard deviation of one. This
ensures that each variable contributes equally to the analysis.

2. Compute the Covariance Matrix:

Calculate the covariance matrix Σ = 1 XT X, which is a p × p matrix


n−1
representing the covariances between each pair of variables.
3. Eigen Decomposition:

Perform eigen decomposition on the covariance matrix Σ to obtain eigenvalues and


eigenvectors.

The eigenvalues indicate the amount of variance captured by each principal


component, while the eigenvectors represent the directions of these components in the
feature space.

4. Select principal components


Sort the eigenvalues in descending order and select the top k eigenvalues and their
corresponding eigenvectors.

The choice of k is based on the desired level of explained variance.

5. Transform the Data:


Project the original standardized data onto the selected eigenvectors to obtain the
principal component scores.

This results in a new dataset with reduced dimensionality.

Python codeing implementions


import pandas as pd
import numpy as np
# Here we are using inbuilt dataset of scikit learn from
sklearn.datasets import load_breast_cancer #

33
instantiating
cancer = load_breast_cancer(as_frame=True) #
creating dataframe
df = cancer.frame #
checking shape
print('Original Dataframe shape :',df.shape) # Input
features

X = df[cancer['feature_names']] print('Inputs
Dataframe shape :', X.shape) #
Mean
X_mean = X.mean() #
Standard deviation
X_std = X.std()
# Standardization
Z = (X - X_mean) / X_std #
covariance
c = Z.cov()

# Plot the covariance matrix import


matplotlib.pyplot as plt import seaborn
as sns sns.heatmap(c)
plt.show()

34
OUTPUT

Explanation:

Data Standardization:
Standardscalar is used to standardize the dataset so that each feature Has a

mean of zero and a standard deviation of one.

PCA Transformation:
The PCA class from scikitlearn is utilized to perform PCA. The ncomponents parameter
specifies the number of principle components to retain. The fit_transform method computers the
principal components and Returns the transformed data.

Output:
Explained_varience_ratio_

Provides the proportion of variance explained by each principal component.

Components_ contains the principal axes in the features space, representing the direction of
maximum varience.

This implementation can be extended to datasets with n_components retain the desired number
of principal components. It's important to note that while PCA reduces dimensionality, it may
also lead to some loss of information. Therefore, it's crucial to balance dimensionality reduction
with the amount of variance retained in the data

35
Chapter 6
Application and Future Directions

6.1. Applications of Computational Mathematics in Data Science


Computational mathematics plays a critical role in various fields of data science,
providing analytical techniques to extract insights, optimize processes and enhance
decisionmaking. Below are some of its key applications:

6. 1. 1 Healthcare
Medical Imaging: PCA is widely used to enhance medical images such as MRI scans by
reducing noise while preserving crucial features.

Disease Prediction: Machine learning models supported by computational mathematics


help predict diseases like diabetes and cancer based on patient data.

Genomics: PCA is used to analyze highdimensional genetic data, identifying significant


genetic markers for diseases.

6. 1.2 Finance
Risk Assessment: Statistical models, Monte Carlo simulations and optimization techniques are
used to assess financial risks.

Fraud Detection: Machine learning algorithms based on probability and statistics detect
fraudulent transactions by analyzing transaction patterns.

Portfolio Optimization: Techniques such as convex optimization and linear programming help
investors allocate resources efficiently to maximize returns.

6.1.3 Artificial Intelligence and Machine Learning


Feature Selection: PCA helps reduce highdimensional datasets, allowing machine learning
models to focus on the most relevant features, improving efficiency.

Deep Learning: Computational mathematics supports neural network training by optimizing


weight updates using gradientbased methods.

Natural Language Processing (NLP): Mathematical models such as singular value


decomposition (SVD) improve text processing techniques like topic modeling and sentiment
analysis.

36
6. 1.4 Climate Science and Environmental Monitoring
Climate Prediction: Computational mathematics helps in modeling climate change using
largescale simulations and PCA based data reduction techniques.

Disaster Forecasting: Datadriven models predict natural disasters like hurricanes and
earthquakes, helping mitigate their impact.

6.2 Future Directions in Computational Mathematics and Data Science


The future of computational mathematics in data science is shaped by emerging technologies and
evolving research areas. Below are some key future trends:

6.2.1 Quantum Computing


Quantum computing has the potential to revolutionize computational mathematics by
solving complex optimization and probabilistic problems exponentially faster than classical
computers. In data science, quantum algorithms could enable realtime analytics on massive
datasets.
Explainable AI (XAI)
As AI systems become more complex, ensuring their interpretability is crucial. Future
research will focus on developing mathematical models that make deep learning models
transparent and understandable.

6.2.2 Advanced Optimization Techniques


Optimization will continue to be a core focus, with advancements in convex optimization,
reinforcement learning and heuristic algorithms improving efficiency in solving largescale
computational problems.

6.2.3 Scalable Data Processing


As datasets grow in size and complexity, future developments will focus on designing
scalable mathematical techniques for realtime data processing, ensuring computational efficiency
without compromising accuracy.

37
6.2.4 Ethical AI and Bias Reduction
Mathematical frameworks will be developed to minimize bias in AI models, ensuring
fairness in decisionmaking processes such as hiring, credit scoring and law enforcement
applications.

6.3 Challenges and Open Problems


Despite rapid advancements, several challenges remain in computational mathematics and
data science:

6.3.1 Computational Complexity


Many mathematical problems, such as NPhard optimization problems, remain
computationally expenive to solve. Developing efficient approximation algorithms is a crucial
research direction.

6.3.2 Interpretability vs. Accuracy


Highly accurate AI models, such as deep neural networks, often lack interpretability.
Future work will focus on balancing model complexity with transparency.

6.3.3 Data Privacy and Security


Ensuring data security while maintaining analytical efficiency is a major challenge.
Differential privacy and homomorphic encryption are emerging mathematical techniques that
aim to tackle this issue.

6.3.4 RealTime Decision Making


With the rise of IoT and edge computing, mathematical models need to support realtime
data processing and decision making without relying on high computational resources.

38
6.4 Conclusion
Computational mathematics will continue to be a driving force in data science, enabling
new technological breakthroughs across various industries. The integration of advanced
mathematical techniques with modern computational frameworks will pave the way for a more
efficient, transparent and intelligent future.

In this project, we have explored the critical intersection between mathematics and data
science, demonstrating how computational techniques are essential in extracting meaningful
insights from vast datasets. The study has highlighted the role of fundamental mathematical
concepts—ranging from linear algebra and calculus to probability and optimization—in driving
innovations in data science. A focal point of the research was Principal Component Analysis
(PCA), which was examined both from a theoretical perspective and through practical
applications.

39
7. CONCLUSION:

By delving into the mathematical foundations and computational processes behind PCA,
the project showcased how dimensionality reduction can simplify complex datasets without
significant loss of information. This not only enhances computational efficiency but also
improves the interpretability of data, which is crucial in various realworld applications such as
healthcare, finance, and artificial intelligence.

Furthermore, the project addressed several contemporary challenges, including computational


complexity, scalability and the balance between model accuracy and interpretability. It also
provided insights into future directions, such as the integration of quantum computing and the
development of explainable AI, which promise to further transform the landscape of data science.

Overall, this work underscores the indispensable role of mathematics in advancing data science
methodologies and sets the stage for future research. The integration of advanced mathematical
techniques with emerging computational frameworks is not only enhancing current applications
but is also paving the way for new innovations that will continue to shape our datadriven world.

40
REFERENCES:

1. Strang, G. (2009). Introduction to Linear Algebra.


A foundational text that explains the core concepts of linear algebra used in many data
science algorithms.

2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning.

A comprehensive resource on statistical learning methods that are critical in data science.

3. Bishop, C. M. (2006). Pattern Recognition and Machine Learning.


An essential reference for understanding machine learning techniques and the underlying
mathematics.

4. Provost, F., & Fawcett, T. (2013). Data Science for Business.


This book bridges the gap between data science theory and practical business applications.

5. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical
Learning.

An accessible guide to statistical modeling and inference, with applications in data science.

6. Deisenroth, M. P., Faisal, A. A., & Ong, C. S. (2020). Mathematics for Machine
Learning.

Focuses on the mathematical tools essential for machine learning and data analysis.

7. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.


Provides an indepth treatment of optimization techniques that underpin many data science
methods.

8. Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra.


Explains numerical methods in linear algebra, which are crucial for efficient computation.

41
42

You might also like