0% found this document useful (0 votes)
19 views

ESE Lab File

Uploaded by

AMRITI GUPTA
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

ESE Lab File

Uploaded by

AMRITI GUPTA
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 105

DELHI TECHNOLOGICAL UNIVERSITY

(Formerly Delhi College of Engineering)


Shahbad Daulatpur, Bawana Road, Delhi 110042

Department of Software Engineering

SE302: Empirical Software Engineering

Topic: Software Defect Prediction Using


Federated Transfer Learning

Submitted To: Submitted By:


Ms Shweta Meena Khushwant (2K20/SE/75)
Assistant Professor Mayank Gautam (2K20/SE/80)
Department of Software Engineering

1
INDEX
S.No Aim Date Page
. No.
1. Collecting Empirical Studies 3-5

2. Identify research gaps from the empirical studies. 6-14


Collection of datasets from open-source repositories.
3. Write a program to perform an exploratory analysis of the 15-18
dataset.

4. Write a program to perform feature reduction techniques 19-25


for the collected dataset.
a. Correlation-based feature evaluation
b. Relief attribute feature evaluation
c. Information gain feature evaluation
d. Principle Component Analysis
5. Develop a machine learning model for the selected topic 26-77
(minimum 10 datasets and 10 techniques).
6. Consider the model in 5. and state the hypothesis, 78-82
formulate the plan, analyze sample data, interpret results,
type 1-2 error
7. Write a program to implement a t-test (one sample t-test, 83-85
independent t-test, paired t-test)
8. Write a Program to implement Chi-square Test 86-88
9. Write a Program to implement Friedman Test 89-90
10. Write a program to implement Wilcoxon Signed Rank Test 91-92
11. Write a program to implement the Nemenyi test. 93-94
12. Write down the threats to validity observed in the 95
experiment conducted for models.
13. Explore tools such as WEKA and KEIL 96-99
14. Explore Python and R 100-105

EXPERIMENT - 01
2
AIM
Collecting Empirical Studies
THEORY
Empirical research is research that is based on the observation and measurement of phenomena,
as directly experienced by the researcher. The data thus gathered may be compared against a
theory or hypothesis, but the results are still based on real-life experience.
An example of empirical analysis would be if a researcher was interested in finding out whether
listening to happy music promotes prosocial behaviour. An experiment could be conducted
where one group of the audience is exposed to happy music and the other is not exposed to music
at all.

TABLE - 1

S.No. Paper Title Web Link Publishi Author’s Conference/ No. of


ng Year Name Journal Name citations

1 Software Software 2020 Aili Wang, 2020 IEEE/ACS 5


Defect Defect Yutong 17th
Prediction Prediction Zang, International
based on based on Yixin Yan Conference on
Federated Federated Computer
Transfer Transfer Systems and
Learning Learning Applications
(AICCSA)

2 Defect Defect 2022 Wenjun Computer 2


Prediction Prediction Zhang, Integrated
method based method based Kelvin Manufacturing
on Federated on Federated Wong, System (CIMS)
Transfer Transfer Dhanjoo Tongji
Learning and Learning and Ghista University,
Knowledge Knowledge Shanghai
Distillation Distillation 201804, China

3 A perspective A perspective 2022 Weihua Li, Mechanical 102


survey on deep survey on deep Ruyi Huang, Systems and
transfer transfer Jipu Li, Signal
learning for learning for Yixiao Liao, Processing
Defect Defect Zhuyun Journal, Volume
Prediction Prediction Chen 167A, 15 March
2022

4 2015 He Qing, Li Internetware '15: 13


Software Software
Biwen, Shen Proceedings of
Defect Defect
Beijun, the 7th Asia-
Prediction Prediction
Yong Xia Pacific
Using Feature- Using Feature-
Symposium on

3
Internetware
Based Transfer Based Transfer
Learning Learning

5 2019 Wanzi Wen, 2019 IEEE 1st 9


An Empirical An Empirical
Bin Zhang, International
Study on Study on
Xiang Gu, Workshop on
Transfer Transfer
Xiaolin Ju Intelligent Bug
Learning for Learning for
Fixing (IBF)
Software Software
Defect Defect
Prediction Prediction

6 Software Software defect 2015 Qimeng 2015 First 10


defect prediction via Cao, Qing International
prediction via transfer Sun, Conference on
transfer learning based Qinghua Reliability
learning based neural network Cao, Huobin Systems
neural network Tan Engineering
(ICRSE)

7 Cross Project Cross Project 2019 Zhou Xu, Journal of 30


Defect Defect Shuai Pang, Computer
Prediction via Prediction via Tao Zhang, Science and
Balanced Balanced Xia-Pu Luo, Technology Vol.
Distribution Distribution Jin Liu 34, 2019
Adaptation Adaptation
Based Transfer Based Transfer
Learning Learning

8 Deep Learning Deep Learning 2020 Safa Omri IEEE/ACM 26


for Software for Software 42nd
Defect Defect International
Prediction: A Prediction: A Conference on
Survey Survey Software
Engineering
Workshops

9 2021 Elena Empirical 25


A survey on A survey on
Akimova, Software
Software Software defect
Alexander Engineering: An
defect prediction
Bersenev, International
prediction using deep
Artem Journal
using deep learning
Diekov
learning

4
10 Cross-Project Cross-Project 2020 Tianwei Lee, International 42
Software Software Jingfeng Conference on
Defect Defect Xue, Weijie Machine
Prediction Prediction Han Learning for
Based on Based on Cyber Security
Feature Feature
Selection and Selection and
Transfer Transfer
Learning Learning

11 Homogeneous Homogeneous 2022 Meetesh International 90


Transfer Transfer Nevendra, Conference on
Learning for Learning for Pradeep Information
Defect Defect Singh Systems and
Prediction Prediction Management
Sciences(ISMS)
2022

12 Software Software 2020 Jinyin Chen, ICSE '20: 20


visualization visualization Keke Hu, Proceedings of
and deep and deep Yue Yu, Qi the ACM/IEEE
transfer transfer Xuan, Yi 42nd
learning for learning for Liu International
effective effective Conference on
software defect software defect Software
prediction prediction Engineering

13 Transfer Transfer 2012 Ying Ma, Information and 366


learning for learning for Guangchun Software
cross-company cross-company Luo, Xue Technology,
software defect software defect Zeng, Aiguo Volume 54,
prediction prediction Chen Issue 3

14 Multiview Multiview 2019 Jinyin Chen, IEEE 25


Transfer Transfer Yitao Yang, Transactions on
Learning for Learning for Keke Hu, Qi Software
Software Software Xuan, Yi Engineerin
Defect Defect Liu ( Volume: 34,
Prediction Prediction Issue: 2

15 Transfer Transfer 2020 Rituraj 2020 6


Learning Code Learning Code Singh, International
Vectorizer Vectorizer Jasmeet Conference on
based Machine based Machine Singh, Computational
Learning Learning Mehrab Performance
Models for Models for Singh Gill, Evaluation
Software Software Ruchika (ComPE)
Defect Defect Malhotra,
Prediction Prediction Garima

5
EXPERIMENT - 02
AIM
Identify research gaps from the empirical studies. Collection of datasets from open source
repositories.
THEORY
A research gap is, simply, a topic or area for which missing or insufficient information limits the
ability to reach a conclusion for a question.
A research question is a question that a study or research project aims to answer. This question
often addresses an issue or a problem, which, through analysis and interpretation of data, is
answered in the study's conclusion.

TABLE - 2 (Research Gaps)

S.No. Paper Title Research Gaps

1 Software Defect Prediction based on Lack of benchmark datasets: One of the major
Federated Transfer Learning challenges in FTL for SDP is the lack of benchmark
datasets. This limits the comparability of different
approaches and makes it difficult to evaluate their
effectiveness. Therefore, there is a need to develop
publicly available benchmark datasets that can be used
to evaluate the performance of different FTL-based
SDP models.

2 Defect Prediction method based on Model generalization: FTL and KD are intended to
Federated Transfer Learning and improve the generalization of the model across
Knowledge Distillation different data sources. However, there is a need to
investigate how to optimize the transfer of knowledge
from the source model to the target model to improve
the generalization.

3 A perspective survey on deep The study focuses on using a bagging-based ensemble


transfer learning for Defect classification approach. It would be interesting to
Prediction investigate the effectiveness of other ensemble
techniques, such as boosting, stacking, and hybrid
approaches, in software defect prediction.

4 Software Defect Prediction Using Real-world applicability: The proposed tool needs to
Feature-Based Transfer Learning be tested in real-world settings to evaluate its
effectiveness in practice. This includes evaluating its
performance on industry-scale datasets and
investigating its adoption in software development
processes.

5 An Empirical Study on Transfer Comparison with other techniques: The proposed tool
Learning for Software Defect is not compared with other state-of-the-art techniques
Prediction in software defect prediction. Therefore, there is a

6
need to compare its performance with other
techniques, such as deep learning-based models,
decision tree-based models, and Bayesian networks.

6 Software defect prediction via The effectiveness of transfer learning-based neural


transfer learning-based neural networks in software defect prediction must be
network evaluated in real-world settings. This includes
evaluating their performance on industry-scale
datasets and investigating their adoption in software
development processes.

7 Cross-Project Defect Prediction via Feature selection: The study does not consider feature
Balanced Distribution Adaptation- selection techniques to identify the most relevant
Based Transfer Learning features for defect prediction. Investigating the
effectiveness of feature selection techniques, such as
wrapper, filter, and embedded methods, can improve
the accuracy and generalization of the proposed
approach.

8 Deep Learning for Software Defect Model interpretability: The proposed approach uses a
Prediction: A Survey black-box model, which can be difficult to interpret.
Investigating techniques for improving the
interpretability of the proposed approach, such as
feature importance ranking, attention mechanisms, and
model visualization, can help in understanding its
decision-making process.

9 A survey on Software defect Although transfer learning is briefly discussed in the


prediction using deep learning paper, there is a need to investigate its effectiveness in
software defect prediction using deep learning. This
includes investigating techniques such as domain
adaptation, multi-task learning, and adversarial
transfer learning.

10 Cross-Project Software Defect Real-world applicability: The effectiveness of deep


Prediction Based on Feature learning-based models in software defect prediction
Selection and Transfer Learning needs to be evaluated in real-world settings. This
includes evaluating their performance on industry-
scale datasets and investigating their adoption in
software development processes.

11 Homogeneous Transfer Learning for Scalability of transfer learning models: Transfer


Defect Prediction learning models can be computationally expensive,
especially when dealing with large-scale software
projects. There is a need to investigate the scalability
of transfer learning models and develop efficient
techniques for large-scale defect prediction.

12 Software visualization and deep Incorporation of domain knowledge: Transfer learning


transfer learning for effective models can benefit from the incorporation of domain
software defect prediction knowledge, such as software metrics and developer

7
expertise. There is a need to investigate the most
effective ways to incorporate such knowledge into
transfer learning models.

13 Transfer learning for cross-company Lack of standardized datasets: One of the key
software defect prediction challenges in transfer learning for cross-company
software defect prediction is the availability of
standardized datasets that can be used for evaluation
purposes. There is a need for more standardized
datasets that can be used to compare the performance
of different transfer learning algorithms.

14 Multiview Transfer Learning for Limited studies on transfer learning approaches:


Software Defect Prediction Despite the potential benefits of transfer learning for
software defect prediction, there are only a limited
number of studies that have explored different transfer
learning approaches. More research is needed to
explore different transfer learning techniques such as
domain adaptation, transfer clustering, and transfer
learning with deep neural networks.

15 Transfer Learning Code Vectorizer- Code vectorization is a critical step in the process of
based Machine Learning Models for using machine learning models for software defect
Software Defect Prediction prediction. There are several different techniques for
vectorizing code, including bag-of-words, n-grams,
and deep learning-based approaches. However, there
still needs to be more research comparing the
effectiveness of varying code vectorization techniques
in the context of transfer learning. Further research
could investigate the impact of different code
vectorization techniques on the performance of
transfer learning-based machine learning models for
software defect prediction.

8
TABLE - 3(Research Questions)

S.No. Research Questions

1 What is the effectiveness of federated transfer learning in predicting software defects


across multiple organizations with varying data distributions and privacy constraints?

2 What is the impact of a defect prediction approach that utilizes federated transfer
learning and knowledge distillation in improving the performance of software defect
prediction models?

3 What is the current state of research on deep transfer learning for defect prediction,
and how effective is it compared to traditional defect prediction models?

4 What is the effectiveness of feature-based transfer learning in predicting software


defects across multiple software projects?

5 What is the effectiveness of transfer learning in predicting software defects, and how
does it compare to traditional machine learning techniques?

6 What is the impact of data distribution across organizations on the accuracy of


federated transfer learning for software defect prediction?

7 What are the optimal strategies for selecting and aggregating data from multiple
organizations in federated transfer learning for software defect prediction, considering
varying data distributions and privacy constraints?

8 What are the current state-of-the-art deep learning techniques for software defect
prediction, and how do they compare in terms of their effectiveness and limitations?

9 What are the current trends, techniques, and challenges in software defect prediction
using deep learning?

10 What are the most effective transfer learning techniques, such as fine-tuning, feature
extraction, or model adaptation, for software defect prediction in a federated transfer
learning setting?

11 What is the impact of homogeneous transfer learning on defect prediction in software


development?

12 What are the challenges and limitations of federated transfer learning for software
defect prediction, such as communication efficiency, model convergence, and privacy
concerns, and how can they be addressed?

13 What are the privacy implications of using federated transfer learning for software
defect prediction, and how can these concerns be addressed?

14 What are the best practices for designing and training federated transfer learning
models for software defect prediction, and how can these models be effectively
deployed in real-world scenarios?

9
TABLE - 4 (Answers)

S. No. Answers

1. Federated transfer learning has the potential to improve the accuracy of software defect
prediction models by leveraging the collective knowledge of multiple organizations while
ensuring the privacy and security of their data. By using federated transfer learning,
organizations can train models on data from other organizations without sharing the raw data,
thereby addressing data privacy concerns. Moreover, by leveraging data from multiple
sources, federated transfer learning can reduce the bias in the data and improve the robustness
and generalizability of the prediction models. However, the effectiveness of federated transfer
learning in software defect prediction depends on various factors, such as the quality and
quantity of the data, the similarity of the data distributions across organizations, and the
effectiveness of the federated learning algorithms. Thus, further research is needed to evaluate
the potential of federated transfer learning in software defect prediction and to identify the
best practices and challenges associated with this approach.

2. The approach involves training a model on data from multiple organizations through federated
transfer learning and then distilling the knowledge into a smaller model using knowledge
distillation. The performance of the proposed method can be assessed through metrics such as
accuracy, precision, recall, and F1 score, and compared to traditional defect prediction
methods. The evaluation results can provide insights into the potential of the proposed
approach for enhancing the accuracy and efficiency of software defect prediction.

3. Deep transfer learning has gained increasing attention in recent years as a potentially effective
method for Defect Prediction, leveraging knowledge learned from related tasks to improve
prediction accuracy. To gain insight into the current state of research in this area, a survey was
conducted reviewing recent studies on deep transfer learning for Defect Prediction. The
survey found that deep transfer learning has shown promising results, effectively transferring
knowledge from source domains to target domains with limited labeled data, and
outperforming traditional models such as logistic regression and decision trees. However,
challenges such as the need for large amounts of data and appropriate domain selection were
identified, along with potential transferability issues. Overall, the survey concludes that deep
transfer learning has great potential for Defect Prediction and could prove a valuable tool for
software development teams.

4. Feature-based transfer learning has become a popular approach in software defect prediction
due to its ability to leverage knowledge from similar software projects to improve the
accuracy of the prediction model. In this study, we aim to evaluate the effectiveness of
feature-based transfer learning in predicting software defects across different software
projects. We collected data from multiple software projects and applied feature-based transfer
learning to train a prediction model. We compared the performance of the transfer learning
model with a model trained from scratch using only the target project data. The results showed
that the transfer learning model outperformed the model trained from scratch, with an average
improvement of 10% in prediction accuracy. Our findings suggest that feature-based transfer
learning can be an effective approach to improve the accuracy of software defect prediction
models when training data is limited or when data is available from similar projects.

5. The research topic "An Empirical Study on Transfer Learning for Software Defect Prediction"
aims to investigate the effectiveness of transfer learning in predicting software defects.

10
Transfer learning is a machine learning technique that involves reusing knowledge gained
from one task to improve the performance of a different but related task. In this study, the
researchers conducted an empirical investigation of transfer learning for software defect
prediction by comparing the performance of transfer learning models to traditional machine
learning models.
In conclusion, the empirical study on transfer learning for software defect prediction
demonstrated the effectiveness of transfer learning in improving the performance of software
defect prediction models. The results of the study can help software developers and
researchers to better understand the potential of transfer learning in software defect prediction
and to apply this technique to improve the quality of software development.

6. The research aims to investigate the effectiveness of transfer learning-based neural networks
in predicting software defects. The study will collect software data from various sources and
apply transfer learning techniques to improve the model's predictive performance. The
performance of the transfer learning-based neural network model will be compared to
traditional machine learning models, such as logistic regression and decision tree, using
metrics such as accuracy, precision, recall, and F1 score. The results of this study will provide
insights into the potential of transfer learning-based neural networks in software defect
prediction and help developers choose the best approach to improving software quality.

7. The Balanced Distribution Adaptation-Based Transfer Learning approach for cross-project


defect prediction aims to improve the prediction accuracy by adapting the data distributions
across different projects. The research question investigates the effectiveness of this approach
in addressing the challenges of transferring knowledge from one project to another, where data
distributions may be imbalanced. The study involves comparing the performance of the
Balanced Distribution Adaptation-Based Transfer Learning approach with traditional defect
prediction methods and evaluating its effectiveness in improving prediction accuracy in cross-
project defect prediction scenarios. The results of this research would contribute to the
understanding of the efficacy of this transfer learning approach in addressing data distribution
challenges in cross-project defect prediction and may provide insights for practitioners and
researchers in the field of software engineering for more accurate and effective defect
prediction across different projects.

8. The research topic "Deep Learning for Software Defect Prediction: A Survey" aims to provide
an overview of the current state-of-the-art deep learning techniques that are used for software
defect prediction. The survey would involve reviewing and analyzing existing literature on
deep learning models, such as convolutional neural networks (CNNs), recurrent neural
networks (RNNs), and transformer-based models, that have been applied to software defect
prediction tasks. The survey would also explore the effectiveness of these deep learning
techniques in terms of their prediction accuracy, robustness, scalability, and interpretability.
Additionally, the limitations of these deep learning models, such as potential biases, data
requirements, and interpretability challenges, would be examined. The findings of this survey
could provide insights into the current landscape of deep learning for software defect
prediction, identify gaps and challenges, and suggest directions for future research in this area.

9. The research question aims to investigate the state of the art in software defect prediction
using deep learning techniques. This would involve conducting a survey to explore the current
trends and practices in the field, including the types of deep learning models being used, the
datasets and features employed, and the evaluation metrics used for performance assessment.
The survey would also delve into the challenges faced in software defect prediction using deep

11
learning, such as issues related to data quality, interpretability of deep learning models, and
addressing class imbalance. The findings of the survey would provide insights into the current
landscape of software defect prediction using deep learning and could potentially highlight
areas for further research and improvement in this field.

10. The effectiveness of cross-project software defect prediction can be influenced by the use of
feature selection techniques and transfer learning approaches. Feature selection aims to
identify a subset of relevant features from a large set of features, while transfer learning
involves leveraging knowledge learned from one project to improve prediction performance in
another project. Understanding the impact of feature selection and transfer learning on cross-
project software defect prediction can provide insights into optimizing the prediction accuracy
and efficiency in software development practices.

11. The impact of homogeneous transfer learning on defect prediction in software development
can vary depending on several factors. Homogeneous transfer learning involves transferring
knowledge or models from a source domain to a target domain within the same organization
or software project, without considering differences in data distributions or privacy
constraints. The effectiveness of homogeneous transfer learning for defect prediction can be
evaluated through empirical research that compares the performance of transferred models
with baseline models trained only on the target domain data or models trained from scratch.

12. The inclusion of diverse data sources can enrich the feature representation of the software data
used for training the federated transfer learning models. Code metrics, which provide
quantitative measures of software code quality and complexity, can capture structural and
functional characteristics of the codebase. Developer comments, which contain valuable
insights and contextual information about the code, can provide additional contextual clues
that are not present in the code itself. User feedback, such as bug reports or customer
feedback, can provide real-world usage information and highlight potential defects or issues
that may not be captured by other data sources. Incorporating such diverse data sources into
the federated transfer learning process can result in more comprehensive and informative
feature representations, potentially leading to improved predictive performance.

13. One significant privacy concern is the potential leakage of sensitive information during the
federated transfer learning process. When data from different organizations are combined for
training a shared model, there is a risk of exposing sensitive information about the
organizations, their software development practices, or their customers. This can include
proprietary or confidential information, intellectual property, customer data, or other sensitive
data that organizations may not want to share with others.
Another privacy concern is the potential violation of data privacy regulations or legal
requirements. Organizations may be subject to various data protection laws, such as the
General Data Protection Regulation (GDPR) in the European Union, which require them to
comply with strict rules and regulations regarding the collection, storage, and processing of
personal data. Federated transfer learning may involve transferring data across organizational
boundaries, which can raise compliance issues with these data protection laws, especially if
the data used for training the shared model contain personal or sensitive information.

14. Designing and training federated transfer learning models for software defect prediction
requires careful consideration of various best practices to ensure effective performance and
deployment in real-world scenarios. Best practices for designing and training federated
transfer learning models for software defect prediction include careful selection of

12
participating organizations, thorough data preprocessing, appropriate transfer learning
techniques, efficient and secure model training, and considerations for real-world deployment.
Adhering to these best practices can help ensure the effectiveness and practicality of federated
transfer learning models for software defect prediction in real-world scenarios.

Table - 5 ( Papers corresponding to Research Questions)

S. No. Research Question S. No. of Paper that


corresponds to that Question

1. What is the effectiveness of federated transfer learning in 7, 10, 13


predicting software defects across multiple organizations
with varying data distributions and privacy constraints?

2. What is the impact of a defect prediction approach that 1, 2


utilizes federated transfer learning and knowledge
distillation in improving the performance of software
defect prediction models?

3. What is the current state of research on deep transfer 3, 6, 9


learning for defect prediction, and how effective is it
compared to traditional defect prediction models?

4. What is the effectiveness of feature-based transfer 4, 7


learning in predicting software defects across multiple
software projects?

5. What is the effectiveness of transfer learning in predicting 1 - 15


software defects, and how does it compare to traditional
machine learning techniques?

6. What is the impact of data distribution across 7, 8, 10, 13


organizations on the accuracy of federated transfer
learning for software defect prediction?

7. What are the optimal strategies for selecting and 9, 14


aggregating data from multiple organizations in federated
transfer learning for software defect prediction,
considering varying data distributions and privacy
constraints?

8. What are the current state-of-the-art deep learning 2, 6, 7, 8, 9, 13


techniques for software defect prediction, and how do they
compare in terms of their effectiveness and limitations?

9. What are the current trends, techniques, and challenges in 3, 8, 9, 12


software defect prediction using deep learning?

13
10. What are the most effective transfer learning techniques, 1, 4, 7, 14, 15
such as fine-tuning, feature extraction, or model
adaptation, for software defect prediction in a federated
transfer learning setting?

11. What is the impact of homogeneous transfer learning on 11


defect prediction in software development?

12. What are the challenges and limitations of federated 1, 2, 5, 6, 9, 10


transfer learning for software defect prediction, such as
communication efficiency, model convergence, and
privacy concerns, and how can they be addressed?

13. What are the privacy implications of using federated 2, 6, 13


transfer learning for software defect prediction, and how
can these concerns be addressed?

14. What are the best practices for designing and training 1 -15
federated transfer learning models for software defect
prediction, and how can these models be effectively
deployed in real-world scenarios?

LEARNING
The research question is written so that it outlines various aspects of the study, including the
population and variables to be studied and the problem the study addresses.

14
EXPERIMENT - 03
AIM
Write a program to perform an exploratory analysis of the dataset.
THEORY
Exploratory Data Analysis (EDA) is an approach to data analysis using visual techniques. It is
used to discover trends, patterns, or to check assumptions with the help of statistical summaries
and graphical representations.
CODE AND OUTPUTS
Read the dataset and print the 1st five rows by using the head() function.

Find the shape of the data using the shape.

The describe() function applies basic statistical computations on the dataset.

Use the info() method to know about the columns and their data types.

15
Let’s check if there are any missing values in our dataset or not.

Data visualization
It is the process of analyzing data in the form of graphs or maps, making it a lot easier to understand the
trends or patterns in the data. There are various types of visualizations – univariate, bi-variate and multi-
variate analysis.

16
Histogram: It can be used for both uni and bivariate analysis.

Boxplot: It can also be used for univariate and bivariate analyses.

Scatter Plot: It can be used for bivariate analyses.

17
Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-called
normal)objects.

LEARNING
By means of this experiment we learnt about exploratory data analysis and how it is performed on various
datasets. The exploratory analysis involves understanding, visualizing, and preprocessing data to identify
patterns and trends, validate hypotheses, and ensure data quality, forming a foundation for further analysis
or modeling.

18
EXPERIMENT - 04
AIM
Write a program to perform feature reduction techniques for the collected dataset.
a. Correlation-based feature evaluation
b. Relief attribute feature evaluation
c. Information gain feature evaluation
d. Principle Component Analysis

THEORY
Feature reduction technique is a way of converting the higher dimensions dataset into lesser
dimensions dataset ensuring that it provides similar information.
Principal Component Analysis is a statistical procedure to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated variables.
CFS (Correlation-based Feature Selection) is an algorithm that couples this evaluation
formula with an appropriate correlation measure and a heuristic search strategy.
Relief is an algorithm that takes a filter-method approach to feature selection that is notably
sensitive to feature interactions.
Information Gain is defined as the amount of information provided by the feature items for the
text category.

CODE AND OUTPUTS


Importing libraries and loading dataset

PRINCIPLE COMPONENT ANALYSIS


1. Importing the libraries

2. Distributing the dataset into X and y components for data analysis. Splitting the dataset
into the Training set and Test set

19
3. Feature Scaling: Doing the pre-processing part on training and testing set such as fitting
the Standard scale.

20
4. Applying the PCA function into the training and testing set for analysis.

5. Plotting heatmap

CORRELATION-BASED

Correlation states how the features are related to each other or the target variable. Heatmap
makes it easy to identify which features are most related to the target variable, we will plot the
heatmap of correlated features using the Seaborn library.

21
INFORMATION GAIN
1. The unique() function finds the unique elements of an array and returns these unique
elements as a sorted array

22
2. Splitting the dataset into the Training set and Test set

23
RELIEF ATTRIBUTE

24
The main focus of this kernel is the RReliefF algorithm, but let's spend some time on the data
preprocessing, to make our job easier.

LEARNING
The following are key learnings for performing feature reduction techniques on a collected dataset. First,
correlation-based feature evaluation helps identify redundant or highly correlated features that can be
potentially reduced.
Second, relief attribute feature evaluation using algorithms like ReliefF or SURF assesses the relevance of
features based on their contribution to the prediction task.
Third, information gain feature evaluation measures the predictive power of features using entropy or
information gain. Lastly, Principal Component Analysis (PCA) can effectively reduce dimensionality by
projecting the dataset onto a lower-dimensional space while retaining the most important features.

Experimenting with different techniques and selecting the most appropriate one based on the specific
dataset and prediction task is crucial for successful feature reduction.

25
EXPERIMENT - 05
AIM
Develop a machine learning model for the selected topic (minimum 10 datasets and 10
techniques).
THEORY
SVM: A support vector machine is a type of deep learning algorithm that performs supervised
learning for the classification or regression of data groups.
Logistic Regression: Logistic regression is a supervised learning classification algorithm used to
predict the probability of a target variable.
Naive Bayes: Naïve Bayes algorithm is a supervised learning algorithm, which is based on the
Bayes theorem and used for solving classification problems.
Decision Tree: It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the outcome.
Random Forest: It is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset.
XGBoost: XGBoost algorithm makes use of fast parallel prefix sum operations to scan through
all possible splits, as well as parallel radix sorting to repartition data.
KNN: KNN is a non-parametric, supervised learning classifier, which uses proximity to make
classifications or predictions about the grouping of an individual data point.
LSTM: It is a variety of recurrent neural networks (RNNs) that are capable of learning long-
term dependencies, especially in sequence prediction problems.
CatBoost: CatBoost is an algorithm for gradient boosting on decision trees
ANN: An artificial neural network is an attempt to simulate the network of neurons that make up
a human brain so that the computer will be able to learn things and make decisions.
CODE AND OUTPUT
1. Dataset: ant-1.3
● Importing the Libraries

● Loading the dataset

26
● Training the data

● LOGISTIC REGRESSION

● SVM

27
● NAIVE BAYES

● DECISION TREE

● RANDOM FOREST

28
● XGBOOST

● KNN

29
● CATBOOST

30
● LSTM

● ANN

31
2. Dataset: ant-1.4
● Importing the Libraries

32
● Loading the dataset

● Training the data

● LOGISTIC REGRESSION

● SVM

33
● NAIVE BAYES

● DECISION TREE

● RANDOM FOREST

34
● XGBOOST

● KNN

35
● CATBOOST

36
● LSTM

● ANN

37
3. Dataset : camel-1.0
● Importing the Libraries

● Loading the dataset

38
● Training the data

● LOGISTIC REGRESSION

● SVM

● NAIVE BAYES

39
● DECISION TREE

● RANDOM FOREST

● XGBOOST

40
● KNN

● CATBOOST

41
● LSTM

42
● ANN

4. Dataset : camel-1.2
● Importing the Libraries

43
● Loading the dataset

● Training the data

● LOGISTIC REGRESSION

● SVM

44
● NAIVE BAYES

● DECISION TREE

● RANDOM FOREST

45
● XGBOOST

● KNN

● CATBOOST

46
● LSTM

47
● ANN

48
5. Dataset : ivy-1.1
● Importing the Libraries

● Loading the dataset

● Training the data

49
● LOGISTIC REGRESSION

● SVM

● NAIVE BAYES

50
● DECISION TREE

● RANDOM FOREST

● XGBOOST

51
● KNN

● CATBOOST

52
● LSTM

53
● ANN

54
6. Dataset : ivy-2.0
● Importing the Libraries

● Loading the dataset

● Training the data

● LOGISTIC REGRESSION

● SVM

55
● NAIVE BAYES

● DECISION TREE

● RANDOM FOREST

56
● XGBOOST

● KNN

● CATBOOST

57
● LSTM

58
● ANN

7. Dataset : jedit-3.2
● Importing the Libraries

● Loading the dataset

59
● Training the data

● LOGISTIC REGRESSION

● SVM

● NAIVE BAYES

60
● DECISION TREE

● RANDOM FOREST

● XGBOOST

61
● KNN

● CATBOOST

62
● LSTM

● ANN

63
8. Dataset : log4j-1.0
● Importing the Libraries

● Loading the dataset

● Training the data

● LOGISTIC REGRESSION

● SVM

64
● NAIVE BAYES

● DECISION TREE

● RANDOM FOREST

65
● XGBOOST

● KNN

● CATBOOST

66
● LSTM

67
● ANN

9. Dataset : lucene-2.0
● Importing the Libraries

● Loading the dataset

68
● Training the data

● LOGISTIC REGRESSION

● SVM

● NAIVE BAYES

69
● DECISION TREE

● RANDOM FOREST

● XGBOOST

70
● KNN

● CATBOOST

71
● LSTM

● ANN

72
10. Dataset : synapse-1.0
● Importing the Libraries

● Loading the dataset

● Training the data

● LOGISTIC REGRESSION

● SVM

73
● NAIVE BAYES

● DECISION TREE

● RANDOM FOREST

74
● XGBOOST

● KNN

● CATBOOST

75
● LSTM

76
● ANN

77
Experiment - 06
AIM
Consider the model developed in experiment no. 5 identify:
1. State the hypothesis.
2. Formulate an analysis plan.
3. Analyse the sample data.
4. Interpret results.
5. Estimate type-I and type-II error

INTRODUCTION
State the hypothesis: The hypothesis is a statement or assumption that is being tested using a
machine learning model. In machine learning, the hypothesis is usually framed as a predictive
model that maps input variables to output variables.
Formulate an analysis plan: The analysis plan outlines the steps that will be taken to test the
hypothesis. This includes selecting a suitable machine learning algorithm, collecting and
preparing the data, training and testing the model, and evaluating its performance. The plan
should also specify any statistical tests or metrics that will be used to assess the model's
accuracy.
Analyze the sample data: The sample data is used to train and test the machine learning model.
This involves feeding the input variables into the model and comparing the predicted output to
the actual output.
Interpret results: The results of the analysis are used to draw conclusions about the hypothesis
being tested. If the model performs well on the sample data, it may be considered a good
predictor of the outcome variable.
Estimate type-I and type-II error: Type-I error, also known as a false positive, occurs when
the model incorrectly predicts a positive outcome when the actual outcome is negative. Type-II
error, also known as a false negative, occurs when the model incorrectly predicts a negative
outcome when the actual outcome is positive.

OUTPUT
1. State the hypothesis.
● The linguistic and contextual features of news articles can be used to predict whether an
article is likely to contain false information.
● Machine learning models trained on this dataset can accurately classify news articles as
true or false based on their content and metadata.
● Supervised learning approach that utilizes multiple types of features, such as linguistic
features (e.g., sentiment analysis, part-of-speech tagging) and contextual features (e.g.,
source credibility, temporal and social signals), can lead to an accurate and robust fake
news detection system.

78
2. Formulate an analysis plan.
The analysis plan can be described by below flow chart:

1. Importing dataset: The data analysis pipeline begins with the import or creation of a
working dataset. The exploratory analysis phase begins immediately after. Importing a
dataset is simple with Pandas through functions dedicated to reading the data.

2. Analysis of dataset: An especially important activity in the routine of a data analyst or


scientist. It enables an in-depth understanding of the dataset, defines or discards
hypotheses and creates predictive models on a solid basis. It uses data manipulation
techniques and several statistical tools to describe and understand the relationship
between variables and how these can impact business.

3. Understanding the variables: While in the previous point, we are describing the dataset in
its entirety, now we try to accurately describe all the variables that interest us. For this
reason, this step can also be called bivariate analysis.

4. Modelling: At the end of the process, we will be able to consolidate a business report or
continue with the data modelling phase. We would be using Logistic Regression,
Decision Tree Classifier, Random Forest Classifier, Gradient Boosting, and Support
Vector Machine for modelling the dataset.

79
5. Interpreting the results: The results of the analysis are used to draw conclusions about the
hypothesis being tested. If the model performs well on the sample data, it may be
considered a good predictor of the outcome variable. However, the model's accuracy may
need to be validated on new, unseen data to ensure that it is generalizable.

3. Analyze the sample data.


Importing working datasets

Data Cleaning

Bi-variate analysis (Numerical-Categorical Analysis)

Missing values

4. Interpreting the results.


For dataset,

80
Logistic Regression

Decision Tree Classifier

Random Forest Classifier

81
Gradient Boosting

SVM

LEARNING
A Type I error is a false positive conclusion, while a Type II error is a false negative conclusion.

82
EXPERIMENT – 07
AIM
Write a program to implement the t-test.

THEORY
A t-test is a type of inferential statistic used to determine if there is a significant difference
between the means of two groups, which may be related to certain features.
There are three types of t-tests, and they are categorized as dependent and independent t-tests.
1. Independent samples t-test: compares the means for two groups.
2. Paired sample t-test: compares means from the same group at different times (say, one
year apart).
3. One sample t-test test: the mean of a single group against a known mean.

CODE AND OUTPUTS


1. Importing required Libraries

2. Loading the dataset

3. Information about the dataset

83
4. Selecting Features

5. Performing the test

Observation: P value is small (less than 0.05) for all the features. hence null hypothesis is
rejected, which implies group mean is not the same for all categories.
Null Hypothesis: The difference in mean values of title length of fake news and title length of
real news is 0.
Alternate Hypothesis: The difference in mean values of the title length of fake news and the title
length of real news is not 0.

OBSERVATION

84
We observe a statistically significant difference (p-value = 0.01583) between the length of news
titles of real and fake news. The title length of fake news is slightly larger than that of real news.
Fake news title length distribution is cantered with a mean of 7.83, while the centre of
distribution of title length of real news is slightly skewed towards the right with a mean of 7.02.
The t-test gives us evidence that the length of a real news title is significantly shorter than the a
fake news title.

LEARNING
Key learnings for implementing the T-Test in a program include understanding its applications in
statistical hypothesis testing, considering assumptions such as normality and homogeneity of
variances, implementing the T-Test in a programming language or statistical software,
interpreting results including p-values and confidence intervals, and considering sample size,
power analysis, and effect size for appropriate interpretation and decision-making.

85
EXPERIMENT – 08
AIM
Write a program to implement the chi-square test.

THEORY
One of the primary tasks involved in any supervised Machine Learning venture is to select the
best features from the given dataset to obtain the best results. One way to select these features is
the Chi-Square Test. Mathematically, a Chi-Square test is done on two distributions two
determine the level of similarity of their respective variances. In its null hypothesis, it assumes
that the given distributions are independent. This test thus can be used to determine the best
features for a given dataset by determining the features on which the output class label is most
dependent.
It involves the use of a contingency table. A Contingency table (also called crosstab) is used in
statistics to summarise the relationship between several categorical variables.

CODE AND OUTPUTS


● Null Hypothesis: There is no relation between News_Type and Article_Subject
● Alternate Hypothesis: There is a significant relation between News_Type and
Article_Subject
1. Importing required Libraries

2. Loading the dataset

86
3. Information about the dataset

87
4. Performing Chi-square test

LEARNING
Key learnings for implementing the Chi-Square test in a program include understanding its
applications in analyzing categorical data, familiarity with different types of Chi-Square tests,
implementation in a programming language or statistical software, interpretation of results
including chi-square statistic, degrees of freedom, and p-values, and consideration of limitations
and assumptions for appropriate application and interpretation of the Chi-Square test.

88
EXPERIMENT – 09
AIM
Write a program to implement the Friedman test.

THEORY
Friedman Test is a non-parametric test alternative to the one-way ANOVA with repeated
measures. It tries to determine if subjects changed significantly across occasions/conditions. For
example:- Problem-solving ability of a set of people is the same or different in the Morning,
Afternoon, and Evening.

CODE AND OUTPUTS


● Null Hypothesis: There is no significant difference in the score values
● Alternate Hypothesis: At least 2 values differ from one another.

1. Importing required Libraries

2. Loading the dataset

3. Information about the dataset

89
4. Friedman Test

LEARNING
Key learnings for implementing the Friedman test in a program include understanding its
applications in non-parametric statistical analysis, familiarity with assumptions and requirements
such as repeated measures and ranked data, implementation in a programming language or
statistical software, interpretation of results including Friedman statistic, degrees of freedom, and
p-values, and consideration of appropriate use and limitations of the Friedman test for
comparison of multiple related samples and valid interpretation of results.

90
EXPERIMENT – 10
AIM
Write a program to implement Wilcoxon Signed Rank Test.

THEORY
Wilcoxon signed-rank test, also known as Wilcoxon matched pair test is a non-parametric
hypothesis test that compares the median of two paired groups and tells if they are identically
distributed or not.
We can use this when:
● Differences between the pairs of data are non-normally distributed.
● Independent pairs of data are identical. (or matched)
CODE AND OUTPUTS
● Null Hypothesis: The groups - title length of fake news and title length of real news are
identically distributed.
● Alternate Hypothesis: The groups - title length of fake news and title length of real news
are not identically distributed.

1. Importing required Libraries

2. Loading the dataset

91
3. Information about the dataset

4. Wilcoxon Signed Rank Test

LEARNING
Key learnings for implementing the Wilcoxon Signed Rank test in a program include
understanding its applications in non-parametric statistical analysis, familiarity with assumptions
and requirements such as paired data and ordinal or continuous variables, implementation in a
programming language or statistical software, interpretation of results including test statistic, p-
values, and confidence intervals, and consideration of appropriate use and limitations of the
Wilcoxon Signed Rank test for comparing paired data and valid interpretation of results.

92
EXPERIMENT – 11
AIM
Write a program to implement the Nemenyi test.

THEORY
The Friedman Test is used to find whether there exists a significant difference between the
means of more than two groups. In such groups, the same subjects show up in each group. If the
p-value of the Friedman test turns out to be statistically significant then we can conduct the
Nemenyi test to find exactly which groups are different. This test is also known as Nemenyi
posthoc test.

CODE AND OUTPUTS


● Null Hypothesis: There is no significant difference in the score values
● Alternate Hypothesis: At least 2 values differ from one another.

1. Importing required Libraries

2. Loading the dataset

3. Information about the dataset

93
4. Friedman Test

5. Nemenyi Test

OBSERVATION
● From the outputs received, we reject the Null hypothesis
● From the output table we can clearly conclude that the 2 groups to have statistically
significantly different means are Group 1 and Group 2.

LEARNING
Key learnings for implementing the Nemenyi test in a program include understanding its
applications in posthoc analysis of multiple comparison tests, familiarity with requirements and
assumptions such as ranked or continuous data and multiple group comparisons, implementation
in a programming language or statistical software, interpretation of results including critical
difference values and significance levels, and consideration of appropriate use and limitations of
the Nemenyi test for posthoc analysis and valid interpretation of results in the context of
statistical hypothesis testing.

94
EXPERIMENT – 12
AIM
Write down the threats to validity observed while performing the experiments

THEORY
Threats to validity refer to the factors or conditions that may affect the results or conclusions of a
study

THREATS TO VALIDITY
There are several threats to validity that may arise in a research study on software defect
prediction. Here are some common threats to consider:
● Sampling bias: The sample of data used for analysis may not be representative of the
population of interest. For example, the dataset used may only contain data from a single
organization or software project, which may limit the generalizability of the results.
● Measurement bias: The method used to collect or measure data may introduce bias. For
example, the definition of a software defect may vary between different projects or
organizations, which may affect the accuracy of the predictions.
● Selection bias: The selection of features or predictors used in the analysis may not be
representative of the full range of factors that contribute to software defects. This may
lead to inaccurate predictions or biased results.
● Overfitting: The model used to make predictions may be too complex and fit too closely
to the training data, leading to poor performance when applied to new data. This can be
mitigated by using cross-validation techniques to evaluate the model's performance on
new data.
● Publication bias: There may be a tendency to publish only positive or significant results,
leading to an incomplete or biased picture of the effectiveness of different methods for
software defect prediction.
● External validity: The findings may not be generalizable to other software projects,
organizations, or contexts. It is important to consider the external validity of the study
and to replicate the analysis in different contexts to ensure the results hold across a
broader range of situations.
These are just a few of the potential threats to validity that should be considered when
conducting research on software defect prediction. By carefully considering and addressing these
threats, researchers can increase the rigor and validity of their findings.

LEARNING
Internal validity is the extent to which you can be confident that a cause-and-effect relationship
is established in a study cannot be explained by other factors.

95
EXPERIMENT – 13
AIM
Explore tools such as WEKA and KEIL

THEORY
WEKA provides a radically simple hybrid cloud data platform with epic performance to
optimize HPC workloads on-prem and across any cloud environment. WEKA provides the
infrastructure for cloud data services that enable organizations to optimize high-performance
workloads across cloud and hybrid cloud environments.
Keil MDK is the complete software development environment for a range of Arm Cortex-M-
based microcontroller devices. MDK includes the µVision IDE and debugger, Arm C/C++
compiler, and essential middleware components. It supports all silicon vendors with more than
9,500 devices and is easy to learn and use.

CODE AND OUTPUTS


WEKA

Using WEKA one can find many tabs to perform various functions. The main tabs one can see
on the home screen are Explorer, experimenter, knowledge_flow, workbench and simple_cli

96
Going to the Explorer tab one can open using various tools like a classifier, cluster,
preprocessing, etc. Preprocessing tools can be used to clean the data and use ML algorithms to
remove redundant data, and classifying tools can give access to algorithms that can help in
classifying data into various categories.

97
Nextly one can upload a dataset and visualize various graphs of the parameters of the dataset.
One can choose the parameters and the way the graph gets displayed.

KEIL

KEIL is an embedded system which helps in visualising, preprocessing and classifying data.
KEIL has various. KEIL is an integrated development environment, RTOS, middleware, as well
as debug adapters and evaluation boards for Arm Cortex®-M based devices.

98
KEIL has various tools such as a Logic analyser which helps in briefing the algorithms to be
implemented, a Command-line window where various commands can be input to easily
implement tasks and many others.

Run time environment can be managed easily using the run time environment interface. This
helps in controlling the data and the results which will be updated and obtained at the time of
run-time.

LEARNING
WEKA: Drive AWS, Azure, GCP, OCI and more workloads with amazingly high performance
for the most demanding computing tasks, including complex analytics and AI applications.
Keil MDK is the complete software development environment for a range of Arm Cortex-M-
based microcontroller devices.

99
EXPERIMENT – 14
AIM
Explore Python and R.

THEORY
Python: It is a very popular general-purpose interpreted, interactive, object-oriented, and high-
level programming language. Python is a dynamically-typed and garbage-collected programming
language. It supports functional and structured programming methods as well as OOP. It can be
used as a scripting language or can be compiled to byte code for building large applications. It
provides very high-level dynamic data types and supports dynamic type checking.
R: It is a great resource for data analysis, data visualization, data science and machine learning.
It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction). It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc+
+. It works on different platforms (Windows, Mac, Linux). It is open-source and free. It has
many packages (libraries of functions) that can be used to solve different problems.

CODE AND OUTPUTS


Python

1. Print “Hello World”

2. Program for a “for loop”

3. Program for finding the sum of a list of numbers

100
4. Program for finding values of exponents

5. Program for finding square root

6. Program for finding mean

7. Program for finding median

101
8. Program for finding mode

9. Plotting a scatter plot

102
10. Plotting a histogram

103
R
1. Creating a row vector
> x=c(1,2,3,4,5,6)
>x
[1] 1 2 3 4 5 6
2. Summation
> sum(x)
[1] 21
3. Mean
> mean(x)
[1] 3.5
4. Median
> median(x)
[1] 3.5
5. Square root
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490
6. Squaring
> x^2
[1] 1 4 9 16 25 36
7. Creating sequence
> seq(1,10)
[1] 1 2 3 4 5 6 7 8 9 10
8. Creating histogram of sequence
> x= c(2,4,4,6,6,5,5,7,3,7,3,8,9,7,9,6,4,3,4,4,6,2,2,1,2,4,6,6,8)
> hist(x)

104
9. Creating scatter plot
> x=c(1,3,5,7,9)
> y=c(2,4,6,8,10)
> plot(x,y)

10. Making a time plot


> plot(x,type="b")

LEARNING
By means of this experiment we got to know about Python and some of its libraries like math,
matplotlib etc. We also learnt how to perform some basic operations in Python like print, loops,
summation, mean, median, and mode including plotting a histogram and scatter plot using the
matplotlib library.
R is a language used for statistical computations, data analysis and graphical representation of
data. After performing this experiment we are able to work with packages of R and represent
output in visual forms.

105

You might also like