ESE Lab File
ESE Lab File
1
INDEX
S.No Aim Date Page
. No.
1. Collecting Empirical Studies 3-5
EXPERIMENT - 01
2
AIM
Collecting Empirical Studies
THEORY
Empirical research is research that is based on the observation and measurement of phenomena,
as directly experienced by the researcher. The data thus gathered may be compared against a
theory or hypothesis, but the results are still based on real-life experience.
An example of empirical analysis would be if a researcher was interested in finding out whether
listening to happy music promotes prosocial behaviour. An experiment could be conducted
where one group of the audience is exposed to happy music and the other is not exposed to music
at all.
TABLE - 1
3
Internetware
Based Transfer Based Transfer
Learning Learning
4
10 Cross-Project Cross-Project 2020 Tianwei Lee, International 42
Software Software Jingfeng Conference on
Defect Defect Xue, Weijie Machine
Prediction Prediction Han Learning for
Based on Based on Cyber Security
Feature Feature
Selection and Selection and
Transfer Transfer
Learning Learning
5
EXPERIMENT - 02
AIM
Identify research gaps from the empirical studies. Collection of datasets from open source
repositories.
THEORY
A research gap is, simply, a topic or area for which missing or insufficient information limits the
ability to reach a conclusion for a question.
A research question is a question that a study or research project aims to answer. This question
often addresses an issue or a problem, which, through analysis and interpretation of data, is
answered in the study's conclusion.
1 Software Defect Prediction based on Lack of benchmark datasets: One of the major
Federated Transfer Learning challenges in FTL for SDP is the lack of benchmark
datasets. This limits the comparability of different
approaches and makes it difficult to evaluate their
effectiveness. Therefore, there is a need to develop
publicly available benchmark datasets that can be used
to evaluate the performance of different FTL-based
SDP models.
2 Defect Prediction method based on Model generalization: FTL and KD are intended to
Federated Transfer Learning and improve the generalization of the model across
Knowledge Distillation different data sources. However, there is a need to
investigate how to optimize the transfer of knowledge
from the source model to the target model to improve
the generalization.
4 Software Defect Prediction Using Real-world applicability: The proposed tool needs to
Feature-Based Transfer Learning be tested in real-world settings to evaluate its
effectiveness in practice. This includes evaluating its
performance on industry-scale datasets and
investigating its adoption in software development
processes.
5 An Empirical Study on Transfer Comparison with other techniques: The proposed tool
Learning for Software Defect is not compared with other state-of-the-art techniques
Prediction in software defect prediction. Therefore, there is a
6
need to compare its performance with other
techniques, such as deep learning-based models,
decision tree-based models, and Bayesian networks.
7 Cross-Project Defect Prediction via Feature selection: The study does not consider feature
Balanced Distribution Adaptation- selection techniques to identify the most relevant
Based Transfer Learning features for defect prediction. Investigating the
effectiveness of feature selection techniques, such as
wrapper, filter, and embedded methods, can improve
the accuracy and generalization of the proposed
approach.
8 Deep Learning for Software Defect Model interpretability: The proposed approach uses a
Prediction: A Survey black-box model, which can be difficult to interpret.
Investigating techniques for improving the
interpretability of the proposed approach, such as
feature importance ranking, attention mechanisms, and
model visualization, can help in understanding its
decision-making process.
7
expertise. There is a need to investigate the most
effective ways to incorporate such knowledge into
transfer learning models.
13 Transfer learning for cross-company Lack of standardized datasets: One of the key
software defect prediction challenges in transfer learning for cross-company
software defect prediction is the availability of
standardized datasets that can be used for evaluation
purposes. There is a need for more standardized
datasets that can be used to compare the performance
of different transfer learning algorithms.
15 Transfer Learning Code Vectorizer- Code vectorization is a critical step in the process of
based Machine Learning Models for using machine learning models for software defect
Software Defect Prediction prediction. There are several different techniques for
vectorizing code, including bag-of-words, n-grams,
and deep learning-based approaches. However, there
still needs to be more research comparing the
effectiveness of varying code vectorization techniques
in the context of transfer learning. Further research
could investigate the impact of different code
vectorization techniques on the performance of
transfer learning-based machine learning models for
software defect prediction.
8
TABLE - 3(Research Questions)
2 What is the impact of a defect prediction approach that utilizes federated transfer
learning and knowledge distillation in improving the performance of software defect
prediction models?
3 What is the current state of research on deep transfer learning for defect prediction,
and how effective is it compared to traditional defect prediction models?
5 What is the effectiveness of transfer learning in predicting software defects, and how
does it compare to traditional machine learning techniques?
7 What are the optimal strategies for selecting and aggregating data from multiple
organizations in federated transfer learning for software defect prediction, considering
varying data distributions and privacy constraints?
8 What are the current state-of-the-art deep learning techniques for software defect
prediction, and how do they compare in terms of their effectiveness and limitations?
9 What are the current trends, techniques, and challenges in software defect prediction
using deep learning?
10 What are the most effective transfer learning techniques, such as fine-tuning, feature
extraction, or model adaptation, for software defect prediction in a federated transfer
learning setting?
12 What are the challenges and limitations of federated transfer learning for software
defect prediction, such as communication efficiency, model convergence, and privacy
concerns, and how can they be addressed?
13 What are the privacy implications of using federated transfer learning for software
defect prediction, and how can these concerns be addressed?
14 What are the best practices for designing and training federated transfer learning
models for software defect prediction, and how can these models be effectively
deployed in real-world scenarios?
9
TABLE - 4 (Answers)
S. No. Answers
1. Federated transfer learning has the potential to improve the accuracy of software defect
prediction models by leveraging the collective knowledge of multiple organizations while
ensuring the privacy and security of their data. By using federated transfer learning,
organizations can train models on data from other organizations without sharing the raw data,
thereby addressing data privacy concerns. Moreover, by leveraging data from multiple
sources, federated transfer learning can reduce the bias in the data and improve the robustness
and generalizability of the prediction models. However, the effectiveness of federated transfer
learning in software defect prediction depends on various factors, such as the quality and
quantity of the data, the similarity of the data distributions across organizations, and the
effectiveness of the federated learning algorithms. Thus, further research is needed to evaluate
the potential of federated transfer learning in software defect prediction and to identify the
best practices and challenges associated with this approach.
2. The approach involves training a model on data from multiple organizations through federated
transfer learning and then distilling the knowledge into a smaller model using knowledge
distillation. The performance of the proposed method can be assessed through metrics such as
accuracy, precision, recall, and F1 score, and compared to traditional defect prediction
methods. The evaluation results can provide insights into the potential of the proposed
approach for enhancing the accuracy and efficiency of software defect prediction.
3. Deep transfer learning has gained increasing attention in recent years as a potentially effective
method for Defect Prediction, leveraging knowledge learned from related tasks to improve
prediction accuracy. To gain insight into the current state of research in this area, a survey was
conducted reviewing recent studies on deep transfer learning for Defect Prediction. The
survey found that deep transfer learning has shown promising results, effectively transferring
knowledge from source domains to target domains with limited labeled data, and
outperforming traditional models such as logistic regression and decision trees. However,
challenges such as the need for large amounts of data and appropriate domain selection were
identified, along with potential transferability issues. Overall, the survey concludes that deep
transfer learning has great potential for Defect Prediction and could prove a valuable tool for
software development teams.
4. Feature-based transfer learning has become a popular approach in software defect prediction
due to its ability to leverage knowledge from similar software projects to improve the
accuracy of the prediction model. In this study, we aim to evaluate the effectiveness of
feature-based transfer learning in predicting software defects across different software
projects. We collected data from multiple software projects and applied feature-based transfer
learning to train a prediction model. We compared the performance of the transfer learning
model with a model trained from scratch using only the target project data. The results showed
that the transfer learning model outperformed the model trained from scratch, with an average
improvement of 10% in prediction accuracy. Our findings suggest that feature-based transfer
learning can be an effective approach to improve the accuracy of software defect prediction
models when training data is limited or when data is available from similar projects.
5. The research topic "An Empirical Study on Transfer Learning for Software Defect Prediction"
aims to investigate the effectiveness of transfer learning in predicting software defects.
10
Transfer learning is a machine learning technique that involves reusing knowledge gained
from one task to improve the performance of a different but related task. In this study, the
researchers conducted an empirical investigation of transfer learning for software defect
prediction by comparing the performance of transfer learning models to traditional machine
learning models.
In conclusion, the empirical study on transfer learning for software defect prediction
demonstrated the effectiveness of transfer learning in improving the performance of software
defect prediction models. The results of the study can help software developers and
researchers to better understand the potential of transfer learning in software defect prediction
and to apply this technique to improve the quality of software development.
6. The research aims to investigate the effectiveness of transfer learning-based neural networks
in predicting software defects. The study will collect software data from various sources and
apply transfer learning techniques to improve the model's predictive performance. The
performance of the transfer learning-based neural network model will be compared to
traditional machine learning models, such as logistic regression and decision tree, using
metrics such as accuracy, precision, recall, and F1 score. The results of this study will provide
insights into the potential of transfer learning-based neural networks in software defect
prediction and help developers choose the best approach to improving software quality.
8. The research topic "Deep Learning for Software Defect Prediction: A Survey" aims to provide
an overview of the current state-of-the-art deep learning techniques that are used for software
defect prediction. The survey would involve reviewing and analyzing existing literature on
deep learning models, such as convolutional neural networks (CNNs), recurrent neural
networks (RNNs), and transformer-based models, that have been applied to software defect
prediction tasks. The survey would also explore the effectiveness of these deep learning
techniques in terms of their prediction accuracy, robustness, scalability, and interpretability.
Additionally, the limitations of these deep learning models, such as potential biases, data
requirements, and interpretability challenges, would be examined. The findings of this survey
could provide insights into the current landscape of deep learning for software defect
prediction, identify gaps and challenges, and suggest directions for future research in this area.
9. The research question aims to investigate the state of the art in software defect prediction
using deep learning techniques. This would involve conducting a survey to explore the current
trends and practices in the field, including the types of deep learning models being used, the
datasets and features employed, and the evaluation metrics used for performance assessment.
The survey would also delve into the challenges faced in software defect prediction using deep
11
learning, such as issues related to data quality, interpretability of deep learning models, and
addressing class imbalance. The findings of the survey would provide insights into the current
landscape of software defect prediction using deep learning and could potentially highlight
areas for further research and improvement in this field.
10. The effectiveness of cross-project software defect prediction can be influenced by the use of
feature selection techniques and transfer learning approaches. Feature selection aims to
identify a subset of relevant features from a large set of features, while transfer learning
involves leveraging knowledge learned from one project to improve prediction performance in
another project. Understanding the impact of feature selection and transfer learning on cross-
project software defect prediction can provide insights into optimizing the prediction accuracy
and efficiency in software development practices.
11. The impact of homogeneous transfer learning on defect prediction in software development
can vary depending on several factors. Homogeneous transfer learning involves transferring
knowledge or models from a source domain to a target domain within the same organization
or software project, without considering differences in data distributions or privacy
constraints. The effectiveness of homogeneous transfer learning for defect prediction can be
evaluated through empirical research that compares the performance of transferred models
with baseline models trained only on the target domain data or models trained from scratch.
12. The inclusion of diverse data sources can enrich the feature representation of the software data
used for training the federated transfer learning models. Code metrics, which provide
quantitative measures of software code quality and complexity, can capture structural and
functional characteristics of the codebase. Developer comments, which contain valuable
insights and contextual information about the code, can provide additional contextual clues
that are not present in the code itself. User feedback, such as bug reports or customer
feedback, can provide real-world usage information and highlight potential defects or issues
that may not be captured by other data sources. Incorporating such diverse data sources into
the federated transfer learning process can result in more comprehensive and informative
feature representations, potentially leading to improved predictive performance.
13. One significant privacy concern is the potential leakage of sensitive information during the
federated transfer learning process. When data from different organizations are combined for
training a shared model, there is a risk of exposing sensitive information about the
organizations, their software development practices, or their customers. This can include
proprietary or confidential information, intellectual property, customer data, or other sensitive
data that organizations may not want to share with others.
Another privacy concern is the potential violation of data privacy regulations or legal
requirements. Organizations may be subject to various data protection laws, such as the
General Data Protection Regulation (GDPR) in the European Union, which require them to
comply with strict rules and regulations regarding the collection, storage, and processing of
personal data. Federated transfer learning may involve transferring data across organizational
boundaries, which can raise compliance issues with these data protection laws, especially if
the data used for training the shared model contain personal or sensitive information.
14. Designing and training federated transfer learning models for software defect prediction
requires careful consideration of various best practices to ensure effective performance and
deployment in real-world scenarios. Best practices for designing and training federated
transfer learning models for software defect prediction include careful selection of
12
participating organizations, thorough data preprocessing, appropriate transfer learning
techniques, efficient and secure model training, and considerations for real-world deployment.
Adhering to these best practices can help ensure the effectiveness and practicality of federated
transfer learning models for software defect prediction in real-world scenarios.
13
10. What are the most effective transfer learning techniques, 1, 4, 7, 14, 15
such as fine-tuning, feature extraction, or model
adaptation, for software defect prediction in a federated
transfer learning setting?
14. What are the best practices for designing and training 1 -15
federated transfer learning models for software defect
prediction, and how can these models be effectively
deployed in real-world scenarios?
LEARNING
The research question is written so that it outlines various aspects of the study, including the
population and variables to be studied and the problem the study addresses.
14
EXPERIMENT - 03
AIM
Write a program to perform an exploratory analysis of the dataset.
THEORY
Exploratory Data Analysis (EDA) is an approach to data analysis using visual techniques. It is
used to discover trends, patterns, or to check assumptions with the help of statistical summaries
and graphical representations.
CODE AND OUTPUTS
Read the dataset and print the 1st five rows by using the head() function.
Use the info() method to know about the columns and their data types.
15
Let’s check if there are any missing values in our dataset or not.
Data visualization
It is the process of analyzing data in the form of graphs or maps, making it a lot easier to understand the
trends or patterns in the data. There are various types of visualizations – univariate, bi-variate and multi-
variate analysis.
16
Histogram: It can be used for both uni and bivariate analysis.
17
Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-called
normal)objects.
LEARNING
By means of this experiment we learnt about exploratory data analysis and how it is performed on various
datasets. The exploratory analysis involves understanding, visualizing, and preprocessing data to identify
patterns and trends, validate hypotheses, and ensure data quality, forming a foundation for further analysis
or modeling.
18
EXPERIMENT - 04
AIM
Write a program to perform feature reduction techniques for the collected dataset.
a. Correlation-based feature evaluation
b. Relief attribute feature evaluation
c. Information gain feature evaluation
d. Principle Component Analysis
THEORY
Feature reduction technique is a way of converting the higher dimensions dataset into lesser
dimensions dataset ensuring that it provides similar information.
Principal Component Analysis is a statistical procedure to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated variables.
CFS (Correlation-based Feature Selection) is an algorithm that couples this evaluation
formula with an appropriate correlation measure and a heuristic search strategy.
Relief is an algorithm that takes a filter-method approach to feature selection that is notably
sensitive to feature interactions.
Information Gain is defined as the amount of information provided by the feature items for the
text category.
2. Distributing the dataset into X and y components for data analysis. Splitting the dataset
into the Training set and Test set
19
3. Feature Scaling: Doing the pre-processing part on training and testing set such as fitting
the Standard scale.
20
4. Applying the PCA function into the training and testing set for analysis.
5. Plotting heatmap
CORRELATION-BASED
Correlation states how the features are related to each other or the target variable. Heatmap
makes it easy to identify which features are most related to the target variable, we will plot the
heatmap of correlated features using the Seaborn library.
21
INFORMATION GAIN
1. The unique() function finds the unique elements of an array and returns these unique
elements as a sorted array
22
2. Splitting the dataset into the Training set and Test set
23
RELIEF ATTRIBUTE
24
The main focus of this kernel is the RReliefF algorithm, but let's spend some time on the data
preprocessing, to make our job easier.
LEARNING
The following are key learnings for performing feature reduction techniques on a collected dataset. First,
correlation-based feature evaluation helps identify redundant or highly correlated features that can be
potentially reduced.
Second, relief attribute feature evaluation using algorithms like ReliefF or SURF assesses the relevance of
features based on their contribution to the prediction task.
Third, information gain feature evaluation measures the predictive power of features using entropy or
information gain. Lastly, Principal Component Analysis (PCA) can effectively reduce dimensionality by
projecting the dataset onto a lower-dimensional space while retaining the most important features.
Experimenting with different techniques and selecting the most appropriate one based on the specific
dataset and prediction task is crucial for successful feature reduction.
25
EXPERIMENT - 05
AIM
Develop a machine learning model for the selected topic (minimum 10 datasets and 10
techniques).
THEORY
SVM: A support vector machine is a type of deep learning algorithm that performs supervised
learning for the classification or regression of data groups.
Logistic Regression: Logistic regression is a supervised learning classification algorithm used to
predict the probability of a target variable.
Naive Bayes: Naïve Bayes algorithm is a supervised learning algorithm, which is based on the
Bayes theorem and used for solving classification problems.
Decision Tree: It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the outcome.
Random Forest: It is a classifier that contains a number of decision trees on various subsets of
the given dataset and takes the average to improve the predictive accuracy of that dataset.
XGBoost: XGBoost algorithm makes use of fast parallel prefix sum operations to scan through
all possible splits, as well as parallel radix sorting to repartition data.
KNN: KNN is a non-parametric, supervised learning classifier, which uses proximity to make
classifications or predictions about the grouping of an individual data point.
LSTM: It is a variety of recurrent neural networks (RNNs) that are capable of learning long-
term dependencies, especially in sequence prediction problems.
CatBoost: CatBoost is an algorithm for gradient boosting on decision trees
ANN: An artificial neural network is an attempt to simulate the network of neurons that make up
a human brain so that the computer will be able to learn things and make decisions.
CODE AND OUTPUT
1. Dataset: ant-1.3
● Importing the Libraries
26
● Training the data
● LOGISTIC REGRESSION
● SVM
27
● NAIVE BAYES
● DECISION TREE
● RANDOM FOREST
28
● XGBOOST
● KNN
29
● CATBOOST
30
● LSTM
● ANN
31
2. Dataset: ant-1.4
● Importing the Libraries
32
● Loading the dataset
● LOGISTIC REGRESSION
● SVM
33
● NAIVE BAYES
● DECISION TREE
● RANDOM FOREST
34
● XGBOOST
● KNN
35
● CATBOOST
36
● LSTM
● ANN
37
3. Dataset : camel-1.0
● Importing the Libraries
38
● Training the data
● LOGISTIC REGRESSION
● SVM
● NAIVE BAYES
39
● DECISION TREE
● RANDOM FOREST
● XGBOOST
40
● KNN
● CATBOOST
41
● LSTM
42
● ANN
4. Dataset : camel-1.2
● Importing the Libraries
43
● Loading the dataset
● LOGISTIC REGRESSION
● SVM
44
● NAIVE BAYES
● DECISION TREE
● RANDOM FOREST
45
● XGBOOST
● KNN
● CATBOOST
46
● LSTM
47
● ANN
48
5. Dataset : ivy-1.1
● Importing the Libraries
49
● LOGISTIC REGRESSION
● SVM
● NAIVE BAYES
50
● DECISION TREE
● RANDOM FOREST
● XGBOOST
51
● KNN
● CATBOOST
52
● LSTM
53
● ANN
54
6. Dataset : ivy-2.0
● Importing the Libraries
● LOGISTIC REGRESSION
● SVM
55
● NAIVE BAYES
● DECISION TREE
● RANDOM FOREST
56
● XGBOOST
● KNN
● CATBOOST
57
● LSTM
58
● ANN
7. Dataset : jedit-3.2
● Importing the Libraries
59
● Training the data
● LOGISTIC REGRESSION
● SVM
● NAIVE BAYES
60
● DECISION TREE
● RANDOM FOREST
● XGBOOST
61
● KNN
● CATBOOST
62
● LSTM
● ANN
63
8. Dataset : log4j-1.0
● Importing the Libraries
● LOGISTIC REGRESSION
● SVM
64
● NAIVE BAYES
● DECISION TREE
● RANDOM FOREST
65
● XGBOOST
● KNN
● CATBOOST
66
● LSTM
67
● ANN
9. Dataset : lucene-2.0
● Importing the Libraries
68
● Training the data
● LOGISTIC REGRESSION
● SVM
● NAIVE BAYES
69
● DECISION TREE
● RANDOM FOREST
● XGBOOST
70
● KNN
● CATBOOST
71
● LSTM
● ANN
72
10. Dataset : synapse-1.0
● Importing the Libraries
● LOGISTIC REGRESSION
● SVM
73
● NAIVE BAYES
● DECISION TREE
● RANDOM FOREST
74
● XGBOOST
● KNN
● CATBOOST
75
● LSTM
76
● ANN
77
Experiment - 06
AIM
Consider the model developed in experiment no. 5 identify:
1. State the hypothesis.
2. Formulate an analysis plan.
3. Analyse the sample data.
4. Interpret results.
5. Estimate type-I and type-II error
INTRODUCTION
State the hypothesis: The hypothesis is a statement or assumption that is being tested using a
machine learning model. In machine learning, the hypothesis is usually framed as a predictive
model that maps input variables to output variables.
Formulate an analysis plan: The analysis plan outlines the steps that will be taken to test the
hypothesis. This includes selecting a suitable machine learning algorithm, collecting and
preparing the data, training and testing the model, and evaluating its performance. The plan
should also specify any statistical tests or metrics that will be used to assess the model's
accuracy.
Analyze the sample data: The sample data is used to train and test the machine learning model.
This involves feeding the input variables into the model and comparing the predicted output to
the actual output.
Interpret results: The results of the analysis are used to draw conclusions about the hypothesis
being tested. If the model performs well on the sample data, it may be considered a good
predictor of the outcome variable.
Estimate type-I and type-II error: Type-I error, also known as a false positive, occurs when
the model incorrectly predicts a positive outcome when the actual outcome is negative. Type-II
error, also known as a false negative, occurs when the model incorrectly predicts a negative
outcome when the actual outcome is positive.
OUTPUT
1. State the hypothesis.
● The linguistic and contextual features of news articles can be used to predict whether an
article is likely to contain false information.
● Machine learning models trained on this dataset can accurately classify news articles as
true or false based on their content and metadata.
● Supervised learning approach that utilizes multiple types of features, such as linguistic
features (e.g., sentiment analysis, part-of-speech tagging) and contextual features (e.g.,
source credibility, temporal and social signals), can lead to an accurate and robust fake
news detection system.
78
2. Formulate an analysis plan.
The analysis plan can be described by below flow chart:
1. Importing dataset: The data analysis pipeline begins with the import or creation of a
working dataset. The exploratory analysis phase begins immediately after. Importing a
dataset is simple with Pandas through functions dedicated to reading the data.
3. Understanding the variables: While in the previous point, we are describing the dataset in
its entirety, now we try to accurately describe all the variables that interest us. For this
reason, this step can also be called bivariate analysis.
4. Modelling: At the end of the process, we will be able to consolidate a business report or
continue with the data modelling phase. We would be using Logistic Regression,
Decision Tree Classifier, Random Forest Classifier, Gradient Boosting, and Support
Vector Machine for modelling the dataset.
79
5. Interpreting the results: The results of the analysis are used to draw conclusions about the
hypothesis being tested. If the model performs well on the sample data, it may be
considered a good predictor of the outcome variable. However, the model's accuracy may
need to be validated on new, unseen data to ensure that it is generalizable.
Data Cleaning
Missing values
80
Logistic Regression
81
Gradient Boosting
SVM
LEARNING
A Type I error is a false positive conclusion, while a Type II error is a false negative conclusion.
82
EXPERIMENT – 07
AIM
Write a program to implement the t-test.
THEORY
A t-test is a type of inferential statistic used to determine if there is a significant difference
between the means of two groups, which may be related to certain features.
There are three types of t-tests, and they are categorized as dependent and independent t-tests.
1. Independent samples t-test: compares the means for two groups.
2. Paired sample t-test: compares means from the same group at different times (say, one
year apart).
3. One sample t-test test: the mean of a single group against a known mean.
83
4. Selecting Features
Observation: P value is small (less than 0.05) for all the features. hence null hypothesis is
rejected, which implies group mean is not the same for all categories.
Null Hypothesis: The difference in mean values of title length of fake news and title length of
real news is 0.
Alternate Hypothesis: The difference in mean values of the title length of fake news and the title
length of real news is not 0.
OBSERVATION
84
We observe a statistically significant difference (p-value = 0.01583) between the length of news
titles of real and fake news. The title length of fake news is slightly larger than that of real news.
Fake news title length distribution is cantered with a mean of 7.83, while the centre of
distribution of title length of real news is slightly skewed towards the right with a mean of 7.02.
The t-test gives us evidence that the length of a real news title is significantly shorter than the a
fake news title.
LEARNING
Key learnings for implementing the T-Test in a program include understanding its applications in
statistical hypothesis testing, considering assumptions such as normality and homogeneity of
variances, implementing the T-Test in a programming language or statistical software,
interpreting results including p-values and confidence intervals, and considering sample size,
power analysis, and effect size for appropriate interpretation and decision-making.
85
EXPERIMENT – 08
AIM
Write a program to implement the chi-square test.
THEORY
One of the primary tasks involved in any supervised Machine Learning venture is to select the
best features from the given dataset to obtain the best results. One way to select these features is
the Chi-Square Test. Mathematically, a Chi-Square test is done on two distributions two
determine the level of similarity of their respective variances. In its null hypothesis, it assumes
that the given distributions are independent. This test thus can be used to determine the best
features for a given dataset by determining the features on which the output class label is most
dependent.
It involves the use of a contingency table. A Contingency table (also called crosstab) is used in
statistics to summarise the relationship between several categorical variables.
86
3. Information about the dataset
87
4. Performing Chi-square test
LEARNING
Key learnings for implementing the Chi-Square test in a program include understanding its
applications in analyzing categorical data, familiarity with different types of Chi-Square tests,
implementation in a programming language or statistical software, interpretation of results
including chi-square statistic, degrees of freedom, and p-values, and consideration of limitations
and assumptions for appropriate application and interpretation of the Chi-Square test.
88
EXPERIMENT – 09
AIM
Write a program to implement the Friedman test.
THEORY
Friedman Test is a non-parametric test alternative to the one-way ANOVA with repeated
measures. It tries to determine if subjects changed significantly across occasions/conditions. For
example:- Problem-solving ability of a set of people is the same or different in the Morning,
Afternoon, and Evening.
89
4. Friedman Test
LEARNING
Key learnings for implementing the Friedman test in a program include understanding its
applications in non-parametric statistical analysis, familiarity with assumptions and requirements
such as repeated measures and ranked data, implementation in a programming language or
statistical software, interpretation of results including Friedman statistic, degrees of freedom, and
p-values, and consideration of appropriate use and limitations of the Friedman test for
comparison of multiple related samples and valid interpretation of results.
90
EXPERIMENT – 10
AIM
Write a program to implement Wilcoxon Signed Rank Test.
THEORY
Wilcoxon signed-rank test, also known as Wilcoxon matched pair test is a non-parametric
hypothesis test that compares the median of two paired groups and tells if they are identically
distributed or not.
We can use this when:
● Differences between the pairs of data are non-normally distributed.
● Independent pairs of data are identical. (or matched)
CODE AND OUTPUTS
● Null Hypothesis: The groups - title length of fake news and title length of real news are
identically distributed.
● Alternate Hypothesis: The groups - title length of fake news and title length of real news
are not identically distributed.
91
3. Information about the dataset
LEARNING
Key learnings for implementing the Wilcoxon Signed Rank test in a program include
understanding its applications in non-parametric statistical analysis, familiarity with assumptions
and requirements such as paired data and ordinal or continuous variables, implementation in a
programming language or statistical software, interpretation of results including test statistic, p-
values, and confidence intervals, and consideration of appropriate use and limitations of the
Wilcoxon Signed Rank test for comparing paired data and valid interpretation of results.
92
EXPERIMENT – 11
AIM
Write a program to implement the Nemenyi test.
THEORY
The Friedman Test is used to find whether there exists a significant difference between the
means of more than two groups. In such groups, the same subjects show up in each group. If the
p-value of the Friedman test turns out to be statistically significant then we can conduct the
Nemenyi test to find exactly which groups are different. This test is also known as Nemenyi
posthoc test.
93
4. Friedman Test
5. Nemenyi Test
OBSERVATION
● From the outputs received, we reject the Null hypothesis
● From the output table we can clearly conclude that the 2 groups to have statistically
significantly different means are Group 1 and Group 2.
LEARNING
Key learnings for implementing the Nemenyi test in a program include understanding its
applications in posthoc analysis of multiple comparison tests, familiarity with requirements and
assumptions such as ranked or continuous data and multiple group comparisons, implementation
in a programming language or statistical software, interpretation of results including critical
difference values and significance levels, and consideration of appropriate use and limitations of
the Nemenyi test for posthoc analysis and valid interpretation of results in the context of
statistical hypothesis testing.
94
EXPERIMENT – 12
AIM
Write down the threats to validity observed while performing the experiments
THEORY
Threats to validity refer to the factors or conditions that may affect the results or conclusions of a
study
THREATS TO VALIDITY
There are several threats to validity that may arise in a research study on software defect
prediction. Here are some common threats to consider:
● Sampling bias: The sample of data used for analysis may not be representative of the
population of interest. For example, the dataset used may only contain data from a single
organization or software project, which may limit the generalizability of the results.
● Measurement bias: The method used to collect or measure data may introduce bias. For
example, the definition of a software defect may vary between different projects or
organizations, which may affect the accuracy of the predictions.
● Selection bias: The selection of features or predictors used in the analysis may not be
representative of the full range of factors that contribute to software defects. This may
lead to inaccurate predictions or biased results.
● Overfitting: The model used to make predictions may be too complex and fit too closely
to the training data, leading to poor performance when applied to new data. This can be
mitigated by using cross-validation techniques to evaluate the model's performance on
new data.
● Publication bias: There may be a tendency to publish only positive or significant results,
leading to an incomplete or biased picture of the effectiveness of different methods for
software defect prediction.
● External validity: The findings may not be generalizable to other software projects,
organizations, or contexts. It is important to consider the external validity of the study
and to replicate the analysis in different contexts to ensure the results hold across a
broader range of situations.
These are just a few of the potential threats to validity that should be considered when
conducting research on software defect prediction. By carefully considering and addressing these
threats, researchers can increase the rigor and validity of their findings.
LEARNING
Internal validity is the extent to which you can be confident that a cause-and-effect relationship
is established in a study cannot be explained by other factors.
95
EXPERIMENT – 13
AIM
Explore tools such as WEKA and KEIL
THEORY
WEKA provides a radically simple hybrid cloud data platform with epic performance to
optimize HPC workloads on-prem and across any cloud environment. WEKA provides the
infrastructure for cloud data services that enable organizations to optimize high-performance
workloads across cloud and hybrid cloud environments.
Keil MDK is the complete software development environment for a range of Arm Cortex-M-
based microcontroller devices. MDK includes the µVision IDE and debugger, Arm C/C++
compiler, and essential middleware components. It supports all silicon vendors with more than
9,500 devices and is easy to learn and use.
Using WEKA one can find many tabs to perform various functions. The main tabs one can see
on the home screen are Explorer, experimenter, knowledge_flow, workbench and simple_cli
96
Going to the Explorer tab one can open using various tools like a classifier, cluster,
preprocessing, etc. Preprocessing tools can be used to clean the data and use ML algorithms to
remove redundant data, and classifying tools can give access to algorithms that can help in
classifying data into various categories.
97
Nextly one can upload a dataset and visualize various graphs of the parameters of the dataset.
One can choose the parameters and the way the graph gets displayed.
KEIL
KEIL is an embedded system which helps in visualising, preprocessing and classifying data.
KEIL has various. KEIL is an integrated development environment, RTOS, middleware, as well
as debug adapters and evaluation boards for Arm Cortex®-M based devices.
98
KEIL has various tools such as a Logic analyser which helps in briefing the algorithms to be
implemented, a Command-line window where various commands can be input to easily
implement tasks and many others.
Run time environment can be managed easily using the run time environment interface. This
helps in controlling the data and the results which will be updated and obtained at the time of
run-time.
LEARNING
WEKA: Drive AWS, Azure, GCP, OCI and more workloads with amazingly high performance
for the most demanding computing tasks, including complex analytics and AI applications.
Keil MDK is the complete software development environment for a range of Arm Cortex-M-
based microcontroller devices.
99
EXPERIMENT – 14
AIM
Explore Python and R.
THEORY
Python: It is a very popular general-purpose interpreted, interactive, object-oriented, and high-
level programming language. Python is a dynamically-typed and garbage-collected programming
language. It supports functional and structured programming methods as well as OOP. It can be
used as a scripting language or can be compiled to byte code for building large applications. It
provides very high-level dynamic data types and supports dynamic type checking.
R: It is a great resource for data analysis, data visualization, data science and machine learning.
It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction). It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc+
+. It works on different platforms (Windows, Mac, Linux). It is open-source and free. It has
many packages (libraries of functions) that can be used to solve different problems.
100
4. Program for finding values of exponents
101
8. Program for finding mode
102
10. Plotting a histogram
103
R
1. Creating a row vector
> x=c(1,2,3,4,5,6)
>x
[1] 1 2 3 4 5 6
2. Summation
> sum(x)
[1] 21
3. Mean
> mean(x)
[1] 3.5
4. Median
> median(x)
[1] 3.5
5. Square root
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490
6. Squaring
> x^2
[1] 1 4 9 16 25 36
7. Creating sequence
> seq(1,10)
[1] 1 2 3 4 5 6 7 8 9 10
8. Creating histogram of sequence
> x= c(2,4,4,6,6,5,5,7,3,7,3,8,9,7,9,6,4,3,4,4,6,2,2,1,2,4,6,6,8)
> hist(x)
104
9. Creating scatter plot
> x=c(1,3,5,7,9)
> y=c(2,4,6,8,10)
> plot(x,y)
LEARNING
By means of this experiment we got to know about Python and some of its libraries like math,
matplotlib etc. We also learnt how to perform some basic operations in Python like print, loops,
summation, mean, median, and mode including plotting a histogram and scatter plot using the
matplotlib library.
R is a language used for statistical computations, data analysis and graphical representation of
data. After performing this experiment we are able to work with packages of R and represent
output in visual forms.
105