This repository has been archived on 2023-06-18. You can view files and clone it, but cannot push or open issues or pull requests.

32 KiB

Raw Permalink Blame History

author	title	geometry
Claudio Maggioni	Information Modelling & Analysis -- Project 2	margin=2cm

Code Repository

The code and result files, part of this submission, can be found at:

Repository: https://github.com/infoMA2023/project-02-bug-prediction-maggicl
Commit ID: 786d6a0684

Data Pre-Processing

I use the sources of the [Closure]{.smallcaps} repository that were already downloaded using the command:

defects4j checkout -p Closure -v 1f -w ./resources

and used the code in the following subfolder for the project:

./resources/defects4j-checkout-closure-1f/src/com/google/javascript/jscomp/

relative to the root folder of the repository. The resulting CSV of extracted, labelled feature vectors can be found in the repository at the path:

./metrics/feature_vectors_labeled.csv

relative to the root folder of the repository.

Unlabeled feature vectors can be computed by running the script ./extract_feature_vectors.py. The resulting CSV of unlabeled feature vectors is located in ./metrics/feature_vectors.csv.

Labels for feature vectors can be computed by running the script ./label_feature_vectors.py.

Feature Vector Extraction

I extracted 291 feature vectors in total. Aggregate metrics about the extracted feature vectors, i.e. the distribution of the values of each code metric, can be found in Table 1{reference-type="ref" reference="tab:metrics"}.

::: {#tab:metrics}

Metric	Average	Maximum
`BCM`	13.4124	221
`CPX`	5.8247	96
`DCM`	4.8652	176.2
`EX`	0.1134	2
`FLD`	6.5773	167
`INT`	0.6667	3
`MTH`	11.6529	209
`NML`	13.5622	28
`RET`	3.6735	86
`RFC`	107.2710	882
`SZ`	18.9966	347
`WRD`	314.4740	3133

: Distribution of values for each extracted code metric. :::

Feature Vector Labelling

After feature vectors are labeled, I determine that the dataset contains 75 buggy classes and 216 non-buggy classes.

Classifiers

In this section I explain how I define and perform training for each classifier.

Since the dataset has an unbalanced number of feature vectors of each class, in order to increase classification performance I upsample the dataset by performing sampling with replacement over the least frequent class until the number of feature vectors matches the most frequest class¹.

Other than for the GaussianNB (Naive Bayes) classifier, the classifiers chosen for the project offer to select hyperparameter values. In order to choose them, I perform a grid search over each classifier. The hyperparameter values I have considered in the grid search for each classifier are the following:

For DecisionTreeClassifier:

Parameter Values

criterion gini, entropy

splitter best, random
For SVC:

Parameter Values

kernel linear, poly, rbf, sigmoid

gamma scale, auto

Parameter	Values
criterion	gini, entropy
splitter	best, random

Parameter	Values
kernel	linear, poly, rbf, sigmoid
gamma	scale, auto

For MLPClassifier:

Parameter	Values
max_iter	500000
hidden_layer_sizes	`[5, 10, 15, ..., 100]`, `[15, 30, 45, 60, 75, 90]^2`, `[20, 40, 60, 80, 100]^3`
activation	identity, logistic, tanh, relu
solver	lbfgs, sgd, adam
learning_rate	constant, invscaling, adaptive

Note that the [...]^2 denotes a cartesian product of the array with itself, and $[...]^3$ denotes the cartesian product of [...]^2 with the array (i.e. $[...]^3 = [...]^2 \times [...] = ([...] \times [...]) \times [...]$).

Note also the high upper bound on iterations (500000). This is to allow convergence of the less optimal hyperparameter configurations and avoid ConvergenceWarning errors.

For RandomForestClassifier:

Parameter Values

criterion gini, entropy

max_features sqrt, log2

class_weight balanced, balanced_subsample

Parameter	Values
criterion	gini, entropy
max_features	sqrt, log2
class_weight	balanced, balanced_subsample

The script ./train_classifiers.py, according to the random seed 3735924759, performs upscaling of the dataset and the grid search training, by recording precision, accuracy, recall and the F1 score of each configuration of hyperparameters. These metrics are then collected and stored in ./models/models.csv.

The metrics for each classifier and each hyperparameter configuration in decreasing order of accuracy are reported in the following sections.

For each classifier, I then choose the hyperparameter configuration with highest accuracy. Namely, these configurations are:

Classifier	Hyper-parameter configuration	Precision	Accuracy	Recall	F1 Score
DecisionTreeClassifier	`criterion`: gini, `splitter`: best	0.7885	0.8506	0.9535	0.8632
GaussianNB	--	0.8	0.6782	0.4651	0.5882
MLPClassifier	`activation`: logistic, `hidden_layer_sizes`: (60, 80, 100), `learning_rate`: constant, `max_iter`: 500000, `solver`: lbfgs	0.8958	0.9425	1	0.9451
RandomForestClassifier	`class_weight`: balanced, `criterion`: gini, `max_features`: sqrt	0.8367	0.8851	0.9535	0.8913
SVC	`gamma`: scale, `kernel`: rbf	0.7174	0.7356	0.7674	0.7416

Decision Tree (DT)

criterion	splitter	precision	accuracy	recall	f1
gini	best	0.788462	0.850575	0.953488	0.863158
gini	random	0.784314	0.83908	0.930233	0.851064
entropy	random	0.736842	0.816092	0.976744	0.84
entropy	best	0.745455	0.816092	0.953488	0.836735

Naive Bayes (NB)

precision	accuracy	recall	f1
0.8	0.678161	0.465116	0.588235

Support Vector Machine (SVP)

gamma	kernel	precision	accuracy	recall	f1
scale	rbf	0.717391	0.735632	0.767442	0.741573
scale	linear	0.75	0.735632	0.697674	0.722892
auto	linear	0.75	0.735632	0.697674	0.722892
auto	rbf	0.702128	0.724138	0.767442	0.733333
scale	sigmoid	0.647059	0.678161	0.767442	0.702128
auto	sigmoid	0.647059	0.678161	0.767442	0.702128
auto	poly	0.772727	0.643678	0.395349	0.523077
scale	poly	0.833333	0.597701	0.232558	0.363636

Multi-Layer Perceptron (MLP)

For sake of brevity, only the top 100 results by accuracy are shown.

activation	hidden_layer_sizes	learning_rate	max_iter	solver	precision	accuracy	recall	f1
logistic	(60, 80, 100)	constant	500000	lbfgs	0.895833	0.942529	1	0.945055
logistic	(40, 80, 100)	adaptive	500000	lbfgs	0.86	0.91954	1	0.924731
tanh	(40, 80, 100)	invscaling	500000	adam	0.86	0.91954	1	0.924731
tanh	(60, 100, 80)	adaptive	500000	lbfgs	0.86	0.91954	1	0.924731
tanh	(100, 60, 20)	constant	500000	adam	0.86	0.91954	1	0.924731
tanh	(100, 80, 80)	constant	500000	adam	0.86	0.91954	1	0.924731
relu	(75, 30)	adaptive	500000	lbfgs	0.86	0.91954	1	0.924731
logistic	(20, 40, 60)	adaptive	500000	lbfgs	0.875	0.91954	0.976744	0.923077
logistic	(40, 60, 80)	adaptive	500000	lbfgs	0.843137	0.908046	1	0.914894
logistic	(80, 40, 20)	constant	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	30	invscaling	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	60	constant	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	85	adaptive	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(30, 30)	constant	500000	adam	0.843137	0.908046	1	0.914894
tanh	(45, 45)	adaptive	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(60, 60)	invscaling	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(75, 45)	invscaling	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(75, 75)	adaptive	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(90, 90)	invscaling	500000	adam	0.843137	0.908046	1	0.914894
tanh	(20, 40, 60)	invscaling	500000	adam	0.843137	0.908046	1	0.914894
tanh	(20, 100, 20)	invscaling	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(40, 20, 100)	constant	500000	adam	0.843137	0.908046	1	0.914894
tanh	(40, 80, 60)	invscaling	500000	adam	0.843137	0.908046	1	0.914894
tanh	(40, 80, 100)	adaptive	500000	adam	0.843137	0.908046	1	0.914894
tanh	(60, 20, 40)	adaptive	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(60, 60, 80)	constant	500000	adam	0.843137	0.908046	1	0.914894
tanh	(60, 80, 80)	adaptive	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(80, 20, 40)	adaptive	500000	lbfgs	0.843137	0.908046	1	0.914894
tanh	(80, 40, 80)	constant	500000	adam	0.843137	0.908046	1	0.914894
tanh	(80, 60, 60)	constant	500000	lbfgs	0.843137	0.908046	1	0.914894
relu	(20, 20, 80)	constant	500000	adam	0.843137	0.908046	1	0.914894
relu	(20, 40, 100)	constant	500000	lbfgs	0.843137	0.908046	1	0.914894
relu	(20, 60, 20)	adaptive	500000	adam	0.843137	0.908046	1	0.914894
relu	(20, 60, 100)	adaptive	500000	adam	0.843137	0.908046	1	0.914894
relu	(20, 100, 20)	constant	500000	adam	0.843137	0.908046	1	0.914894
relu	(20, 100, 40)	adaptive	500000	adam	0.843137	0.908046	1	0.914894
relu	(40, 20, 80)	constant	500000	lbfgs	0.843137	0.908046	1	0.914894
relu	(40, 80, 60)	constant	500000	lbfgs	0.843137	0.908046	1	0.914894
relu	(60, 20, 100)	constant	500000	adam	0.843137	0.908046	1	0.914894
relu	(80, 20, 60)	constant	500000	lbfgs	0.843137	0.908046	1	0.914894
relu	(80, 60, 20)	adaptive	500000	adam	0.843137	0.908046	1	0.914894
relu	(100, 20, 60)	invscaling	500000	adam	0.843137	0.908046	1	0.914894
logistic	(20, 60, 80)	invscaling	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
logistic	(60, 20, 20)	adaptive	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
tanh	(15, 45)	constant	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
tanh	(45, 90)	invscaling	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
tanh	(90, 30)	constant	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
tanh	(20, 80, 100)	invscaling	500000	adam	0.857143	0.908046	0.976744	0.913043
tanh	(20, 80, 100)	adaptive	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
tanh	(40, 40, 40)	adaptive	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
tanh	(40, 60, 100)	adaptive	500000	adam	0.857143	0.908046	0.976744	0.913043
tanh	(60, 80, 60)	constant	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
tanh	(100, 40, 60)	invscaling	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
tanh	(100, 80, 100)	adaptive	500000	adam	0.857143	0.908046	0.976744	0.913043
relu	(30, 30)	adaptive	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
relu	(20, 20, 40)	adaptive	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
relu	(20, 40, 40)	adaptive	500000	adam	0.857143	0.908046	0.976744	0.913043
relu	(40, 20, 100)	adaptive	500000	adam	0.857143	0.908046	0.976744	0.913043
relu	(60, 80, 20)	invscaling	500000	lbfgs	0.857143	0.908046	0.976744	0.913043
logistic	(40, 80, 60)	adaptive	500000	lbfgs	0.87234	0.908046	0.953488	0.911111
logistic	35	adaptive	500000	lbfgs	0.826923	0.896552	1	0.905263
logistic	(15, 60)	invscaling	500000	lbfgs	0.826923	0.896552	1	0.905263
logistic	(45, 45)	constant	500000	lbfgs	0.826923	0.896552	1	0.905263
logistic	(20, 20, 60)	adaptive	500000	lbfgs	0.826923	0.896552	1	0.905263
logistic	(60, 60, 80)	adaptive	500000	lbfgs	0.826923	0.896552	1	0.905263
logistic	(80, 40, 100)	invscaling	500000	lbfgs	0.826923	0.896552	1	0.905263
logistic	(100, 100, 100)	constant	500000	lbfgs	0.826923	0.896552	1	0.905263
tanh	60	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(15, 15)	invscaling	500000	adam	0.826923	0.896552	1	0.905263
tanh	(15, 45)	adaptive	500000	adam	0.826923	0.896552	1	0.905263
tanh	(30, 30)	invscaling	500000	adam	0.826923	0.896552	1	0.905263
tanh	(30, 60)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 90)	invscaling	500000	lbfgs	0.826923	0.896552	1	0.905263
tanh	(75, 15)	constant	500000	lbfgs	0.826923	0.896552	1	0.905263
tanh	(75, 45)	constant	500000	lbfgs	0.826923	0.896552	1	0.905263
tanh	(90, 15)	adaptive	500000	adam	0.826923	0.896552	1	0.905263
tanh	(90, 45)	invscaling	500000	lbfgs	0.826923	0.896552	1	0.905263
tanh	(20, 40, 20)	invscaling	500000	adam	0.826923	0.896552	1	0.905263
tanh	(20, 40, 40)	invscaling	500000	adam	0.826923	0.896552	1	0.905263
tanh	(20, 60, 20)	adaptive	500000	adam	0.826923	0.896552	1	0.905263
tanh	(20, 80, 60)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(20, 80, 80)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(20, 80, 100)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(40, 20, 60)	invscaling	500000	lbfgs	0.826923	0.896552	1	0.905263
tanh	(40, 60, 60)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(40, 60, 60)	invscaling	500000	adam	0.826923	0.896552	1	0.905263
tanh	(40, 80, 20)	adaptive	500000	adam	0.826923	0.896552	1	0.905263
tanh	(40, 100, 60)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 40, 20)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 40, 40)	invscaling	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 40, 80)	constant	500000	lbfgs	0.826923	0.896552	1	0.905263
tanh	(60, 60, 20)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 80, 60)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 80, 80)	invscaling	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 100, 20)	invscaling	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 100, 40)	adaptive	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 100, 60)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 100, 60)	adaptive	500000	adam	0.826923	0.896552	1	0.905263
tanh	(60, 100, 80)	constant	500000	adam	0.826923	0.896552	1	0.905263
tanh	(80, 40, 40)	constant	500000	adam	0.826923	0.896552	1	0.905263

Random Forest (RF)

criterion	class_weight	max_features	precision	accuracy	recall	f1
gini	balanced	sqrt	0.836735	0.885057	0.953488	0.891304
entropy	balanced	sqrt	0.807692	0.873563	0.976744	0.884211
gini	balanced_subsample	sqrt	0.807692	0.873563	0.976744	0.884211
entropy	balanced_subsample	sqrt	0.807692	0.873563	0.976744	0.884211
gini	balanced	log2	0.82	0.873563	0.953488	0.88172
entropy	balanced	log2	0.82	0.873563	0.953488	0.88172
gini	balanced_subsample	log2	0.803922	0.862069	0.953488	0.87234
entropy	balanced_subsample	log2	0.803922	0.862069	0.953488	0.87234

Evaluation

To evaluate the performance of each selected classifier, each model has trained 100 times by repeating a 5-fold cross validation procedure 20 times. A graphical and statistical analysis to compare the distribution of performance metrics (precision, recall and F1 score) follows.

Numeric tables, statistical analysis conclusions and the boxplot diagram shown in this section are obtained by running the script:

./evaluate_classifiers.py

Output Distributions

A boxplot chart to show the distribution of each of precision, recall, and F1 score for all classifiers (including the biased classifier) is shown in figure 1{reference-type="ref" reference="fig:boxplot"}. Table 2{reference-type="ref" reference="tab:meanstd"} is a numeric table summing up mean and standard deviation of each metric.

{#fig:boxplot}

::: {#tab:meanstd}

Classifier	Metric	Mean	Std. dev.
BiasedClassifier	f1	0.666238	0.00511791
BiasedClassifier	precision	0.49954	0.00575757
BiasedClassifier	recall	1	0
DecisionTreeClassifier	f1	0.888104	0.0302085
DecisionTreeClassifier	precision	0.832669	0.0425814
DecisionTreeClassifier	recall	0.953261	0.0358021
GaussianNB	f1	0.54952	0.0912106
GaussianNB	precision	0.820866	0.066323
GaussianNB	recall	0.418885	0.0950382
MLPClassifier	f1	0.884807	0.0348063
MLPClassifier	precision	0.836507	0.0489391
MLPClassifier	recall	0.941823	0.0454851
RandomForestClassifier	f1	0.91079	0.0301481
RandomForestClassifier	precision	0.870685	0.0427383
RandomForestClassifier	recall	0.956665	0.0381288
SVC	f1	0.752656	0.0537908
SVC	precision	0.7557	0.0515768
SVC	recall	0.754709	0.0819603

: Mean and standard deviation for each classifier for 20-times cross validation. :::

Comparison and Significance

Given the distribution of metrics presented in the previous section, I perform a statistical analysis using the Wilixcon paired test to determine for each pair of classifiers which one performs better according to each performance metric. When the p-value is too high, ANOVA power analysis (corrected by alpha = $= 0.05$) is performed to determine if the metrics are equally distributed or if the statistical test is inconclusive.

F1 Values

Mean F1 for DT: 0.8881, mean F1 for NB: 0.5495 \Rightarrow DT is better than NB (p-value $= 0.0$)
Mean F1 for DT: 0.8881, mean F1 for MLP: 0.8848 \Rightarrow statistical test inconclusive (p-value = 0.4711, 5% corrected ANOVA power $= 0.2987$)
Mean F1 for DT: 0.8881, mean F1 for RF: 0.9108 \Rightarrow RF is better than DT (p-value $= 0.0$)
Mean F1 for DT: 0.8881, mean F1 for SVP: 0.7527 \Rightarrow DT is better than SVP (p-value $= 0.0$)
Mean F1 for NB: 0.5495, mean F1 for MLP: 0.8848 \Rightarrow MLP is better than NB (p-value $= 0.0$)
Mean F1 for NB: 0.5495, mean F1 for RF: 0.9108 \Rightarrow RF is better than NB (p-value $= 0.0$)
Mean F1 for NB: 0.5495, mean F1 for SVP: 0.7527 \Rightarrow SVP is better than NB (p-value $= 0.0$)
Mean F1 for MLP: 0.8848, mean F1 for RF: 0.9108 \Rightarrow RF is better than MLP (p-value $= 0.0$)
Mean F1 for MLP: 0.8848, mean F1 for SVP: 0.7527 \Rightarrow MLP is better than SVP (p-value $= 0.0$)
Mean F1 for RF: 0.9108, mean F1 for SVP: 0.7527 \Rightarrow RF is better than SVP (p-value $= 0.0$)
Mean F1 for Biased: 0.6662, mean F1 for DT: 0.8881 \Rightarrow DT is better than Biased (p-value $= 0.0$)
Mean F1 for Biased: 0.6662, mean F1 for NB: 0.5495 \Rightarrow Biased is better than NB (p-value $= 0.0$)
Mean F1 for Biased: 0.6662, mean F1 for MLP: 0.8848 \Rightarrow MLP is better than Biased (p-value $= 0.0$)
Mean F1 for Biased: 0.6662, mean F1 for RF: 0.9108 \Rightarrow RF is better than Biased (p-value $= 0.0$)
Mean F1 for Biased: 0.6662, mean F1 for SVP: 0.7527 \Rightarrow SVP is better than Biased (p-value $= 0.0$)

Precision

Mean precision for DT: 0.8327, mean precision for NB: 0.8209 \Rightarrow DT is as effective as NB (p-value = 0.0893, 5% corrected ANOVA power $= 0.8498$)
Mean precision for DT: 0.8327, mean precision for MLP: 0.8365 \Rightarrow statistical test inconclusive (p-value = 0.4012, 5% corrected ANOVA power $= 0.2196$)
Mean precision for DT: 0.8327, mean precision for RF: 0.8707 \Rightarrow RF is better than DT (p-value $= 0.0$)
Mean precision for DT: 0.8327, mean precision for SVP: 0.7557 \Rightarrow DT is better than SVP (p-value $= 0.0$)
Mean precision for NB: 0.8209, mean precision for MLP: 0.8365 \Rightarrow MLP is better than NB (p-value $= 0.0348$)
Mean precision for NB: 0.8209, mean precision for RF: 0.8707 \Rightarrow RF is better than NB (p-value $= 0.0$)
Mean precision for NB: 0.8209, mean precision for SVP: 0.7557 \Rightarrow NB is better than SVP (p-value $= 0.0$)
Mean precision for MLP: 0.8365, mean precision for RF: 0.8707 \Rightarrow RF is better than MLP (p-value $= 0.0$)
Mean precision for MLP: 0.8365, mean precision for SVP: 0.7557 \Rightarrow MLP is better than SVP (p-value $= 0.0$)
Mean precision for RF: 0.8707, mean precision for SVP: 0.7557 \Rightarrow RF is better than SVP (p-value $= 0.0$)
Mean precision for Biased: 0.4995, mean precision for DT: 0.8327 \Rightarrow DT is better than Biased (p-value $= 0.0$)
Mean precision for Biased: 0.4995, mean precision for NB: 0.8209 \Rightarrow NB is better than Biased (p-value $= 0.0$)
Mean precision for Biased: 0.4995, mean precision for MLP: 0.8365 \Rightarrow MLP is better than Biased (p-value $= 0.0$)
Mean precision for Biased: 0.4995, mean precision for RF: 0.8707 \Rightarrow RF is better than Biased (p-value $= 0.0$)
Mean precision for Biased: 0.4995, mean precision for SVP: 0.7557 \Rightarrow SVP is better than Biased (p-value $= 0.0$)

Recall

Mean recall for DT: 0.9533, mean recall for NB: 0.4189 \Rightarrow DT is better than NB (p-value $= 0.0$)
Mean recall for DT: 0.9533, mean recall for MLP: 0.9418 \Rightarrow DT is better than MLP (p-value $= 0.0118$)
Mean recall for DT: 0.9533, mean recall for RF: 0.9567 \Rightarrow statistical test inconclusive (p-value = 0.3276, 5% corrected ANOVA power $= 0.2558$)
Mean recall for DT: 0.9533, mean recall for SVP: 0.7547 \Rightarrow DT is better than SVP (p-value $= 0.0$)
Mean recall for NB: 0.4189, mean recall for MLP: 0.9418 \Rightarrow MLP is better than NB (p-value $= 0.0$)
Mean recall for NB: 0.4189, mean recall for RF: 0.9567 \Rightarrow RF is better than NB (p-value $= 0.0$)
Mean recall for NB: 0.4189, mean recall for SVP: 0.7547 \Rightarrow SVP is better than NB (p-value $= 0.0$)
Mean recall for MLP: 0.9418, mean recall for RF: 0.9567 \Rightarrow RF is better than MLP (p-value $= 0.0001$)
Mean recall for MLP: 0.9418, mean recall for SVP: 0.7547 \Rightarrow MLP is better than SVP (p-value $= 0.0$)
Mean recall for RF: 0.9567, mean recall for SVP: 0.7547 \Rightarrow RF is better than SVP (p-value $= 0.0$)
Mean recall for Biased: 1.0, mean recall for DT: 0.9533 \Rightarrow Biased is better than DT (p-value $= 0.0$)
Mean recall for Biased: 1.0, mean recall for NB: 0.4189 \Rightarrow Biased is better than NB (p-value $= 0.0$)
Mean recall for Biased: 1.0, mean recall for MLP: 0.9418 \Rightarrow Biased is better than MLP (p-value $= 0.0$)
Mean recall for Biased: 1.0, mean recall for RF: 0.9567 \Rightarrow Biased is better than RF (p-value $= 0.0$)
Mean recall for Biased: 1.0, mean recall for SVP: 0.7547 \Rightarrow Biased is better than SVP (p-value $= 0.0$)

Practical Usefulness

The evaluation shows that all classifiers but the Naive Bayes classifier outperform the biased classifier in overall performance (represented by the F1 score). Given the evaluation was not performed on a dedicated test set, all classifiers may be subjected to training bias, which might yield better metrics than a similar evaluation performed on completely new data. However, given the evaluation sample size (i.e. the number of training runs) this effect is minimal.

Given this premise, I can say with reasonable confidence the DT, MLP, RP and SVP classifiers would be useful in predicting potential bugs in the Google JSComp project, i.e. the source of the used dataset. Given the literature presented during lecture, it is not certain if the same trained classifier would yield acceptable results for source code from another project. This is because differences in coding conventions, the bug tracking process, or simply the definition of what constitutes a bug may vary from project to project. A solution to this problem might be to simply train the classifiers on a dataset coming from the project where bug prediction is needed.

Upsampling due to unbalanced classes was suggested by Michele Cattaneo, who is attending this class. ↩︎

32 KiB Raw Permalink Blame History