Note that the $[...]^2$ denotes a cartesian product of the array with itself, and $[...]^3$
denotes the cartesian product of $[...]^2$ with the array (i.e. $[...]^3 = [...]^2 \times [...] = ([...] \times [...]) \times [...]$).
Note also the high upper bound on iterations (500000). This is to allow convergence of the less optimal hyperparameter configurations and avoid `ConvergenceWarning` errors.
- For *RandomForestClassifier*:
| **Parameter** | **Values** |
|:-------------|:-----------------------------|
| criterion | gini, entropy |
| max_features | sqrt, log2 |
| class_weight | balanced, balanced_subsample |
The script `./train_classifiers.py`, according to the random seed $3735924759$, performs upscaling of the dataset and the grid search training, by recording precision, accuracy, recall and the F1 score of each configuration of hyperparameters. These metrics are then collected and stored in `./models/models.csv`.
The metrics for each classifier and each hyperparameter configuration in decreasing order of
A boxplot chart to show the distribution of each of precision, recall, and F1 score
for all classifiers (including the biased classifier) is shown in figure [1](#fig:boxplot){reference-type="ref" reference="fig:boxplot"}. Table [2](#tab:meanstd){reference-type="ref" reference="tab:meanstd"} is a numeric
table summing up mean and standard deviation of each metric.
Given the distribution of metrics presented in the previous section, I perform a statistical analysis
using the Wilixcon paired test to determine for each pair of classifiers which one performs better
according to each performance metric. When the *p-value* is too high, *ANOVA* power analysis (corrected
by *alpha* = $= 0.05$) is performed to determine if the metrics are equally distributed or if the statistical
test is inconclusive.
# F1 Values
- Mean *F1* for *DT*: 0.8881, mean *F1* for *NB*: 0.5495 $\Rightarrow$ *DT* is better than *NB* (*p-value* $= 0.0$)
- Mean *F1* for *DT*: 0.8881, mean *F1* for *MLP*: 0.8848 $\Rightarrow$ statistical test inconclusive (*p-value* $= 0.4711$, *5% corrected ANOVA power* $= 0.2987$)
- Mean *F1* for *DT*: 0.8881, mean *F1* for *RF*: 0.9108 $\Rightarrow$ *RF* is better than *DT* (*p-value* $= 0.0$)
- Mean *F1* for *DT*: 0.8881, mean *F1* for *SVP*: 0.7527 $\Rightarrow$ *DT* is better than *SVP* (*p-value* $= 0.0$)
- Mean *F1* for *NB*: 0.5495, mean *F1* for *MLP*: 0.8848 $\Rightarrow$ *MLP* is better than *NB* (*p-value* $= 0.0$)
- Mean *F1* for *NB*: 0.5495, mean *F1* for *RF*: 0.9108 $\Rightarrow$ *RF* is better than *NB* (*p-value* $= 0.0$)
- Mean *F1* for *NB*: 0.5495, mean *F1* for *SVP*: 0.7527 $\Rightarrow$ *SVP* is better than *NB* (*p-value* $= 0.0$)
- Mean *F1* for *MLP*: 0.8848, mean *F1* for *RF*: 0.9108 $\Rightarrow$ *RF* is better than *MLP* (*p-value* $= 0.0$)
- Mean *F1* for *MLP*: 0.8848, mean *F1* for *SVP*: 0.7527 $\Rightarrow$ *MLP* is better than *SVP* (*p-value* $= 0.0$)
- Mean *F1* for *RF*: 0.9108, mean *F1* for *SVP*: 0.7527 $\Rightarrow$ *RF* is better than *SVP* (*p-value* $= 0.0$)
- Mean *F1* for *Biased*: 0.6662, mean *F1* for *DT*: 0.8881 $\Rightarrow$ *DT* is better than *Biased* (*p-value* $= 0.0$)
- Mean *F1* for *Biased*: 0.6662, mean *F1* for *NB*: 0.5495 $\Rightarrow$ *Biased* is better than *NB* (*p-value* $= 0.0$)
- Mean *F1* for *Biased*: 0.6662, mean *F1* for *MLP*: 0.8848 $\Rightarrow$ *MLP* is better than *Biased* (*p-value* $= 0.0$)
- Mean *F1* for *Biased*: 0.6662, mean *F1* for *RF*: 0.9108 $\Rightarrow$ *RF* is better than *Biased* (*p-value* $= 0.0$)
- Mean *F1* for *Biased*: 0.6662, mean *F1* for *SVP*: 0.7527 $\Rightarrow$ *SVP* is better than *Biased* (*p-value* $= 0.0$)
- Mean *precision* for *DT*: 0.8327, mean *precision* for *NB*: 0.8209 $\Rightarrow$ *DT* is as effective as *NB* (*p-value* $= 0.0893$, *5% corrected ANOVA power* $= 0.8498$)
- Mean *precision* for *DT*: 0.8327, mean *precision* for *MLP*: 0.8365 $\Rightarrow$ statistical test inconclusive (*p-value* $= 0.4012$, *5% corrected ANOVA power* $= 0.2196$)
- Mean *precision* for *DT*: 0.8327, mean *precision* for *RF*: 0.8707 $\Rightarrow$ *RF* is better than *DT* (*p-value* $= 0.0$)
- Mean *precision* for *DT*: 0.8327, mean *precision* for *SVP*: 0.7557 $\Rightarrow$ *DT* is better than *SVP* (*p-value* $= 0.0$)
- Mean *precision* for *NB*: 0.8209, mean *precision* for *MLP*: 0.8365 $\Rightarrow$ *MLP* is better than *NB* (*p-value* $= 0.0348$)
- Mean *precision* for *NB*: 0.8209, mean *precision* for *RF*: 0.8707 $\Rightarrow$ *RF* is better than *NB* (*p-value* $= 0.0$)
- Mean *precision* for *NB*: 0.8209, mean *precision* for *SVP*: 0.7557 $\Rightarrow$ *NB* is better than *SVP* (*p-value* $= 0.0$)
- Mean *precision* for *MLP*: 0.8365, mean *precision* for *RF*: 0.8707 $\Rightarrow$ *RF* is better than *MLP* (*p-value* $= 0.0$)
- Mean *precision* for *MLP*: 0.8365, mean *precision* for *SVP*: 0.7557 $\Rightarrow$ *MLP* is better than *SVP* (*p-value* $= 0.0$)
- Mean *precision* for *RF*: 0.8707, mean *precision* for *SVP*: 0.7557 $\Rightarrow$ *RF* is better than *SVP* (*p-value* $= 0.0$)
- Mean *precision* for *Biased*: 0.4995, mean *precision* for *DT*: 0.8327 $\Rightarrow$ *DT* is better than *Biased* (*p-value* $= 0.0$)
- Mean *precision* for *Biased*: 0.4995, mean *precision* for *NB*: 0.8209 $\Rightarrow$ *NB* is better than *Biased* (*p-value* $= 0.0$)
- Mean *precision* for *Biased*: 0.4995, mean *precision* for *MLP*: 0.8365 $\Rightarrow$ *MLP* is better than *Biased* (*p-value* $= 0.0$)
- Mean *precision* for *Biased*: 0.4995, mean *precision* for *RF*: 0.8707 $\Rightarrow$ *RF* is better than *Biased* (*p-value* $= 0.0$)
- Mean *precision* for *Biased*: 0.4995, mean *precision* for *SVP*: 0.7557 $\Rightarrow$ *SVP* is better than *Biased* (*p-value* $= 0.0$)
- Mean *recall* for *DT*: 0.9533, mean *recall* for *NB*: 0.4189 $\Rightarrow$ *DT* is better than *NB* (*p-value* $= 0.0$)
- Mean *recall* for *DT*: 0.9533, mean *recall* for *MLP*: 0.9418 $\Rightarrow$ *DT* is better than *MLP* (*p-value* $= 0.0118$)
- Mean *recall* for *DT*: 0.9533, mean *recall* for *RF*: 0.9567 $\Rightarrow$ statistical test inconclusive (*p-value* $= 0.3276$, *5% corrected ANOVA power* $= 0.2558$)
- Mean *recall* for *DT*: 0.9533, mean *recall* for *SVP*: 0.7547 $\Rightarrow$ *DT* is better than *SVP* (*p-value* $= 0.0$)
- Mean *recall* for *NB*: 0.4189, mean *recall* for *MLP*: 0.9418 $\Rightarrow$ *MLP* is better than *NB* (*p-value* $= 0.0$)
- Mean *recall* for *NB*: 0.4189, mean *recall* for *RF*: 0.9567 $\Rightarrow$ *RF* is better than *NB* (*p-value* $= 0.0$)
- Mean *recall* for *NB*: 0.4189, mean *recall* for *SVP*: 0.7547 $\Rightarrow$ *SVP* is better than *NB* (*p-value* $= 0.0$)
- Mean *recall* for *MLP*: 0.9418, mean *recall* for *RF*: 0.9567 $\Rightarrow$ *RF* is better than *MLP* (*p-value* $= 0.0001$)
- Mean *recall* for *MLP*: 0.9418, mean *recall* for *SVP*: 0.7547 $\Rightarrow$ *MLP* is better than *SVP* (*p-value* $= 0.0$)
- Mean *recall* for *RF*: 0.9567, mean *recall* for *SVP*: 0.7547 $\Rightarrow$ *RF* is better than *SVP* (*p-value* $= 0.0$)
- Mean *recall* for *Biased*: 1.0, mean *recall* for *DT*: 0.9533 $\Rightarrow$ *Biased* is better than *DT* (*p-value* $= 0.0$)
- Mean *recall* for *Biased*: 1.0, mean *recall* for *NB*: 0.4189 $\Rightarrow$ *Biased* is better than *NB* (*p-value* $= 0.0$)
- Mean *recall* for *Biased*: 1.0, mean *recall* for *MLP*: 0.9418 $\Rightarrow$ *Biased* is better than *MLP* (*p-value* $= 0.0$)
- Mean *recall* for *Biased*: 1.0, mean *recall* for *RF*: 0.9567 $\Rightarrow$ *Biased* is better than *RF* (*p-value* $= 0.0$)
- Mean *recall* for *Biased*: 1.0, mean *recall* for *SVP*: 0.7547 $\Rightarrow$ *Biased* is better than *SVP* (*p-value* $= 0.0$)