report work
This commit is contained in:
parent
39a6c59e2c
commit
185bee2933
4 changed files with 288 additions and 35 deletions
30
grid_search_table.py
Executable file
30
grid_search_table.py
Executable file
|
@ -0,0 +1,30 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
import os
|
||||
import pandas as pd
|
||||
|
||||
from train_classifiers import get_classifiers
|
||||
|
||||
def main():
|
||||
i = 0
|
||||
df = pd.DataFrame(columns=['Classifier', 'Parameter', 'Values'])
|
||||
for clazz, grid in get_classifiers():
|
||||
for name, values in grid.items():
|
||||
df.loc[i, 'Classifier'] = type(clazz).__name__
|
||||
df.loc[i, 'Parameter'] = name
|
||||
df.loc[i, 'Values'] = ', '.join([str(x) for x in values])
|
||||
i += 1
|
||||
|
||||
n1 = '5, 10, 15, ..., 100'
|
||||
n2 = ', '.join([str(x) for x in range(15, 101, 15)])
|
||||
n3 = ', '.join([str(x) for x in range(20, 101, 20)])
|
||||
|
||||
df.loc[(df.Classifier == 'MLPClassifier') & (df.Parameter == 'hidden_layer_sizes'), 'Values'] = f'$[{n1}]$, $[{n2}]^2$, $[{n3}]^3$'
|
||||
|
||||
for i in set(df['Classifier']):
|
||||
print(i)
|
||||
print(df.loc[df.Classifier == i, ['Parameter', 'Values']].to_markdown(index=False))
|
||||
print()
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
272
report/main.md
272
report/main.md
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
author: Claudio Maggioni
|
||||
title: Information Modelling & Analysis -- Project 2
|
||||
geometry: margin=2cm,bottom=3cm
|
||||
geometry: margin=2cm
|
||||
---
|
||||
|
||||
<!--The following shows a minimal submission report for project 2. If you
|
||||
|
@ -18,7 +18,7 @@ expected info, you'll be fine.-->
|
|||
|
||||
# Code Repository
|
||||
|
||||
The code and result files, part of this submission, can be found at
|
||||
The code and result files, part of this submission, can be found at:
|
||||
|
||||
- Repository: [https://github.com/infoMA2023/project-02-bug-prediction-maggicl](https://github.com/infoMA2023/project-02-bug-prediction-maggicl)
|
||||
- Commit ID: **TBD**
|
||||
|
@ -37,8 +37,8 @@ and used the code in the following subfolder for the project:
|
|||
./resources/defects4j-checkout-closure-1f/src/com/google/javascript/jscomp/
|
||||
```
|
||||
|
||||
relative to the root folder of the repository. The resulting csv of extracted, labelled feature vectors can be found in
|
||||
the repository at the following path:
|
||||
relative to the root folder of the repository. The resulting CSV of extracted, labelled feature vectors can be found in
|
||||
the repository at the path:
|
||||
|
||||
```
|
||||
./metrics/feature_vectors_labeled.csv
|
||||
|
@ -46,6 +46,11 @@ the repository at the following path:
|
|||
|
||||
relative to the root folder of the repository.
|
||||
|
||||
Unlabeled feature vectors can be computed by running the script `./extract_feature_vectors.py`.
|
||||
The resulting CSV of unlabeled feature vectors is located in `./metrics/feature_vectors.csv`.
|
||||
|
||||
Labels for feature vectors can be computed by running the script `./label_feature_vectors.py`.
|
||||
|
||||
## Feature Vector Extraction
|
||||
|
||||
I extracted **291** feature vectors in total. Aggregate metrics
|
||||
|
@ -54,24 +59,25 @@ code metric, can be found in Table [1](#tab:metrics){reference-type="ref"
|
|||
reference="tab:metrics"}.
|
||||
|
||||
::: {#tab:metrics}
|
||||
| **Metric** | **Min** | **Average** | **Max** |
|
||||
|:----|------:|-----------:|-------:|
|
||||
| BCM | 0 | 13.4124 | 221 |
|
||||
| CPX | 0 | 5.8247 | 96 |
|
||||
| DCM | 0 | 4.8652 | 176.2 |
|
||||
| EX | 0 | 0.1134 | 2 |
|
||||
| FLD | 0 | 6.5773 | 167 |
|
||||
| INT | 0 | 0.6667 | 3 |
|
||||
| MTH | 0 | 11.6529 | 209 |
|
||||
| NML | 0 | 13.5622 | 28 |
|
||||
| RET | 0 | 3.6735 | 86 |
|
||||
| RFC | 0 | 107.2710 | 882 |
|
||||
| SZ | 0 | 18.9966 | 347 |
|
||||
| WRD | 0 | 314.4740 | 3133 |
|
||||
| **Metric** | **Minimum** | **Average** | **Maximum** |
|
||||
|:------|-:|---------:|-----:|
|
||||
| `BCM` |0| 13.4124 | 221 |
|
||||
| `CPX` |0| 5.8247 | 96 |
|
||||
| `DCM` |0| 4.8652 |176.2 |
|
||||
| `EX` |0| 0.1134 | 2 |
|
||||
| `FLD` |0| 6.5773 | 167 |
|
||||
| `INT` |0| 0.6667 | 3 |
|
||||
| `MTH` |0| 11.6529 | 209 |
|
||||
| `NML` |0| 13.5622 | 28 |
|
||||
| `RET` |0| 3.6735 | 86 |
|
||||
| `RFC` |0| 107.2710 | 882 |
|
||||
| `SZ` |0| 18.9966 | 347 |
|
||||
| `WRD` |0| 314.4740 | 3133 |
|
||||
|
||||
: Distribution of values for each extracted code metric.
|
||||
:::
|
||||
|
||||
|
||||
## Feature Vector Labelling
|
||||
|
||||
After feature vectors are labeled, I determine that the dataset contains
|
||||
|
@ -79,25 +85,221 @@ After feature vectors are labeled, I determine that the dataset contains
|
|||
|
||||
# Classifiers
|
||||
|
||||
In every subsection below, describe in a concise way which different
|
||||
<!--In every subsection below, describe in a concise way which different
|
||||
hyperparameters you tried for the corresponding classifier, and report
|
||||
the corresponding precision, recall and F1 values (for example in a
|
||||
table or an [itemize]{.smallcaps}-environment). Furthermore, for every
|
||||
type of classifiers, explicitly mention which hyperparameter
|
||||
configuration you chose (based on above reported results) to be used in
|
||||
further steps, and (in one or two sentences), explain why these
|
||||
hyperparameters may outperform the other ones you tested..
|
||||
hyperparameters may outperform the other ones you tested..-->
|
||||
|
||||
In this section I explain how I define and perform training for each classifier.
|
||||
|
||||
Since the dataset has an unbalanced number of feature vectors of each class, in order
|
||||
to increase classification performance I upsample the dataset by performing sampling with
|
||||
replacement over the least frequent class until the number of feature vectors matches the
|
||||
most frequest class[^1].
|
||||
|
||||
[^1]: Upsampling due to unbalanced classes was suggested by *Michele Cattaneo*, who is attending this class.
|
||||
|
||||
Other than for the `GaussianNB` (Naive Bayes) classifier, the classifiers chosen for the
|
||||
project offer to select hyperparameter values. In order to choose them, I perform a grid
|
||||
search over each classifier. The hyperparameter values I have considered in the grid search
|
||||
for each classifier are the following:
|
||||
|
||||
- For *DecisionTreeClassifier*:
|
||||
|
||||
| **Parameter** | **Values** |
|
||||
|:------------|:--------------|
|
||||
| criterion | gini, entropy |
|
||||
| splitter | best, random |
|
||||
|
||||
- For *SVC*:
|
||||
|
||||
| **Parameter** | **Values** |
|
||||
|:------------|:---------------------------|
|
||||
| kernel | linear, poly, rbf, sigmoid |
|
||||
| gamma | scale, auto |
|
||||
|
||||
- For *MLPClassifier*:
|
||||
|
||||
| **Parameter** | **Values** |
|
||||
|:------------|:---------------------------|
|
||||
| max_iter | 500000 |
|
||||
| hidden_layer_sizes | $[5, 10, 15, ..., 100]$, $[15, 30, 45, 60, 75, 90]^2$, $[20, 40, 60, 80, 100]^3$ |
|
||||
| activation | identity, logistic, tanh, relu |
|
||||
| solver | lbfgs, sgd, adam |
|
||||
| learning_rate | constant, invscaling, adaptive |
|
||||
|
||||
Note that the $[...]^2$ denotes a cartesian product of the array with itself, and $[...]^3$
|
||||
denotes the cartesian product of $[...]^2$ with the array (i.e. $[...]^3 = [...]^2 \times [...] = ([...] \times [...]) \times [...]$).
|
||||
|
||||
Note also the high upper bound on iterations (500000). This is to allow convergence of the less optimal hyperparameter configurations and avoid `ConvergenceWarning` errors.
|
||||
|
||||
- For *RandomForestClassifier*:
|
||||
|
||||
| **Parameter** | **Values** |
|
||||
|:-------------|:-----------------------------|
|
||||
| criterion | gini, entropy |
|
||||
| max_features | sqrt, log2 |
|
||||
| class_weight | balanced, balanced_subsample |
|
||||
|
||||
The script `./train_classifiers.py`, according to the random seed $3735924759$, performs upscaling of the dataset and the grid search training, by recording precision, accuracy, recall and the F1 score of each configuration of hyperparameters. These metrics are then collected and stored in `./models/models.csv`.
|
||||
|
||||
The metrics for each classifier and each hyperparameter configuration in decreasing order of
|
||||
accuracy are reported in the following sections.
|
||||
|
||||
For each classifier, I then choose the hyperparameter configuration with highest accuracy.
|
||||
|
||||
## Decision Tree (DT)
|
||||
|
||||
| criterion | splitter | precision | accuracy | recall | f1 |
|
||||
|:------------|:-----------|------------:|-----------:|---------:|---------:|
|
||||
| gini | best | 0.788462 | 0.850575 | 0.953488 | 0.863158 |
|
||||
| gini | random | 0.784314 | 0.83908 | 0.930233 | 0.851064 |
|
||||
| entropy | random | 0.736842 | 0.816092 | 0.976744 | 0.84 |
|
||||
| entropy | best | 0.745455 | 0.816092 | 0.953488 | 0.836735 |
|
||||
|
||||
## Naive Bayes (NB)
|
||||
|
||||
| precision | accuracy | recall | f1 |
|
||||
|------------:|-----------:|---------:|---------:|
|
||||
| 0.8 | 0.678161 | 0.465116 | 0.588235 |
|
||||
|
||||
## Support Vector Machine (SVP)
|
||||
|
||||
| gamma | kernel | precision | accuracy | recall | f1 |
|
||||
|:--------|:---------|------------:|-----------:|---------:|---------:|
|
||||
| scale | rbf | 0.717391 | 0.735632 | 0.767442 | 0.741573 |
|
||||
| scale | linear | 0.75 | 0.735632 | 0.697674 | 0.722892 |
|
||||
| auto | linear | 0.75 | 0.735632 | 0.697674 | 0.722892 |
|
||||
| auto | rbf | 0.702128 | 0.724138 | 0.767442 | 0.733333 |
|
||||
| scale | sigmoid | 0.647059 | 0.678161 | 0.767442 | 0.702128 |
|
||||
| auto | sigmoid | 0.647059 | 0.678161 | 0.767442 | 0.702128 |
|
||||
| auto | poly | 0.772727 | 0.643678 | 0.395349 | 0.523077 |
|
||||
| scale | poly | 0.833333 | 0.597701 | 0.232558 | 0.363636 |
|
||||
|
||||
## Multi-Layer Perceptron (MLP)
|
||||
|
||||
For sake of brevity, only the top 100 results by accuracy are shown.
|
||||
|
||||
| activation | hidden_layer_sizes | learning_rate | max_iter | solver | precision | accuracy | recall | f1 |
|
||||
|:-------------|:---------------------|:----------------|-----------:|:---------|------------:|-----------:|---------:|---------:|
|
||||
| logistic | (60, 80, 100) | constant | 500000 | lbfgs | 0.895833 | 0.942529 | 1 | 0.945055 |
|
||||
| logistic | (40, 80, 100) | adaptive | 500000 | lbfgs | 0.86 | 0.91954 | 1 | 0.924731 |
|
||||
| tanh | (40, 80, 100) | invscaling | 500000 | adam | 0.86 | 0.91954 | 1 | 0.924731 |
|
||||
| tanh | (60, 100, 80) | adaptive | 500000 | lbfgs | 0.86 | 0.91954 | 1 | 0.924731 |
|
||||
| tanh | (100, 60, 20) | constant | 500000 | adam | 0.86 | 0.91954 | 1 | 0.924731 |
|
||||
| tanh | (100, 80, 80) | constant | 500000 | adam | 0.86 | 0.91954 | 1 | 0.924731 |
|
||||
| relu | (75, 30) | adaptive | 500000 | lbfgs | 0.86 | 0.91954 | 1 | 0.924731 |
|
||||
| logistic | (20, 40, 60) | adaptive | 500000 | lbfgs | 0.875 | 0.91954 | 0.976744 | 0.923077 |
|
||||
| logistic | (40, 60, 80) | adaptive | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| logistic | (80, 40, 20) | constant | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | 30 | invscaling | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | 60 | constant | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | 85 | adaptive | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (30, 30) | constant | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (45, 45) | adaptive | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (60, 60) | invscaling | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (75, 45) | invscaling | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (75, 75) | adaptive | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (90, 90) | invscaling | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (20, 40, 60) | invscaling | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (20, 100, 20) | invscaling | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (40, 20, 100) | constant | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (40, 80, 60) | invscaling | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (40, 80, 100) | adaptive | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (60, 20, 40) | adaptive | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (60, 60, 80) | constant | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (60, 80, 80) | adaptive | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (80, 20, 40) | adaptive | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (80, 40, 80) | constant | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| tanh | (80, 60, 60) | constant | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (20, 20, 80) | constant | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (20, 40, 100) | constant | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (20, 60, 20) | adaptive | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (20, 60, 100) | adaptive | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (20, 100, 20) | constant | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (20, 100, 40) | adaptive | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (40, 20, 80) | constant | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (40, 80, 60) | constant | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (60, 20, 100) | constant | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (80, 20, 60) | constant | 500000 | lbfgs | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (80, 60, 20) | adaptive | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| relu | (100, 20, 60) | invscaling | 500000 | adam | 0.843137 | 0.908046 | 1 | 0.914894 |
|
||||
| logistic | (20, 60, 80) | invscaling | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| logistic | (60, 20, 20) | adaptive | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (15, 45) | constant | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (45, 90) | invscaling | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (90, 30) | constant | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (20, 80, 100) | invscaling | 500000 | adam | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (20, 80, 100) | adaptive | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (40, 40, 40) | adaptive | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (40, 60, 100) | adaptive | 500000 | adam | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (60, 80, 60) | constant | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (100, 40, 60) | invscaling | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| tanh | (100, 80, 100) | adaptive | 500000 | adam | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| relu | (30, 30) | adaptive | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| relu | (20, 20, 40) | adaptive | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| relu | (20, 40, 40) | adaptive | 500000 | adam | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| relu | (40, 20, 100) | adaptive | 500000 | adam | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| relu | (60, 80, 20) | invscaling | 500000 | lbfgs | 0.857143 | 0.908046 | 0.976744 | 0.913043 |
|
||||
| logistic | (40, 80, 60) | adaptive | 500000 | lbfgs | 0.87234 | 0.908046 | 0.953488 | 0.911111 |
|
||||
| logistic | 35 | adaptive | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| logistic | (15, 60) | invscaling | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| logistic | (45, 45) | constant | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| logistic | (20, 20, 60) | adaptive | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| logistic | (60, 60, 80) | adaptive | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| logistic | (80, 40, 100) | invscaling | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| logistic | (100, 100, 100) | constant | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | 60 | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (15, 15) | invscaling | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (15, 45) | adaptive | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (30, 30) | invscaling | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (30, 60) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 90) | invscaling | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (75, 15) | constant | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (75, 45) | constant | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (90, 15) | adaptive | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (90, 45) | invscaling | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (20, 40, 20) | invscaling | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (20, 40, 40) | invscaling | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (20, 60, 20) | adaptive | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (20, 80, 60) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (20, 80, 80) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (20, 80, 100) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (40, 20, 60) | invscaling | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (40, 60, 60) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (40, 60, 60) | invscaling | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (40, 80, 20) | adaptive | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (40, 100, 60) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 40, 20) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 40, 40) | invscaling | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 40, 80) | constant | 500000 | lbfgs | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 60, 20) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 80, 60) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 80, 80) | invscaling | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 100, 20) | invscaling | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 100, 40) | adaptive | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 100, 60) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 100, 60) | adaptive | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (60, 100, 80) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
| tanh | (80, 40, 40) | constant | 500000 | adam | 0.826923 | 0.896552 | 1 | 0.905263 |
|
||||
|
||||
## Random Forest (RF)
|
||||
|
||||
| criterion | class_weight | max_features | precision | accuracy | recall | f1 |
|
||||
|:------------|:-------------------|:---------------|------------:|-----------:|---------:|---------:|
|
||||
| gini | balanced | sqrt | 0.836735 | 0.885057 | 0.953488 | 0.891304 |
|
||||
| entropy | balanced | sqrt | 0.807692 | 0.873563 | 0.976744 | 0.884211 |
|
||||
| gini | balanced_subsample | sqrt | 0.807692 | 0.873563 | 0.976744 | 0.884211 |
|
||||
| entropy | balanced_subsample | sqrt | 0.807692 | 0.873563 | 0.976744 | 0.884211 |
|
||||
| gini | balanced | log2 | 0.82 | 0.873563 | 0.953488 | 0.88172 |
|
||||
| entropy | balanced | log2 | 0.82 | 0.873563 | 0.953488 | 0.88172 |
|
||||
| gini | balanced_subsample | log2 | 0.803922 | 0.862069 | 0.953488 | 0.87234 |
|
||||
| entropy | balanced_subsample | log2 | 0.803922 | 0.862069 | 0.953488 | 0.87234 |
|
||||
|
||||
# Evaluation
|
||||
|
||||
## Output Distributions
|
||||
|
@ -121,11 +323,11 @@ subsubsections:
|
|||
::: {#tab:precision}
|
||||
| | DecisionTreeClassifier | GaussianNB | MLPClassifier | RandomForestClassifier | SVC |
|
||||
|:-----------------------|:-------------------------|:-------------|:----------------|:-------------------------|------:|
|
||||
| BiasedClassifier | 0 | 0 | 0 | 0 | 0 |
|
||||
| DecisionTreeClassifier | -- | 0.0893 | 0.4012 | 0 | 0 |
|
||||
| GaussianNB | -- | -- | 0.0348 | 0 | 0 |
|
||||
| MLPClassifier | -- | -- | -- | 0 | 0 |
|
||||
| RandomForestClassifier | -- | -- | -- | -- | 0 |
|
||||
| BiasedClassifier | 0.0000 | 0.0000 | 0.0000 | 0.0000 |0.0000|
|
||||
| DecisionTreeClassifier | -- | 0.0893 | 0.4012 | 0.0000 |0.0000|
|
||||
| GaussianNB | -- | -- | 0.0348 | 0.0000 |0.0000|
|
||||
| MLPClassifier | -- | -- | -- | 0.0000 |0.0000|
|
||||
| RandomForestClassifier | -- | -- | -- | -- |0.0000|
|
||||
|
||||
: Pairwise Wilcoxon test for precision for each combination of classifiers.
|
||||
:::
|
||||
|
@ -133,22 +335,22 @@ subsubsections:
|
|||
::: {#tab:recall}
|
||||
| | DecisionTreeClassifier | GaussianNB | MLPClassifier | RandomForestClassifier | SVC |
|
||||
|:-----------------------|:-------------------------|:-------------|:----------------|:-------------------------|------:|
|
||||
| BiasedClassifier | 0 | 0 | 0 | 0 | 0 |
|
||||
| DecisionTreeClassifier | -- | 0 | 0.0118 | 0.3276 | 0 |
|
||||
| GaussianNB | -- | -- | 0 | 0 | 0 |
|
||||
| MLPClassifier | -- | -- | -- | 0.0001 | 0 |
|
||||
| RandomForestClassifier | -- | -- | -- | -- | 0 |
|
||||
| BiasedClassifier | 0.0000 | 0.0000 | 0.0000 | 0.0000 |0.0000|
|
||||
| DecisionTreeClassifier | -- | 0.0000 | 0.0118 | 0.3276 |0.0000|
|
||||
| GaussianNB | -- | -- | 0.0000 | 0.0000 |0.0000|
|
||||
| MLPClassifier | -- | -- | -- | 0.0001 |0.0000|
|
||||
| RandomForestClassifier | -- | -- | -- | -- |0.0000|
|
||||
: Pairwise Wilcoxon test for recall for each combination of classifiers.
|
||||
:::
|
||||
|
||||
::: {#tab:f1}
|
||||
| | DecisionTreeClassifier | GaussianNB | MLPClassifier | RandomForestClassifier | SVC |
|
||||
|:-----------------------|:-------------------------|:-------------|:----------------|:-------------------------|------:|
|
||||
| BiasedClassifier | 0 | 0 | 0 | 0 | 0 |
|
||||
| DecisionTreeClassifier | -- | 0 | 0.4711 | 0 | 0 |
|
||||
| GaussianNB | -- | -- | 0 | 0 | 0 |
|
||||
| MLPClassifier | -- | -- | -- | 0 | 0 |
|
||||
| RandomForestClassifier | -- | -- | -- | -- | 0 |
|
||||
| BiasedClassifier | 0.0000 | 0.0000 | 0.0000 | 0.0000 |0.0000|
|
||||
| DecisionTreeClassifier | -- | 0.0000 | 0.4711 | 0.0000 |0.0000|
|
||||
| GaussianNB | -- | -- | 0.0000 | 0.0000 |0.0000|
|
||||
| MLPClassifier | -- | -- | -- | 0.0000 |0.0000|
|
||||
| RandomForestClassifier | -- | -- | -- | -- |0.0000|
|
||||
: Pairwise Wilcoxon test for the F1 score metric for each combination of classifiers.
|
||||
:::
|
||||
|
||||
|
|
BIN
report/main.pdf
BIN
report/main.pdf
Binary file not shown.
|
@ -165,6 +165,27 @@ def main():
|
|||
else:
|
||||
df = pd.read_csv(OUT_DIR + '/models.csv')
|
||||
|
||||
for clazz in set(df['classifier']):
|
||||
dfc = df.loc[df.classifier == clazz, :].copy()
|
||||
dfc = dfc[dfc.columns.drop(list(df.filter(regex='^(mean_)|(std_)|(rank_)|(params$)|(classifier$)')))]
|
||||
dfc = dfc.rename(columns={
|
||||
"split0_test_precision": "precision",
|
||||
"split0_test_accuracy": "accuracy",
|
||||
"split0_test_recall": "recall",
|
||||
"split0_test_f1": "f1"
|
||||
})
|
||||
|
||||
dfc = dfc.reindex(
|
||||
[x for x in dfc.columns if x.startswith('param_')] + \
|
||||
[x for x in dfc.columns if not x.startswith('param_')], \
|
||||
axis=1)
|
||||
dfc = dfc.rename(columns=dict([(c, c.replace('param_', '')) for c in dfc.columns]))
|
||||
dfc = dfc.loc[:, dfc.notna().any(axis=0)]
|
||||
|
||||
print(clazz)
|
||||
print(dfc.head(100).to_markdown(index=False))
|
||||
print()
|
||||
|
||||
find_best_and_save(df)
|
||||
|
||||
|
||||
|
|
Reference in a new issue