|
||
---|---|---|
docs | ||
out | ||
scripts | ||
src | ||
tests | ||
.env.template | ||
.gitattributes | ||
.gitignore | ||
.gitlab-ci.yml | ||
coveragerc | ||
environment-dev.yml | ||
environment-server.yml | ||
README.md | ||
requirements.txt | ||
sonar-project.properties |
Assignment 1: Automated Bug Triaging
Group 2: Baris Aksakal, Edoardo Riggio, Claudio Maggioni
Repository structure
/docs
: LaTeX report code;/out
/csv
: Cleaner output;/json
: Scraper output;/model
: Pickled models (model training output) and model evaluation output;/plots
: Plots for the dataset statistical analysis;
/src
/analysis
: Notebook for the dataset statistical analysis;/model-dl
/bert_medium.ipynb
: Original implementation of the classifier model. Now broke down in python files;/model*.ipynb
: Alternative model implementation by Baris Aksakal. Not used in the final implementation;
/{cleaner,modelimpl,scraper}
: Python modules used for scraper, cleaner, and model script implementation;/auc.py
: ROC curve generation script;/clean.py
: Cleaner script;/runmodel.py
: Model execution script;/scrape.py
: Scraper script;/trainmodel.py
: Model training script;
/environment-dev.yaml
: Conda environment file for development environment;/environment-server.yml
: Conda environment file for model training and execution (to be used withgym.si.usi.ch
).
Setup
Conda Environment
Training and running models is only supported on a CUDA 11.6 compatible environment like gym.si.usi.ch
. The following
instructions will create and activate a Conda environment with all required dependencies to scrape, clean,
train and run the model:
conda env remove -n bug-triaging-env || true # delete environment if already present
conda env create --name bug-triaging-env --file=environment-server.yml
conda activate bug-triaging-env
Development environment
(may not work on all platforms/architectures)
A pytorch-free version of the environment can be installed for development purposes. Only the scraper and cleaner script may be run using this environment. To install the development environment run:
conda env remove -n bug-triaging-env-dev || true # delete environment if already present
conda env create --name bug-triaging-env-dev --file=environment-dev.yml
conda activate bug-triaging-env-dev
GitHub API token
In order to be able to run the scraper and the model executor, a GitHub API token is needed. The token must be placed in
a .env
file in this directory in a variable named GITHUB_TOKEN
. The contents of the file should look like this:
GITHUB_TOKEN=<insert-token-here>
Scraper
The scraper script is located in src/scrape.py
and takes no arguments. It will download and save all issues in the
microsoft/vscode
repository in a gzip-compressed archive of JSON files, one per issue. The file will be saved in
out/json/issues.tar.gz
. The file is deleted if it already exists.
To run the scraper run:
python3 src/scrape.py
Cleaner
The cleaner script is located in src/clean.py
and takes no arguments. It will read the out/json/issues.tar.gz
,
perform the cleaning process, and perform the train-test split according to the instructions given in the assignment
document. The output of the cleaning process is saved in 3 CSV files and one text file:
out/csv/issues_train_000001_170000.csv
, including all issues that belong to the complete training set;out/csv/issues_train_recent_150000_170000.csv
, including all issues that belong to the training set made up of " recent" issues;out/csv/issues_test_170001_180000.csv
, including all issues that belong to the test set.out/csv/issues_removed_count.txt
, including the count of issues (excluding PRs) that were discarded by the cleaning process in the entire dataset.
The script will overwrite these files if they exist. To run the cleaner script run:
python3 src/clean.py
Training script
The script used to train the model is located in src/trainmodel.py
. The script takes the following arguments:
usage: trainmodel.py [-h] [-r LEARNING_RATE] [-c] [-f] {all,recent} epochs
Training and evaluation script. The script will train and save the obtained model and then perform test set evaluation.
If the given parameters match with a model that was already saved, the script only runs the evaluation procedure.
positional arguments:
{all,recent} The dataset to train with
epochs Number of epochs of the training process
options:
-h, --help show this help message and exit
-r LEARNING_RATE, --learning-rate LEARNING_RATE
The learning rate fed in the Adam optimizer
-c, --force-cpu disables CUDA support. Useful when debugging
-f, --force-retraining forces training of a new model even if a matching model is already found within the saved
models
The script loads the generated CSV datasets in out/csv
and will output three files in out/model
:
out/model/bug_triaging_{all,recent}_{epochs}e_{LEARNING_RATE}lr_final.pt
, the pytorch "pickled" model;out/model/bug_triaging_{all,recent}_{epochs}e_{LEARNING_RATE}lr_final.label_range.txt
, a text file containing two lines which determine the numeric range of classification labels outputted by the model (this file is used when using the ROC and model execution scripts);out/model/bug_triaging_{all,recent}_{epochs}e_{LEARNING_RATE}lr_final.labels.csv
, a CSV file matching the assignee usernames with the numeric encoding used to train and execute the model with (this file is used when using the ROC and model execution scripts).
({all,recent}
, {epochs}
and {LEARNING_RATE}
are placeholders whose value will match the parameters given to the
script)
To train the configurations that were chosen for the report execute:
python3 src/trainmodel.py all 4 -r '5e-6'
python3 src/trainmodel.py recent 4 -r '5e-6'
NOTE: The pickled pytorch model files have not been committed to this repo due to file size restrictions. They are
however saved in gym.si.usi.ch:/home/SA23-G2/bug-triaging/out/model
.
ROC curve generation script
The script used to train the model is located in src/auc.py
. The script takes the following arguments:
usage: auc.py [-h] [-c] modelfile
ROC curve and AUC computation script. The script evaluates the given model against the test set and generates a OvR ROC
curve plot with one curve per class, a micro-averaged OvR ROC plot and the corresponding AUC value.
positional arguments:
modelfile Path to the pickled pytorch model to classify the issue with
options:
-h, --help show this help message and exit
-c, --force-cpu disables CUDA support. Useful when debugging
modelfile
must contain a path to one of the .pt
files generated with the training script. The label range text file
and the labels CSV file are assumed to be in the same directory of the picked model.
The script outputs two PNG plots and a text file:
out/model/{model}.ovr_curves.png
contains a plot of the One-vs-Rest ROC curves for each class (assignee) appearing both in the train and test set;out/model/{model}.ovr_avg.png
contains a plot of the micro-averaged One-vs-Rest ROC curve;out/model/{model}.auc.txt
contains the AUC for the micro-average ROC curve.
({model}
is a placeholder for the filename without extension - the output of the shell command
basename {modelfile} .pt
- for the pickled pytorch model given as argument)
To generate the curves for the two trained models run:
python3 src/auc.py out/model/bug_triaging_all_4e_5e-06lr_final.pt
python3 src/auc.py out/model/bug_triaging_recent_4e_5e-06lr_final.pt
Execution script
The script used to train the model is located in src/runmodel.py
. The script takes the following arguments:
usage: runmodel.py [-h] [-t TOP] [-c] modelfile issue_id
Model execution script. Downloads a given issue id from the microsoft/vscode repository, performs the cleaning process
and recommends an assignee using the given model. The script may fail if the issue title and body do not contain any
latin characters.
positional arguments:
modelfile Path to the pickled pytorch model to classify the issue with
issue_id The microsoft/vscode GitHub issue id to classify
options:
-h, --help show this help message and exit
-t TOP, --top TOP Number of recommendations to output
-c, --force-cpu disables CUDA support. Useful when debugging
The script outputs the top-5 assignee recommendations for the given issue, and the actual assignee if the issue has already been assigned.
Alongside each assignee, the script outputs the corresponding numerical embedding. A numerical
embedding equal to -1
in the truth label denotes that the assignee does not appear in the training set
(after the train/validation split).
The script also outputs the number of commits each assignee authored in the repository.
This is an example of the script output for issue 192213
:
1: 'roblourens' (44) (confidence: 16.37%) (3932 commits authored)
2: 'lramos15' (36) (confidence: 12.62%) (829 commits authored)
3: 'bpasero' (16) (confidence: 7.29%) (11589 commits authored)
4: 'jrieken' (32) (confidence: 4.53%) (9726 commits authored)
5: 'hediet' (28) (confidence: 3.84%) (1231 commits authored)
Truth: 'alexdima' (9) (6564 commits authored)
To execute both the model trained on the recent
dataset for issue 192213 run:
python3 src/runmodel.py out/model/bug_triaging_all_4e_5e-06lr_final.pt 192213
To execute both the model trained on the all
dataset for issue 192213 run:
python3 src/runmodel.py out/model/bug_triaging_recent_4e_5e-06lr_final.pt 192213
Report
To compile the report run:
cd docs
pdflatex -interaction=nonstopmode -output-directory=. main.tex
pdflatex -interaction=nonstopmode -output-directory=. main.tex