Go to file
Claudio Maggioni a4ceee8716 Final version of the project
History has been rewritten to delete large files in repo
2024-01-03 15:28:43 +01:00
dataset Final version of the project 2024-01-03 15:28:43 +01:00
models Final version of the project 2024-01-03 15:28:43 +01:00
report Final version of the project 2024-01-03 15:28:43 +01:00
test Final version of the project 2024-01-03 15:28:43 +01:00
train Final version of the project 2024-01-03 15:28:43 +01:00
.gitignore Final version of the project 2024-01-03 15:28:43 +01:00
environment.yml Final version of the project 2024-01-03 15:28:43 +01:00
plot_acc.py Final version of the project 2024-01-03 15:28:43 +01:00
README.md Final version of the project 2024-01-03 15:28:43 +01:00
requirements.txt Final version of the project 2024-01-03 15:28:43 +01:00
train_model.py Final version of the project 2024-01-03 15:28:43 +01:00

Assignment 2: If statements

Group 2: Baris Aksakal, Edoardo Riggio, Claudio Maggioni

Repository Structure

  • /dataset: code and data related to scraping repository from GitHub;
  • /models
    • /baris: code and persisted model of the original architecture built by Baris. model_0.1.ipynb and test_model.ipynb are respectively an earlier and later iteration of the code used to train this model;
    • /final: persisted model for the final architecture with training and test evaluation statistics;
      • /test_outputs.csv: CSV deliverable for the test set evaluation on the test set we extracted;
      • /test_usi_outputs.csv: CSV deliverable for the test set evaluation on the provided test set.
  • /test: unit tests for the model training scripts;
  • /train: dependencies of the main model training script;
  • /train_model.py: main model training script;
  • /plot_acc.py: accuracy statistics plotting script.

Environment Setup

In order to execute both the scraping and training scripts, Python 3.10 or greater is required. Dependencies can be installed through a virtual env by running:

python3 -m venv .env 
source .env/bin/activate 
pip install -r requirements.txt

Dataset Extraction

Please refer to the README.md file in /dataset for documentation on the dataset extraction process.

Model Training

Model training can be performed by running the script:

python3 train_model.py

The script is able to resume fine-tuning if the pretraining phase was completed by a previous execution, and it is able to directly skip to model evaluation on the two test sets if fine-tuning was already completed.

The persisted pretrained model is located in /models/final/pretrain. Each epoch of the fine-tuning train process is persisted at path /models/final/<N>, where <N> is the epoch number starting from 0. The epoch number for the epoch selected by the early stopping process is stored in /models/final/best.txt.

/models/final/stats.csv stores the training and validation loss and accuracy statistics during the training process. /models/final/test_outputs.csv is the CSV deliverable for the test set evaluation on the test set we extracted, while /models/final/test_usi_outputs.csv is the CSV deliverable for the test set evaluation on the provided test set.

The stdout for the training process script can be found in the file /models/final/train_log.txt.

Plots

The train and validation loss and accuracy plots can be generated from /models/final/stats.csv with the following command:

python3 plot_acc.py

The output is stored in /models/final/training_metrics.png.

Report

To compile the report run:

cd report
pdflatex -interaction=nonstopmode -output-directory=. main.tex
pdflatex -interaction=nonstopmode -output-directory=. main.tex

The report is then located in report/main.pdf.