soft-analytics-02/README.md

# Assignment 2: If statements

**Group 2: Baris Aksakal, Edoardo Riggio, Claudio Maggioni**

## Repository Structure

- `/dataset`: code and data related to scraping repository from GitHub;
- `/models`
    - `/baris`: code and persisted model of the original architecture built by
      Baris. `model_0.1.ipynb` and `test_model.ipynb` are respectively an
      earlier and later iteration of the code used to train this model;
    - `/final`: persisted model for the final architecture with training and
      test evaluation statistics;
        - `/test_outputs.csv`: CSV deliverable for the test set evaluation on
          the test set we extracted;
        - `/test_usi_outputs.csv`: CSV deliverable for the test set evaluation
          on the provided test set.
- `/test`: unit tests for the model training scripts;
- `/train`: dependencies of the main model training script;
- `/train_model.py`: main model training script;
- `/plot_acc.py`: accuracy statistics plotting script.

## Environment Setup

In order to execute both the scraping and training scripts, Python 3.10 or
greater is required. Dependencies can be installed through a virtual env by
running:

```shell
python3 -m venv .env 
source .env/bin/activate 
pip install -r requirements.txt
```

## Dataset Extraction

Please refer to [the README.md file in `/dataset`](dataset/README.md) for
documentation on the dataset extraction process.

## Model Training

Model training can be performed by running the script:

```shell
python3 train_model.py
```

The script is able to resume fine-tuning if the pretraining phase was completed
by a previous execution, and it is able to directly skip to model evaluation on
the two test sets if fine-tuning was already completed.

The persisted pretrained model is located in `/models/final/pretrain`. Each
epoch of the fine-tuning train process is persisted at path
`/models/final/<N>`, where `<N>` is the epoch number starting from 0. The epoch
number for the epoch selected by the early stopping process is stored in
`/models/final/best.txt`.

`/models/final/stats.csv` stores the training and validation loss and accuracy
statistics during the training process. `/models/final/test_outputs.csv` is the
CSV deliverable for the test set evaluation on the test set we extracted, while
`/models/final/test_usi_outputs.csv` is the CSV deliverable for the test set
evaluation on the provided test set. 

The stdout for the training process script can be found in the file 
`/models/final/train_log.txt`.

### Plots

The train and validation loss and accuracy plots can be generated from 
`/models/final/stats.csv` with the following command:

```shell
python3 plot_acc.py
```

The output is stored in `/models/final/training_metrics.png`.

# Report

To compile the report run:

```shell
cd report
pdflatex -interaction=nonstopmode -output-directory=. main.tex
pdflatex -interaction=nonstopmode -output-directory=. main.tex
```

The report is then located in `report/main.pdf`.
Final version of the project History has been rewritten to delete large files in repo 2024-01-03 14:25:41 +00:00			`# Assignment 2: If statements`

			`Group 2: Baris Aksakal, Edoardo Riggio, Claudio Maggioni`

			`## Repository Structure`

			- `/dataset`: code and data related to scraping repository from GitHub;
			- `/models`
			- `/baris`: code and persisted model of the original architecture built by
			Baris. `model_0.1.ipynb` and `test_model.ipynb` are respectively an
			`earlier and later iteration of the code used to train this model;`
			- `/final`: persisted model for the final architecture with training and
			`test evaluation statistics;`
			- `/test_outputs.csv`: CSV deliverable for the test set evaluation on
			`the test set we extracted;`
			- `/test_usi_outputs.csv`: CSV deliverable for the test set evaluation
			`on the provided test set.`
			- `/test`: unit tests for the model training scripts;
			- `/train`: dependencies of the main model training script;
			- `/train_model.py`: main model training script;
			- `/plot_acc.py`: accuracy statistics plotting script.

			`## Environment Setup`

			`In order to execute both the scraping and training scripts, Python 3.10 or`
			`greater is required. Dependencies can be installed through a virtual env by`
			`running:`

			```shell
			`python3 -m venv .env`
			`source .env/bin/activate`
			`pip install -r requirements.txt`
			```

			`## Dataset Extraction`

			Please refer to [the README.md file in `/dataset`](dataset/README.md) for
			`documentation on the dataset extraction process.`

			`## Model Training`

			`Model training can be performed by running the script:`

			```shell
			`python3 train_model.py`
			```

			`The script is able to resume fine-tuning if the pretraining phase was completed`
			`by a previous execution, and it is able to directly skip to model evaluation on`
			`the two test sets if fine-tuning was already completed.`

			The persisted pretrained model is located in `/models/final/pretrain`. Each
			`epoch of the fine-tuning train process is persisted at path`
			`/models/final/<N>`, where `<N>` is the epoch number starting from 0. The epoch
			`number for the epoch selected by the early stopping process is stored in`
			`/models/final/best.txt`.

			`/models/final/stats.csv` stores the training and validation loss and accuracy
			statistics during the training process. `/models/final/test_outputs.csv` is the
			`CSV deliverable for the test set evaluation on the test set we extracted, while`
			`/models/final/test_usi_outputs.csv` is the CSV deliverable for the test set
			`evaluation on the provided test set.`

			`The stdout for the training process script can be found in the file`
			`/models/final/train_log.txt`.

			`### Plots`

			`The train and validation loss and accuracy plots can be generated from`
			`/models/final/stats.csv` with the following command:

			```shell
			`python3 plot_acc.py`
			```

			The output is stored in `/models/final/training_metrics.png`.

			`# Report`

			`To compile the report run:`

			```shell
			`cd report`
			`pdflatex -interaction=nonstopmode -output-directory=. main.tex`
			`pdflatex -interaction=nonstopmode -output-directory=. main.tex`
			```

			The report is then located in `report/main.pdf`.