soft-analytics-01/docs/sections/model.tex

\section*{Model implementation}

The BERT model was implemented by loosely following a Medium article named
``Text Classification with BERT in PyTorch - Towards Data Science'' by Ruben Winastwan%
\footnote{\url{https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f}}.

Our implementation uses the \texttt{BertForSequenceClassification} model from the HuggingFace \texttt{transformers}
library. The model architecture simply joins the pre-trained deep learning weights from BERT-medium with a feed-forward
output layer consisting on one neuron per class to predict.

We train this model over two datasets, where one contains all issues in the range $[1, 170000]$ and another contains
a more ``recent'' set of issues, namely in the range $[150000, 170000]$. In the training and evaluation scripts, these
datasets are named \texttt{all} and \texttt{recent} respectively. The test set is made of issues in the range
$[170001, 180000]$, and it is used to evaluate the model with both training datasets. Each of the \texttt{all} and
\texttt{recent} datasets are split chronologically in a train set and validation set with
90\% / 10\% proportions.

In order not to bias the model implementation with knowledge from ``future'' data, the classifier has as many output
neurons as distinct assignees appearing in the training set. Additionally, instances in the validation set where the
assignee does not match one of the assignees in the training set are excluded. However, in order not to bias the model
evaluation, those instances are not excluded from the test set.

The training script encodes assignees with a numerical embedding between 0 and the number of assignees minus 1. The order
of the values in this embedding reflects the chronological order of the first issue assigned to each assignee. The only
predictor variables that are considered by the model are the cleaned issue title and body, which are concatenated without
adding any additional tokens or markers, tokenized, and mapped in a 768-wide vector.

The size of the train, validation and test split for each dataset is illustrated in table~\ref{tab:set_size}.

\begin{table}[H]
    \centering
    \begin{tabular}{lrr}
        \toprule
        Split      & \texttt{recent} & \texttt{all} \\
        \midrule
        Training   & 8303            & 91858        \\
        Validation & 921             & 10167        \\
        Test       & 4787            & 4787         \\
        \bottomrule
    \end{tabular}
    \caption{Number of instances in the training, validation and test set for model training on the \texttt{recent}
    and \texttt{all} datasets.}
    \label{tab:set_size}
\end{table}

Our training procedure runs over the data for 4 epochs for both datasets. In each epoch, the model is trained on a
shuffled copy of the training set while average loss and accuracy are tracked. After backward propagation,
the \textit{Adam} optimizer is applied upon the weights of the model with a learning
rate of $5 \cdot 10^{-6}$ and \textit{beta} values equal to $(0.9, 0.9999)$.
After each epoch, validation loss and accuracy are computed.

Due to lack of time,
no automatic early stopping procedure has been implemented in the model training script. Therefore, the validation output
has been manually used to do hyperparameter tuning. For example, the number of epochs has been chosen so that for both
models the validation loss and accuracy decrease (allowing for some tolerance) between epochs and that the metrics do
not diverge too much from the values observed during training.

Another instance where the validation set has been useful is in the choice of the embedding process for the issue title
and body. We choose to use \texttt{distilbert-base-uncased}, a non-cased tokenizer after empirically determining that it
provides better performance than a cased counterpart (namely \texttt{bert-base-cased}) over the validation set. However,
we do not claim that our hyperparameter tuning procedure has been completely exhaustive. For instance, due to lack of
time and computing power, both tokenizers have been tested only with a token length of 512 and truncation enabled.

In table~\ref{tab:metrics-recent} we report loss and accuracy for the train and validation set during training of the
model over the \texttt{recent} dataset, while in table~\ref{tab:metrics-all} we report the same values for the model
trained over the \texttt{all} dataset. By comparing the validation accuracy of both models we can say that the
\texttt{recent} model performs better over the test set.

\begin{table}[H]
    \centering
    \begin{tabular}{lrrrr}
        \toprule
        Epoch & Train loss & Validation loss & Train accuracy & Validation accuracy \\
        \midrule
        1     & 0.204      & 0.174           & 0.171          & 0.343               \\
        2     & 0.156      & 0.140           & 0.386          & 0.467               \\
        3     & 0.124      & 0.125           & 0.542          & 0.545               \\
        4     & 0.100      & 0.120           & 0.642          & 0.557               \\
        \bottomrule
    \end{tabular}
    \caption{Train set and validation set loss and accuracy during model training over the \texttt{recent} dataset.}
    \label{tab:metrics-recent}
\end{table}

\begin{table}[H]
    \centering
    \begin{tabular}{lrrrr}
        \toprule
        Epoch & Train loss & Validation loss & Train accuracy & Validation accuracy \\
        \midrule
        1     & 0.137      & 0.164           & 0.453          & 0.357               \\
        2     & 0.095      & 0.154           & 0.601          & 0.405               \\
        3     & 0.077      & 0.157           & 0.676          & 0.427               \\
        4     & 0.060      & 0.160           & 0.751          & 0.435               \\
        \bottomrule
    \end{tabular}
    \caption{Train set and validation set loss and accuracy during model training over the \texttt{all} dataset.}
    \label{tab:metrics-all}
\end{table}

The performance of the models trained on the \texttt{all} and \texttt{recent} datasets is reported in
table~\tab{tab:test-results}. We notice that both models are significantly better at outputting the correct assignee
within the top 2 or top 3 results rather than picking the most confident output only. For all accuracies observed, the
\texttt{recent} model still performs better than the \textt{all} model.

\begin{table}[H]
    \centering
    \begin{tabular}{lrr}
        \toprule
        Truth label found            & \texttt{recent} & \texttt{all} \\
        \midrule
        In top recommendation        & 0.4980          & 0.4034       \\
        Within top 2 recommendations & 0.6179          & 0.5408       \\
        Within top 3 recommendations & 0.6651          & 0.5916       \\
        Within top 4 recommendations & 0.6940          & 0.6359       \\
        Within top 5 recommendations & 0.7174          & 0.6658       \\
        \bottomrule
    \end{tabular}
    \caption{Model accuracy on the test set for training with the \textt{all} and \texttt{recent} datasets. Accuracy
    is reported for the recommendations given by the model output ordered by confidence.}
    \label{tab:test-results}
\end{table}

The receiving operating characteristics (ROC) curve is reported according to the One-vs-Rest method by computing
one curve for each class (i.e.\ assignee) in the training set. The curve for the \texttt{recent} model is reported in
figure~\ref{fig:roc-recent}, while the curve for the \texttt{all} model is reported in figure~\ref{fig:roc-all}. As the
numeric label for each assignee is given in chronological order of first issue assignment, we can observe a difference
between long-standing and more recent contributors. Long-standing contributors have lower AUC than recent contributors
for both models. This may indicate that the models are more effective at predicting recent contributors as they are the
most active on issues in the test set, which is by construction made of recent issues. This may be caused by
long-standing authors eventually leaving the project.

\begin{figure}
    \includegraphics[width=\linewidth]{../out/model/bug_triaging_recent_4e_5e-06lr_final.ovr_curves}
    \caption{One-vs-Rest ROC curves for each class in the \texttt{recent} dataset for the model trained on the same dataset.}
    \label{fig:roc-recent}
\end{figure}

\begin{figure}
    \includegraphics[width=\linewidth]{../out/model/bug_triaging_all_4e_5e-06lr_final.ovr_curves}
    \caption{One-vs-Rest ROC curves for each class in the \texttt{all} dataset for the model trained on the same dataset.}
    \label{fig:roc-all}
\end{figure}

Additionally, we report a micro-averaged ROC curve to understand each model's overall performance, and we report the
corresponding area under curve (AUC) value. These curves can be found in figure~\ref{fig:roc-avg}. The \texttt{recent} model
is the one with a higher overall AUC .

\begin{figure}
    \centering
    \begin{subfigure}[t]{\linewidth}
        \centering\includegraphics[width=.7\linewidth]{../out/model/bug_triaging_recent_4e_5e-06lr_final.ovr_avg}
        \caption{ROC curve for the model trained on the \texttt{recent} dataset. The AUC score is $0.9228$.}
    \end{subfigure}
    \begin{subfigure}[t]{\linewidth}
        \centering\includegraphics[width=.7\linewidth]{../out/model/bug_triaging_all_4e_5e-06lr_final.ovr_avg}
        \caption{ROC curve for the model trained on the \texttt{all} dataset. The AUC score is $0.9121$.}
    \end{subfigure}
    \caption{Micro-averaged One-vs-Rest ROC curves for the trained models over the test set.}
    \label{fig:roc-avg}
\end{figure}