162 lines
9.5 KiB
TeX
162 lines
9.5 KiB
TeX
\section*{Model implementation}
|
|
|
|
The BERT model was implemented by loosely following a Medium article named
|
|
``Text Classification with BERT in PyTorch - Towards Data Science'' by Ruben Winastwan%
|
|
\footnote{\url{https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f}}.
|
|
|
|
Our implementation uses the \texttt{BertForSequenceClassification} model from the HuggingFace \texttt{transformers}
|
|
library. The model architecture simply joins the pre-trained deep learning weights from BERT-medium with a feed-forward
|
|
output layer consisting on one neuron per class to predict.
|
|
|
|
We train this model over two datasets, where one contains all issues in the range $[1, 170000]$ and another contains
|
|
a more ``recent'' set of issues, namely in the range $[150000, 170000]$. In the training and evaluation scripts, these
|
|
datasets are named \texttt{all} and \texttt{recent} respectively. The test set is made of issues in the range
|
|
$[170001, 180000]$, and it is used to evaluate the model with both training datasets. Each of the \texttt{all} and
|
|
\texttt{recent} datasets are split chronologically in a train set and validation set with
|
|
90\% / 10\% proportions.
|
|
|
|
In order not to bias the model implementation with knowledge from ``future'' data, the classifier has as many output
|
|
neurons as distinct assignees appearing in the training set. Additionally, instances in the validation set where the
|
|
assignee does not match one of the assignees in the training set are excluded. However, in order not to bias the model
|
|
evaluation, those instances are not excluded from the test set.
|
|
|
|
The training script encodes assignees with a numerical embedding between 0 and the number of assignees minus 1. The order
|
|
of the values in this embedding reflects the chronological order of the first issue assigned to each assignee. The only
|
|
predictor variables that are considered by the model are the cleaned issue title and body, which are concatenated without
|
|
adding any additional tokens or markers, tokenized, and mapped in a 768-wide vector.
|
|
|
|
The size of the train, validation and test split for each dataset is illustrated in table~\ref{tab:set_size}.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\begin{tabular}{lrr}
|
|
\toprule
|
|
Split & \texttt{recent} & \texttt{all} \\
|
|
\midrule
|
|
Training & 8303 & 91858 \\
|
|
Validation & 921 & 10167 \\
|
|
Test & 4787 & 4787 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Number of instances in the training, validation and test set for model training on the \texttt{recent}
|
|
and \texttt{all} datasets.}
|
|
\label{tab:set_size}
|
|
\end{table}
|
|
|
|
Our training procedure runs over the data for 4 epochs for both datasets. In each epoch, the model is trained on a
|
|
shuffled copy of the training set while average loss and accuracy are tracked. After backward propagation,
|
|
the \textit{Adam} optimizer is applied upon the weights of the model with a learning
|
|
rate of $5 \cdot 10^{-6}$ and \textit{beta} values equal to $(0.9, 0.9999)$.
|
|
After each epoch, validation loss and accuracy are computed.
|
|
|
|
Due to lack of time,
|
|
no automatic early stopping procedure has been implemented in the model training script. Therefore, the validation output
|
|
has been manually used to do hyperparameter tuning. For example, the number of epochs has been chosen so that for both
|
|
models the validation loss and accuracy decrease (allowing for some tolerance) between epochs and that the metrics do
|
|
not diverge too much from the values observed during training.
|
|
|
|
Another instance where the validation set has been useful is in the choice of the embedding process for the issue title
|
|
and body. We choose to use \texttt{distilbert-base-uncased}, a non-cased tokenizer after empirically determining that it
|
|
provides better performance than a cased counterpart (namely \texttt{bert-base-cased}) over the validation set. However,
|
|
we do not claim that our hyperparameter tuning procedure has been completely exhaustive. For instance, due to lack of
|
|
time and computing power, both tokenizers have been tested only with a token length of 512 and truncation enabled.
|
|
|
|
In table~\ref{tab:metrics-recent} we report loss and accuracy for the train and validation set during training of the
|
|
model over the \texttt{recent} dataset, while in table~\ref{tab:metrics-all} we report the same values for the model
|
|
trained over the \texttt{all} dataset. By comparing the validation accuracy of both models we can say that the
|
|
\texttt{recent} model performs better over the test set.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\begin{tabular}{lrrrr}
|
|
\toprule
|
|
Epoch & Train loss & Validation loss & Train accuracy & Validation accuracy \\
|
|
\midrule
|
|
1 & 0.204 & 0.174 & 0.171 & 0.343 \\
|
|
2 & 0.156 & 0.140 & 0.386 & 0.467 \\
|
|
3 & 0.124 & 0.125 & 0.542 & 0.545 \\
|
|
4 & 0.100 & 0.120 & 0.642 & 0.557 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Train set and validation set loss and accuracy during model training over the \texttt{recent} dataset.}
|
|
\label{tab:metrics-recent}
|
|
\end{table}
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\begin{tabular}{lrrrr}
|
|
\toprule
|
|
Epoch & Train loss & Validation loss & Train accuracy & Validation accuracy \\
|
|
\midrule
|
|
1 & 0.137 & 0.164 & 0.453 & 0.357 \\
|
|
2 & 0.095 & 0.154 & 0.601 & 0.405 \\
|
|
3 & 0.077 & 0.157 & 0.676 & 0.427 \\
|
|
4 & 0.060 & 0.160 & 0.751 & 0.435 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Train set and validation set loss and accuracy during model training over the \texttt{all} dataset.}
|
|
\label{tab:metrics-all}
|
|
\end{table}
|
|
|
|
The performance of the models trained on the \texttt{all} and \texttt{recent} datasets is reported in
|
|
table~\tab{tab:test-results}. We notice that both models are significantly better at outputting the correct assignee
|
|
within the top 2 or top 3 results rather than picking the most confident output only. For all accuracies observed, the
|
|
\texttt{recent} model still performs better than the \textt{all} model.
|
|
|
|
\begin{table}[H]
|
|
\centering
|
|
\begin{tabular}{lrr}
|
|
\toprule
|
|
Truth label found & \texttt{recent} & \texttt{all} \\
|
|
\midrule
|
|
In top recommendation & 0.4980 & 0.4034 \\
|
|
Within top 2 recommendations & 0.6179 & 0.5408 \\
|
|
Within top 3 recommendations & 0.6651 & 0.5916 \\
|
|
Within top 4 recommendations & 0.6940 & 0.6359 \\
|
|
Within top 5 recommendations & 0.7174 & 0.6658 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Model accuracy on the test set for training with the \textt{all} and \texttt{recent} datasets. Accuracy
|
|
is reported for the recommendations given by the model output ordered by confidence.}
|
|
\label{tab:test-results}
|
|
\end{table}
|
|
|
|
The receiving operating characteristics (ROC) curve is reported according to the One-vs-Rest method by computing
|
|
one curve for each class (i.e.\ assignee) in the training set. The curve for the \texttt{recent} model is reported in
|
|
figure~\ref{fig:roc-recent}, while the curve for the \texttt{all} model is reported in figure~\ref{fig:roc-all}. As the
|
|
numeric label for each assignee is given in chronological order of first issue assignment, we can observe a difference
|
|
between long-standing and more recent contributors. Long-standing contributors have lower AUC than recent contributors
|
|
for both models. This may indicate that the models are more effective at predicting recent contributors as they are the
|
|
most active on issues in the test set, which is by construction made of recent issues. This may be caused by
|
|
long-standing authors eventually leaving the project.
|
|
|
|
\begin{figure}
|
|
\includegraphics[width=\linewidth]{../out/model/bug_triaging_recent_4e_5e-06lr_final.ovr_curves}
|
|
\caption{One-vs-Rest ROC curves for each class in the \texttt{recent} dataset for the model trained on the same dataset.}
|
|
\label{fig:roc-recent}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\includegraphics[width=\linewidth]{../out/model/bug_triaging_all_4e_5e-06lr_final.ovr_curves}
|
|
\caption{One-vs-Rest ROC curves for each class in the \texttt{all} dataset for the model trained on the same dataset.}
|
|
\label{fig:roc-all}
|
|
\end{figure}
|
|
|
|
Additionally, we report a micro-averaged ROC curve to understand each model's overall performance, and we report the
|
|
corresponding area under curve (AUC) value. These curves can be found in figure~\ref{fig:roc-avg}. The \texttt{recent} model
|
|
is the one with a higher overall AUC .
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\begin{subfigure}[t]{\linewidth}
|
|
\centering\includegraphics[width=.7\linewidth]{../out/model/bug_triaging_recent_4e_5e-06lr_final.ovr_avg}
|
|
\caption{ROC curve for the model trained on the \texttt{recent} dataset. The AUC score is $0.9228$.}
|
|
\end{subfigure}
|
|
\begin{subfigure}[t]{\linewidth}
|
|
\centering\includegraphics[width=.7\linewidth]{../out/model/bug_triaging_all_4e_5e-06lr_final.ovr_avg}
|
|
\caption{ROC curve for the model trained on the \texttt{all} dataset. The AUC score is $0.9121$.}
|
|
\end{subfigure}
|
|
\caption{Micro-averaged One-vs-Rest ROC curves for the trained models over the test set.}
|
|
\label{fig:roc-avg}
|
|
\end{figure}
|
|
|