\section*{Model implementation} The BERT model was implemented by loosely following a Medium article named ``Text Classification with BERT in PyTorch - Towards Data Science'' by Ruben Winastwan% \footnote{\url{https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f}}. Our implementation uses the \texttt{BertForSequenceClassification} model from the HuggingFace \texttt{transformers} library. The model architecture simply joins the pre-trained deep learning weights from BERT-medium with a feed-forward output layer consisting on one neuron per class to predict. We train this model over two datasets, where one contains all issues in the range $[1, 170000]$ and another contains a more ``recent'' set of issues, namely in the range $[150000, 170000]$. In the training and evaluation scripts, these datasets are named \texttt{all} and \texttt{recent} respectively. The test set is made of issues in the range $[170001, 180000]$, and it is used to evaluate the model with both training datasets. Each of the \texttt{all} and \texttt{recent} datasets are split chronologically in a train set and validation set with 90\% / 10\% proportions. In order not to bias the model implementation with knowledge from ``future'' data, the classifier has as many output neurons as distinct assignees appearing in the training set. Additionally, instances in the validation set where the assignee does not match one of the assignees in the training set are excluded. However, in order not to bias the model evaluation, those instances are not excluded from the test set. The training script encodes assignees with a numerical embedding between 0 and the number of assignees minus 1. The order of the values in this embedding reflects the chronological order of the first issue assigned to each assignee. The only predictor variables that are considered by the model are the cleaned issue title and body, which are concatenated without adding any additional tokens or markers, tokenized, and mapped in a 768-wide vector. The size of the train, validation and test split for each dataset is illustrated in table~\ref{tab:set_size}. \begin{table}[H] \centering \begin{tabular}{lrr} \toprule Split & \texttt{recent} & \texttt{all} \\ \midrule Training & 8303 & 91858 \\ Validation & 921 & 10167 \\ Test & 4787 & 4787 \\ \bottomrule \end{tabular} \caption{Number of instances in the training, validation and test set for model training on the \texttt{recent} and \texttt{all} datasets.} \label{tab:set_size} \end{table} Our training procedure runs over the data for 4 epochs for both datasets. In each epoch, the model is trained on a shuffled copy of the training set while average loss and accuracy are tracked. After backward propagation, the \textit{Adam} optimizer is applied upon the weights of the model with a learning rate of $5 \cdot 10^{-6}$ and \textit{beta} values equal to $(0.9, 0.9999)$. After each epoch, validation loss and accuracy are computed. Due to lack of time, no automatic early stopping procedure has been implemented in the model training script. Therefore, the validation output has been manually used to do hyperparameter tuning. For example, the number of epochs has been chosen so that for both models the validation loss and accuracy decrease (allowing for some tolerance) between epochs and that the metrics do not diverge too much from the values observed during training. Another instance where the validation set has been useful is in the choice of the embedding process for the issue title and body. We choose to use \texttt{distilbert-base-uncased}, a non-cased tokenizer after empirically determining that it provides better performance than a cased counterpart (namely \texttt{bert-base-cased}) over the validation set. However, we do not claim that our hyperparameter tuning procedure has been completely exhaustive. For instance, due to lack of time and computing power, both tokenizers have been tested only with a token length of 512 and truncation enabled. In table~\ref{tab:metrics-recent} we report loss and accuracy for the train and validation set during training of the model over the \texttt{recent} dataset, while in table~\ref{tab:metrics-all} we report the same values for the model trained over the \texttt{all} dataset. By comparing the validation accuracy of both models we can say that the \texttt{recent} model performs better over the test set. \begin{table}[H] \centering \begin{tabular}{lrrrr} \toprule Epoch & Train loss & Validation loss & Train accuracy & Validation accuracy \\ \midrule 1 & 0.204 & 0.174 & 0.171 & 0.343 \\ 2 & 0.156 & 0.140 & 0.386 & 0.467 \\ 3 & 0.124 & 0.125 & 0.542 & 0.545 \\ 4 & 0.100 & 0.120 & 0.642 & 0.557 \\ \bottomrule \end{tabular} \caption{Train set and validation set loss and accuracy during model training over the \texttt{recent} dataset.} \label{tab:metrics-recent} \end{table} \begin{table}[H] \centering \begin{tabular}{lrrrr} \toprule Epoch & Train loss & Validation loss & Train accuracy & Validation accuracy \\ \midrule 1 & 0.137 & 0.164 & 0.453 & 0.357 \\ 2 & 0.095 & 0.154 & 0.601 & 0.405 \\ 3 & 0.077 & 0.157 & 0.676 & 0.427 \\ 4 & 0.060 & 0.160 & 0.751 & 0.435 \\ \bottomrule \end{tabular} \caption{Train set and validation set loss and accuracy during model training over the \texttt{all} dataset.} \label{tab:metrics-all} \end{table} The performance of the models trained on the \texttt{all} and \texttt{recent} datasets is reported in table~\tab{tab:test-results}. We notice that both models are significantly better at outputting the correct assignee within the top 2 or top 3 results rather than picking the most confident output only. For all accuracies observed, the \texttt{recent} model still performs better than the \textt{all} model. \begin{table}[H] \centering \begin{tabular}{lrr} \toprule Truth label found & \texttt{recent} & \texttt{all} \\ \midrule In top recommendation & 0.4980 & 0.4034 \\ Within top 2 recommendations & 0.6179 & 0.5408 \\ Within top 3 recommendations & 0.6651 & 0.5916 \\ Within top 4 recommendations & 0.6940 & 0.6359 \\ Within top 5 recommendations & 0.7174 & 0.6658 \\ \bottomrule \end{tabular} \caption{Model accuracy on the test set for training with the \textt{all} and \texttt{recent} datasets. Accuracy is reported for the recommendations given by the model output ordered by confidence.} \label{tab:test-results} \end{table} The receiving operating characteristics (ROC) curve is reported according to the One-vs-Rest method by computing one curve for each class (i.e.\ assignee) in the training set. The curve for the \texttt{recent} model is reported in figure~\ref{fig:roc-recent}, while the curve for the \texttt{all} model is reported in figure~\ref{fig:roc-all}. As the numeric label for each assignee is given in chronological order of first issue assignment, we can observe a difference between long-standing and more recent contributors. Long-standing contributors have lower AUC than recent contributors for both models. This may indicate that the models are more effective at predicting recent contributors as they are the most active on issues in the test set, which is by construction made of recent issues. This may be caused by long-standing authors eventually leaving the project. \begin{figure} \includegraphics[width=\linewidth]{../out/model/bug_triaging_recent_4e_5e-06lr_final.ovr_curves} \caption{One-vs-Rest ROC curves for each class in the \texttt{recent} dataset for the model trained on the same dataset.} \label{fig:roc-recent} \end{figure} \begin{figure} \includegraphics[width=\linewidth]{../out/model/bug_triaging_all_4e_5e-06lr_final.ovr_curves} \caption{One-vs-Rest ROC curves for each class in the \texttt{all} dataset for the model trained on the same dataset.} \label{fig:roc-all} \end{figure} Additionally, we report a micro-averaged ROC curve to understand each model's overall performance, and we report the corresponding area under curve (AUC) value. These curves can be found in figure~\ref{fig:roc-avg}. The \texttt{recent} model is the one with a higher overall AUC . \begin{figure} \centering \begin{subfigure}[t]{\linewidth} \centering\includegraphics[width=.7\linewidth]{../out/model/bug_triaging_recent_4e_5e-06lr_final.ovr_avg} \caption{ROC curve for the model trained on the \texttt{recent} dataset. The AUC score is $0.9228$.} \end{subfigure} \begin{subfigure}[t]{\linewidth} \centering\includegraphics[width=.7\linewidth]{../out/model/bug_triaging_all_4e_5e-06lr_final.ovr_avg} \caption{ROC curve for the model trained on the \texttt{all} dataset. The AUC score is $0.9121$.} \end{subfigure} \caption{Micro-averaged One-vs-Rest ROC curves for the trained models over the test set.} \label{fig:roc-avg} \end{figure}