soft-analytics-01/docs/sections/cleaning.tex

\section*{Data Cleaning}
Regarding data cleaning, we've employed a series of procedures to eliminate as much as possible noisy data that could potentially hinder the learning process of the DL model.

The first think we do is to transform the body from markdown to HTML\@.
Thanks to this conversion, we are able to work directly on HTML tags and use Python's Beautiful Soup to remove some HTML blocks.
In particular, we remove everything that is contained inside the \verb|<details> </details>| tag.
This is done because everything that is contained in the tag is related to system details of the user and is not useful for the classification task our model needs to perform.
Moreover, we remove the HTML comments, since they are part of some boilerplate code that gets generated when submitting an issue.

Now that all the unuseful sections have been removed, we convert the HTML back to plain text.
From, here, we use a Python library to remove all emojis contained in the body and in the title (since they wouldn't help for the training).
We also remove all URLs and newlines.

Finally, we check is the remaining body and title are written in a language that uses latin characters.
If we encounter an issue written in other languages (such as Russian and Chinese), then the whole issue is discarded.
This is done because our DL model has been pre-trained on English documents, thus it would not make any sense to train it on Chinese or Russian data.

We also tried to typical techniques used in this case: stemming and stopword removal.
By applying stemming, we noticed that the body lost the fluidity of natural language.
Since our model has been trained to recognize natural language, we felt that it would not make any sense to remove stopwords.
In addition, we were planning to use stopword removal to decrease the number of tokens to feed to BERT (the limit for our base model is set to 512).
But, after a statistical analysis on the data, we noticed that 101468 out of 102065 ($99.4\%$) of the issues are composed by less than 512 tokens (the next section will describe out statistical analysis in greater detail).
And in this case we had similar results to the stemming, meaning that the text lost its fluidity.