soft-analytics-01/docs/sections/statistics.tex

\section*{Data Analysis}
Given the CSV exported by the cleaning pipeline, we managed to extract some interesting statistics of the training set (from issue 1 to 170000).
In particular, we analyzed the issue word count distribution, the author distribution, and the distribution of opened issues during the week.

For the word count distribution, we tried to understand how many issues had less than 512 words (as we said before, 512 is the maximum number of tokens that we can pass to BERT).
From our analysis, we saw that of 102065 cleaned and valid issues, $99.4\%$ of them (101468 issues) have a length of less than 512 words.
On the other hand, only $0.6\%$ of the issues (597 issues) have a length greater than 512 words.
This result makes the use of stopword removal useless (for out goal of reducing the number of tokens).
The image below represents the distribution of all issues with word count less than 512 words.

\begin{center}
    \includegraphics[width=10cm]{../out/plots/length_dist}
\end{center}

From this distribution, we can see extrapolate that the most frequent length is of 42 words, which is the case for $1.1\%$ of the issues (1115 issues).

Regarding the author distribution -- meaning the number of issues per author -- we managed to find out that out of a total of 102 authors, $39.3\%$ of authors (40 authors) contributed to less than 10 issues.
On the other hand, $60.7\%$ of authors (62 authors) contributed to more than 10 issues.
The issues per author can be seen in the graph below.

\begin{center}
    \includegraphics[width=10cm]{../out/plots/author_dist}
\end{center}

From this graph, we can extrapolate the top 5 authors based on issue assignment.
The result is the following:

\begin{enumerate}
    \item mjbvz: $11.6\%$ (11882 issues)
	\item bpasero: $8.11\%$ (8280 issues)
	\item Tyriar: $7.91\%$ (8075 issues)
	\item joaomoreno: $7.61\%$ (7775 issues)
	\item isidorn: $6.77\%$ (6914 issues)
\end{enumerate}