This commit is contained in:
Claudio Maggioni 2023-12-28 11:27:31 +01:00
parent 9d6315152e
commit 6d44303c57
2 changed files with 21 additions and 22 deletions

Binary file not shown.

View file

@ -85,7 +85,7 @@
250 test cases are extracted from the pool following this procedure. With equal probabilities (each with $p=1/3$):
\begin{itemize}
\item The extracted test case may be kept as is;
\item The extracted test case may be kept as-is;
\item The extracted test case may be randomly mutated using the \textit{mutate} function. An argument will be
chosen at random, and if of type \texttt{str} a random position in the string will be replaced with a
random character. If the argument is of type \texttt{int}, a random value $\in [-10, 10]$ will be added to
@ -121,29 +121,16 @@
The genetic algorithm is ran 10 times. At the end of each execution the best individuals (sorted by increasing
fitness) are selected if they cover at least one branch that has not been covered. This is the only point in the
procedure where the set of covered branches is updated\footnote{This differs from the reference implementation of
\texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function}.
\texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function.}.
\subsection*{Section 4: Statistical comparison of test generators}
Report and comment the results of the experimental procedure:
To compare the performance of the fuzzer and the genetic algorithm, the mutation testing tool \textit{mut.py} has
been used to measure how robust the generated test suites hard. Both implementations have been executed for 10 times
using different RNG seeds each time, and a statistical comparison of the resulting mutation score distributions has
been performed to determine when one generation method is statistically more performant than the other.
\paragraph{For each benchmark program P:}
\begin{itemize}
\item Repeat the following experiment N times (e.g., with N = 10):
\begin{itemize}
\item Generate random test cases for P using the GA generator
\item Measure the mutation score for P
\item Generate search based test cases for P using the Fuzzer
\item Measure the mutation score for P
\end{itemize}
\item Visualize the N mutations score values of Fuzzer and GA using boxplots
\item Report the average mutation score of Fuzzer and GA
\item Compute the effect size using the Cohens d effect size measure
\item Compare the N mutation score values of Fuzzer vs GA using the Wilcoxon statistical test
\end{itemize}
\begin{figure}[H]
\begin{figure}[t]
\begin{center}
\includegraphics[width=\linewidth]{../out/mutation_scores}
\caption{Distributions of \textit{mut.py} mutation scores over the generated benchmark tests suites
@ -151,7 +138,7 @@
\end{center}
\end{figure}
\begin{figure}[H]
\begin{figure}[t]
\begin{center}
\includegraphics[width=\linewidth]{../out/mutation_scores_mean}
\caption{\textit{mut.py} Mutation score average over the generated benchmark tests suites
@ -159,7 +146,10 @@
\end{center}
\end{figure}
\begin{table}[H]
Figure~\ref{fig:mutation-scores} shows a boxplot of the mutation score distributions for each file in the benchmark
suite, while figure~\ref{fig:mutation-scores-mean} shows the mean mutation scores.
\begin{table}[t]
\centering
\begin{tabular}{lrrp{3.5cm}r}
\toprule
@ -182,4 +172,13 @@
means, the wilcoxon paired test p-value and the Cohen's $d$ effect size for each file in the
benchmark.}\label{tab:stats}
\end{table}
To perform a statistical comparison, the Wilcoxon paired test has been used with a p-value threshold of 0.05 has been
used to check if there is there is a statistical difference between the distributions. Moreover, the Cohen's d
effect-size has been used to measure the significance of the difference. Results of the statistical analysis are
shown in table~\ref{tab:stats}.
Only 3 benchmark files out of 10 make the two script have statistical different performance. They are namely
\textit{check\_armstrong}, \textit{rabin\_karp} and \textit{anagram\_check}. The first two show that the genetic algorithm
performs significantly better than the fuzzer, while the last file shows the opposite.
\end{document}