report
This commit is contained in:
parent
9d6315152e
commit
6d44303c57
2 changed files with 21 additions and 22 deletions
BIN
report/main.pdf
BIN
report/main.pdf
Binary file not shown.
|
@ -85,7 +85,7 @@
|
|||
250 test cases are extracted from the pool following this procedure. With equal probabilities (each with $p=1/3$):
|
||||
|
||||
\begin{itemize}
|
||||
\item The extracted test case may be kept as is;
|
||||
\item The extracted test case may be kept as-is;
|
||||
\item The extracted test case may be randomly mutated using the \textit{mutate} function. An argument will be
|
||||
chosen at random, and if of type \texttt{str} a random position in the string will be replaced with a
|
||||
random character. If the argument is of type \texttt{int}, a random value $\in [-10, 10]$ will be added to
|
||||
|
@ -121,29 +121,16 @@
|
|||
The genetic algorithm is ran 10 times. At the end of each execution the best individuals (sorted by increasing
|
||||
fitness) are selected if they cover at least one branch that has not been covered. This is the only point in the
|
||||
procedure where the set of covered branches is updated\footnote{This differs from the reference implementation of
|
||||
\texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function}.
|
||||
|
||||
\texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function.}.
|
||||
|
||||
\subsection*{Section 4: Statistical comparison of test generators}
|
||||
|
||||
Report and comment the results of the experimental procedure:
|
||||
To compare the performance of the fuzzer and the genetic algorithm, the mutation testing tool \textit{mut.py} has
|
||||
been used to measure how robust the generated test suites hard. Both implementations have been executed for 10 times
|
||||
using different RNG seeds each time, and a statistical comparison of the resulting mutation score distributions has
|
||||
been performed to determine when one generation method is statistically more performant than the other.
|
||||
|
||||
\paragraph{For each benchmark program P:}
|
||||
\begin{itemize}
|
||||
\item Repeat the following experiment N times (e.g., with N = 10):
|
||||
\begin{itemize}
|
||||
\item Generate random test cases for P using the GA generator
|
||||
\item Measure the mutation score for P
|
||||
\item Generate search based test cases for P using the Fuzzer
|
||||
\item Measure the mutation score for P
|
||||
\end{itemize}
|
||||
\item Visualize the N mutations score values of Fuzzer and GA using boxplots
|
||||
\item Report the average mutation score of Fuzzer and GA
|
||||
\item Compute the effect size using the Cohen’s d effect size measure
|
||||
\item Compare the N mutation score values of Fuzzer vs GA using the Wilcoxon statistical test
|
||||
\end{itemize}
|
||||
|
||||
\begin{figure}[H]
|
||||
\begin{figure}[t]
|
||||
\begin{center}
|
||||
\includegraphics[width=\linewidth]{../out/mutation_scores}
|
||||
\caption{Distributions of \textit{mut.py} mutation scores over the generated benchmark tests suites
|
||||
|
@ -151,7 +138,7 @@
|
|||
\end{center}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[H]
|
||||
\begin{figure}[t]
|
||||
\begin{center}
|
||||
\includegraphics[width=\linewidth]{../out/mutation_scores_mean}
|
||||
\caption{\textit{mut.py} Mutation score average over the generated benchmark tests suites
|
||||
|
@ -159,7 +146,10 @@
|
|||
\end{center}
|
||||
\end{figure}
|
||||
|
||||
\begin{table}[H]
|
||||
Figure~\ref{fig:mutation-scores} shows a boxplot of the mutation score distributions for each file in the benchmark
|
||||
suite, while figure~\ref{fig:mutation-scores-mean} shows the mean mutation scores.
|
||||
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\begin{tabular}{lrrp{3.5cm}r}
|
||||
\toprule
|
||||
|
@ -182,4 +172,13 @@
|
|||
means, the wilcoxon paired test p-value and the Cohen's $d$ effect size for each file in the
|
||||
benchmark.}\label{tab:stats}
|
||||
\end{table}
|
||||
|
||||
To perform a statistical comparison, the Wilcoxon paired test has been used with a p-value threshold of 0.05 has been
|
||||
used to check if there is there is a statistical difference between the distributions. Moreover, the Cohen's d
|
||||
effect-size has been used to measure the significance of the difference. Results of the statistical analysis are
|
||||
shown in table~\ref{tab:stats}.
|
||||
|
||||
Only 3 benchmark files out of 10 make the two script have statistical different performance. They are namely
|
||||
\textit{check\_armstrong}, \textit{rabin\_karp} and \textit{anagram\_check}. The first two show that the genetic algorithm
|
||||
performs significantly better than the fuzzer, while the last file shows the opposite.
|
||||
\end{document}
|
||||
|
|
Reference in a new issue