report

2023-12-28 11:27:31 +01:00 · 2023-12-28 11:27:31 +01:00 · 6d44303c57
commit 6d44303c57
parent 9d6315152e
2 changed files with 21 additions and 22 deletions
--- a/report/main.pdf
+++ b/report/main.pdf
--- a/report/main.tex
+++ b/report/main.tex
@ -85,7 +85,7 @@
    250 test cases are extracted from the pool following this procedure. With equal probabilities (each with $p=1/3$):

    \begin{itemize}
-        \item The extracted test case may be kept as is;
+        \item The extracted test case may be kept as-is;
        \item The extracted test case may be randomly mutated using the \textit{mutate} function. An argument will be
        chosen at random, and if of type \texttt{str} a random position in the string will be replaced with a
        random character. If the argument is of type \texttt{int}, a random value $\in [-10, 10]$ will be added to
@ -121,29 +121,16 @@
    The genetic algorithm is ran 10 times. At the end of each execution the best individuals (sorted by increasing
    fitness) are selected if they cover at least one branch that has not been covered. This is the only point in the
    procedure where the set of covered branches is updated\footnote{This differs from the reference implementation of
-    \texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function}.
-
+    \texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function.}.

    \subsection*{Section 4: Statistical comparison of test generators}

-    Report and comment the results of the experimental procedure:
+    To compare the performance of the fuzzer and the genetic algorithm, the mutation testing tool \textit{mut.py} has
+    been used to measure how robust the generated test suites hard. Both implementations have been executed for 10 times
+    using different RNG seeds each time, and a statistical comparison of the resulting mutation score distributions has
+    been performed to determine when one generation method is statistically more performant than the other.

-    \paragraph{For each benchmark program P:}
-    \begin{itemize}
-        \item Repeat the following experiment N times (e.g., with N = 10):
-        \begin{itemize}
-            \item Generate random test cases for P using the GA generator
-            \item Measure the mutation score for P
-            \item Generate search based test cases for P using the Fuzzer
-            \item Measure the mutation score for P
-        \end{itemize}
-        \item Visualize the N mutations score values of Fuzzer and GA using boxplots
-        \item Report the average mutation score of Fuzzer and GA
-        \item Compute the effect size using the Cohen’s d effect size measure
-        \item Compare the N mutation score values of Fuzzer vs GA using the Wilcoxon statistical test
-    \end{itemize}
-
-    \begin{figure}[H]
+    \begin{figure}[t]
        \begin{center}
            \includegraphics[width=\linewidth]{../out/mutation_scores}
            \caption{Distributions of \textit{mut.py} mutation scores over the generated benchmark tests suites
@ -151,7 +138,7 @@
        \end{center}
    \end{figure}

-    \begin{figure}[H]
+    \begin{figure}[t]
        \begin{center}
            \includegraphics[width=\linewidth]{../out/mutation_scores_mean}
            \caption{\textit{mut.py} Mutation score average over the generated benchmark tests suites
@ -159,7 +146,10 @@
        \end{center}
    \end{figure}

-    \begin{table}[H]
+    Figure~\ref{fig:mutation-scores} shows a boxplot of the mutation score distributions for each file in the benchmark
+    suite, while figure~\ref{fig:mutation-scores-mean} shows the mean mutation scores.
+
+    \begin{table}[t]
        \centering
        \begin{tabular}{lrrp{3.5cm}r}
            \toprule
@ -182,4 +172,13 @@
        means, the wilcoxon paired test p-value and the Cohen's $d$ effect size for each file in the
        benchmark.}\label{tab:stats}
    \end{table}
+
+    To perform a statistical comparison, the Wilcoxon paired test has been used with a p-value threshold of 0.05 has been
+    used to check if there is there is a statistical difference between the distributions. Moreover, the Cohen's d
+    effect-size has been used to measure the significance of the difference. Results of the statistical analysis are
+    shown in table~\ref{tab:stats}.
+
+    Only 3 benchmark files out of 10 make the two script have statistical different performance. They are namely
+    \textit{check\_armstrong}, \textit{rabin\_karp} and \textit{anagram\_check}. The first two show that the genetic algorithm
+    performs significantly better than the fuzzer, while the last file shows the opposite.
 \end{document}