kse-02/report/main.tex

%!TEX TS-program = pdflatexmk
\documentclass{scrartcl}

\usepackage{algorithm}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{booktabs}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{microtype}
\usepackage{rotating}
\usepackage{graphicx}
\usepackage{paralist}
\usepackage{tabularx}
\usepackage{multicol}
\usepackage{multirow}
\usepackage{pbox}
\usepackage{enumitem}
\usepackage{colortbl}
\usepackage{pifont}
\usepackage{xspace}
\usepackage{url}
\usepackage{tikz}
\usepackage{fontawesome}
\usepackage{lscape}
\usepackage{listings}
\usepackage{color}
\usepackage{anyfontsize}
\usepackage{comment}
\usepackage{soul}
\usepackage{multibib}
\usepackage{float}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[margin=2.5cm]{geometry}
\usepackage{changepage}

\title{Knowledge Search \& Extraction \\ Project 02: Python Test Generator}
\author{Claudio Maggioni}
\date{}

\begin{document}

    \maketitle

    \begin{adjustwidth}{-4cm}{-4cm}
        \centering
        \begin{tabular}{cc}
            \toprule
            Repository URL & \url{https://github.com/kamclassroom2022/project-02-python-test-generator-maggicl} \\
            Commit ID      & \texttt{549b8859ffaad5391da362ed9bbe9db82b2c5c0b}                         \\ \bottomrule
        \end{tabular}
    \end{adjustwidth}
    \vspace{1cm}

    \subsection*{Section 1 - Instrumentation}

    The script \textit{instrument.py} in the main directory of the project performs instrumentation to replace each
    condition node in the Python files present benchmark suite with a call to \texttt{evaluate\_condition}, which will
    preserve program behaviour but as a side effect will compute and store condition distance for each traversed branch.

    Table~\ref{tab:count1} summarizes the number of Python files, function definition (\textit{FunctionDef}) nodes,
    and comparison nodes (\textit{Compare} nodes not in an \texttt{assert} or \texttt{return} statement) found by the
    instrumentation script.

    \begin{table} [H]
        \centering
        \begin{tabular}{lr}
            \toprule
            \textbf{Type}    & \textbf{Number} \\
            \midrule
            Python Files     & 10              \\
            Function Nodes   & 12              \\
            Comparison Nodes & 44              \\
            \bottomrule
        \end{tabular}
        \caption{Count of files and nodes found.}
        \label{tab:count1}
    \end{table}

    \subsection*{Section 2: Fuzzer test generator}

    The script \textit{fuzzer.py} loads the instrumented benchmark suite and generates tests at random to maximize branch
    coverage.

    The implementation submitted with this report slightly improves on the specification required as it is
    able to deal with an arbitrary number of function parameters, which must be type-hinted as either \texttt{str} or
    \texttt{int}. The fuzzing process generates a pool of 1000 test case inputs according to the function signature,
    using randomly generated integers $\in [-1000, 1000]$, and randomly generated string of length $\in [0, 10]$ with
    ASCII characters with code $\in [32, 127]$. Note that test cases generated in the pool may not satisfy the
    preconditions (i.e.\ the \texttt{assert} statements on the inputs) for the given function.

    250 test cases are extracted from the pool following this procedure. With equal probabilities (each with $p=1/3$):

    \begin{itemize}
        \item The extracted test case may be kept as-is;
        \item The extracted test case may be randomly mutated using the \textit{mutate} function. An argument will be
        chosen at random, and if of type \texttt{str} a random position in the string will be replaced with a
        random character. If the argument is of type \texttt{int}, a random value $\in [-10, 10]$ will be added to
        the argument. If the resulting test case is not present in the pool, it will be added to the pool;
        \item The extracted test case may be randomly combined with another randomly extracted test using the
        \textit{crossover} function. The function will choose at random an argument, and if of type \texttt{int} it will
        swap the values assigned to the two tests. If the argument is of type \texttt{str}, the strings from the two test
        cases will be split in two substrings at random and they will be joined by combining the ``head'' substring from
        one test case with the ``tail'' substring from the other. If the two resulting test cases are new, they will be
        added to the pool.
    \end{itemize}

    If the resulting test case (or test cases) satisfy the function precondition, and if their execution covers branches
    that have not been covered by other test cases, they will be added to the test suite. The resulting test suite is
    then saved as a \textit{unittest} file, comprising of one test class per function present in the benchmark test file.

    \subsection*{Section 3: Genetic Algorithm test generator}

    The script \textit{genetic.py} loads the instrumented benchmark suite and generates tests using a genetic algorithm
    to maximize branch coverage and minimize distance to condition boundary values.

    The genetic algorithm is implemented via the library \textit{deap} using the \textit{eaSimple} procedure.
    The algorithm is initialized with 200 individuals extracted from a pool generated in the same way as the previous
    section. The algorithm runs for 20 generations, and it implements the \textit{mate} and \textit{mutate} operators
    using the \textit{crossover} and \textit{mutate} functions respectively as described in the previous section.

    The fitness function used returns a value of $\infty$ if the test case does not satisfy the function precondition,
    a value of $1000000$ if the test case does not cover any new branches,
    or the sum of normalized ($1 / (x + 1)$) sum of distances for branches that are not yet covered by other test cases.
    A penalty of $2$ is summed to the fitness value for every branch that is already covered. The fitness function is
    minimized by the genetic algorithm.

    The genetic algorithm is ran 10 times. At the end of each execution the best individuals (sorted by increasing
    fitness) are selected if they cover at least one branch that has not been covered. This is the only point in the
    procedure where the set of covered branches is updated\footnote{This differs from the reference implementation of
    \texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function.}.

    \subsection*{Section 4: Statistical comparison of test generators}

    To compare the performance of the fuzzer and the genetic algorithm, the mutation testing tool \textit{mut.py} has
    been used to measure how robust the generated test suites hard. Both implementations have been executed for 10 times
    using different RNG seeds each time, and a statistical comparison of the resulting mutation score distributions has
    been performed to determine when one generation method is statistically more performant than the other.

    \begin{figure}[t]
        \begin{center}
            \includegraphics[width=\linewidth]{../out/mutation_scores}
            \caption{Distributions of \textit{mut.py} mutation scores over the generated benchmark tests suites
            using the fuzzer and the genetic algorithm.}\label{fig:mutation-scores}
        \end{center}
    \end{figure}

    \begin{figure}[t]
        \begin{center}
            \includegraphics[width=\linewidth]{../out/mutation_scores_mean}
            \caption{\textit{mut.py} Mutation score average over the generated benchmark tests suites
            using the fuzzer and the genetic algorithm.}\label{fig:mutation-scores-mean}
        \end{center}
    \end{figure}

    Figure~\ref{fig:mutation-scores} shows a boxplot of the mutation score distributions for each file in the benchmark
    suite, while figure~\ref{fig:mutation-scores-mean} shows the mean mutation scores.

    \begin{table}[t]
        \centering
        \begin{tabular}{lrrp{3.5cm}r}
            \toprule
            \textbf{File}          & \textbf{$E(\text{Fuzzer})$} & \textbf{$E(\text{Genetic})$} & \textbf{Cohen's $|d|$} & \textbf{Wilcoxon $p$} \\
            \midrule
            check\_armstrong       & 58.07                       & 93.50                        & 2.0757  \hfill Huge        & 0.0020                \\
            railfence\_cipher      & 88.41                       & 87.44                        & 0.8844 \hfill Very large & 0.1011                \\
            longest\_substring     & 77.41                       & 76.98                        & 0.0771 \hfill Small      & 0.7589                \\
            common\_divisor\_count & 76.17                       & 72.76                        & 0.7471 \hfill Large      & 0.1258                \\
            zellers\_birthday      & 68.09                       & 71.75                        & 1.4701  \hfill Huge        & 0.0039                \\
            exponentiation         & 69.44                       & 67.14                        & 0.3342 \hfill Medium     & 0.7108                \\
            caesar\_cipher         & 60.59                       & 61.20                        & 0.3549  \hfill Medium      & 0.2955                \\
            gcd                    & 59.15                       & 55.66                        & 0.5016 \hfill Large      & 0.1627                \\
            rabin\_karp            & 27.90                       & 47.55                        & 2.3688  \hfill Huge        & 0.0078                \\
            anagram\_check         & 23.10                       & 7.70                         & $\infty$  \hfill Huge      & 0.0020                \\
            \bottomrule
        \end{tabular}
        \caption{Statistical comparison between fuzzer and genetic algorithm test case generation in terms of mutation
        score as reported by \textit{mut.py} over 10 runs, sorted by genetic mutation score. The table reports run
        means, the wilcoxon paired test p-value and the Cohen's $d$ effect size for each file in the
        benchmark.}\label{tab:stats}
    \end{table}

    To perform a statistical comparison, the Wilcoxon paired test has been used with a p-value threshold of 0.05 has been
    used to check if there is there is a statistical difference between the distributions. Moreover, the Cohen's d
    effect-size has been used to measure the significance of the difference. Results of the statistical analysis are
    shown in table~\ref{tab:stats}.

    Only 4 benchmark files out of 10 make the two script have statistical different performance. They are namely
    \textit{check\_armstrong}, \textit{zeller\_birthday}, \textit{rabin\_karp} and \textit{anagram\_check}.
    The first three show that the genetic algorithm performs significantly better than the fuzzer,
    while the last file shows the opposite.
\end{document}