kse-02/report/main.tex

%!TEX TS-program = pdflatexmk
\documentclass{scrartcl}

\usepackage{algorithm}
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{booktabs}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{microtype}
\usepackage{rotating}
\usepackage{graphicx}
\usepackage{paralist}
\usepackage{tabularx}
\usepackage{multicol}
\usepackage{multirow}
\usepackage{pbox}
\usepackage{enumitem}
\usepackage{colortbl}
\usepackage{pifont}
\usepackage{xspace}
\usepackage{url}
\usepackage{tikz}
\usepackage{fontawesome}
\usepackage{lscape}
\usepackage{listings}
\usepackage{color}
\usepackage{anyfontsize}
\usepackage{comment}
\usepackage{soul}
\usepackage{multibib}
\usepackage{float}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[margin=2.5cm]{geometry}

\title{Knowledge Search \& Extraction \\ Project 02: Python Test Generator}
\author{Claudio Maggioni}
\date{}

\begin{document}

    \maketitle

    \subsection*{Section 1 - Instrumentation}

    The script \textit{instrument.py} in the main directory of the project performs instrumentation to replace each
    condition node in the Python files present benchmark suite with a call to \texttt{evaluate\_condition}, which will
    preserve program behaviour but as a side effect will compute and store condition distance for each traversed branch.

    Table~\ref{tab:count1} summarizes the number of Python files, function definition (\textit{FunctionDef}) nodes,
    and comparison nodes (\textit{Compare} nodes not in an \texttt{assert} or \texttt{return} statement) found by the
    instrumentation script.

    \begin{table} [H]
        \centering
        \begin{tabular}{lr}
            \toprule
            \textbf{Type}    & \textbf{Number} \\
            \midrule
            Python Files     & 10              \\
            Function Nodes   & 12              \\
            Comparison Nodes & 44              \\
            \bottomrule
        \end{tabular}
        \caption{Count of files and nodes found.}
        \label{tab:count1}
    \end{table}

    \subsection*{Section 2: Fuzzer test generator}

    The script \textit{fuzzer.py} loads the instrumented benchmark suite and generates tests at random to maximize branch
    coverage.

    The implementation submitted with this report slightly improves on the specification required as it is
    able to deal with an arbitrary number of function parameters, which must be type-hinted as either \texttt{str} or
    \texttt{int}. The fuzzing process generates a pool of 1000 test case inputs according to the function signature,
    using randomly generated integers $\in [-1000, 1000]$, and randomly generated string of length $\in [0, 10]$ with
    ASCII characters with code $\in [32, 127]$. Note that test cases generated in the pool may not satisfy the
    preconditions (i.e.\ the \texttt{assert} statements on the inputs) for the given function.

    250 test cases are extracted from the pool following this procedure. With equal probabilities (each with $p=1/3$):

    \begin{itemize}
        \item The extracted test case may be kept as is;
        \item The extracted test case may be randomly mutated using the \textit{mutate} function. An argument will be
        chosen at random, and if of type \texttt{str} a random position in the string will be replaced with a
        random character. If the argument is of type \texttt{int}, a random value $\in [-10, 10]$ will be added to
        the argument. If the resulting test case is not present in the pool, it will be added to the pool;
        \item The extracted test case may be randomly combined with another randomly extracted test using the
        \textit{crossover} function. The function will choose at random an argument, and if of type \texttt{int} it will
        swap the values assigned to the two tests. If the argument is of type \texttt{str}, the strings from the two test
        cases will be split in two substrings at random and they will be joined by combining the ``head'' substring from
        one test case with the ``tail'' substring from the other. If the two resulting test cases are new, they will be
        added to the pool.
    \end{itemize}

    If the resulting test case (or test cases) satisfy the function precondition, and if their execution covers branches
    that have not been covered by other test cases, they will be added to the test suite. The resulting test suite is
    then saved as a \textit{unittest} file, comprising of one test class per function present in the benchmark test file.

    \subsection*{Section 3: Genetic Algorithm test generator}

    The script \textit{genetic.py} loads the instrumented benchmark suite and generates tests using a genetic algorithm
    to maximize branch coverage and minimize distance to condition boundary values.

    The genetic algorithm is implemented via the library \textit{deap} using the \textit{eaSimple} procedure.
    The algorithm is initialized with 200 individuals extracted from a pool generated in the same way as the previous
    section. The algorithm runs for 20 generations, and it implements the \textit{mate} and \textit{mutate} operators
    using the \textit{crossover} and \textit{mutate} functions respectively as described in the previous section.

    The fitness function used returns a value of $\infty$ if the test case does not satisfy the function precondition,
    a value of $1000000$ if the test case does not cover any new branches,
    or the sum of normalized ($1 / (x + 1)$) sum of distances for branches that are not yet covered by other test cases.
    A penalty of $2$ is summed to the fitness value for every branch that is already covered. The fitness function is
    minimized by the genetic algorithm.

    The genetic algorithm is ran 10 times. At the end of each execution the best individuals (sorted by increasing
    fitness) are selected if they cover at least one branch that has not been covered. This is the only point in the
    procedure where the set of covered branches is updated\footnote{This differs from the reference implementation of
    \texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function}.


    \subsection*{Section 4: Statistical comparison of test generators}

    Report and comment the results of the experimental procedure:

    \paragraph{For each benchmark program P:}
    \begin{itemize}
        \item Repeat the following experiment N times (e.g., with N = 10):
        \begin{itemize}
            \item Generate random test cases for P using the GA generator
            \item Measure the mutation score for P
            \item Generate search based test cases for P using the Fuzzer
            \item Measure the mutation score for P
        \end{itemize}
        \item Visualize the N mutations score values of Fuzzer and GA using boxplots
        \item Report the average mutation score of Fuzzer and GA
        \item Compute the effect size using the Cohen’s d effect size measure
        \item Compare the N mutation score values of Fuzzer vs GA using the Wilcoxon statistical test
    \end{itemize}

    \begin{figure}[H]
        \begin{center}
            \includegraphics[width=\linewidth]{../out/mutation_scores}
            \caption{Distributions of \textit{mut.py} mutation scores over the generated benchmark tests suites
            using the fuzzer and the genetic algorithm.}\label{fig:mutation-scores}
        \end{center}
    \end{figure}

    \begin{figure}[H]
        \begin{center}
            \includegraphics[width=\linewidth]{../out/mutation_scores_mean}
            \caption{\textit{mut.py} Mutation score average over the generated benchmark tests suites
            using the fuzzer and the genetic algorithm.}\label{fig:mutation-scores-mean}
        \end{center}
    \end{figure}

    \begin{table}[H]
        \centering
        \begin{tabular}{lrrp{3.5cm}r}
            \toprule
            \textbf{File}          & \textbf{$E(\text{Fuzzer})$} & \textbf{$E(\text{Genetic})$} & \textbf{Cohen's $|d|$} & \textbf{Wilcoxon $p$} \\
            \midrule
            check\_armstrong       & 58.07                       & 93.50                        & 2.0757  \hfill Huge        & 0.0020                \\
            railfence\_cipher      & 88.41                       & 87.44                        & 0.8844 \hfill Very large & 0.1011                \\
            longest\_substring     & 77.41                       & 76.98                        & 0.0771 \hfill Small      & 0.7589                \\
            common\_divisor\_count & 76.17                       & 72.76                        & 0.7471 \hfill Large      & 0.1258                \\
            zellers\_birthday      & 68.09                       & 71.75                        & 1.4701  \hfill Huge        & 0.0039                \\
            exponentiation         & 69.44                       & 67.14                        & 0.3342 \hfill Medium     & 0.7108                \\
            caesar\_cipher         & 60.59                       & 61.20                        & 0.3549  \hfill Medium      & 0.2955                \\
            gcd                    & 59.15                       & 55.66                        & 0.5016 \hfill Large      & 0.1627                \\
            rabin\_karp            & 27.90                       & 47.55                        & 2.3688  \hfill Huge        & 0.0078                \\
            anagram\_check         & 23.10                       & 7.70                         & $\infty$  \hfill Huge      & 0.0020                \\
            \bottomrule
        \end{tabular}
        \caption{Statistical comparison between fuzzer and genetic algorithm test case generation in terms of mutation
        score as reported by \textit{mut.py} over 10 runs, sorted by genetic mutation score. The table reports run
        means, the wilcoxon paired test p-value and the Cohen's $d$ effect size for each file in the
        benchmark.}\label{tab:stats}
    \end{table}
\end{document}