%!TEX TS-program = pdflatexmk \documentclass{scrartcl} \usepackage{algorithm} \usepackage{textcomp} \usepackage{xcolor} \usepackage{booktabs} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{microtype} \usepackage{rotating} \usepackage{graphicx} \usepackage{paralist} \usepackage{tabularx} \usepackage{multicol} \usepackage{multirow} \usepackage{pbox} \usepackage{enumitem} \usepackage{colortbl} \usepackage{pifont} \usepackage{xspace} \usepackage{url} \usepackage{tikz} \usepackage{fontawesome} \usepackage{lscape} \usepackage{listings} \usepackage{color} \usepackage{anyfontsize} \usepackage{comment} \usepackage{soul} \usepackage{multibib} \usepackage{float} \usepackage{caption} \usepackage{subcaption} \usepackage{amssymb} \usepackage{amsmath} \usepackage{hyperref} \usepackage[margin=2.5cm]{geometry} \title{Knowledge Search \& Extraction \\ Project 02: Python Test Generator} \author{Claudio Maggioni} \date{} \begin{document} \maketitle \subsection*{Section 1 - Instrumentation} The script \textit{instrument.py} in the main directory of the project performs instrumentation to replace each condition node in the Python files present benchmark suite with a call to \texttt{evaluate\_condition}, which will preserve program behaviour but as a side effect will compute and store condition distance for each traversed branch. Table~\ref{tab:count1} summarizes the number of Python files, function definition (\textit{FunctionDef}) nodes, and comparison nodes (\textit{Compare} nodes not in an \texttt{assert} or \texttt{return} statement) found by the instrumentation script. \begin{table} [H] \centering \begin{tabular}{lr} \toprule \textbf{Type} & \textbf{Number} \\ \midrule Python Files & 10 \\ Function Nodes & 12 \\ Comparison Nodes & 44 \\ \bottomrule \end{tabular} \caption{Count of files and nodes found.} \label{tab:count1} \end{table} \subsection*{Section 2: Fuzzer test generator} The script \textit{fuzzer.py} loads the instrumented benchmark suite and generates tests at random to maximize branch coverage. The implementation submitted with this report slightly improves on the specification required as it is able to deal with an arbitrary number of function parameters, which must be type-hinted as either \texttt{str} or \texttt{int}. The fuzzing process generates a pool of 1000 test case inputs according to the function signature, using randomly generated integers $\in [-1000, 1000]$, and randomly generated string of length $\in [0, 10]$ with ASCII characters with code $\in [32, 127]$. Note that test cases generated in the pool may not satisfy the preconditions (i.e.\ the \texttt{assert} statements on the inputs) for the given function. 250 test cases are extracted from the pool following this procedure. With equal probabilities (each with $p=1/3$): \begin{itemize} \item The extracted test case may be kept as-is; \item The extracted test case may be randomly mutated using the \textit{mutate} function. An argument will be chosen at random, and if of type \texttt{str} a random position in the string will be replaced with a random character. If the argument is of type \texttt{int}, a random value $\in [-10, 10]$ will be added to the argument. If the resulting test case is not present in the pool, it will be added to the pool; \item The extracted test case may be randomly combined with another randomly extracted test using the \textit{crossover} function. The function will choose at random an argument, and if of type \texttt{int} it will swap the values assigned to the two tests. If the argument is of type \texttt{str}, the strings from the two test cases will be split in two substrings at random and they will be joined by combining the ``head'' substring from one test case with the ``tail'' substring from the other. If the two resulting test cases are new, they will be added to the pool. \end{itemize} If the resulting test case (or test cases) satisfy the function precondition, and if their execution covers branches that have not been covered by other test cases, they will be added to the test suite. The resulting test suite is then saved as a \textit{unittest} file, comprising of one test class per function present in the benchmark test file. \subsection*{Section 3: Genetic Algorithm test generator} The script \textit{genetic.py} loads the instrumented benchmark suite and generates tests using a genetic algorithm to maximize branch coverage and minimize distance to condition boundary values. The genetic algorithm is implemented via the library \textit{deap} using the \textit{eaSimple} procedure. The algorithm is initialized with 200 individuals extracted from a pool generated in the same way as the previous section. The algorithm runs for 20 generations, and it implements the \textit{mate} and \textit{mutate} operators using the \textit{crossover} and \textit{mutate} functions respectively as described in the previous section. The fitness function used returns a value of $\infty$ if the test case does not satisfy the function precondition, a value of $1000000$ if the test case does not cover any new branches, or the sum of normalized ($1 / (x + 1)$) sum of distances for branches that are not yet covered by other test cases. A penalty of $2$ is summed to the fitness value for every branch that is already covered. The fitness function is minimized by the genetic algorithm. The genetic algorithm is ran 10 times. At the end of each execution the best individuals (sorted by increasing fitness) are selected if they cover at least one branch that has not been covered. This is the only point in the procedure where the set of covered branches is updated\footnote{This differs from the reference implementation of \texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function.}. \subsection*{Section 4: Statistical comparison of test generators} To compare the performance of the fuzzer and the genetic algorithm, the mutation testing tool \textit{mut.py} has been used to measure how robust the generated test suites hard. Both implementations have been executed for 10 times using different RNG seeds each time, and a statistical comparison of the resulting mutation score distributions has been performed to determine when one generation method is statistically more performant than the other. \begin{figure}[t] \begin{center} \includegraphics[width=\linewidth]{../out/mutation_scores} \caption{Distributions of \textit{mut.py} mutation scores over the generated benchmark tests suites using the fuzzer and the genetic algorithm.}\label{fig:mutation-scores} \end{center} \end{figure} \begin{figure}[t] \begin{center} \includegraphics[width=\linewidth]{../out/mutation_scores_mean} \caption{\textit{mut.py} Mutation score average over the generated benchmark tests suites using the fuzzer and the genetic algorithm.}\label{fig:mutation-scores-mean} \end{center} \end{figure} Figure~\ref{fig:mutation-scores} shows a boxplot of the mutation score distributions for each file in the benchmark suite, while figure~\ref{fig:mutation-scores-mean} shows the mean mutation scores. \begin{table}[t] \centering \begin{tabular}{lrrp{3.5cm}r} \toprule \textbf{File} & \textbf{$E(\text{Fuzzer})$} & \textbf{$E(\text{Genetic})$} & \textbf{Cohen's $|d|$} & \textbf{Wilcoxon $p$} \\ \midrule check\_armstrong & 58.07 & 93.50 & 2.0757 \hfill Huge & 0.0020 \\ railfence\_cipher & 88.41 & 87.44 & 0.8844 \hfill Very large & 0.1011 \\ longest\_substring & 77.41 & 76.98 & 0.0771 \hfill Small & 0.7589 \\ common\_divisor\_count & 76.17 & 72.76 & 0.7471 \hfill Large & 0.1258 \\ zellers\_birthday & 68.09 & 71.75 & 1.4701 \hfill Huge & 0.0039 \\ exponentiation & 69.44 & 67.14 & 0.3342 \hfill Medium & 0.7108 \\ caesar\_cipher & 60.59 & 61.20 & 0.3549 \hfill Medium & 0.2955 \\ gcd & 59.15 & 55.66 & 0.5016 \hfill Large & 0.1627 \\ rabin\_karp & 27.90 & 47.55 & 2.3688 \hfill Huge & 0.0078 \\ anagram\_check & 23.10 & 7.70 & $\infty$ \hfill Huge & 0.0020 \\ \bottomrule \end{tabular} \caption{Statistical comparison between fuzzer and genetic algorithm test case generation in terms of mutation score as reported by \textit{mut.py} over 10 runs, sorted by genetic mutation score. The table reports run means, the wilcoxon paired test p-value and the Cohen's $d$ effect size for each file in the benchmark.}\label{tab:stats} \end{table} To perform a statistical comparison, the Wilcoxon paired test has been used with a p-value threshold of 0.05 has been used to check if there is there is a statistical difference between the distributions. Moreover, the Cohen's d effect-size has been used to measure the significance of the difference. Results of the statistical analysis are shown in table~\ref{tab:stats}. Only 3 benchmark files out of 10 make the two script have statistical different performance. They are namely \textit{check\_armstrong}, \textit{rabin\_karp} and \textit{anagram\_check}. The first two show that the genetic algorithm performs significantly better than the fuzzer, while the last file shows the opposite. \end{document}