196 lines
11 KiB
TeX
196 lines
11 KiB
TeX
%!TEX TS-program = pdflatexmk
|
|
\documentclass{scrartcl}
|
|
|
|
\usepackage{algorithm}
|
|
\usepackage{textcomp}
|
|
\usepackage{xcolor}
|
|
\usepackage{booktabs}
|
|
\usepackage[utf8]{inputenc}
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage{microtype}
|
|
\usepackage{rotating}
|
|
\usepackage{graphicx}
|
|
\usepackage{paralist}
|
|
\usepackage{tabularx}
|
|
\usepackage{multicol}
|
|
\usepackage{multirow}
|
|
\usepackage{pbox}
|
|
\usepackage{enumitem}
|
|
\usepackage{colortbl}
|
|
\usepackage{pifont}
|
|
\usepackage{xspace}
|
|
\usepackage{url}
|
|
\usepackage{tikz}
|
|
\usepackage{fontawesome}
|
|
\usepackage{lscape}
|
|
\usepackage{listings}
|
|
\usepackage{color}
|
|
\usepackage{anyfontsize}
|
|
\usepackage{comment}
|
|
\usepackage{soul}
|
|
\usepackage{multibib}
|
|
\usepackage{float}
|
|
\usepackage{caption}
|
|
\usepackage{subcaption}
|
|
\usepackage{amssymb}
|
|
\usepackage{amsmath}
|
|
\usepackage{hyperref}
|
|
\usepackage[margin=2.5cm]{geometry}
|
|
\usepackage{changepage}
|
|
|
|
\title{Knowledge Search \& Extraction \\ Project 02: Python Test Generator}
|
|
\author{Claudio Maggioni}
|
|
\date{}
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
|
|
\begin{adjustwidth}{-4cm}{-4cm}
|
|
\centering
|
|
\begin{tabular}{cc}
|
|
\toprule
|
|
Repository URL & \url{https://github.com/kamclassroom2022/project-02-python-test-generator-maggicl} \\
|
|
Commit ID & \texttt{549b8859ffaad5391da362ed9bbe9db82b2c5c0b} \\ \bottomrule
|
|
\end{tabular}
|
|
\end{adjustwidth}
|
|
\vspace{1cm}
|
|
|
|
\subsection*{Section 1 - Instrumentation}
|
|
|
|
The script \textit{instrument.py} in the main directory of the project performs instrumentation to replace each
|
|
condition node in the Python files present benchmark suite with a call to \texttt{evaluate\_condition}, which will
|
|
preserve program behaviour but as a side effect will compute and store condition distance for each traversed branch.
|
|
|
|
Table~\ref{tab:count1} summarizes the number of Python files, function definition (\textit{FunctionDef}) nodes,
|
|
and comparison nodes (\textit{Compare} nodes not in an \texttt{assert} or \texttt{return} statement) found by the
|
|
instrumentation script.
|
|
|
|
\begin{table} [H]
|
|
\centering
|
|
\begin{tabular}{lr}
|
|
\toprule
|
|
\textbf{Type} & \textbf{Number} \\
|
|
\midrule
|
|
Python Files & 10 \\
|
|
Function Nodes & 12 \\
|
|
Comparison Nodes & 44 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Count of files and nodes found.}
|
|
\label{tab:count1}
|
|
\end{table}
|
|
|
|
\subsection*{Section 2: Fuzzer test generator}
|
|
|
|
The script \textit{fuzzer.py} loads the instrumented benchmark suite and generates tests at random to maximize branch
|
|
coverage.
|
|
|
|
The implementation submitted with this report slightly improves on the specification required as it is
|
|
able to deal with an arbitrary number of function parameters, which must be type-hinted as either \texttt{str} or
|
|
\texttt{int}. The fuzzing process generates a pool of 1000 test case inputs according to the function signature,
|
|
using randomly generated integers $\in [-1000, 1000]$, and randomly generated string of length $\in [0, 10]$ with
|
|
ASCII characters with code $\in [32, 127]$. Note that test cases generated in the pool may not satisfy the
|
|
preconditions (i.e.\ the \texttt{assert} statements on the inputs) for the given function.
|
|
|
|
250 test cases are extracted from the pool following this procedure. With equal probabilities (each with $p=1/3$):
|
|
|
|
\begin{itemize}
|
|
\item The extracted test case may be kept as-is;
|
|
\item The extracted test case may be randomly mutated using the \textit{mutate} function. An argument will be
|
|
chosen at random, and if of type \texttt{str} a random position in the string will be replaced with a
|
|
random character. If the argument is of type \texttt{int}, a random value $\in [-10, 10]$ will be added to
|
|
the argument. If the resulting test case is not present in the pool, it will be added to the pool;
|
|
\item The extracted test case may be randomly combined with another randomly extracted test using the
|
|
\textit{crossover} function. The function will choose at random an argument, and if of type \texttt{int} it will
|
|
swap the values assigned to the two tests. If the argument is of type \texttt{str}, the strings from the two test
|
|
cases will be split in two substrings at random and they will be joined by combining the ``head'' substring from
|
|
one test case with the ``tail'' substring from the other. If the two resulting test cases are new, they will be
|
|
added to the pool.
|
|
\end{itemize}
|
|
|
|
If the resulting test case (or test cases) satisfy the function precondition, and if their execution covers branches
|
|
that have not been covered by other test cases, they will be added to the test suite. The resulting test suite is
|
|
then saved as a \textit{unittest} file, comprising of one test class per function present in the benchmark test file.
|
|
|
|
\subsection*{Section 3: Genetic Algorithm test generator}
|
|
|
|
The script \textit{genetic.py} loads the instrumented benchmark suite and generates tests using a genetic algorithm
|
|
to maximize branch coverage and minimize distance to condition boundary values.
|
|
|
|
The genetic algorithm is implemented via the library \textit{deap} using the \textit{eaSimple} procedure.
|
|
The algorithm is initialized with 200 individuals extracted from a pool generated in the same way as the previous
|
|
section. The algorithm runs for 20 generations, and it implements the \textit{mate} and \textit{mutate} operators
|
|
using the \textit{crossover} and \textit{mutate} functions respectively as described in the previous section.
|
|
|
|
The fitness function used returns a value of $\infty$ if the test case does not satisfy the function precondition,
|
|
a value of $1000000$ if the test case does not cover any new branches,
|
|
or the sum of normalized ($1 / (x + 1)$) sum of distances for branches that are not yet covered by other test cases.
|
|
A penalty of $2$ is summed to the fitness value for every branch that is already covered. The fitness function is
|
|
minimized by the genetic algorithm.
|
|
|
|
The genetic algorithm is ran 10 times. At the end of each execution the best individuals (sorted by increasing
|
|
fitness) are selected if they cover at least one branch that has not been covered. This is the only point in the
|
|
procedure where the set of covered branches is updated\footnote{This differs from the reference implementation of
|
|
\texttt{sb\_cgi\_decode.py}, which performs the update directly in the fitness function.}.
|
|
|
|
\subsection*{Section 4: Statistical comparison of test generators}
|
|
|
|
To compare the performance of the fuzzer and the genetic algorithm, the mutation testing tool \textit{mut.py} has
|
|
been used to measure how robust the generated test suites hard. Both implementations have been executed for 10 times
|
|
using different RNG seeds each time, and a statistical comparison of the resulting mutation score distributions has
|
|
been performed to determine when one generation method is statistically more performant than the other.
|
|
|
|
\begin{figure}[t]
|
|
\begin{center}
|
|
\includegraphics[width=\linewidth]{../out/mutation_scores}
|
|
\caption{Distributions of \textit{mut.py} mutation scores over the generated benchmark tests suites
|
|
using the fuzzer and the genetic algorithm.}\label{fig:mutation-scores}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
\begin{figure}[t]
|
|
\begin{center}
|
|
\includegraphics[width=\linewidth]{../out/mutation_scores_mean}
|
|
\caption{\textit{mut.py} Mutation score average over the generated benchmark tests suites
|
|
using the fuzzer and the genetic algorithm.}\label{fig:mutation-scores-mean}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
Figure~\ref{fig:mutation-scores} shows a boxplot of the mutation score distributions for each file in the benchmark
|
|
suite, while figure~\ref{fig:mutation-scores-mean} shows the mean mutation scores.
|
|
|
|
\begin{table}[t]
|
|
\centering
|
|
\begin{tabular}{lrrp{3.5cm}r}
|
|
\toprule
|
|
\textbf{File} & \textbf{$E(\text{Fuzzer})$} & \textbf{$E(\text{Genetic})$} & \textbf{Cohen's $|d|$} & \textbf{Wilcoxon $p$} \\
|
|
\midrule
|
|
check\_armstrong & 58.07 & 93.50 & 2.0757 \hfill Huge & 0.0020 \\
|
|
railfence\_cipher & 88.41 & 87.44 & 0.8844 \hfill Very large & 0.1011 \\
|
|
longest\_substring & 77.41 & 76.98 & 0.0771 \hfill Small & 0.7589 \\
|
|
common\_divisor\_count & 76.17 & 72.76 & 0.7471 \hfill Large & 0.1258 \\
|
|
zellers\_birthday & 68.09 & 71.75 & 1.4701 \hfill Huge & 0.0039 \\
|
|
exponentiation & 69.44 & 67.14 & 0.3342 \hfill Medium & 0.7108 \\
|
|
caesar\_cipher & 60.59 & 61.20 & 0.3549 \hfill Medium & 0.2955 \\
|
|
gcd & 59.15 & 55.66 & 0.5016 \hfill Large & 0.1627 \\
|
|
rabin\_karp & 27.90 & 47.55 & 2.3688 \hfill Huge & 0.0078 \\
|
|
anagram\_check & 23.10 & 7.70 & $\infty$ \hfill Huge & 0.0020 \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\caption{Statistical comparison between fuzzer and genetic algorithm test case generation in terms of mutation
|
|
score as reported by \textit{mut.py} over 10 runs, sorted by genetic mutation score. The table reports run
|
|
means, the wilcoxon paired test p-value and the Cohen's $d$ effect size for each file in the
|
|
benchmark.}\label{tab:stats}
|
|
\end{table}
|
|
|
|
To perform a statistical comparison, the Wilcoxon paired test has been used with a p-value threshold of 0.05 has been
|
|
used to check if there is there is a statistical difference between the distributions. Moreover, the Cohen's d
|
|
effect-size has been used to measure the significance of the difference. Results of the statistical analysis are
|
|
shown in table~\ref{tab:stats}.
|
|
|
|
Only 4 benchmark files out of 10 make the two script have statistical different performance. They are namely
|
|
\textit{check\_armstrong}, \textit{zeller\_birthday}, \textit{rabin\_karp} and \textit{anagram\_check}.
|
|
The first three show that the genetic algorithm performs significantly better than the fuzzer,
|
|
while the last file shows the opposite.
|
|
\end{document}
|