\documentclass[11pt,a4paper]{scrartcl} \usepackage{algorithm} \usepackage{algpseudocode} \usepackage[utf8]{inputenc} \usepackage[margin=2.25cm]{geometry} \usepackage{hyperref} \usepackage{listings} \usepackage{xcolor} \usepackage{lmodern} \usepackage{booktabs} \usepackage{multirow} \usepackage{graphicx} \usepackage{float} \usepackage{multicol} \usepackage{tikz} \usepackage{listings} \usepackage{pgfplots} \pgfplotsset{compat=1.18} \usepackage{subcaption} \setlength{\parindent}{0cm} \setlength{\parskip}{0.3em} \hypersetup{pdfborder={0 0 0}} %\usepackage[nomessages]{fp} no easter eggs this time \usepackage{amsmath} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \usepackage{minted} \definecolor{codegreen}{rgb}{0,0.6,0} \definecolor{codegray}{rgb}{0.5,0.5,0.5} \definecolor{codepurple}{rgb}{0.58,0,0.82} \definecolor{backcolour}{rgb}{0.95,0.95,0.92} \lstdefinestyle{mystyle}{ backgroundcolor=\color{backcolour}, commentstyle=\color{codegreen}, keywordstyle=\color{magenta}, keywordstyle=[2]{\color{olive}}, numberstyle=\tiny\color{codegray}, stringstyle=\color{codepurple}, basicstyle=\ttfamily\footnotesize, breakatwhitespace=false, breaklines=true, captionpos=b, keepspaces=true, numbers=left, numbersep=5pt, showspaces=false, showstringspaces=false, showtabs=false, tabsize=2, aboveskip=0.8em, belowcaptionskip=0.8em } \lstset{style=mystyle} \geometry{left=2cm,right=2cm,top=2cm,bottom=3cm} \title{ \vspace{-5ex} Assignment 3 -- Software Analysis \\\vspace{0.5cm} \Large Extended Java Typechecking \vspace{-1ex} } \author{Claudio Maggioni} \date{\vspace{-3ex}} \begin{document} \maketitle \section{Project selection} The assignment description requires to find a project with more than 1000 lines of code making significant use of arrays or strings. Given these requirements, I decide to analyze the Apache Commons Text project in the GitHub repository \href{https://github.com/apache/commons-text}{\textbf{apache/commons-text}}. \subsection{The Apache Commons Text Project} The Apache Commons family of libraries is an Apache Software Foundation\footnote{\url{https://apache.org/}} sponsored collection of Java libraries designed to complement the standard libraries of Java. The Apache Commons Text project focuses on text manipulation, encoding and decoding of \textit{String}s and \textit{CharSequence}-implementing classes in general. All the source and test classes are contained within in the package \textit{org.apache.commons.text} or in a sub-package of that package. For the sake of brevity, this prefix is omitted from now on when mentioning file paths and classes in the project. I choose to analyze the project at the \textit{git} commit \texttt{78fac0f157f74feb804140613e4ffec449070990} as it is the latest commit on the \textit{master} branch at the time of writing. To verify that the project satisfies the 1000 lines of code requirement, I run the \textit{cloc} tool. Results are shown in table \ref{tab:cloc}. Given the project has more than 29,000 lines of Java code, this requirement is satisfied. \begin{table}[H] \centering \begin{tabular}{lrrrr} \toprule Language & Files & Blank & Comment & Code \\ \midrule Java & 194 & 5642 & 18704 & 26589 \\ XML & 16 & 205 & 425 & 1370 \\ Text & 6 & 194 & 0 & 667 \\ Maven & 1 & 23 & 24 & 536 \\ YAML & 6 & 39 & 110 & 160 \\ Markdown & 4 & 40 & 106 & 109 \\ Velocity Template Language & 1 & 21 & 31 & 87 \\ CSV & 1 & 0 & 0 & 5 \\ Properties & 2 & 2 & 28 & 5 \\ Bourne Shell & 1 & 0 & 2 & 2 \\ \midrule Total & 232 & 6166 & 19430 & 29530 \\ \bottomrule \end{tabular} \caption{Output of the \textit{cloc} tool for the Apache Commons Text project at tag \textit{78fac0f1} (before refactoring is carried out).} \label{tab:cloc} \end{table} \section{Running the CheckerFramework Type Checker} The relevant source code to analyze has been copied to the directory \textit{sources} in the assignment repository \begin{center} \href{https://gitlab.com/usi-si-teaching/msde/2022-2023/software-analysis/maggioni/assignment-3}{\textit{usi-si-teaching/msde/2022-2023/software-analysis/maggioni/assignment-3}} \end{center} on \textit{gitlab.com}. The Maven build specification for the project has been modified to run the CheckerFramework extended type checker (version 3.33.0) as an annotation processor to be ran on top of the Java compiler. Both source code and test code is checked with the tool for violations, which are reported with compilation warnings. To run the type checker simply run: \begin{verbatim} mvn clean compile \end{verbatim} in a suitable environment (i.e. with JDK 1.8 or greater and Maven installed). To additionally run the Apache Commons Text test suite and enable \texttt{assert} assertions (later useful for CheckerFramework \texttt{@AssumeAssertion(index)} assertions) simply run: \begin{verbatim} env MAVEN_OPTS="-ea" mvn clean test \end{verbatim} Apache Commons Text includes classes that have been deprecated. As changing the interface and behaviour of these classes would be useless, as alternatives to them exist in the library already, I choose to ignore them while refactoring by adding a \textit{@SuppressWarning} annotation in each of them. The state of the assignment repository after the deprecated classes are annotated and when the type checker was first ran successfully is pinned by the \textit{git} tag \textit{before-refactor}. A copy of the CheckerFramework relevant portion of the compilation output at that tag is stored in the file \textit{before-refactor.txt}. No CheckerFramework checkers other than the index checker is used in this analysis as the code in the project mainly manipulates strings and arrays and a significant number of warnings are generated even by using this checker only. \section{Refactoring} \begin{table}[!ht] \centering \begin{tabular}{lrr} \toprule Warning type & Before refactoring & After refactoring \\ \midrule argument & 254 & 241 \\ array.access.unsafe.high & 130 & 117 \\ array.access.unsafe.high.constant & 31 & 28 \\ array.access.unsafe.high.range & 22 & 22 \\ array.access.unsafe.low & 59 & 58 \\ array.length.negative & 3 & 3 \\ cast.unsafe & 2 & 2 \\ override.return & 12 & 12 \\ \midrule Total & 513 & 483 \\ \bottomrule \end{tabular} \caption{Number of CheckerFramework Type Checker warnings by category before and after refactoring, ignoring deprecated classes.} \label{tab:check} \end{table} Table \ref{tab:check} provides a summary on the extent of the refactoring performed in response to index checker warnings across the Apache Commons Text project. In total, 513 warnings are found before refactoring, with 30 of them later being extinguished by introducing annotations and assertions in the code in the following classes: \begin{multicols}{2} \begin{itemize} \item AlphabetConverter % done \item StringSubstitutor % done \item similarity.LongestCommonSubsequence \item translate.AggregateTranslator \item translate.CharSequenceTranslator \end{itemize} \vfill\null \columnbreak \begin{itemize} \item translate.CodePointTranslator \item translate.CsvTranslators \item translate.JavaUnicodeEscaper \item translate.SinglePassTranslator \item translate.UnicodeEscaper \end{itemize} \end{multicols} The strategy I adopt to perform the refactoring is based on the compiler errors thrown on the original code. In every flagged statement I attempt to find the root cause of the warning and eliminate it with either extended type qualifier annotations or assertions when adding only the former fails. Instead of using \texttt{@SuppressWarning} annotations I choose to use \texttt{@AssumeAssertion}-annotated assertions as I aim to use the existing Commons Text test suite to aid in finding incorrectly-placed annotations. As mentioned before in the report, I run the test suite of the project by enabling assertions and I verify that all tests still pass and no \textit{AssertionError}s are thrown. In total, the refactor consists in the placement of 16 extended type qualifiers and 14 assertions. A more detailed description of salient refactoring decisions taken to extinguish the warnings follows. \subsection{Class \textit{AlphabetConverter}} \begin{minted}[linenos,firstnumber=387]{java} for (int j = 0; j < encoded.length();) { final int i = encoded.codePointAt(j); final String s = codePointToString(i); if (s.equals(originalToEncoded.get(i))) { result.append(s); j++; // because we do not encode in Unicode extended the // length of each encoded char is 1 } else { if (j + encodedLetterLength > encoded.length()) { throw new UnsupportedEncodingException("Unexpected end " + "of string while decoding " + encoded); } final String nextGroup = encoded.substring(j, j + encodedLetterLength); \end{minted} Here the \texttt{substring(\ldots)} call at line 151 is flagged by CheckerFramework warning the start and end index may be negative and that the start index may be greater than the end index. As the attribute \texttt{encodedLetterLength} is positive according to the contract of the class constructor and \texttt{j} is only incremented in the for loop or by a factor of \texttt{encodedLetterLength}, the code is correct. After introducing a \mintinline{java}{@Positive} annotation on the declaration of \texttt{j} and an \mintinline{java}{assert encodedLength > 0} after line 395, CheckerFramework agrees with my judgement. \subsection{Class \textit{StringSubstitutor}} \begin{minted}[linenos,breaklines,firstnumber=910]{java} /** [...] * @throws IllegalArgumentException if variable is not found when its allowed to throw exception * @throws StringIndexOutOfBoundsException if {@code offset} is not in the * range {@code 0 <= offset <= source.length()} * @throws StringIndexOutOfBoundsException if {@code length < 0} * @throws StringIndexOutOfBoundsException if {@code offset + length > source.length()} */ public String replace(final String source, final int offset, final int length) { if (source == null) { return null; } final TextStringBuilder buf = new TextStringBuilder(length).append(source, offset, length); if (!substitute(buf, 0, length)) { return source.substring(offset, offset + length); } return buf.toString(); } \end{minted} The implementation of method \texttt{replace} is flagged by the extended type checker as the indices \texttt{offset} and \texttt{length} are not bound checked against the string \texttt{source}. As the unsafe behaviour of the method is documented in its \textit{javadoc} with appropriate \texttt{@throws} clauses, I simply add this implied preconditions to the method's contract by using extended type qualifiers: \begin{minted}[breaklines]{java} public String replace(final String source, final @IndexOrHigh("#1") int offset, final @NonNegative @LTLengthOf(value = "#1", offset = "#2 - 1") int length) \end{minted} \subsection{Class \textit{translate.CharSequenceTranslator} and implementors} Apache Commons Text provides the aforementioned abstract class implementation as a template method of sorts for expressing text encoding and decoding algorithms. The class essentially provides facilities to scan UTF-16 code points sequentially, and delegating the translation of each code point to the implementation of the abstract method: \begin{minted}[breaklines]{java} public abstract int translate(CharSequence input, int index, Writer writer) throws IOException; \end{minted} CheckerFramework gives some warnings about some of the implementations of this method, highlighting that they assume the \texttt{input} \textit{CharSequence} is non-empty and the \texttt{index} parameter is a valid index for the string. Even if the method is public, I choose to interpret this hierarchy to mainly be a template method pattern, with high coupling between the algorithm in the abstract class and each abstract method implementation. Given this, I decide to restrict the method's precondition to highlight conditions already provided by the caller algorithm, namely the length and index constraints provided by CheckerFramework. The new signature of the abstract method is this: \begin{minted}[breaklines]{java} public abstract int translate(@MinLen(1) CharSequence input, @NonNegative @LTLengthOf("#1") int index, Writer writer) throws IOException; \end{minted} As some methods have a more forgiving implementation, and a broader child method argument type from a more restrictive parent type does not break the rules of inheritance (thanks to contravariance), I choose to propagate the extended type annotations only when needed and avoid introducing additional preconditions to more tolerant implementations of the template method. \subsection{Class \textit{translate.SinglePassTranslator} and implementors} \textit{SinglePassTranslator} is one of the implementor classes of the aforementioned \textit{CharSequenceTranslator} template method. However, the class is itself a template method pattern ``for processing whole input in single pass''\footnote{According to the class \textit{javadoc} description.}, i.e. essentially performing an abstraction inversion of the codepoint-by-codepoint algorithm in \textit{CharSequenceTranslator} by implementing the encoding or decoding process in a single go. The class immediately delegates the implementation of the translation algorithm to the abstract package-private method: \begin{minted}[breaklines]{java} abstract void translateWhole(CharSequence input, Writer writer) throws IOException; \end{minted} and requires callers of the public implementation of \texttt{translate} to call it with \texttt{index} equal to 0. I simply propagate the non-empty extended type annotation on \texttt{input} (i.e. \mintinline{java}{@MinLen(1)}) on this new abstract method and implementors. The \textit{translate.CsvTranslators\$CsvUnescaper} implementation of this new template method requires additional attention to extinguish all CheckerFramework's index checker warnings. \begin{minted}[breaklines,linenos,firstnumber=60]{java} void translateWhole(final @MinLen(1) CharSequence input, final Writer writer) throws IOException { // is input not quoted? if (input.charAt(0) != CSV_QUOTE || input.charAt(input.length() - 1) != CSV_QUOTE) { writer.write(input.toString()); return; } // strip quotes final String quoteless = input.subSequence(1, input.length() - 1).toString(); \end{minted} Here CheckerFramework was unable to deduce that the \mintinline{java}{input.length() - 1} indeed results in a safe index for \texttt{input} as the \textit{CharSequence} is always non-empty (as specified with the propagated type qualifiers from the abstract signature of \texttt{translateWhole}). This warning is fixed by precomputing the last index of the string and introducing a trivially true assertion on it: \begin{minted}[breaklines,linenos,firstnumber=60]{java} void translateWhole(final @MinLen(1) CharSequence input, final Writer writer) throws IOException { final int lastIndex = input.length() - 1; assert lastIndex >= 0 : "@AssumeAssertion(index): input.length() is >= 1 by contract"; // is input not quoted? if (input.charAt(0) != CSV_QUOTE || input.charAt(lastIndex) != CSV_QUOTE) { writer.write(input.toString()); return; } // strip quotes final String quoteless = input.subSequence(1, lastIndex).toString(); \end{minted} This assertion is one good example of CheckerFramework's limitations in verifying transitively true properties due to the lack of deductive verification capabilities. All assertions added in the project are trivial in this way to some degree. \section{Conclusions} As evidenced by the Apache Common Text test suite and the previous section of this report, no changes in the implementation behaviour were introduced in the code by the refactor. Only extended type annotations and assertions (that hold when executing the test suite) were added to the code. No implementaton-derived bugs were discovered during refactoring, altough the introduction of additional method preconditions via extended type annotations was sometimes needed. Some questionable design choices were found in the library, namely the \textit{CharSequenceTranslator} template method hierarchy with \mintinline{java}{public} implementors and the inversion of abstration by refused bequest in the \textit{SinglePassTranslator} partial implementation, however I managed to introduce correct extended types without drastic alterations (which may have broken clients of the library). It was quite easy to introduce the CheckerFramework index checker to the Maven build toolchain of the project. However, compiling the project after this was significantly slower than before and lots of spurious warnings (i.e. false positives) were produced by the tool due to its lack of deductive verification capabilities. Some benefits were gained by using the tool, namely being able to better refine the precondition of the methods in the library by documenting them in an automatedly parsable fashion. However, the sheer number of trivially true assertions introduced to silence some warnings is a significant downside to consider when using the index checker. The CheckerFramework extended type checker has a cost and complexity of usage greater than the Infer static analysis tool and lower than the Dafny deductive verifier. However, with respect to my experience with all the tools in carrying out the assignment, this is the tool that has shown to be the less useful. The less powerful Infer static analyzer, even with a great number of false positives, was able to highlight some bugs in the project I have chosen for that assignment (Apache Commons Lang). As all Apache Commons libraries are quite mature and well-maintained it is expected that both tools find relatively few things to flag in them. However, no noteworthy errors were found by CheckerFramework in this project other than some small imprecisions in the precondition specification of some methods. When compared to Dafny and deductive verifiers in general, clearly both Infer and CheckerFramework pale in comparison in terms of capabilities. However, the increased cost of adoption and of the tool's execution make it suitable only to critical portions of project codebases. Finally, I would like to mention the \href{https://github.com/JetBrains/java-annotations}{Jetbrains Java Annotations} as a much less powerful but effective extended type checking mechanism to handle nullable an non-null values in plain Java. Even if its scope of analysis is rather narrow and warnings an the type checking process is proprietary (as only Jetbrains IDEs interpret the annotations) I have found it an effective development aid in my university and work experience. \end{document}