ima01/report/main.md

---
author: Claudio Maggioni
title: Information Modelling & Analysis -- Project 1
geometry: margin=2cm,bottom=3cm
---

<!--
The following shows a minimal submission report for project 1. If you
choose to use this template, replace all template instructions (the
yellow bits) with your own values. In addition, for any section, if
**and only if** anything was unclear or warnings were raised by the
code, and you had to take assumptions about the correct implementation (e.g.,
about details of a metric), describe your assumptions in one or two sentences.

You may - at your own risk - also choose not to use this template. As long as
your submission is a latex-generated, English PDF containing all expected info,
you'll be fine.  -->

# Code Repository

The code and result files part of this submission can be found at:

::: center Repository:
\url{https://github.com/infoMA2023/project-01-god-classes-maggicl}

Commit ID: **TBD** :::

# Data Pre-Processing

## God Classes

The first part of the project requires to label some classes of the _Xerces_
project as "God classes" based on the number of methods each class has. From
here onwards the Java package prefix `org.apache.xerces` is omitted when discussing
fully qualified domain names of classes for sake of brevity.

Specifically, I label "God classes" the classes that have a number of methods
six times the standard deviation above the the mean number of methods, i.e.
where the condition

$$|M(C)| > \mu(M) + 6\sigma(M)$$

holds.

To scan and compute the number of methods of each class I use the Python library
`javalang`, which implements the Java AST and parser. The Python script
`./find_god_classes.py` uses this library to parse each file in the project and
compute the number of methods of each class. Note that only non-constructor
methods are counted (specifically the code counts the number of `method` nodes
in each `ClassDeclaration` node).

Then, the script computes mean and standard deviation of the number of methods
and filters the list of classes according to the condition described above. The
file `god_classes/god_classes.csv` then is outputted listing all the god classes
found.

The god classes I identified, and their corresponding number of methods can be
found in Table [1](#tab:god_classes){reference-type="ref"
reference="tab:god_classes"}.

::: {#tab:god_classes}
| **Class Name**                                 | **# Methods** |
|:------------------------------------------------|------------:|
| impl.xs.traversers.XSDHandler |         118 |
| impl.dtd.DTDGrammar           |         101 |
| xinclude.XIncludeHandler      |         116 |
| dom.CoreDocumentImpl          |         125 |

  : Identified God Classes
:::


## Feature Vectors

In this part of the project we produce the feature vectors used to later cluster
the methods of each God class into separate clusters. We produce one feature
method per non-constructor Java method in each god class.

The columns of each vector represent fields and methods referenced by each
method, i.e. fields and methods actively used by the method in their method's
body.

When analyzing references to fields, additional constraints need to be specified
to handle edge cases.  Namely, a field's property may be referenced (e.g. an
access to array `a` may fetch its `length` property, i.e. `a.length`). In this
cases I consider the qualifier (i.e. the field itself, `a`) itself and not its
property. When the qualifier is a class (i.e.  the code references a property of
another class, e.g. `Integer.MAX_VALUE`) we consider the class name itself (i.e.
`Integer`) and not the name of the property. Should the qualifier be a
subproperty itself (e.g. in `a.b.c`, where `a.b` would be the qualifier
according to `javalang`)

For methods, I only consider calls to methods of the class itself where the
qualifier is unspecified or `this`. Calls to parent methods (i.e. calls like
`super.something()`) are not considered.

The feature vector extraction phase is performed by the Python script
`extract_feature_vectors.py`. The script takes `god_classes/god_classes.csv` as
input and loads the AST of each class listed in it. Then, a list of all the
fields and methods in the class is built, and each method is scanned to see
which fields and methods it references in its body according to the previously
described rules. Then, a CSV per class is built storing all feature vectors.
Each file has a name matching to the FQDN (Fully-qualified domain name) of the
class. Each CSV row refers to a method in the class, and each CSV column refers
to a field, method or referenced class. A cell has the value of 1 when the
method of that row references the field, method or class marked by that column,
and it has the value 0 otherwise. Columns with only zeros are omitted.

Table [2](#tab:feat_vec){reference-type="ref" reference="tab:feat_vec"} shows
aggregate numbers regarding the extracted feature vectors for the god classes.
Note that the number of attributes refers to the number of fields, methods or
classes actually references (i.e. the number of columns after omission of 0s).

::: {#tab:feat_vec}
| **Class Name**                                  | **# Feature Vectors** | **# Attributes\*** |
|:------------------------------------------------|----------------------:|-----------------:|
| impl.xs.traversers.XSDHandler |                 106 |            183 |
| impl.dtd.DTDGrammar           |                  91 |            106 |
| xinclude.XIncludeHandler      |                 108 |            143 |
| dom.CoreDocumentImpl          |                 117 |             63 |

  : Feature vector summary (\*= used at least once)
:::

# Clustering {#sec:clustering}

In this section I covering the techniques to cluster the methods of each god
class. The project aims to use KMeans clustering and agglomerative hierarchical
clustering to group these methods toghether in cohesive units which could be
potentially refactored out of the god class they belong to.

## Algorithm Configurations

To perform KMeans clustering, I use the `cluster.KMeans` Scikit-Learn
implementation of the algorithm.  I use the default parameters: feature vectors
are compared with euclidian distance, centroids are used instead of medioids,
and the initial centroids are computed with the greedy algorithm `kmeans++`. The
random seed is fixed to $0$ to allow for reproducibility between executions of
the clustering script.

To perform Hierarchical clustering, I use the `cluster.AgglomerativeClustering`
Scikit-Learn implementation of the algorithm.  Again feature vectors are
compared with euclidian distance, but as a linkage metric I choose to use
complete linkage.  As agglomerative clustering is deternministic, no random seed
is needed for this algorithm.

I run the two algorithms for all $k \in [2,65]$, or if less than 65 feature
vectors with distinct values are assigned to the god class, the upper bound of
$k$ is such value.

## Testing Various K & Silhouette Scores

To find the optimal value of $k$ for both algorithms, the distribution of
cluster sizes and silhouette across values of $k$, and to apply the optimal
clustering for each god class I run the command:

```shell
./silhouette.py --validate --autorun
```

Feature vectors are read from the `feature_vectors` directory and all the
results are stored in the `clustering` directory.

Figures [1](#fig:xsd){reference-type="ref" reference="fig:xsd"},
[2](#fig:dtd){reference-type="ref" reference="fig:dtd"},
[3](#fig:xinc){reference-type="ref" reference="fig:xinc"}, and
[4](#fig:cimpl){reference-type="ref" reference="fig:cimpl"} show the
distributions of cluster sizes for each god class obtained by running the KMeans
and agglomerative clustering algorithm as described in the previous sections.

For all god classes, the mean of number of elements in each cluster
exponentially decreases as $k$ increases. Aside the first values of $k$ for
class `DTDGrammar` (where it was 2), the minimum cluster size was 1 for all
analyzed clusterings.  Conversely, the maximum cluster size varies a lot, almost
always being monotonically non increasing as $k$ increases, occasionally forming
wide plateaus. The silhouette metric distribution instead generally follows a
dogleg-like path, sharply decreasing for the first values of $k$ and slowly
increasing afterwards $k$. This leads the choice of the optimal $k$ number of
clusters for each algorithm to be between really low and really high values.

The figures also show the distribution of the silhouette metric per algorithm
and per value of $k$. The optimal values of $k$ and the respective silhouette
values for each implementation are reported in Table
[3](#tab:sumup){reference-type="ref" reference="tab:sumup"}.

From the values we can gather that agglomerative clustering performs overall
better than KMeans for the god classes in the project.  Almost god classes are
optimally clustered with few clusters, with the exception of `CoreDocumentImpl`
being optimally clustered with unit clusters. This could indicate higher
cohesion between implementation details of the other classes, and lower cohesion
in `CoreDocumentImpl` (given the name it would not be surprising if this class
plays the role of an utility class of sort, combining lots of implementation
details affecting different areas of the code).

Agglomerative clustering with complete linkage could perform better than KMeans
due to a more urgent need for separation rather than cohesion in the classes
that were analyzed. Given the high dimensionality of the feature vectures used,
and the fact that eucledian distance is used to compare feature vectors, the
hyper-space of method features for each god class is likely sparse, with
occasional clusters of tightly-knit features.  Given the prevailing sparsity,
complete linkage could be suitable here since it avoids to agglomerate distant
clusters above all.

![Clustering metrics for class impl.xs.traversers.XSDHandler](../clustering/org.apache.xerces.impl.xs.traversers.XSDHandler_stats.png){#fig:xsd}

![Clustering metrics for class impl.dtd.DTDGrammar](../clustering/org.apache.xerces.impl.dtd.DTDGrammar_stats.png){#fig:dtd}

![Clustering metrics for class xinclude.XIncludeHandler](../clustering/org.apache.xerces.xinclude.XIncludeHandler_stats.png){#fig:xinc}

![Clustering metrics for class dom.CoreDocumentImpl](../clustering/org.apache.xerces.dom.CoreDocumentImpl_stats.png){#fig:cimpl}

::: {#tab:sumup}
| **Class Name**            | **KMeans K** |   **KMeans silhouette** |   **Hierarchical K** |   **Hierarchical silhouette** |
|:------------- --------------|-----------:|--------------------:|-----------------:|--------------------------:|
| dom.CoreDocumentImpl        |         45 |0.7290 |               45 |                    0.7290 |
| impl.xs.traversers.XSDHandler |          2 |0.5986 |                3 |                    0.5989 |
| impl.dtd.DTDGrammar           |         58 |0.3980 |                2 |                    0.4355 |
| xinclude.XIncludeHandler      |          2 |0.6980 |                2 |                    0.6856 |

  : Optimal hyperparameters and corresponding silhouette metrics for KMeans and
Hierarchical clustering algorithm.
:::

# Evaluation

## Ground Truth

I computed the ground truth using the Python script `./ground_truth.py` The
generated files are checked into the repository with the names
`clustering/{className}_groundtruth.csv` where `{className}` is the FQDN of each
god class.

The ground truth in this project is not given but generated according to simple
heuristics. Since no inherent structure or labelling from experts exists to
group the methods in each god class, the project requires to label methods based
on keyword matching whitin each method name. The list of keywords used can be
found in `keyword_list.txt`. This approach allows to have a ground truth at all
with little computational cost and labelling effort, but it assumes the method
name and the chosen keywords are indeed of enough significance to form a
meaningful clustering of methods that form refactorable cohesive units of
functionality.

## Precision and Recall

::: {#tab:eval}
| **Class Name**                                      | **KMeans Precision**   | **KMeans Recall**   | **Agglomerative Precision**   | **Agglomerative Recall**   |
|:------------------------------------------------|-------------------:|----------------:|--------------------------:|-----------------------:|
| xinclude.XIncludeHandler      | 69.83%             | 97.80%           | 69.58%                    | 95.65%                 |
| dom.CoreDocumentImpl          | 64.80%              | 28.26%          | 68.11%                    | 29.70%                  |
| impl.xs.traversers.XSDHandler | 36.17%             | 97.24%          | 36.45%                    | 96.11%                 |
| impl.dtd.DTDGrammar           | 87.65%             | 6.87%           | 52.21%                    | 94.28%                 |

  : Evaluation Summary
:::

Precision and Recall, for the optimal configurations found in Section
[3](#sec:clustering){reference-type="ref" reference="sec:clustering"}, are
reported in Table [4](#tab:eval){reference-type="ref" reference="tab:eval"}.

\begin{center}
\color{red} comment precision and recall values
\end{center}

## Practical Usefulness

\begin{center}
\color{red}Discuss the practical usefulness of the obtained code refactoring assistant in a
realistic setting (1 paragraph).
\end{center}