ima01/report/main.md

---
author: Claudio Maggioni
title: Information Modelling & Analysis -- Project 1
geometry: margin=2.5cm,bottom=3cm
---

<!--
The following shows a minimal submission report for project 1. If you
choose to use this template, replace all template instructions (the
yellow bits) with your own values. In addition, for any section, if
**and only if** anything was unclear or warnings were raised by the
code, and you had to take assumptions about the correct implementation
(e.g., about details of a metric), describe your assumptions in one or
two sentences.

You may - at your own risk - also choose not to use this template. As
long as your submission is a latex-generated, English PDF containing all
expected info, you'll be fine.
-->

# Code Repository

The code and result files part of this submission can be found at:

::: center
Repository: \url{https://github.com/infoMA2023/project-01-god-classes-maggicl}

Commit ID: **TBD**
:::

# Data Pre-Processing

## God Classes

The first part of the project requires to label some classes of the _Xerces_ 
project as "God classes" based on the number of methods each class has.
Specifically, I label "God classes" the classes that have a number of methods
six times the standard deviation above the the mean number of methods, i.e. where
the condition

$$|M(C)| > \mu(M) + 6\sigma(M)$$

holds.

To scan and compute the number of methods of each class I use the Python library `javalang`, which implements the Java AST and parser. The Python script
`./find_god_classes.py` uses this library to parse each file in the project and 
compute the number of methods of each class. Note that only non-constructor methods are counted (specifically the code counts the number of `method` nodes in each `ClassDeclaration` node).

Then, the script computes mean and standard 
deviation of the number of methods and filters the list of classes according to the 
condition described above. The file `god_classes/god_classes.csv` then is outputted
listing all the god classes found.

The god classes I identified, and their corresponding number of methods
can be found in Table [1](#tab:god_classes){reference-type="ref"
reference="tab:god_classes"}. 

::: {#tab:god_classes}
| **Class Name**                                 | **# Methods** |
|:------------------------------------------------|------------:|
| org.apache.xerces.impl.xs.traversers.XSDHandler |         118 |
| org.apache.xerces.impl.dtd.DTDGrammar           |         101 |
| org.apache.xerces.xinclude.XIncludeHandler      |         116 |
| org.apache.xerces.dom.CoreDocumentImpl          |         125 |

  : Identified God Classes
:::


## Feature Vectors

In this part of the project we produce the feature vectors used to later cluster
the methods of each God class into separate clusters. We produce one feature method per
non-constructor Java method in each god class.

The columns of each vector represent 
fields and methods referenced by each method, i.e. fields and methods actively used by the method in their method's body.

When analyzing references to fields, additional constraints need to be specified to handle edge cases. 
Namely, a field's property may be referenced (e.g. an access to array `a` may fetch its `length` property, i.e. `a.length`). In this
cases I consider the qualifier (i.e. the field itself, `a`) itself and not its property. When the qualifier is a class (i.e. 
the code references a property of another class, e.g. `Integer.MAX_VALUE`) we consider the class name itself (i.e. `Integer`) and not 
the name of the property. Should the qualifier be a subproperty itself (e.g. in `a.b.c`, where `a.b` would be the qualifier according to `javalang`)

For methods, I only consider calls to methods of the class itself where the qualifier is unspecified or `this`. Calls to parent methods
(i.e. calls like `super.something()`) are not considered.

The feature vector extraction phase is performed by the Python script `extract_feature_vectors.py`. The script takes `god_classes/god_classes.csv` as input
and loads the AST of each class listed in it. Then, a list of all the fields and methods in the class is built, and each method is scanned to see which fields
and methods it references in its body according to the previously described rules. Then, a CSV per class is built storing all feature vectors. Each file has a name matching to the FQDN (Fully-qualified domain name) of the class. Each CSV row refers to a method in the class, and each CSV column refers to a field, method or referenced class. A cell has the value of 1 when the method of that row references the field, method or class marked by that column, and it has the value 0 otherwise. Columns with only zeros are omitted.

Table [2](#tab:feat_vec){reference-type="ref" reference="tab:feat_vec"}
shows aggregate numbers regarding the extracted feature vectors for the
god classes. Note that the number of attributes refers to the number of fields, methods or classes actually references (i.e. the number of columns after omission of 0s).

::: {#tab:feat_vec}
| **Class Name**                                  | **# Feature Vectors** | **# Attributes\*** |
|:------------------------------------------------|----------------------:|-----------------:|
| org.apache.xerces.impl.xs.traversers.XSDHandler |                 106 |            183 |
| org.apache.xerces.impl.dtd.DTDGrammar           |                  91 |            106 |
| org.apache.xerces.xinclude.XIncludeHandler      |                 108 |            143 |
| org.apache.xerces.dom.CoreDocumentImpl          |                 117 |             63 |

  : Feature vector summary (\*= used at least once)
:::

# Clustering {#sec:clustering}

## Algorithm Configurations

Report/comment the algorithm configurations (distance function, linkage
rule, etc.). You may do so in any form you feel suited, but a short
paragraph of text is probably sufficient.

## Testing Various K & Silhouette Scores

\(1\) Report data about the clusters produced by the two algorithms at
various k (#clusters, size of clusters, silhouette scores). You may use
any suitable format (table, graph, \...).

\(2\) Briefly comment your results. What is the best configuration, and
why? Anything else you observed?

# Evaluation

## Ground Truth

I computed the ground truth using the command \.... The generated files
are checked into the repository with the names \....

Comment briefly on the strengths & weaknesses of our ground truth.

## Precision and Recall

::: {#tab:eval}
  ---------------- ------------------- -------- ------------- --------
  **Class Name**    **Agglomerative**            **K-Means**  
                          Prec.         Recall      Prec.      Recall
  \...                    \...           \...       \...        \...
  ---------------- ------------------- -------- ------------- --------

  : Evaluation Summary
:::

Precision and Recall, for the optimal configurations found in Section
[3](#sec:clustering){reference-type="ref" reference="sec:clustering"},
are reported in Table [3](#tab:eval){reference-type="ref"
reference="tab:eval"}.

## Practical Usefulness

Discuss the practical usefulness of the obtained code refactoring
assistant in a realistic setting (1 paragraph).
compiled report for part 1 2023-03-06 14:33:39 +00:00			`---`
			`author: Claudio Maggioni`
			`title: Information Modelling & Analysis -- Project 1`
report part 1 done 2023-04-18 11:49:07 +00:00			`geometry: margin=2.5cm,bottom=3cm`
compiled report for part 1 2023-03-06 14:33:39 +00:00			`---`

			`<!--`
			`The following shows a minimal submission report for project 1. If you`
			`choose to use this template, replace all template instructions (the`
			`yellow bits) with your own values. In addition, for any section, if`
			`and only if anything was unclear or warnings were raised by the`
			`code, and you had to take assumptions about the correct implementation`
			`(e.g., about details of a metric), describe your assumptions in one or`
			`two sentences.`

			`You may - at your own risk - also choose not to use this template. As`
			`long as your submission is a latex-generated, English PDF containing all`
			`expected info, you'll be fine.`
			`-->`

			`# Code Repository`

			`The code and result files part of this submission can be found at:`

			`::: center`
			`Repository: \url{https://github.com/infoMA2023/project-01-god-classes-maggicl}`

			`Commit ID: TBD`
			`:::`

			`# Data Pre-Processing`

			`## God Classes`

report part 1 done 2023-04-18 11:49:07 +00:00			`The first part of the project requires to label some classes of the _Xerces_`
			`project as "God classes" based on the number of methods each class has.`
			`Specifically, I label "God classes" the classes that have a number of methods`
			`six times the standard deviation above the the mean number of methods, i.e. where`
			`the condition`

			`$$\|M(C)\| > \mu(M) + 6\sigma(M)$$`

			`holds.`

			To scan and compute the number of methods of each class I use the Python library `javalang`, which implements the Java AST and parser. The Python script
			`./find_god_classes.py` uses this library to parse each file in the project and
			compute the number of methods of each class. Note that only non-constructor methods are counted (specifically the code counts the number of `method` nodes in each `ClassDeclaration` node).

			`Then, the script computes mean and standard`
			`deviation of the number of methods and filters the list of classes according to the`
			condition described above. The file `god_classes/god_classes.csv` then is outputted
			`listing all the god classes found.`

			`The god classes I identified, and their corresponding number of methods`
			`can be found in Table [1](#tab:god_classes){reference-type="ref"`
			`reference="tab:god_classes"}.`

compiled report for part 1 2023-03-06 14:33:39 +00:00			`::: {#tab:god_classes}`
done part 1 and 2 of report 2023-04-18 13:28:53 +00:00			`\| Class Name \| # Methods \|`
			`\|:------------------------------------------------\|------------:\|`
			`\| org.apache.xerces.impl.xs.traversers.XSDHandler \| 118 \|`
			`\| org.apache.xerces.impl.dtd.DTDGrammar \| 101 \|`
			`\| org.apache.xerces.xinclude.XIncludeHandler \| 116 \|`
			`\| org.apache.xerces.dom.CoreDocumentImpl \| 125 \|`
compiled report for part 1 2023-03-06 14:33:39 +00:00
			`: Identified God Classes`
			`:::`


			`## Feature Vectors`

Feature vectors code fixed 2023-04-18 13:00:21 +00:00			`In this part of the project we produce the feature vectors used to later cluster`
			`the methods of each God class into separate clusters. We produce one feature method per`
			`non-constructor Java method in each god class.`

			`The columns of each vector represent`
done part 1 and 2 of report 2023-04-18 13:28:53 +00:00			`fields and methods referenced by each method, i.e. fields and methods actively used by the method in their method's body.`

			`When analyzing references to fields, additional constraints need to be specified to handle edge cases.`
			Namely, a field's property may be referenced (e.g. an access to array `a` may fetch its `length` property, i.e. `a.length`). In this
			cases I consider the qualifier (i.e. the field itself, `a`) itself and not its property. When the qualifier is a class (i.e.
			the code references a property of another class, e.g. `Integer.MAX_VALUE`) we consider the class name itself (i.e. `Integer`) and not
			the name of the property. Should the qualifier be a subproperty itself (e.g. in `a.b.c`, where `a.b` would be the qualifier according to `javalang`)

			For methods, I only consider calls to methods of the class itself where the qualifier is unspecified or `this`. Calls to parent methods
			(i.e. calls like `super.something()`) are not considered.

			The feature vector extraction phase is performed by the Python script `extract_feature_vectors.py`. The script takes `god_classes/god_classes.csv` as input
			`and loads the AST of each class listed in it. Then, a list of all the fields and methods in the class is built, and each method is scanned to see which fields`
			and methods it references in its body according to the previously described rules. Then, a CSV per class is built storing all feature vectors. Each file has a name matching to the FQDN (Fully-qualified domain name) of the class. Each CSV row refers to a method in the class, and each CSV column refers to a field, method or referenced class. A cell has the value of 1 when the method of that row references the field, method or class marked by that column, and it has the value 0 otherwise. Columns with only zeros are omitted.
Feature vectors code fixed 2023-04-18 13:00:21 +00:00
compiled report for part 1 2023-03-06 14:33:39 +00:00			`Table [2](#tab:feat_vec){reference-type="ref" reference="tab:feat_vec"}`
			`shows aggregate numbers regarding the extracted feature vectors for the`
done part 1 and 2 of report 2023-04-18 13:28:53 +00:00			`god classes. Note that the number of attributes refers to the number of fields, methods or classes actually references (i.e. the number of columns after omission of 0s).`
compiled report for part 1 2023-03-06 14:33:39 +00:00
			`::: {#tab:feat_vec}`
done part 1 and 2 of report 2023-04-18 13:28:53 +00:00			`\| Class Name \| # Feature Vectors \| # Attributes\* \|`
			`\|:------------------------------------------------\|----------------------:\|-----------------:\|`
			`\| org.apache.xerces.impl.xs.traversers.XSDHandler \| 106 \| 183 \|`
			`\| org.apache.xerces.impl.dtd.DTDGrammar \| 91 \| 106 \|`
			`\| org.apache.xerces.xinclude.XIncludeHandler \| 108 \| 143 \|`
			`\| org.apache.xerces.dom.CoreDocumentImpl \| 117 \| 63 \|`
compiled report for part 1 2023-03-06 14:33:39 +00:00
			`: Feature vector summary (\*= used at least once)`
			`:::`

			`# Clustering {#sec:clustering}`

			`## Algorithm Configurations`

			`Report/comment the algorithm configurations (distance function, linkage`
			`rule, etc.). You may do so in any form you feel suited, but a short`
			`paragraph of text is probably sufficient.`

			`## Testing Various K & Silhouette Scores`

			`\(1\) Report data about the clusters produced by the two algorithms at`
			`various k (#clusters, size of clusters, silhouette scores). You may use`
			`any suitable format (table, graph, \...).`

			`\(2\) Briefly comment your results. What is the best configuration, and`
			`why? Anything else you observed?`

			`# Evaluation`

			`## Ground Truth`

			`I computed the ground truth using the command \.... The generated files`
			`are checked into the repository with the names \....`

			`Comment briefly on the strengths & weaknesses of our ground truth.`

			`## Precision and Recall`

			`::: {#tab:eval}`
			`---------------- ------------------- -------- ------------- --------`
			`Class Name Agglomerative K-Means`
			`Prec. Recall Prec. Recall`
			`\... \... \... \... \...`
			`---------------- ------------------- -------- ------------- --------`

			`: Evaluation Summary`
			`:::`

			`Precision and Recall, for the optimal configurations found in Section`
			`[3](#sec:clustering){reference-type="ref" reference="sec:clustering"},`
			`are reported in Table [3](#tab:eval){reference-type="ref"`
			`reference="tab:eval"}.`

			`## Practical Usefulness`

			`Discuss the practical usefulness of the obtained code refactoring`
			`assistant in a realistic setting (1 paragraph).`