The first part of the project requires to label some classes of the _Xerces_
project as "God classes" based on the number of methods each class has.
Specifically, I label "God classes" the classes that have a number of methods
six times the standard deviation above the the mean number of methods, i.e. where
the condition
$$|M(C)| > \mu(M) + 6\sigma(M)$$
holds.
To scan and compute the number of methods of each class I use the Python library `javalang`, which implements the Java AST and parser. The Python script
`./find_god_classes.py` uses this library to parse each file in the project and
compute the number of methods of each class. Note that only non-constructor methods are counted (specifically the code counts the number of `method` nodes in each `ClassDeclaration` node).
Then, the script computes mean and standard
deviation of the number of methods and filters the list of classes according to the
condition described above. The file `god_classes/god_classes.csv` then is outputted
listing all the god classes found.
The god classes I identified, and their corresponding number of methods
can be found in Table [1](#tab:god_classes){reference-type="ref"
fields and methods referenced by each method, i.e. fields and methods actively used by the method in their method's body.
When analyzing references to fields, additional constraints need to be specified to handle edge cases.
Namely, a field's property may be referenced (e.g. an access to array `a` may fetch its `length` property, i.e. `a.length`). In this
cases I consider the qualifier (i.e. the field itself, `a`) itself and not its property. When the qualifier is a class (i.e.
the code references a property of another class, e.g. `Integer.MAX_VALUE`) we consider the class name itself (i.e. `Integer`) and not
the name of the property. Should the qualifier be a subproperty itself (e.g. in `a.b.c`, where `a.b` would be the qualifier according to `javalang`)
For methods, I only consider calls to methods of the class itself where the qualifier is unspecified or `this`. Calls to parent methods
(i.e. calls like `super.something()`) are not considered.
The feature vector extraction phase is performed by the Python script `extract_feature_vectors.py`. The script takes `god_classes/god_classes.csv` as input
and loads the AST of each class listed in it. Then, a list of all the fields and methods in the class is built, and each method is scanned to see which fields
and methods it references in its body according to the previously described rules. Then, a CSV per class is built storing all feature vectors. Each file has a name matching to the FQDN (Fully-qualified domain name) of the class. Each CSV row refers to a method in the class, and each CSV column refers to a field, method or referenced class. A cell has the value of 1 when the method of that row references the field, method or class marked by that column, and it has the value 0 otherwise. Columns with only zeros are omitted.
god classes. Note that the number of attributes refers to the number of fields, methods or classes actually references (i.e. the number of columns after omission of 0s).