kse-01/README.md

# Project 02: Multi-source code search

### About the Project

This project has the goal of developing a search engine able to query a large Python code repository using multiple sources of information.
It is part of the Knowledge Analysis & Management - 2022 course from the Università della Svizzera italiana.

In this repository, you can find the following files:
- tensor flow: a code repository to be used during this project
- ground-truth-unique: a file containing the references triples necessary to evaluate the search engine (step 3)

For more information, see the Project-02 slides (available on iCourse)

Note: Feel free to modify this file according to the project's necessities.

## Environment setup

To install the required dependencies make sure `python3` points to a Python 3.10 or 3.11 installation and then run:

```shell
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```

## Part 1: data extraction

To extract the data in file `data.csv` run the command:

```shell
python3 extract-data.py
```

The script prints the requested counts, which are namely:

```
Methods: 5817
Functions: 4565
Classes: 1882
Python Files: 2817
```

## Part 2: Training

In order to train and predict the output of a given query run the command:

```shell
python3 search-data.py [method] "[query]"
```

where `[method]` is one of `{tfidf,freq,lsi,doc2vec}` or `all` to run all classifiers and `[query]` is the natural
language query to search. Outputs are printed on stdout, and in case of `doc2vec` the trained model file is saved in
`./doc2vec_model.dat` and fetched in this path for subsequent executions.

## Part 3: Evaluation

To evaluate a model run the command:

```shell
python3 search-data.py [method] ./ground-truth-unique.txt
```

where `[method]` is one of `{tfidf,freq,lsi,doc2vec}` or `all` to evaluate all classifiers. The script outputs the
performance of the classifiers in terms of average precision and recall, which are namely:

| Engine   | Average Precision   | Average Recall   |
|:---------|:--------------------|:-----------------|
| tfidf    | 20.00%              | 20.00%           |
| freq     | 27.00%              | 40.00%           |
| lsi      | 4.00%               | 20.00%           |
| doc2vec  | 10.00%              | 10.00%           |