kse-01/README.md
2023-11-08 22:25:13 +01:00

85 lines
No EOL
2.5 KiB
Markdown

# Project 02: Multi-source code search
**Claudio Maggioni**
### About the Project
This project has the goal of developing a search engine able to query a large Python code repository using multiple
sources of information.
It is part of the Knowledge Analysis & Management - 2022 course from the Università della Svizzera italiana.
In this repository, you can find the following files:
- tensor flow: a code repository to be used during this project
- ground-truth-unique: a file containing the references triples necessary to evaluate the search engine (step 3)
For more information, see the Project-02 slides (available on iCourse)
Note: Feel free to modify this file according to the project's necessities.
## Environment setup
To install the required dependencies make sure `python3` points to a Python 3.10 or 3.11 installation and then run:
```shell
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```
## Part 1: data extraction
To extract the data in file `data.csv` run the command:
```shell
python3 extract-data.py
```
The script prints the requested counts, which are namely:
```
Methods: 5817
Functions: 4565
Classes: 1882
Python Files: 2817
```
## Part 2: Training
In order to train and predict the output of a given query run the command:
```shell
python3 search-data.py [method] "[query]"
```
where `[method]` is one of `{tfidf,freq,lsi,doc2vec}` or `all` to run all classifiers and `[query]` is the natural
language query to search. Outputs are printed on stdout, and in case of `doc2vec` the trained model file is saved in
`./doc2vec_model.dat` and fetched in this path for subsequent executions.
## Part 3: Evaluation
To evaluate a model run the command:
```shell
python3 search-data.py [method] ./ground-truth-unique.txt
```
where `[method]` is one of `{tfidf,freq,lsi,doc2vec}` or `all` to evaluate all classifiers. The script outputs the
performance of the classifiers in terms of average precision and recall, which are namely:
| Engine | Average Precision | Average Recall |
|:---------|:--------------------|:-----------------|
| tfidf | 90.00% | 90.00% |
| freq | 93.33% | 100.00% |
| lsi | 90.00% | 90.00% |
| doc2vec | 73.33% | 80.00% |
## Report
To compile the report run:
```shell
cd report
pdflatex -interaction=nonstopmode -output-directory=. main.tex
pdflatex -interaction=nonstopmode -output-directory=. main.tex
```