71 lines
2.3 KiB
Markdown
71 lines
2.3 KiB
Markdown
# Project 02: Multi-source code search
|
|
|
|
### About the Project
|
|
|
|
This project has the goal of developing a search engine able to query a large Python code repository using multiple sources of information.
|
|
It is part of the Knowledge Analysis & Management - 2022 course from the Università della Svizzera italiana.
|
|
|
|
In this repository, you can find the following files:
|
|
- tensor flow: a code repository to be used during this project
|
|
- ground-truth-unique: a file containing the references triples necessary to evaluate the search engine (step 3)
|
|
|
|
For more information, see the Project-02 slides (available on iCourse)
|
|
|
|
Note: Feel free to modify this file according to the project's necessities.
|
|
|
|
## Environment setup
|
|
|
|
To install the required dependencies make sure `python3` points to a Python 3.10 or 3.11 installation and then run:
|
|
|
|
```shell
|
|
python3 -m venv env
|
|
source env/bin/activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Part 1: data extraction
|
|
|
|
To extract the data in file `data.csv` run the command:
|
|
|
|
```shell
|
|
python3 extract-data.py
|
|
```
|
|
|
|
The script prints the requested counts, which are namely:
|
|
|
|
```
|
|
Methods: 5817
|
|
Functions: 4565
|
|
Classes: 1882
|
|
Python Files: 2817
|
|
```
|
|
|
|
## Part 2: Training
|
|
|
|
In order to train and predict the output of a given query run the command:
|
|
|
|
```shell
|
|
python3 search-data.py [method] "[query]"
|
|
```
|
|
|
|
where `[method]` is one of `{tfidf,freq,lsi,doc2vec}` or `all` to run all classifiers and `[query]` is the natural
|
|
language query to search. Outputs are printed on stdout, and in case of `doc2vec` the trained model file is saved in
|
|
`./doc2vec_model.dat` and fetched in this path for subsequent executions.
|
|
|
|
## Part 3: Evaluation
|
|
|
|
To evaluate a model run the command:
|
|
|
|
```shell
|
|
python3 search-data.py [method] ./ground-truth-unique.txt
|
|
```
|
|
|
|
where `[method]` is one of `{tfidf,freq,lsi,doc2vec}` or `all` to evaluate all classifiers. The script outputs the
|
|
performance of the classifiers in terms of average precision and recall, which are namely:
|
|
|
|
| Engine | Average Precision | Average Recall |
|
|
|:---------|:--------------------|:-----------------|
|
|
| tfidf | 20.00% | 20.00% |
|
|
| freq | 27.00% | 40.00% |
|
|
| lsi | 4.00% | 20.00% |
|
|
| doc2vec | 10.00% | 10.00% |
|