out | ||
report | ||
tensorflow | ||
.gitattributes | ||
.gitignore | ||
data.csv | ||
doc2vec_model.dat | ||
extract-data.py | ||
ground-truth-unique.txt | ||
prec-recall.py | ||
README.md | ||
requirements.txt | ||
search-data.py |
Project 02: Multi-source code search
About the Project
This project has the goal of developing a search engine able to query a large Python code repository using multiple sources of information. It is part of the Knowledge Analysis & Management - 2022 course from the Università della Svizzera italiana.
In this repository, you can find the following files:
- tensor flow: a code repository to be used during this project
- ground-truth-unique: a file containing the references triples necessary to evaluate the search engine (step 3)
For more information, see the Project-02 slides (available on iCourse)
Note: Feel free to modify this file according to the project's necessities.
Environment setup
To install the required dependencies make sure python3
points to a Python 3.10 or 3.11 installation and then run:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
Part 1: data extraction
To extract the data in file data.csv
run the command:
python3 extract-data.py
The script prints the requested counts, which are namely:
Methods: 5817
Functions: 4565
Classes: 1882
Python Files: 2817
Part 2: Training
In order to train and predict the output of a given query run the command:
python3 search-data.py [method] "[query]"
where [method]
is one of {tfidf,freq,lsi,doc2vec}
or all
to run all classifiers and [query]
is the natural
language query to search. Outputs are printed on stdout, and in case of doc2vec
the trained model file is saved in
./doc2vec_model.dat
and fetched in this path for subsequent executions.
Part 3: Evaluation
To evaluate a model run the command:
python3 search-data.py [method] ./ground-truth-unique.txt
where [method]
is one of {tfidf,freq,lsi,doc2vec}
or all
to evaluate all classifiers. The script outputs the
performance of the classifiers in terms of average precision and recall, which are namely:
Engine | Average Precision | Average Recall |
---|---|---|
tfidf | 90.00% | 90.00% |
freq | 93.33% | 100.00% |
lsi | 90.00% | 90.00% |
doc2vec | 73.33% | 80.00% |
Report
To compile the report run:
cd report
pdflatex -interaction=nonstopmode -output-directory=. main.tex
pdflatex -interaction=nonstopmode -output-directory=. main.tex