Go to file
2023-11-08 22:31:16 +01:00
out Models fixed 2023-11-08 22:11:43 +01:00
report Added commit id in report 2023-11-08 22:31:16 +01:00
tensorflow Initial commit 2023-10-09 11:37:31 +00:00
.gitignore Report section 1 and 2 done 2023-11-07 12:35:27 +01:00
data.csv wip part 2 2023-10-11 13:59:07 +02:00
doc2vec_model.dat Models fixed 2023-11-08 22:11:43 +01:00
extract-data.py wip report 2023-11-07 11:48:00 +01:00
ground-truth-unique.txt Initial commit 2023-10-09 11:37:31 +00:00
prec-recall.py Models fixed 2023-11-08 22:11:43 +01:00
README.md added name 2023-11-08 22:25:13 +01:00
requirements.txt wip report 2023-11-07 11:48:00 +01:00
search-data.py Models fixed 2023-11-08 22:11:43 +01:00

Project 02: Multi-source code search

Claudio Maggioni

About the Project

This project has the goal of developing a search engine able to query a large Python code repository using multiple sources of information. It is part of the Knowledge Analysis & Management - 2022 course from the Università della Svizzera italiana.

In this repository, you can find the following files:

  • tensor flow: a code repository to be used during this project
  • ground-truth-unique: a file containing the references triples necessary to evaluate the search engine (step 3)

For more information, see the Project-02 slides (available on iCourse)

Note: Feel free to modify this file according to the project's necessities.

Environment setup

To install the required dependencies make sure python3 points to a Python 3.10 or 3.11 installation and then run:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Part 1: data extraction

To extract the data in file data.csv run the command:

python3 extract-data.py

The script prints the requested counts, which are namely:

Methods: 5817
Functions: 4565
Classes: 1882
Python Files: 2817

Part 2: Training

In order to train and predict the output of a given query run the command:

python3 search-data.py [method] "[query]"

where [method] is one of {tfidf,freq,lsi,doc2vec} or all to run all classifiers and [query] is the natural language query to search. Outputs are printed on stdout, and in case of doc2vec the trained model file is saved in ./doc2vec_model.dat and fetched in this path for subsequent executions.

Part 3: Evaluation

To evaluate a model run the command:

python3 search-data.py [method] ./ground-truth-unique.txt

where [method] is one of {tfidf,freq,lsi,doc2vec} or all to evaluate all classifiers. The script outputs the performance of the classifiers in terms of average precision and recall, which are namely:

Engine Average Precision Average Recall
tfidf 90.00% 90.00%
freq 93.33% 100.00%
lsi 90.00% 90.00%
doc2vec 73.33% 80.00%


To compile the report run:

cd report
pdflatex -interaction=nonstopmode -output-directory=. main.tex
pdflatex -interaction=nonstopmode -output-directory=. main.tex