No description

Find a file

Claudio Maggioni e2d6151c34 Added commit id in report		2023-11-08 22:31:16 +01:00
out	Models fixed	2023-11-08 22:11:43 +01:00
report	Added commit id in report	2023-11-08 22:31:16 +01:00
tensorflow	Initial commit	2023-10-09 11:37:31 +00:00
.gitattributes	Initial commit	2023-10-09 11:37:31 +00:00
.gitignore	Report section 1 and 2 done	2023-11-07 12:35:27 +01:00
data.csv	wip part 2	2023-10-11 13:59:07 +02:00
doc2vec_model.dat	Models fixed	2023-11-08 22:11:43 +01:00
extract-data.py	wip report	2023-11-07 11:48:00 +01:00
ground-truth-unique.txt	Initial commit	2023-10-09 11:37:31 +00:00
prec-recall.py	Models fixed	2023-11-08 22:11:43 +01:00
README.md	added name	2023-11-08 22:25:13 +01:00
requirements.txt	wip report	2023-11-07 11:48:00 +01:00
search-data.py	Models fixed	2023-11-08 22:11:43 +01:00

README.md

Project 02: Multi-source code search

Claudio Maggioni

About the Project

This project has the goal of developing a search engine able to query a large Python code repository using multiple sources of information. It is part of the Knowledge Analysis & Management - 2022 course from the Università della Svizzera italiana.

In this repository, you can find the following files:

tensor flow: a code repository to be used during this project
ground-truth-unique: a file containing the references triples necessary to evaluate the search engine (step 3)

For more information, see the Project-02 slides (available on iCourse)

Note: Feel free to modify this file according to the project's necessities.

Environment setup

To install the required dependencies make sure python3 points to a Python 3.10 or 3.11 installation and then run:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Part 1: data extraction

To extract the data in file data.csv run the command:

python3 extract-data.py

The script prints the requested counts, which are namely:

Methods: 5817
Functions: 4565
Classes: 1882
Python Files: 2817

Part 2: Training

In order to train and predict the output of a given query run the command:

python3 search-data.py [method] "[query]"

where [method] is one of {tfidf,freq,lsi,doc2vec} or all to run all classifiers and [query] is the natural language query to search. Outputs are printed on stdout, and in case of doc2vec the trained model file is saved in ./doc2vec_model.dat and fetched in this path for subsequent executions.

Part 3: Evaluation

To evaluate a model run the command:

python3 search-data.py [method] ./ground-truth-unique.txt

where [method] is one of {tfidf,freq,lsi,doc2vec} or all to evaluate all classifiers. The script outputs the performance of the classifiers in terms of average precision and recall, which are namely:

Engine	Average Precision	Average Recall
tfidf	90.00%	90.00%
freq	93.33%	100.00%
lsi	90.00%	90.00%
doc2vec	73.33%	80.00%

Report

To compile the report run:

cd report
pdflatex -interaction=nonstopmode -output-directory=. main.tex
pdflatex -interaction=nonstopmode -output-directory=. main.tex