kse-01/README.md
2023-11-07 11:48:00 +01:00

2.3 KiB

Project 02: Multi-source code search

About the Project

This project has the goal of developing a search engine able to query a large Python code repository using multiple sources of information. It is part of the Knowledge Analysis & Management - 2022 course from the Università della Svizzera italiana.

In this repository, you can find the following files:

  • tensor flow: a code repository to be used during this project
  • ground-truth-unique: a file containing the references triples necessary to evaluate the search engine (step 3)

For more information, see the Project-02 slides (available on iCourse)

Note: Feel free to modify this file according to the project's necessities.

Environment setup

To install the required dependencies make sure python3 points to a Python 3.10 or 3.11 installation and then run:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Part 1: data extraction

To extract the data in file data.csv run the command:

python3 extract-data.py

The script prints the requested counts, which are namely:

Methods: 5817
Functions: 4565
Classes: 1882
Python Files: 2817

Part 2: Training

In order to train and predict the output of a given query run the command:

python3 search-data.py [method] "[query]"

where [method] is one of {tfidf,freq,lsi,doc2vec} or all to run all classifiers and [query] is the natural language query to search. Outputs are printed on stdout, and in case of doc2vec the trained model file is saved in ./doc2vec_model.dat and fetched in this path for subsequent executions.

Part 3: Evaluation

To evaluate a model run the command:

python3 search-data.py [method] ./ground-truth-unique.txt

where [method] is one of {tfidf,freq,lsi,doc2vec} or all to evaluate all classifiers. The script outputs the performance of the classifiers in terms of average precision and recall, which are namely:

Engine Average Precision Average Recall
tfidf 20.00% 20.00%
freq 27.00% 40.00%
lsi 4.00% 20.00%
doc2vec 10.00% 10.00%