# Project 02: Multi-source code search ### About the Project This project has the goal of developing a search engine able to query a large Python code repository using multiple sources of information. It is part of the Knowledge Analysis & Management - 2022 course from the Università della Svizzera italiana. In this repository, you can find the following files: - tensor flow: a code repository to be used during this project - ground-truth-unique: a file containing the references triples necessary to evaluate the search engine (step 3) For more information, see the Project-02 slides (available on iCourse) Note: Feel free to modify this file according to the project's necessities. ## Environment setup To install the required dependencies make sure `python3` points to a Python 3.10 or 3.11 installation and then run: ```shell python3 -m venv env source env/bin/activate pip install -r requirements.txt ``` ## Part 1: data extraction To extract the data in file `data.csv` run the command: ```shell python3 extract-data.py ``` The script prints the requested counts, which are namely: ``` Methods: 5817 Functions: 4565 Classes: 1882 Python Files: 2817 ``` ## Part 2: Training In order to train and predict the output of a given query run the command: ```shell python3 search-data.py [method] "[query]" ``` where `[method]` is one of `{tfidf,freq,lsi,doc2vec}` or `all` to run all classifiers and `[query]` is the natural language query to search. Outputs are printed on stdout, and in case of `doc2vec` the trained model file is saved in `./doc2vec_model.dat` and fetched in this path for subsequent executions. ## Part 3: Evaluation To evaluate a model run the command: ```shell python3 search-data.py [method] ./ground-truth-unique.txt ``` where `[method]` is one of `{tfidf,freq,lsi,doc2vec}` or `all` to evaluate all classifiers. The script outputs the performance of the classifiers in terms of average precision and recall, which are namely: | Engine | Average Precision | Average Recall | |:---------|:--------------------|:-----------------| | tfidf | 20.00% | 20.00% | | freq | 27.00% | 40.00% | | lsi | 4.00% | 20.00% | | doc2vec | 10.00% | 10.00% |