# Dataset Download Instructions ## Project .zip Export We scraped GitHub repositories using the download tool https://seart-ghs.si.usi.ch/ to generate the `results.csv` file under this directory. Other than the default constraints applied by the `seart-ghs` crawler, we used the following criteria: - lines of code: >=10000 - language: `Python` We found 21269 results. We then downloaded a `.zip` archive of the main branch of each repository using the following command. We started the download process on 2023-11-13 at 12:00. ```shell mkdir download || true cat results.csv | \ awk -F, 'NR>1 { print "wget -O " $2 ".zip https://github.com/" $2 "/archive/refs/heads/" $6 ".zip" }' | \ sed 's#\/#-#;s#\"##g' > download/to_download.sh cd download bash to_download.sh ``` ### Manually Excluded Repos We manually excluded the following repositories from our scraped dataset ("404" means that the repository was inaccessible and could not be downloaded): - `thorn-lab/coronavirus_structural_task_force` (too large, more than 6GiB) - `feeicn/security-ppt` (too large, more than 9GiB) - `salesforce/ai-economist` (404) - `agiliumtrade/ai-metaapi-python-sdk` (404) - `pokemonchw/dieloli` (harmful content) - `thesnowguru/pytrader-python-mt4-mt5-trading-api-connector-drag-n-drop` (DMCA takedown) - `objectiv/objectiv-analytics` (404) - `aws/solutions-aws-security-hub-automated-response-and-remediation` (404) - `openunited/product-factory-backend` (404) - `ibm-epbl/ibm-project-43602-1660718377` (404) - `ibm-epbl/ibm-project-1392-1658386621` (404) - `potatolondon/django-gcloud-connectors` (404) - `fortwoone/oracle-project` (404) - `iperov/deepxtools` (404) - `frequenz/floss-frequenz-sdk-python` (404) ### Check Archive Health The following script was used to check the integrity of each downloaded `.zip` file. ```shell cd download find . -name '*.zip' \ -exec bash -c 'echo $0 $(unzip -l "$0" 2>/dev/null 1>/dev/null && echo "1" || echo "0")' \{\} \; \ > archive_health.txt ``` ## Function Extraction The following command builds a dataset from the archives saved in the `/download` subdirectory: ```shell python3 ./extract.py ``` Functions are extracted with the Python `ast` module, which discards comments (but not docstrings). The script generates one parquet archive per project in the directory `/functions` containing functions. As the dataset was large, this script was terminated early. At termination, 70 million functions were extracted. Due to computing power limitations for model training, we further extracted only 500000 functions out of the ones downloaded to build the training set. The extraction process reads the archives in `/functions` and then stores the extracted functions in the Parquet file `extracted/functions.pq`. The extraction script can be invoked with the command: ```shell python3 extract.py ``` The extraction process guarantees that the extracted functions have valid syntax for Python 3.10+ and that the code of each function contains only ASCII characters.