soft-analytics-02/dataset/README.md

3.0 KiB

Dataset Download Instructions

Project .zip Export

We scraped GitHub repositories using the download tool https://seart-ghs.si.usi.ch/ to generate the results.csv file under this directory. Other than the default constraints applied by the seart-ghs crawler, we used the following criteria:

  • lines of code: >=10000
  • language: Python

We found 21269 results. We then downloaded a .zip archive of the main branch of each repository using the following command. We started the download process on 2023-11-13 at 12:00.

mkdir download || true
cat results.csv | \
  awk -F, 'NR>1 { print "wget -O " $2 ".zip https://github.com/" $2 "/archive/refs/heads/" $6 ".zip" }' | \
  sed 's#\/#-#;s#\"##g' > download/to_download.sh
cd download
bash to_download.sh

Manually Excluded Repos

We manually excluded the following repositories from our scraped dataset ("404" means that the repository was inaccessible and could not be downloaded):

  • thorn-lab/coronavirus_structural_task_force (too large, more than 6GiB)
  • feeicn/security-ppt (too large, more than 9GiB)
  • salesforce/ai-economist (404)
  • agiliumtrade/ai-metaapi-python-sdk (404)
  • pokemonchw/dieloli (harmful content)
  • thesnowguru/pytrader-python-mt4-mt5-trading-api-connector-drag-n-drop (DMCA takedown)
  • objectiv/objectiv-analytics (404)
  • aws/solutions-aws-security-hub-automated-response-and-remediation (404)
  • openunited/product-factory-backend (404)
  • ibm-epbl/ibm-project-43602-1660718377 (404)
  • ibm-epbl/ibm-project-1392-1658386621 (404)
  • potatolondon/django-gcloud-connectors (404)
  • fortwoone/oracle-project (404)
  • iperov/deepxtools (404)
  • frequenz/floss-frequenz-sdk-python (404)

Check Archive Health

The following script was used to check the integrity of each downloaded .zip file.

cd download
find . -name '*.zip' \
    -exec bash -c 'echo $0 $(unzip -l "$0" 2>/dev/null 1>/dev/null && echo "1" || echo "0")' \{\} \; \
    > archive_health.txt

Function Extraction

The following command builds a dataset from the archives saved in the /download subdirectory:

python3 ./extract.py

Functions are extracted with the Python ast module, which discards comments (but not docstrings). The script generates one parquet archive per project in the directory /functions containing functions.

As the dataset was large, this script was terminated early. At termination, 70 million functions were extracted. Due to computing power limitations for model training, we further extracted only 500000 functions out of the ones downloaded to build the training set. The extraction process reads the archives in /functions and then stores the extracted functions in the Parquet file extracted/functions.pq. The extraction script can be invoked with the command:

python3 extract.py

The extraction process guarantees that the extracted functions have valid syntax for Python 3.10+ and that the code of each function contains only ASCII characters.