|
||
---|---|---|
.. | ||
extracted | ||
extract.py | ||
README.md | ||
results.csv | ||
sample.py |
Dataset Download Instructions
Project .zip Export
We scraped GitHub repositories using the download tool https://seart-ghs.si.usi.ch/ to generate the results.csv
file
under this directory. Other than the default constraints applied by the seart-ghs
crawler, we used the following
criteria:
- lines of code: >=10000
- language:
Python
We found 21269 results. We then downloaded a .zip
archive of the main branch of each repository using the following
command. We started the download process on 2023-11-13 at 12:00.
mkdir download || true
cat results.csv | \
awk -F, 'NR>1 { print "wget -O " $2 ".zip https://github.com/" $2 "/archive/refs/heads/" $6 ".zip" }' | \
sed 's#\/#-#;s#\"##g' > download/to_download.sh
cd download
bash to_download.sh
Manually Excluded Repos
We manually excluded the following repositories from our scraped dataset ("404" means that the repository was inaccessible and could not be downloaded):
thorn-lab/coronavirus_structural_task_force
(too large, more than 6GiB)feeicn/security-ppt
(too large, more than 9GiB)salesforce/ai-economist
(404)agiliumtrade/ai-metaapi-python-sdk
(404)pokemonchw/dieloli
(harmful content)thesnowguru/pytrader-python-mt4-mt5-trading-api-connector-drag-n-drop
(DMCA takedown)objectiv/objectiv-analytics
(404)aws/solutions-aws-security-hub-automated-response-and-remediation
(404)openunited/product-factory-backend
(404)ibm-epbl/ibm-project-43602-1660718377
(404)ibm-epbl/ibm-project-1392-1658386621
(404)potatolondon/django-gcloud-connectors
(404)fortwoone/oracle-project
(404)iperov/deepxtools
(404)frequenz/floss-frequenz-sdk-python
(404)
Check Archive Health
The following script was used to check the integrity of each downloaded .zip
file.
cd download
find . -name '*.zip' \
-exec bash -c 'echo $0 $(unzip -l "$0" 2>/dev/null 1>/dev/null && echo "1" || echo "0")' \{\} \; \
> archive_health.txt
Function Extraction
The following command builds a dataset from the archives saved in the /download
subdirectory:
python3 ./extract.py
Functions are extracted with the Python ast
module, which discards comments (but not docstrings). The script generates
one parquet archive per project in the directory /functions
containing functions.
As the dataset was large, this script was terminated early. At termination, 70 million functions were extracted. Due to
computing power limitations for model training, we further extracted only 500000 functions out of the ones downloaded
to build the training set. The extraction process reads the archives in /functions
and then stores the extracted
functions in the Parquet file extracted/functions.pq
. The extraction script can be invoked with the command:
python3 extract.py
The extraction process guarantees that the extracted functions have valid syntax for Python 3.10+ and that the code of each function contains only ASCII characters.