78 lines
3 KiB
Markdown
78 lines
3 KiB
Markdown
# Dataset Download Instructions
|
|
|
|
## Project .zip Export
|
|
|
|
We scraped GitHub repositories using the download tool https://seart-ghs.si.usi.ch/ to generate the `results.csv` file
|
|
under this directory. Other than the default constraints applied by the `seart-ghs` crawler, we used the following
|
|
criteria:
|
|
|
|
- lines of code: >=10000
|
|
- language: `Python`
|
|
|
|
We found 21269 results. We then downloaded a `.zip` archive of the main branch of each repository using the following
|
|
command. We started the download process on 2023-11-13 at 12:00.
|
|
|
|
```shell
|
|
mkdir download || true
|
|
cat results.csv | \
|
|
awk -F, 'NR>1 { print "wget -O " $2 ".zip https://github.com/" $2 "/archive/refs/heads/" $6 ".zip" }' | \
|
|
sed 's#\/#-#;s#\"##g' > download/to_download.sh
|
|
cd download
|
|
bash to_download.sh
|
|
```
|
|
|
|
### Manually Excluded Repos
|
|
|
|
We manually excluded the following repositories from our scraped dataset ("404" means that the repository was
|
|
inaccessible and could not be downloaded):
|
|
|
|
- `thorn-lab/coronavirus_structural_task_force` (too large, more than 6GiB)
|
|
- `feeicn/security-ppt` (too large, more than 9GiB)
|
|
- `salesforce/ai-economist` (404)
|
|
- `agiliumtrade/ai-metaapi-python-sdk` (404)
|
|
- `pokemonchw/dieloli` (harmful content)
|
|
- `thesnowguru/pytrader-python-mt4-mt5-trading-api-connector-drag-n-drop` (DMCA takedown)
|
|
- `objectiv/objectiv-analytics` (404)
|
|
- `aws/solutions-aws-security-hub-automated-response-and-remediation` (404)
|
|
- `openunited/product-factory-backend` (404)
|
|
- `ibm-epbl/ibm-project-43602-1660718377` (404)
|
|
- `ibm-epbl/ibm-project-1392-1658386621` (404)
|
|
- `potatolondon/django-gcloud-connectors` (404)
|
|
- `fortwoone/oracle-project` (404)
|
|
- `iperov/deepxtools` (404)
|
|
- `frequenz/floss-frequenz-sdk-python` (404)
|
|
|
|
### Check Archive Health
|
|
|
|
The following script was used to check the integrity of each downloaded `.zip` file.
|
|
|
|
```shell
|
|
cd download
|
|
find . -name '*.zip' \
|
|
-exec bash -c 'echo $0 $(unzip -l "$0" 2>/dev/null 1>/dev/null && echo "1" || echo "0")' \{\} \; \
|
|
> archive_health.txt
|
|
```
|
|
|
|
## Function Extraction
|
|
|
|
The following command builds a dataset from the archives saved in the `/download` subdirectory:
|
|
|
|
```shell
|
|
python3 ./extract.py
|
|
```
|
|
|
|
Functions are extracted with the Python `ast` module, which discards comments (but not docstrings). The script generates
|
|
one parquet archive per project in the directory `/functions` containing functions.
|
|
|
|
As the dataset was large, this script was terminated early. At termination, 70 million functions were extracted. Due to
|
|
computing power limitations for model training, we further extracted only 500000 functions out of the ones downloaded
|
|
to build the training set. The extraction process reads the archives in `/functions` and then stores the extracted
|
|
functions in the Parquet file `extracted/functions.pq`. The extraction script can be invoked with the command:
|
|
|
|
```shell
|
|
python3 extract.py
|
|
```
|
|
|
|
The extraction process guarantees that the extracted functions have valid syntax for Python 3.10+ and that the code of
|
|
each function contains only ASCII characters.
|
|
|