soft-analytics-02/dataset/README.md

# Dataset Download Instructions

## Project .zip Export

We scraped GitHub repositories using the download tool https://seart-ghs.si.usi.ch/ to generate the `results.csv` file
under this directory. Other than the default constraints applied by the `seart-ghs` crawler, we used the following
criteria:

- lines of code: >=10000
- language: `Python`

We found 21269 results. We then downloaded a `.zip` archive of the main branch of each repository using the following
command. We started the download process on 2023-11-13 at 12:00.

```shell
mkdir download || true
cat results.csv | \
  awk -F, 'NR>1 { print "wget -O " $2 ".zip https://github.com/" $2 "/archive/refs/heads/" $6 ".zip" }' | \
  sed 's#\/#-#;s#\"##g' > download/to_download.sh
cd download
bash to_download.sh
```

### Manually Excluded Repos

We manually excluded the following repositories from our scraped dataset ("404" means that the repository was
inaccessible and could not be downloaded):

- `thorn-lab/coronavirus_structural_task_force` (too large, more than 6GiB)
- `feeicn/security-ppt` (too large, more than 9GiB)
- `salesforce/ai-economist` (404)
- `agiliumtrade/ai-metaapi-python-sdk` (404)
- `pokemonchw/dieloli` (harmful content)
- `thesnowguru/pytrader-python-mt4-mt5-trading-api-connector-drag-n-drop` (DMCA takedown)
- `objectiv/objectiv-analytics` (404)
- `aws/solutions-aws-security-hub-automated-response-and-remediation` (404)
- `openunited/product-factory-backend` (404)
- `ibm-epbl/ibm-project-43602-1660718377` (404)
- `ibm-epbl/ibm-project-1392-1658386621` (404)
- `potatolondon/django-gcloud-connectors` (404)
- `fortwoone/oracle-project` (404)
- `iperov/deepxtools` (404)
- `frequenz/floss-frequenz-sdk-python` (404)

### Check Archive Health

The following script was used to check the integrity of each downloaded `.zip` file.

```shell
cd download
find . -name '*.zip' \
    -exec bash -c 'echo $0 $(unzip -l "$0" 2>/dev/null 1>/dev/null && echo "1" || echo "0")' \{\} \; \
    > archive_health.txt
```

## Function Extraction

The following command builds a dataset from the archives saved in the `/download` subdirectory:

```shell
python3 ./extract.py
```

Functions are extracted with the Python `ast` module, which discards comments (but not docstrings). The script generates
one parquet archive per project in the directory `/functions` containing functions.

As the dataset was large, this script was terminated early. At termination, 70 million functions were extracted. Due to
computing power limitations for model training, we further extracted only 500000 functions out of the ones downloaded
to build the training set. The extraction process reads the archives in `/functions` and then stores the extracted
functions in the Parquet file `extracted/functions.pq`. The extraction script can be invoked with the command:

```shell
python3 extract.py
```

The extraction process guarantees that the extracted functions have valid syntax for Python 3.10+ and that the code of
each function contains only ASCII characters.