Code to fetch dbGaP files using sra-toolkit.
The Dockerfile
can be used to build a docker image that can be used to run the fetch.py
script.
Alternatively, to run outside of the Docker image, you must install SRAToolkit. The code currently uses v3.0.10; it may work with other versions, but it is not guaranteed.
Before running the script, you will need to use the dbGaP File Selector to select which files to download. From the My Requests section of the dbGaP authorized access webpage, locate the Data Access Rquest (DAR) for which you would like to download data. Then click on "Request Files" next to the DAR. On the new page, click on the "dbGaP File Selector" link.
Once in the dbGaP File selector, select which files you would like to download. After you have made your selection, toggle the "Selected" in the "Select" pane. You will need to download two files to use as input for the workflow:
- "Cart file": the cart file containing the list of files to download in sratoolkit kart format.
- "Files Table": the manifest file listing which files should be used to download.
The fetch.py
python script can be run locally to download dbGaP data.
Required inputs:
Argument | Description |
---|---|
--ngc |
The path to the dbGaP project key for your dbGaP application |
--cart |
A cart file generated by the dbGaP File Selector |
--manifest |
A manifest file generated by the dbGaP File Selector |
--outdir |
The output directory where the data should be saved |
Optional inputs:
Argument | Description |
---|---|
--prefetch |
The path to the SRAToolkit prefetch binary |
--untar |
Flag the can be set if the script should untar any .tar or .tar.gz files into a directory with the same name as the archive (without extension). If set, the original .tar or .tar.gz archive will be deleted. |
Because prefetch somestimes exits without error but without downloading all requested files, the script will attempt to download the files and compare agianst the manfiest; if all files were not downloaded initially, it will retry 3 times. Once all files are successfully downloaded, it will copy the files to the final requested outdir
.
Note that if the fetch.py
script crashes for some reason, you will have to restart from the beginning.
A WDL workflow is also provided to download the files. The WDL automatically untars the files and deletes the original archive (by passing the --untar
argument to fetch.py
under the hood).
The inputs to the WDL are as follows:
Required inputs:
Argument | Description |
---|---|
ngc_file |
The path to the dbGaP project key for your dbGaP application |
cart_file |
A cart file generated by the dbGaP File Selector |
manifest_file |
A manifest file generated by the dbGaP File Selector |
output_directory |
The output directory where the data should be saved |
Optional inputs:
Argument | Description |
---|---|
disk_gb | The hard disk size of the instance to use for downloading and untarring. If downloading a large volume of files, you may need to increase this value. (Default: 50) |
The workflow can be found on Dockstore.
Note that the project key (--ngc
or ngc_file
) is sensitive; do not share it with people who are not covered by your dbGaP application as it will allow them to download data.
We recommend that you do not put the project key file in a Terra/AnVIL workspace that you are planning to share with other people.
Instead, store it in a more protected workspace that is only shared with people covered by the dbGaP application.