Downloading Large-Scale Datasets
at the GDC
23 October 2023
Bill Wysocki, Ph.D. GDC User Services Lead
Center for Translational Data Science
University of Chicago
gdc.cancer.gov
2
Downloading Large-Scale
Datasets at the GDC
Brief Introduction
Data Transfer Tool
API Download
Troubleshooting
Q&A
3
Introduction to GDC File Downloads
4
Genomic Data Commons File Download
The NCI's Genomic Data Commons (GDC) provides the cancer research community
with a unified repository and cancer knowledge base that enables data sharing across
cancer genomic studies in support of precision medicine.
Large-scale downloads are focused on Data Files over 5 GB
Files can be browsed and filtered from the GDC Data Repository
5
Options for Large File Download
Option 1: Data Transfer Tool
Standalone tool using the command line
Uses GDC API to download and applies settings
automatically
Download from:
https://gdc.cancer.gov/access-data/gdc-data-transfer-tool
Option 2: GDC API
Download directly from GDC API
Uses other software to access (curl in this presentation)
More customizable in terms of settings, less automated
6
Starting Point 1: One File UUID
One slide image from the TCGA-CESC project
7
Starting Point 2: Manifest with Many Files
All slide images from TCGA-CESC (open access)
8
Data Transfer Tool Demo
9
Token Information
The files we will be downloading today will be larger and open-access
A simulated token will be used for demonstration purposes
Most large-scale download involves controlled data
This simulated token is not necessary but will not interfere
Token File
sim_token.txt
Token String (simulated)
aaabbbccc dddeeefffggg1112
22333444555
10
GDC Data Transfer Tool Commands (1/5)
One UUID:
./gdc-client download
216feaac-8b0c-468d-991f-0412215e7a02
-t sim_token.txt
a) Runs the Data Transfer Tool
11
GDC Data Transfer Tool Commands (2/5)
One UUID:
./gdc-client download
216feaac-8b0c-468d-991f-0412215e7a02
-t sim_token.txt
a) Runs the Data Transfer Tool
b) Uses the download function
12
GDC Data Transfer Tool Commands (3/5)
One UUID:
./gdc-client download
216feaac-8b0c-468d-991f-0412215e7a02
-t sim_token.txt
a) Runs the Data Transfer Tool
b) Uses the download function
c) Specifies the file UUID
13
GDC Data Transfer Tool Commands (4/5)
One UUID:
./gdc-client download
216feaac-8b0c-468d-991f-0412215e7a02
-t sim_token.txt
a) Runs the Data Transfer Tool
b) Uses the download function
c) Specifies the file UUID
d) Specifies the token file
14
GDC Data Transfer Tool Commands (5/5)
Manifest with many UUIDs:
./gdc-client download
-m gdc_manifest.2023-10-16.txt
-t sim_token.txt
a) Runs the Data Transfer Tool
b) Uses the download function
c) Specifies the manifest file
d) Specifies the token file
15
GDC API Demo
16
GDC API Commands: Token Management
Store the token string as a variable for use
with Curl
export MYTOKEN=$(cat sim_token.txt)
Verify that the token string
was successfully stored
echo $MYTOKEN
17
GDC API Commands (1/6)
One UUID:
curl (-X GET)
-H "x-auth-token: $MYTOKEN"
--remote-name --remote-header-name
"https://api.gdc.cancer.gov/data/
216feaac-8b0c-468d-991f-0412215e7a02
?related_files=true"
a) Runs curl software, request type
GET is default
18
GDC API Commands (2/6)
One UUID:
curl (-X GET)
-H "x-auth-token: $MYTOKEN"
--remote-name --remote-header-name
"https://api.gdc.cancer.gov/data/
216feaac-8b0c-468d-991f-0412215e7a02
?related_files=true"
a) Runs curl software, request type
GET is default
b) Specifies header with token string
19
GDC API Commands (3/6)
One UUID:
curl (-X GET)
-H "x-auth-token: $MYTOKEN"
--remote-name --remote-header-name
"https://api.gdc.cancer.gov/data/
216feaac-8b0c-468d-991f-0412215e7a02
?related_files=true"
a) Runs curl software, request type
GET is default
b) Specifies header with token string
c) Downloads file name from API
20
GDC API Commands (4/6)
One UUID:
curl (-X GET)
-H "x-auth-token: $MYTOKEN"
--remote-name --remote-header-name
"https://api.gdc.cancer.gov/data/
216feaac-8b0c-468d-991f-0412215e7a02
?related_files=true"
a) Runs curl software, request type
GET is default
b) Specifies header with token string
c) Downloads file name from API
d) Main API URL with /data endpoint
21
GDC API Commands (5/6)
One UUID:
curl (-X GET)
-H "x-auth-token: $MYTOKEN"
--remote-name --remote-header-name
"https://api.gdc.cancer.gov/data/
216feaac-8b0c-468d-991f-0412215e7a02
?related_files=true"
a) Runs curl software, request type
GET is default
b) Specifies header with token string
c) Downloads file name from API
d) Main API URL with /data endpoint
e) Specifies UUID
22
GDC API Commands (6/6)
One UUID:
curl (-X GET)
-H "x-auth-token: $MYTOKEN"
--remote-name --remote-header-name
"https://api.gdc.cancer.gov/data/
216feaac-8b0c-468d-991f-0412215e7a02
?related_files=true"
a) Runs curl software, request type
GET is default
b) Specifies header with token string
c) Downloads file name from API
d) Main API URL with /data endpoint
e) Specifies UUID
f) Allows for index files to be
downloaded (BAM and VCF only)
23
Downloading Multiple files using the API
Option 1: Use API command and loop through list of UUIDs
Can be performed using bash or Python scripts
Option 2: Pass JSON formatted list of UUIDs
Uses a POST request with header - “Content-Type:
application/json
Requires conversion of list of UUIDs to JSON file
Option 3: Use comma delimited list to specify multiple
UUIDs in one line
Same as GET request in demo
Limited by URL length
24
Final Results: Downloaded Files
Data Transfer Tool
Files will be downloaded in folders named after their UUIDs
The md5sum has been verified
API Download
Files will be downloaded under their respective filenames in your current directory
unless otherwise specified
We recommend checking the md5sum against the file’s properties
The demonstrations in this webinar were based on MacOS or any other Unix-based
terminal. These functions are all available on Windows.
Documentation and personalized assistance is available
25
Troubleshooting Data Download
26
Troubleshooting Data Transfer Tool Errors
The GDC Data Transfer Tool can be used by researchers on a wide variety of operating
systems. However, errors can arise due to security settings, connection issues, etc.
Errors may be informative depending on the issue
Examples of informative error messages:
./gdc-client: No such file or directory
Solution: The command needs to be pointed at a different directory
Your token is invalid or expired. Please get a new token from
GDC Data Portal
Solution: Investigate the token file
27
DTT Error: Three Step Troubleshooting Flowchart
Flowchart starts at a user receiving an error that doesn’t specify the exact problem
This series of checks will allow the user to either solve or narrow down the issue
28
Step 1: Check Data Transfer Tool Version
The GDC has continuously released new versions of the data transfer tool to add new
features and bug fixes
Based on user/developer feedback
Latest version is always available at gdc.cancer.gov
Command: ./gdc-client --version
29
Step 2: Check Authentication Token
The token is a common source of errors, because multiple issues can arise. The following
criteria must be met to be a valid token.
The token must be current Reset token
The token must be correctly parsed Check for spaces or truncated token
The user must have dbGaP access to the project Check user profile
30
Step 3: Download using the GDC API Directly
Download errors with the Data Transfer Tool could arise from software incompatibility but
could also stem from connection issues or security settings
A successful download with the API rules out issues with your connection to the GDC
This may also solve download issues if your downloads finish via API testing
Quick command: curl https://api.gdc.cancer.gov/status
31
GDC Help Desk
Send an email to support@nci-gdc.datacommons.io for assistance with data download
Provide information you gathered from the previous steps, and we can help you diagnose the
issue
The GDC Help Desk is also happy to help walk you through any of the previous steps
outlined here
We also recommend reaching out if you are using an operating system that isn’t Windows,
MacOS, or Ubuntu
32
Useful Links GDC Documentation
https://docs.gdc.cancer.gov
33
Useful Links GDC Website
https://gdc.cancer.gov
34
Useful Links Additional Support
35
Questions?
1 - 8 0 0 - 4 - C A N C E R
U.S. Department of Health & Human Services
National Institutes of Health | National Cancer Institute
Produced October 2023
https://www.cancer.gov/