Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate VCF files from the GDC #80

Open
agduncan94 opened this issue Aug 6, 2020 · 3 comments
Open

Investigate VCF files from the GDC #80

agduncan94 opened this issue Aug 6, 2020 · 3 comments

Comments

@agduncan94
Copy link
Collaborator

agduncan94 commented Aug 6, 2020

Investigate whether VCF files downloaded from the GDC portal work with the vcf track type.

Note: VCF tracks require the VCF file to be gzipped and an index file to exist.

There are two parts to this.

  1. Download a VCF file and generate the index file locally (may need to run bgzip). Using these two files, create a VCF track. Does this work?
  2. Find a VCF file with an existing index. Set the URLtemplate path of a VCF track (and index path) to these locations (requires using the authentication token)

Use samtools tabix to create the index file.

Post your results to the investigation as a comment here.

@agduncan94
Copy link
Collaborator Author

The GDC API has the concept of related files, which might be relevant for finding associated index files.
https://docs.gdc.cancer.gov/API/Users_Guide/Downloading_Files/#related-files

@agduncan94 agduncan94 changed the title Investigate VCF files form the GDC Investigate VCF files from the GDC Aug 10, 2020
@agduncan94 agduncan94 self-assigned this Aug 10, 2020
@agduncan94
Copy link
Collaborator Author

Part 1 - download vcf and generate index locally

  • looked at 6abc7d24-74d1-4e62-975c-753aec620201
  • No index file, used “tabix -p vcf file.vcf.gz” to generate
  • download from UI uses folder structure, would download via API just get the file?
  • if vcf file is not .gz, must run “bgzip -c file.vcf > file.vcf.gz” before creating index

I used the following configuration
[tracks.vcf]
urlTemplate=6abc7d24-74d1-4e62-975c-753aec620201.vep.vcf.gz
storeClass=JBrowse/Store/SeqFeature/VCFTabix
type=JBrowse/View/Track/HTMLVariants

This assumed the index file was named 6abc7d24-74d1-4e62-975c-753aec620201.vep.vcf.gz.tbi and in the same dir (jbrowse/data) as the vcf file.

@agduncan94 agduncan94 removed their assignment Aug 26, 2020
@GFJHogue GFJHogue self-assigned this Sep 21, 2020
@GFJHogue
Copy link
Collaborator

Relevant points from #91 investigation:

  • 38% of the VCFs are gzipped AND have an index
  • 55% of the VCFs are gzipped AND have NO index
  • 7% of the VCFs are NOT gzipped AND have NO index

For the remaining part 2 I'll extend the work in #98 to load remote indexed&gzipped VCFs with a GDC token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants