-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathLecture16_VariantCalls.Rmd
95 lines (77 loc) · 3.42 KB
/
Lecture16_VariantCalls.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: "MCB536 Lecture 16 (Part 3): Read Variant Call Format (VCF) Files in R"
author: "Gavin Ha"
date: "11/30/2021"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
We will learn to read VCF files within R using a publicly available dataset of genomic variant calls for the infamous individual, NA12878. The Genome-in-a-Bottle Consortium has compiled consensus variant calls on this individual's genome and released this data for researchers to use. One of the main purposes of this data is to provide a benchmark for those to develop computational tools and analysis of human genomes. See https://github.com/genome-in-a-bottle/giab_latest_release
Variant Call Format (VCF) is a very common format for representing genomic variation data. See Lecture 16: Slides 12.
## 0. Install and load the `VariantAnnotation` Bioconductor package
Load the `VariantAnnotation` package
```{r, message=FALSE}
#BiocManager::install("VariantAnnotation")
library(VariantAnnotation)
```
## 1. Prepare parameters for reading VCF file.
There are a lot of variants in this file `GIAB_highconf_v.3.3.2.vcf.gz`, so we want to restrict to a smaller region for this example.
### a. Setup parameters for scanning the VCF file.
First, we need to set up a `ScanVcfParam` object to read within `17:35500000-36000000`.
```{r}
vcfFile <- "GIAB_highconf_v.3.3.2.vcf.gz"
vcfHead <- scanVcfHeader(vcfFile)
myGRange4 <- GRanges(seqnames = "17", ranges = IRanges(start = 35500000, end = 36000000))
vcf.param <- ScanVcfParam(which = myGRange4)
```
## 2. Read the VCF file.
```{r}
vcf <- readVcf(vcfFile, genome = "hg19", param = vcf.param)
```
The `vcf` variable is of class `CollapsedVCF` and will contain header information and data. Let's see what information has been parsed by `readVcf`.
## 3. Extract the contents of the VCF entries.
### a. Return the variants in this region as a `GRanges` object.
The `rowRanges` function will return a `GRanges` object containing the coordinates, REF/ALT bases, quality, and filtering status of the variants.
```{r}
rowRanges(vcf)
```
### b. Inspect the header information
The `INFO` column in the original VCF text file contains a semi-colon delimited set of custom fields with flexible format that algorithms will output. Here, it is parsed into usable format. First, let's look at what fields are available from the header.
```{r, eval = FALSE}
info(vcf) # returns a DataFrame object
```
The `FORMAT` column in the original VCF text file contains the format and description of the genotype fields. Let's see what these are.
```{r}
geno(header(vcf))
```
### c. Inspect the genotype, read depth, and allele depth inforation.
To see the genotype `GT`, read depth `DP`, and allele depth `AD`, we access the the list.
```{r}
geno(vcf)$GT[1:5]
geno(vcf)$DP[1:5]
geno(vcf)$AD[1:5]
```
### d. Combine all `geno` fields into a single table.
You can also combine all fields into a `data.frame` object. But this code only works if the VCF contains a single sample.
```{r}
genoData <- data.frame(do.call(cbind, geno(vcf)))
colnames(genoData) <- rownames(geno(header(vcf)))
```
## Exercise 3: Reading variants from a VCF file.
### a. Create a range for `8:128747680-128753680`.
```{r}
# GRanges()
```
### b. Setup parameters to read VCF.
```{r}
# ScanVcfParam
```
### c. Read the VCF file at `8:128747680-128753680`
```{r}
# readVcf
```
### d. What is the RS id, genotype (`GT`) and depth (`DP`) at the SNP in this locus?
```{r}
# geno()
```