-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathhomework07.Rmd
122 lines (86 loc) · 4.67 KB
/
homework07.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: "MCB536: Homework 7"
author: "Student Name"
date: "11/22/2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
***Total: 42 points***
Complete this homework by writing R code to complete the following tasks. Keep in mind:
i. Empty chunks have been included where code is required
ii. For Problem 2e, you should include a image (screen shot) instead of providing code
iii. This homework requires use of data files:
- `BRCA.genome_wide_snp_6_broad_Level_3_scna.seg` (Problems 1, 2)
- `GIAB_highconf_v.3.3.2.vcf.gz` (Problem 3)
iv. You will be graded on your code and output results (knitted .html file). The assignment is worth 50 points total; partial credit can be awarded.
This assignment is due on **Dec 1, 2022**.
For additional resources, please refer to these links:
Problems 1 & 2:
- https://www.bioconductor.org/packages/devel/bioc/vignettes/plyranges/inst/doc/an-introduction.html
- https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html
Problem 3:
- https://bioconductor.org/packages/release/bioc/vignettes/Rsamtools/inst/doc/Rsamtools-Overview.pdf
Problem 4:
- https://bioconductor.org/packages/release/bioc/vignettes/VariantAnnotation/inst/doc/VariantAnnotation.pdf
# Problem 1: Overlaps between genomic regions and copy number alterations. (14 points total)
### Preparation
Load copy number segment results as shown in *2.1 BED format* of *Lecture16_GenomicData.Rmd*. You will use the same file as in the lecture notes, `BRCA.genome_wide_snp_6_broad_Level_3_scna.seg`. Here is code to get you started.
```{r, message=FALSE}
library(tidyverse)
library(GenomicRanges)
library(plyranges)
segs <- read.delim("BRCA.genome_wide_snp_6_broad_Level_3_scna.seg", as.is = TRUE)
mode(segs$Chromosome) <- "character"
segs[segs$Chromosome == 23, "Chromosome"] <- "X"
segs.gr <- as(segs, "GRanges")
```
### a. Find the segments in `segs.gr` that have *any* overlap with the region `chr8:128,746,347-128,755,810` (4 points)
Print out the first five unique TCGA IDs.
```{r}
```
### b. Find the mean of the `Segment_Mean` values for copy number segments that have *any* overlap with the region chr17:37,842,337-37,886,915. (4 points)
```{r}
```
### c. Find the patient sample distribution of copy number for `PIK3CA` (hg19). (6 points)
Find the counts of samples with deletion (D; `Segment_Mean < -0.3`), neutral (N; `Segment_Mean >= -0.3 & Segment_Mean <= 0.3`), gain (G; `Segment_Mean > 0.3`) segments that have `any` overlap with `PIK3CA` gene coordinates.
```{r}
```
# Problem 2: Frequency of copy number alteration events within genomic regions. (12 points total)
This problem will continue to use the copy number data stored in `segs.gr`.
### a. Create a genome-wide tile of 1Mb windows for the human genome (`hg19`). (4 points)
See *3.1 Tiling the genome* of *Lecture16_GenomicData.Rmd* for hints.
```{r}
```
### b. Find the 1Mb window with the most frequent overlapping deletions. (4 points)
Find the 1Mb windows with `any` overlap with deletion copy number segments. Assume a deletion segment is defined as a segment in `segs.gr` having `Segment_Mean < -0.3`.
Return one of the 1Mb window `Granges` entry with the highest frequency (count) of deletion segments.
Hint: Subset the `segs.gr` to only rows with `Segment_Mean < -0.3`.
```{r}
```
### c. Visually inspect the deletion overlap result from part (b) using IGV. (4 points)
Provide a screen shot of IGV at the 1Mb window with the most frequent overlap with deletion segments. The image should include the segments from `BRCA.genome_wide_snp_6_broad_Level_3_scna.seg` loaded.
# Problem 3: Reading and annotating genomic variants (16 points total)
### Preparation
```{r, message=FALSE}
library(VariantAnnotation)
vcfFile <- "GIAB_highconf_v.3.3.2.vcf.gz"
```
### a. Load variant data from VCF file `GIAB_highconf_v.3.3.2.vcf.gz` for `chr8:128,700,000-129,000,000`. (4 points)
Note: use genome build `hg19`.
```{r}
```
### b. Combine the fields of the VCF genotype information into a table. (4 points)
You may use your choice of data objects (e.g. `data.frame`).
```{r}
```
### c. Retrieve the following information at chr8:128747953. (8 points)
Print out the SNP ID (i.e. "rs ID"), reference base (`REF`), alterate base (`ALT`), genotype (`GT`), depth (`DP`), allele depth (`ADALL`), phase set (`PS`).
Hints:
i. `REF` and `ALT` are in the output of `rowRanges(vcf)`. See Section `3a` in `Lecture16_VariantCalls.Rmd`
ii. To get the sequence of `DNAString`, use `as.character(x)`.
ii. To get the sequence of `DNAStringSet`, use `as.character(unlist(x))`.
iii. To expand a list of information for `geno`, use `unlist(x)`.
```{r}
```