Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bam to counts conversion #4

Merged
merged 42 commits into from
Oct 28, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
7a6a06b
add changes
cansavvy Oct 24, 2024
53bf35b
document
cansavvy Oct 24, 2024
65357d2
fix example
cansavvy Oct 24, 2024
07880e1
Some fixes include paired filtered out
cansavvy Oct 24, 2024
ae84a3e
Proof of concept validation Rmd
cansavvy Oct 24, 2024
e00e752
Add a basic unit test
cansavvy Oct 24, 2024
4d23d76
Add timer
cansavvy Oct 25, 2024
9b310f2
Missing comma
cansavvy Oct 25, 2024
c3fadc6
Add missing dependencies
cansavvy Oct 25, 2024
34a7e61
rearrange anypaired steps
cansavvy Oct 25, 2024
6dcb773
Update dependencies for real
cansavvy Oct 25, 2024
329e279
missing a comma
cansavvy Oct 25, 2024
acb5856
Install from github maybe?
cansavvy Oct 25, 2024
3fe4eff
update remotes spec
cansavvy Oct 25, 2024
9bdc6aa
Add Rsamtools
cansavvy Oct 28, 2024
d3a3101
Add biocmanager
cansavvy Oct 28, 2024
3e33b4b
3.15
cansavvy Oct 28, 2024
4b5a5f7
Delete old files
cansavvy Oct 28, 2024
3fc068e
update some handling
cansavvy Oct 28, 2024
0e77040
Updates
cansavvy Oct 28, 2024
6c0b887
updates
cansavvy Oct 28, 2024
e786e53
Updates
cansavvy Oct 28, 2024
70737cc
Remotes
cansavvy Oct 28, 2024
b27495b
Get rid of "--as_cran"
cansavvy Oct 28, 2024
9de88b1
Install tidyr
cansavvy Oct 28, 2024
625e95b
It doesn't like the zip thing
cansavvy Oct 28, 2024
9d5e52e
set mirror
cansavvy Oct 28, 2024
b5535aa
Fix Rsamtools
cansavvy Oct 28, 2024
9df2d4d
Try this?
cansavvy Oct 28, 2024
25c6ca9
install tidyr
cansavvy Oct 28, 2024
42760b1
Fix install step
cansavvy Oct 28, 2024
9f18004
Try this instead
cansavvy Oct 28, 2024
9495f92
Specify cran mirror
cansavvy Oct 28, 2024
92310f9
Try to get Ubuntu tests working
cansavvy Oct 28, 2024
9de05ae
Rhtslib
cansavvy Oct 28, 2024
f1b5041
Try this
cansavvy Oct 28, 2024
c1e861d
3.15
cansavvy Oct 28, 2024
cff0937
Add devtools too
cansavvy Oct 28, 2024
ee81b30
alter gha
cansavvy Oct 28, 2024
7c37407
Fix spacing
cansavvy Oct 28, 2024
64b9403
Update
cansavvy Oct 28, 2024
0a64475
Add bioc friendly GHA
cansavvy Oct 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 4 additions & 12 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,22 +1,14 @@
Type: Package
Package: gimap
Title: Calculate Genetic Interactions for CRISPR Targets
Package: pgmap
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also did some package docs updates

Title: Obtain counts for a paired guided CRISPR screen
Version: 0.1.0
Authors@R: c(
person("Candace", "Savonen", ,
c("[email protected]","[email protected]"),
role = c("aut", "cre")
),
person("Howard", "Baek", ,
c("[email protected]","[email protected]"),
role = c("aut")
),
person("Kate", "Isaac", ,
c("[email protected]"),
role = c("aut")
)
)
Description: This package allows a pipeline to be built for analyzing genetic interactions of paired guide CRISPR screens. It is based off of original research from the Alice Berger lab at Fred Hutchinson Cancer Center.
Description: This package allows counts to be obtained from fastq files from a CRISPR paired guide screen. It is based off of original research from the Alice Berger lab at Fred Hutchinson Cancer Center.
License: GPL-3
URL: https://github.com/FredHutch/gimap
BugReports: https://github.com/FredHutch/gimap/issues
Expand All @@ -42,6 +34,6 @@ Suggests:
roxygen2,
Config/testthat/edition: 3
Encoding: UTF-8
RoxygenNote: 7.3.1
RoxygenNote: 7.3.2
LazyData: true
VignetteBuilder: knitr
29 changes: 7 additions & 22 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,28 +1,13 @@
# Generated by roxygen2: do not edit by hand

export(calc_crispr)
export(calc_gi)
export(calc_counts)
export(example_data)
export(example_data_folder)
export(get_example_data)
export(gimap_annotate)
export(gimap_filter)
export(gimap_normalize)
export(run_qc)
export(setup_data)
export(sample_count)
import(dplyr)
import(ggplot2)
import(kableExtra)
importFrom(dplyr,across)
importFrom(dplyr,if_any)
importFrom(dplyr,mutate)
importFrom(ggplot2,ggplot)
importFrom(ggplot2,labs)
importFrom(janitor,clean_names)
importFrom(magrittr,"%>%")
importFrom(pheatmap,pheatmap)
importFrom(Rsamtools,ScanBamParam)
importFrom(Rsamtools,scanBam)
importFrom(purrr,map)
importFrom(purrr,pmap)
importFrom(purrr,reduce)
importFrom(stringr,word)
importFrom(tidyr,pivot_longer)
importFrom(tidyr,pivot_wider)
importFrom(tidyr,unite)
importFrom(tibble,rownames_to_column)
85 changes: 0 additions & 85 deletions R/00-setup_data.R

This file was deleted.

171 changes: 171 additions & 0 deletions R/bam_to_counts.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
#' Obtain the counts for a group of samples
#' @description This function takes a forward and reverse bam files for a group of samples and returns the counts and stats.
#' @param bam_dir a file path to a directory where the bam files associated with the sample names will be stored.
#' @param sample_name a character vector that indicates the name of the samples as they are listed in the file names.
#' @importFrom purrr pmap map reduce
#' @import dplyr
#' @importFrom tibble rownames_to_column
#' @export
#' @examples \dontrun{
#'
#' # This will be the directory that contains all the file path
#' bam_dir <- file.path(example_data_folder(), "bam")
#'
#' # These MUST be names that are listed in the bam file name itself.
#' # There needs to be exactly 2 bam files per sample
#' sample_names <- c("sample1", "sample2", "sample3")
#'
#' counts_df <- calc_counts(
#' bam_dir = bam_dir,
#' sample_names = sample_names
#')
#'
#' # We can write these to CSV files like this
#' readr::write_csv(counts_df$counts, "counts_pgmap.tsv")
#' readr::write_csv(counts_df$stats, "stats_pgmap.tsv")
#'
#' }
calc_counts <- function(bam_dir, sample_names) {

# Get the file paths for each sample name
sample_files <- sapply(
sample_names, function(file_name) {

files <- list.files(path = bam_dir, pattern = file_name, full.names = TRUE)

if (length(files) < 2) stop(file_name, ": does not have 2 files associated with it.")
if (length(files) > 2) stop(file_name, ": has more than 2 files associated with it.")

return(files)

}, USE.NAMES = TRUE)

# Reformat so samples are rows and files are in columns
sample_df <- sample_files %>%
t() %>%
data.frame() %>%
dplyr::rename(forward_file = X1,
reverse_file = X2) %>%
tibble::rownames_to_column("sample_name")

# Now use the custom function to get the counts for each sample from the pair of bam files
counts <- purrr::pmap(sample_df, function(sample_name, forward_file, reverse_file) {
sample_count(bam_1 = forward_file,
bam_2 = reverse_file,
sample_name = sample_name)
})
# Bring along sample names
names(counts) <- sample_names

# Then condense all samples to one stats data frame
stats_df <- purrr::map(counts, "stats") %>%
dplyr::bind_rows(.id = "sample")

# Then condense all samples to one counts data frame
counts_df <- purrr::map(counts, "counts", right_join) %>%
purrr::reduce(dplyr::full_join)

# Return as a list of two data frames
return(list(counts = counts_df, stats = stats_df))
}

#' Obtain the counts for a single sample
#' @description This function takes a forward and reverse bam file and returns the counts
#' @param bam_1 a file path to one of the bam files for the sample
#' @param bam_2 a file path to the other one of the bam files for the sample
#' @param sample_name a single character that indicates the name of the sample "sample1"
#' @import dplyr
#' @importFrom Rsamtools ScanBamParam scanBam
#' @export
#' @examples \dontrun{
#'
#' bam_dir <- file.path(example_data_folder(), "bam")
#'
#' counts <- sample_count(
#' bam_1 = file.path(bam_dir, "pgMAP_tutorial_gRNA1_trimmed_sample1_aligned.bam"),
#' bam_2 = file.path(bam_dir, "pgMAP_tutorial_gRNA2_trimmed_sample1_aligned.bam")
#' sample_name = "sample1")
#'
#' }

sample_count <- function(bam_1, bam_2, sample_name) {

# Print out message
message(paste("Parsing reads for", sample_name))

# define parameters for reading in BAM files, then read them in:
# qname = the name of the mapped read (query template name)
# rname = the name of the reference sequence that the read aligned to (reference sequence name)
param <- ScanBamParam(
## restrict to mapped reads
flag = scanBamFlag(isUnmappedQuery = FALSE),
## only read in the necessary fields
what = c("qname", "rname")
)

# Read in the BAM files
bam_1 <- scanBam(bam_1, param = param) %>%
as.data.frame()
bam_2 <- scanBam(bam_2, param = param) %>%
as.data.frame()

# Join forward and reverse bams together.
paired_df <- dplyr::inner_join(bam_1, bam_2,
by = "qname",
relationship = "many-to-many",
suffix = c("_1", "_2")
) %>%
dplyr::mutate(paired = rname_1 == rname_2)

# If a given set of reads have one or more correct pairings, then keep the
# correct pairings and discard all incorrect pairings for those reads.
qname2anypaired <- sapply(split(paired_df$paired, f = paired_df$qname), any)

# Calculate weights
# This contains code from the original pipeline that I don't really understand but does seem to effect final numbers
weights_df <- paired_df %>%
dplyr::left_join(
tibble(
"qname" = names(qname2anypaired),
"any_paired" = qname2anypaired
),
by = "qname"
) %>%
filter(paired | !any_paired) %>%
select(-any_paired) %>%
dplyr::group_by(qname) %>%
dplyr::count() %>%
dplyr::mutate(weight = 1 / n)

# Join weights to original data
paired_df <- paired_df %>%
dplyr::left_join(weights_df, by = c("qname")) %>%
dplyr::select(qname, rname = rname_1, n, weight, paired)

# calculating the stats of how many were paired vs unpaired
stats <- sum(paired_df$paired) / nrow(paired_df)
total <- sum(paired_df$weight)
paired <- paired_df %>%
dplyr::filter(paired) %>%
dplyr::pull(weight) %>%
sum()

# Getting the counts
counts_df <- paired_df %>%
dplyr::filter(paired) %>%
dplyr::select("id" = rname, weight) %>%
dplyr::group_by(id) %>%
dplyr::summarize(sample_name = sum(weight)) %>%
dplyr::ungroup()

# rename so sample id is in the column name
colnames(counts_df) <- c("id", sample_name)

stats_df <- data.frame(
perc_paired_correctly = stats,
total_counts = total,
total_paired_counts = paired
)
return(list(counts = counts_df, stats = stats_df))
}

Loading