Improved doublet detection in call_lineages
#225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes a number of improvements to
call_lineages
step of the preprocessing pipeline. These changes are based on my experience processing a dataset with high ambient RNA and a significant proportion of doublets.Adds a
min_umi_per_intbc
parameter to filter the allele table, which is useful for removing ambient intBC molecules.Removes assumption in
assign_lineage_groups
that the size of lineage groups is strictly decreasing since this may not be true with highkinship_thresh
.Changes the doublet detection algorithm to use the kinship scores calculated by
score_lineage_kinships
. I have found that these kinship scores are a more reliable way to detect doublets than the currentfilter_inter_doublets
function since they take into account UMIs instead of just the binarized intBCs.Adds a
keep_doublets
parameter to allow the user to keep the doublets in the allele table which makes it much easier to tune thedoublet_kinship_thresh
parameter.The API remains the same and the old doublet detection algorithm can still be run for now, but I've added a warning message that it will be depreciated in 2.1.0. What this PR does not address is the issue that doublets can silently slip through
call_lineages
since the doublet alleles are filtered out by themin_intbc_thresh
making them look like singlets. It would be better if this failure mode was avoided but I'm not sure how to do it while still filtering.@mattjones315 if you send me test data I can compare this algorithm to the old one. I think its an improvement for most cases but it would be good to test it. I'm also open to implementing a more complex doublet detection algorithm using a mixture model if needed. I'll add tests once we solidify the doublet detection algorithm.