Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put non-count data in CDS objects #7

Open
LuckyMD opened this issue Apr 17, 2019 · 8 comments
Open

Put non-count data in CDS objects #7

LuckyMD opened this issue Apr 17, 2019 · 8 comments
Labels
bug Something isn't working

Comments

@LuckyMD
Copy link

LuckyMD commented Apr 17, 2019

Hi,

I'm afraid this is more of a question than a bug report. Thus, I am deviating from the template.

I haven't been able to find out how to load pre-processed data into cds objects without getting errors in downstream analysis.

I have patient data integrated with MNN, on which I would like to run garnett. I have run this pre-processing in scanpy, and so am transferring data from python into R to use garnett. Following the monocle and garnett tutorials, I generate the CDS object with:

pd <- AnnotatedDataFrame(data = obs_mon)
fd <- AnnotatedDataFrame(data = var_mon)
colnames(data_mat_mon) <- rownames(pd)
rownames(data_mat_mon) <- rownames(fd)
npcs_cds <- newCellDataSet(cellData=data_mat_mon, phenoData=pd, featureData=fd)

Where obs_mon, and var_mon are pandas dataframes converted to R dataframes, and data_man_mon is a numpy array converted to an R dataframe.

As the data is pre-processed I immediately load my marker gene file and check the markers, and then get the following error:
Error: Must run estimateSizeFactors() on cds before calling check_markers

After running npcs_cds <- estimateSizeFactors(npcs_cds), and I get the following error:

Error in quantile.default(newX[, i], ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE
Calls: <Anonymous> ... aggregate_positive_markers -> apply -> FUN -> quantile.default

I assume these errors have to do with certain slots in the cds object not being filled (e.g., reduced dimensions, size factors, dispersions, etc.).

It would be great if you could tell me which slots are missing and how I can assign data to these slots. As I have processed the data in scanpy, I should be able to easily transfer this across.

@LuckyMD LuckyMD added the bug Something isn't working label Apr 17, 2019
@hpliner
Copy link
Collaborator

hpliner commented Apr 17, 2019

I am actually guessing that this is because of NAs in your expression matrix rather than a slot issue. Can you check that sum(is.na(exprs(npcs_cds)) is equal to zero?

@LuckyMD
Copy link
Author

LuckyMD commented Apr 17, 2019

Thanks for the quick reply. I just checked and it is zero.

I've run quite a bit of analysis on this data already in scanpy, so I'm pretty sure it's not a data issue. Of course something can have happened during the conversion to R via anndata2ri, but that has worked with previous datasets.

I've previously tried using pre-processed data with monocle and ran into the same issues unfortunately.

Have you run garnett without doing the pre-processing in a cds object?

@LuckyMD
Copy link
Author

LuckyMD commented Apr 26, 2019

Hi again. I've gotten garnett to work with count data input. But I was wondering if you could let me know how best to use pre-processed data. I have a strong batch effect in my data and I'm pretty sure the classifier would be better trained without this effect in the data.

Thanks in advance for your help

@LuckyMD
Copy link
Author

LuckyMD commented Apr 29, 2019

I have found some instructions on how to load pre-processed data into a cds object here. The same page says the following:

estimateSizeFactors() and estimateDispersions() will only work, and are only needed, if you are working with a CellDataSet with a negbinomial() or negbinomial.size() expression family.

However, when I set expressionFamily=gaussianff() in my call to newCellDataSet(), check_markers() still throws the following error:

Error: Must run estimateSizeFactors() on cds before calling check_markers

Is this intensional? Is garnett currently not compatible with gaussianff()?

Here is the code I'm running:

#Set up the CellDataSet data structure
pd <- AnnotatedDataFrame(data = obs_mon)
fd <- AnnotatedDataFrame(data = var_mon)
colnames(data_mat_mon) <- rownames(pd)
rownames(data_mat_mon) <- rownames(fd)
npcs_cds_norm <- newCellDataSet(cellData=data_mat_mon, phenoData=pd, featureData=fd, expressionFamily=gaussianff())

marker_file_path <- "garnett_marker_genes.txt"

marker_check_norm <- check_markers(npcs_cds_norm, marker_file_path,
                              db=org.Hs.eg.db,
                              cds_gene_id_type = "SYMBOL",
                              marker_file_gene_id_type = "SYMBOL")

Kind regards

@hpliner
Copy link
Collaborator

hpliner commented Apr 29, 2019

You should still run estimateSizeFactors which I think should still work.

I will note that Garnett has only been tested with count data... it has some internal normalization steps to help it be compatible across datasets that may hurt performance with a prenormalized dataset depending on how it was normalized. I would expect this mostly to be an issue when using pre-trained classifiers or when classifying other count data. But again, it’s gonna depend on your preprocessing steps I think.

Hopefully it will at least run once you have size factors (check that they’re not NAs - you can set NAs to 1 if you’re confident depth across cells is consistent).

Would be curious to hear how it works

@LuckyMD
Copy link
Author

LuckyMD commented Apr 30, 2019

Thanks for the response!

If I run estimateSizeFactors() before, would Garnett then use the size factors for a further normalization step while training the classifier or classifying the cells? I would prefer avoiding a further normalization step if possible. I guess I could always set size factors to 1 to avoid this. Are the size factors just stored in pData(cds)$size_factors?

So far the classification unfortunately hasn't worked particularly well for my data with several different iterations of marker gene sets. One reason is probably due to a lack of pre-processing (apart from cell/gene QC and normalization via estimateSizeFactors()) when I run it directly from count data.

@hpliner
Copy link
Collaborator

hpliner commented May 3, 2019

I think it would be fine to set the size factors to 1, yes they are stored at pData(cds)$Size_Factor, so you can just set pData(cd)$Size_Factor <- 1

Curious if this helps your classification, let me know!

@LuckyMD
Copy link
Author

LuckyMD commented May 6, 2019

I just tried what you suggested and ran estimateSizeFactors() prior to check_markers(). All the size factors are set to 1, but then check_markers() gives me the error:

Error in quantile.default(newX[, i], ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE
Calls: <Anonymous> ... aggregate_positive_markers -> apply -> FUN -> quantile.default

any(is.na(pData(cds)$Size_Factors)) however gives me 0. And none of the expression values are NA either (but they are very low). I'm not sure the algorithm works with my MNN-corrected data annoyingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants