Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use the methylation data when to do boxplot ? #324

Open
Xiangyi-Deng opened this issue Aug 8, 2024 · 4 comments
Open

how to use the methylation data when to do boxplot ? #324

Xiangyi-Deng opened this issue Aug 8, 2024 · 4 comments

Comments

@Xiangyi-Deng
Copy link

Hello Alex,

I have a small question about RRBS methylation data.

It is my question that I post before.

My question this time is how to use RRBS methylation data to do a boxplot based on CpG sites values.

You are analysing RRBS data comparing two groups with two replicates.
You aggregated counts over* tiling windows*, which *sums up the
methylated/unmethylated
base counts o
ver the *tilling windows accross genome (see code

methylKit/R/regionalize.R

Lines 121 to 181 in 388770d

setMethod("regionCounts", signature(object="methylRaw",regions="GRanges"),
function(object,regions,cov.bases,strand.aware,save.db=FALSE,...){
#require(GenomicRanges)
# sort regions
regions <- sortSeqlevels(regions)
regions <- sort(regions,ignore.strand=TRUE)
# overlap object with regions
# convert object to GRanges
if(!strand.aware){
g.meth=as(object,"GRanges")
strand(g.meth)="*"
mat=IRanges::as.matrix( findOverlaps(regions,g.meth ) )
#mat=matchMatrix( findOverlaps(regions,g.meth ) )
}else{
mat=IRanges::as.matrix( findOverlaps(regions,as(object,"GRanges")) )
#mat=matchMatrix( findOverlaps(regions,as(object,"GRanges")) )
}
#require(data.table)
# create a temporary data.table row ids from regions and counts from object
coverage=numCs=numTs=id=covered=NULL
df=data.frame(id = mat[, 1], getData(object)[mat[, 2], c(5, 6, 7)])
dt=data.table(df)
#dt=data.table(id=mat[,1],object[mat[,2],c(6,7,8)] ) worked with data.table 1.7.7
# use data.table to sum up counts per region
sum.dt=dt[,list(coverage=sum(coverage),
numCs =sum(numCs),
numTs =sum(numTs),covered=length(numTs)),by=id]
sum.dt=sum.dt[sum.dt$covered>=cov.bases,]
temp.df=as.data.frame(regions) # get regions to a dataframe
# look for values with "name" in it, eg. "tx_name" or "name"
# valuesList = names(values(regions))
# nameid = valuesList[grep (valuesList, pattern="name")]
#create id string for the new object to be returned
#ids have to be unique and we can not assume GRanges objects will
#have a name attribute
if("name" %in% names(temp.df))
{
new.ids=paste(temp.df[sum.dt$id,"seqnames"],temp.df[sum.dt$id,"start"],
temp.df[sum.dt$id,"end"],temp.df[sum.dt$id,"name"],sep=".")
}else{
new.ids=paste(temp.df[sum.dt$id,"seqnames"],temp.df[sum.dt$id,"start"],
temp.df[sum.dt$id,"end"],sep=".")
}
#create a new methylRaw object to return
new.data=data.frame(#id =new.ids,
chr =temp.df[sum.dt$id,"seqnames"],
start =temp.df[sum.dt$id,"start"],
end =temp.df[sum.dt$id,"end"],
strand =temp.df[sum.dt$id,"strand"],
coverage=sum.dt$coverage,
numCs =sum.dt$numCs,
numTs =sum.dt$numTs)
),
so methylation level per region would be the total CpG sites percentage
(sum(freqC)/sum(Coverage)).

But if I do a boxplot, I should calculate the mean methylation level of one dmr which may contain many CpGs.

For example, below is a dmr which contains 6 CpG sites. But because here it shows us the methylation percentage not freqC.

chr1.836516 chr1 836516 F 4 100.00 0
chr1.836543 chr1 836543 F 4 0.00 100
chr1.836941 chr1 836941 F 19 68.42 31.58
chr1.836942 chr1 836942 R 1 100.00 0
chr1.836953 chr1 836953 F 19 78.95 21.05
chr1.836954 chr1 836954 R 1 100.00 0

So if I do boxplot for this dmr,

Mean Methylation Level (Group 1) = (100.00+0.00+68.42+100.00+78.95+100.00)/6 =74.56 or 0.7456

However, this value must be different compared to sum(freqC)/sum(Coverage) , right ?

For sum(freqC)/sum(Coverage) ,

it's :

(4 * 1 + 4* 0+ .... 1* 1)/(4+4+19+1+19+1)= 0.70833333

Do you think I can do boxplot this way if I want to show all CpGs methylation level ?

Can I ignore this difference in some degree.

Because I think it is the only way to show the methylation level by boxplot.

I am looking forward to your response.

image


Best regards,
Xiangyi

@alexg9010
Copy link
Collaborator

Hi Xiangyi,

I think when dealing with this low amount of coverage, a difference of 5% can safely be ignored. Your DMRs should show much higher differences to be significant.

If you are visualizing single DMRs, averaging the methylation per CpG would allow you to create a point for each CpG and is the only way to get the boxplot.

Using the mean per DMR gives you a weighted estimate of the methylation and is more biased towards higher coverage CpGs, but this seems reasonable given the low coverage. But you do not get methylation information per CpG, so this can not be used for the boxplot. However when reporting average methylation per DMR, this is probably the value you should use.

Best,
Alex

.

@Xiangyi-Deng
Copy link
Author

Dear Alex,

Thank you so much for your response.

I think when dealing with this low amount of coverage, a difference of 5% can safely be ignored. Your DMRs should show much higher differences to be significant.

Well, I understand what you mean. Yes, we used a more strict criteria to do dmr calling. Here I just show you an example.

If you are visualizing single DMRs, averaging the methylation per CpG would allow you to create a point for each CpG and is the only way to get the boxplot.

I agree with you that it is only way to show all CpG cites methylation levels in a boxplot.

Using the mean per DMR gives you a weighted estimate of the methylation and is more biased towards higher coverage CpGs, but this seems reasonable given the low coverage. But you do not get methylation information per CpG, so this can not be used for the boxplot. However when reporting average methylation per DMR, this is probably the value you should use.

But I am a little confused. the mean per DMR you mean sum(freqC)/sum(Coverage) , right ? It can not be used for the boxplot.

I think I know it must be more complicated process of dmr calling.

when reporting average methylation per DMR, this is probably the value you should use

You mean average methylation per DMR of DMR calling, or sum(freqC)/sum(Coverage).

If so, I get you.

------------------------ Last question
For our data, we did paired-comparison DMR calling among 6 groups. So we got many DMRs in each comparisons.

And in order to see how much degree these DMRs show, we load data into IGV.

In some degree, we can see the difference, for example, one dmr, low methylation of A, compared to B, but also can be compared to the rest groups.

However, I was wondering if somebody may ask how do you know it is significant that this dmr in A versus other rest groups ?
This dmr may just comes from one paired comparison, for example, A versus B and it may show us really low methylation only in A but high or full methylation in B and the other 4 groups.

So we want to do a boxplot based on the same DMRs and use t.test to show significance between all comparisons.

Could you give me some advice on my last question if you understand my description.

Thanks again.

Best,
Xiangyi

@alexg9010
Copy link
Collaborator

alexg9010 commented Aug 9, 2024 via email

@Xiangyi-Deng
Copy link
Author

Hi Xiangyi, Correct, with "mean per DMR" I am talking about sum(freqC)/sum(Coverage) per DMR, in contrast to "averaging the methylation per CpG" within the DMR which would be (numC1/Coverage1 + ... numCN/CoverageN) / N. Concerning your question, what you suggest should be fine, since you get the significance between the groups using the t.test. Best, Alex Am Fr., 9. Aug. 2024 um 10:58 Uhr schrieb Charite Learner < @.>:

Dear Alex, Thank you so much for your response. I think when dealing with this low amount of coverage, a difference of 5% can safely be ignored. Your DMRs should show much higher differences to be significant. Well, I understand what you mean. Yes, we used a more strict criteria to do dmr calling. Here I just show you an example. If you are visualizing single DMRs, averaging the methylation per CpG would allow you to create a point for each CpG and is the only way to get the boxplot. I agree with you that it is only way to show all CpG cites methylation levels in a boxplot. Using the mean per DMR gives you a weighted estimate of the methylation and is more biased towards higher coverage CpGs, but this seems reasonable given the low coverage. But you do not get methylation information per CpG, so this can not be used for the boxplot. However when reporting average methylation per DMR, this is probably the value you should use. But I am a little confused. the mean per DMR you mean sum(freqC)/sum(Coverage) , right ? It can not be used for the boxplot. I think I know it must be more complicated process of dmr calling. when reporting average methylation per DMR, this is probably the value you should use You mean average methylation per DMR of DMR calling, or sum(freqC)/sum(Coverage). If so, I get you. ------------------------ Last question For our data, we did paired-comparison DMR calling among 6 groups. So we got many DMRs in each comparisons. And in order to see how much degree these DMRs show, we load data into IGV. In some degree, we can see the difference, for example, one dmr, low methylation of A, compared to B, but also can be compared to the rest groups. However, I was wondering if somebody may ask how do you know it is significant that this dmr in A versus other rest groups ? This dmr may just comes from one paired comparison, for example, A versus B and it may show us really low methylation only in A but high or full methylation in B and the other 4 groups. So we want to do a boxplot based on the same DMRs and use t.test to show significance between all comparisons. Could you give me some advice on my last question if you understand my description. Thanks again. Best, Xiangyi — Reply to this email directly, view it on GitHub <#324 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK7JD3VDECGH4Z2YQ5JEJDZQSAEVAVCNFSM6AAAAABMGYUAH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZXGQ4DKNZUGE . You are receiving this because you commented.Message ID: @.
>

Many thanks !

Have a nice weekend.

Best regards,
Xiangyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants