Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between coords self.coords between peak and nonpeak sequences #180

Open
zrcjessica opened this issue Feb 24, 2024 · 2 comments

Comments

@zrcjessica
Copy link

Hello,

I've been going over the code base and it looks to me like self.coords from ChromBPNetBatchGenerator() appears to represent the region start and the region summit for the peak and nonpeak sequences, respectively, after the random crop is applied to the input sequence. In the initialization of the ChromBPNetBatchGenerator objects, the first line of code calls the load_data() function from the data_utils.py script. This then calls get_coords() which appears to define the center of each region (both peak or nonpeak) as the start + summit coordinates from the BED file; i.e. the literal center of the region. However, the final line of code in the __init__() function for the generator calls crop_revcomp_data() which then applies the random_crop() function from augment.py to the peak sequences. This is where they define the new_coords after randomly cropping a sequence of length inputlen (2114bp) from the peak sequence region with jitter applied (3114bp) as:

new_coords = coords.copy()
new_coords[:,1] = new_coords[:,1].astype(int) - (seqs.shape[1]//2) + starts

It looks like a new start coordinate is being defined from what was originally a center coordinate. Even when the mode is set to test and max_jitter is subsequently set to 0, this same transformation is still being applied.

However, for nonpeak sequences, only the subsample_nonpeak_data() function is applied, which simply randomly subsamples from the original coords as returned by get_coords(), which reflect the center of the region.

Therefore, when batches of data returned by the generator, it seems to me like the batch_coords representing peak sequences will reflect the start coordinates while the coords representing nonpeak sequences will reflect the center coordinates of the region.

It's possible I've overlooked something in the code, but based on my own manipulations of the data this still appears to be the case. It might be helpful to add a disclaimer about this so that users realize the discrepancy and deal with it accordingly.

@panushri25
Copy link
Collaborator

panushri25 commented Feb 27, 2024

Hey @zrcjessica, no this is exactly right! I have noticed this recently and have a new version coming up this week that fixes this.

@panushri25
Copy link
Collaborator

This will effect the coordinates released as a part of the .h5 predictions object in chrombpnet test and wont effect anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants