Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ma_qa_metric_local_pairwise description #19

Open
aozalevsky opened this issue Sep 24, 2024 · 5 comments
Open

Improve ma_qa_metric_local_pairwise description #19

aozalevsky opened this issue Sep 24, 2024 · 5 comments
Assignees

Comments

@aozalevsky
Copy link

Right now ma_qa_metric_local_pairwise doesn't have any details about how complete the data should be. For instance, some metrics are supposed to have symmetric square matrices. Thus, only an upper triangular matrix should be enough. I guess the initial definition was intentionally generic, but maybe we can extend the category description with some case-specific (like PAE) details. To give some guidance for software developers.

This came up as a part of the discussion in chaidiscovery/chai-lab#52

@gtauriello
Copy link

Given that mmCIF generally does not guarantee that data is provided for each atom, each residue, each residue-pair or anything else, I am not sure how one would stress this in the description here.

It really depends on the model itself on whether the information for all pairs is provided or not. On the other hand, the description could benefit from a comment that the whole category can be extracted into a separate file. Here is a possible addition to the description:

In cases where the metric is symmetric, it is enough to store just one value per pair. For asymmetric metrics, the order of residues is expected to be meaningful (e.g. PAE where PAE_ij is defined by aligning residue i (label_*_1) and measuring the error on residue j (label_*_2)). In all cases, it is perfectly valid to only provide values for a subset of residue pairs.
Data in this category is expected to be very large and can hence be extracted into a separate file which is linked to the main file using the categories ma_associated_archive_file_details or ma_entry_associated_files with file_content set to "local pairwise QA scores".

Would this work?

@aozalevsky
Copy link
Author

I think it's good! Also, as @benmwebb pointed out, to properly read an external file, its content has to be concatenated with the main file. But it ought to be more complicated than that because the external file (at least in the ma-dm-hisrep-003 example you mentioned) has an additional header

data_ma-dm-hisrep-003
_entry.id ma-dm-hisrep-003
_entry.ma_collection_id ma-dm-hisrep

which causes the following error:

     84 def _check_residue(r):
     85     """Make sure that a residue is not out of range of its Entity"""
---> 86     if r.seq_id > len(r.entity.sequence) or r.seq_id < 1:
     87         raise IndexError("Residue %d out of range for %s (1-%d)"
     88                          % (r.seq_id, r.entity, len(r.entity.sequence)))

AttributeError: 'NoneType' object has no attribute 'sequence'

after deleting the duplicated line

data_ma-dm-hisrep-003

i was able to parse concatenated file. I wonder if it's possible to make the process slightly more user-friendly and cover it in the input section of the python-modelcif docs.

@benmwebb
Copy link
Contributor

Also, as @benmwebb pointed out, to properly read an external file, its content has to be concatenated with the main file.

You can't just glue the two files together, because the external file might be BinaryCIF for example, not mmCIF. By "concatenation" I meant that logically the two files work on the same data model; IDs in one file can refer to the other.

the external file (at least in the ma-dm-hisrep-003 example you mentioned) has an additional header

data_ma-dm-hisrep-003
_entry.id ma-dm-hisrep-003
_entry.ma_collection_id ma-dm-hisrep

which causes the following error:

Right, the Python library assumes that a new data block corresponds to a new System object, so it'll get confused by the IDs there (e.g. any entity IDs will point to empty entities since they are defined in a different system). One simple fix would be to assume that if the names of the two data blocks are the same, it is the same system.

@gtauriello
Copy link

If I remember correctly, the idea was that the extra file for local pairwise QA scores should by itself be a valid mmCIF file (i.e. include a data block and all parent data items). That's why there is a bit of redundancy between the files.

In terms of reading a main ModelCIF file together with an accompanying file in python-modelcif, this may better be handled in ihmwg/python-modelcif#10 ?

@brindakv
Copy link
Contributor

Addressed in #25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants