Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mztab support #107

Merged
merged 61 commits into from
Nov 29, 2024
Merged

feat: mztab support #107

merged 61 commits into from
Nov 29, 2024

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Oct 15, 2024

Context:

mzTab is a commonly used file format to represent results from mass spectrometry experiments. As we plan to support proteomics and metabolomics data, we need to enable metadata extraction and data access for this file format.

Changes

  • pin modos-schema to metabolomics branch
  • add parsing utilities to extract metadata into modos schema from mztab contents
  • implement logic to populate zarr metadata from mzTab
  • add a modos enrich command to trigger metadata extraction from file contents
    • at first, I thought of putting it as a hook whenever something is added, but given that it can be very slow, one may not want to run it right away.
  • all relative imports made absolute (import .helpers -> import modos.helpers)
  • Add boilerplate for extraction of arrays from file contents into the zarr hierarchy
    • basic implementation that nests the arrays directly in the parent group

Limitations

  • We rely on pyteomics for mzTab parsing, which comes with a large dependency tree (inc. sklearn)
  • differences in mztab between proteomics (mztab 1.0) and metabolomics (mztab 2.X-M) need to be clarified
    • this PR focuses on mztab 2.X-M
  • The file is included as-is and only the metadata is can be accessed through s3 using zarr. If relevant, the next step would be to extract data tables into zarr arrays and provide the same flexibility.

Notes
Exploratory tests with array extraction from mztab files revealed little benefit of exposing arrays directly in zarr:

  • zarr supports multi-dimensional arrays, not column-based lists -> better for fixed-size and (mostly) homogenous datatypes
  • mztab arrays are composed of mostly strings and variable datatypes
  • even integer id columns are strings, as they can contain mulitple identifiers separated by |
  • tables are relatively small, making zarr features less interesting

@cmdoret cmdoret linked an issue Oct 15, 2024 that may be closed by this pull request
@cmdoret cmdoret self-assigned this Oct 15, 2024
@cmdoret cmdoret marked this pull request as ready for review November 13, 2024 14:44
@cmdoret cmdoret merged commit f709a55 into main Nov 29, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request]: Incorporate metabolomics and proteomics data
1 participant