feat: mztab support #107

cmdoret · 2024-10-15T11:23:23Z

Context:

mzTab is a commonly used file format to represent results from mass spectrometry experiments. As we plan to support proteomics and metabolomics data, we need to enable metadata extraction and data access for this file format.

Changes

pin modos-schema to metabolomics branch
add parsing utilities to extract metadata into modos schema from mztab contents
implement logic to populate zarr metadata from mzTab
add a modos enrich command to trigger metadata extraction from file contents
- at first, I thought of putting it as a hook whenever something is added, but given that it can be very slow, one may not want to run it right away.
all relative imports made absolute (import .helpers -> import modos.helpers)
Add boilerplate for extraction of arrays from file contents into the zarr hierarchy
- basic implementation that nests the arrays directly in the parent group

Limitations

We rely on pyteomics for mzTab parsing, which comes with a large dependency tree (inc. sklearn)
differences in mztab between proteomics (mztab 1.0) and metabolomics (mztab 2.X-M) need to be clarified
- this PR focuses on mztab 2.X-M
The file is included as-is and only the metadata is can be accessed through s3 using zarr. If relevant, the next step would be to extract data tables into zarr arrays and provide the same flexibility.

Notes
Exploratory tests with array extraction from mztab files revealed little benefit of exposing arrays directly in zarr:

zarr supports multi-dimensional arrays, not column-based lists -> better for fixed-size and (mostly) homogenous datatypes
- it appears that it is currently not possible to retain both column names (structured datatypes, see zarr tuto) and use variable-size datatypes (arbitrary length strings, see object arrays)
mztab arrays are composed of mostly strings and variable datatypes
even integer id columns are strings, as they can contain mulitple identifiers separated by |
tables are relatively small, making zarr features less interesting

Co-authored-by: supermaxiste <[email protected]>

cmdoret added 24 commits September 25, 2024 15:59

feat: code matching client

8b17610

feat(remote): register fuzon endpoint in client

42d8635

refactor(cli): prompt logic to dedicated prompt module

48cd565

refactor(cli): prompt logic to dedicated prompt module

b90072d

chore(deps): pyfuzon as extra dep

5c93ea4

chore: update deps

024bc2e

feat(cli): support code completion in modos add

fc886b2

perf(codes): limit suggestions to 50 codes

8d59c3e

fix(cli): use labels in recommendations

acac519

refactor(codes): custom Code struct

37f5150

chore(deps): bump modos-schema version

8e528a4

feat(cli): prompt autocompletes text, persists uris

1363d6c

fix(cli): disable unnecessary autocomplete on modos create

49a2ad6

test(data): use uris when required

2169165

fix(codes): fuzon-http api parameters

53dbc84

chore(make): document deploy recipe

ab0a807

feat(compose): add fuzon service

2ab52c7

feat(nginx): register fuzon in reverse proxy

85e3ea5

feat(fuzon): dockerized fuzon-http setup

653c73a

fix(compose): add envvar for fuzon service

a85dc51

fix(compose): fuzon envvars

5afc4ef

fix(nginx): typo

8df4cf3

chore(deps): add prompt-toolkit

b4eeb47

fix(deploy): pin fuzon to tag 0.2.3

8d9deb2

cmdoret linked an issue Oct 15, 2024 that may be closed by this pull request

[Feature request]: Incorporate metabolomics and proteomics data #91

Closed

cmdoret self-assigned this Oct 15, 2024

cmdoret and others added 4 commits October 16, 2024 10:51

fix(compose): env var typo

772b47c

Co-authored-by: supermaxiste <[email protected]>

docs(deploy): describe config variables

06ca00b

feat(codes): parametrize n top code matches

ac78fa1

fix(codes): update protocol signatures

4b9c8af

cmdoret added 8 commits October 18, 2024 16:37

chore: add source data for test modo

90bb370

chore: pin schema

0f72b88

docs(cli): clearer help msg

eaf17e5

feat(io): add branch for mzTab meta. extraction

57e022d

refactor(cram): simplify cram extraction logic

0d858a5

feat(mztab): end-to-end mztab metadata extraction

61762e3

fix: pass data instance when extracting metadata

20e2894

chore: typos

9a81f4b

cmdoret force-pushed the feat/mztab branch from 658372a to 9a81f4b Compare October 18, 2024 14:38

cmdoret added 9 commits October 21, 2024 11:08

feat(cli): add enrich command

6182d7f

chore(deps): relative -> absolute imports

b9cedd8

test: update reference path

f1440f3

test: update reference path (bis)

8d135c5

Merge branch 'main' into feat/mztab

10add08

chore: regen lock

3427499

feat(storage): add open method

1572536

refactor(api): reduce code duplication across methods

97bb807

feat: boilerplate for array extraction

cdace3f

cmdoret marked this pull request as ready for review November 13, 2024 14:44

cmdoret added 10 commits November 13, 2024 17:34

chore: update lock file

b5ea4bf

chore(deps): add pandas

d13a899

ci: use make recipe for poetry install

1e04022

chore(ci): drop python 3.10, test on 3.12

c1141da

test(remote): fix pydantic type in endpoint manager doctest

e2727b7

chore(deps): pin modos-schema to v0.3

0ec20bd

docs: mztab tutorial

d4da2a6

chore: metabolomics init file

336c4c0

ci(docs): bump python version

971f981

ci(docs): bump python version

2360e26

cmdoret merged commit f709a55 into main Nov 29, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mztab support #107

feat: mztab support #107

cmdoret commented Oct 15, 2024 •

edited

Loading

feat: mztab support #107

feat: mztab support #107

Conversation

cmdoret commented Oct 15, 2024 • edited Loading

cmdoret commented Oct 15, 2024 •

edited

Loading