Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[builder] CxG schema 5 / Census schema 2 #1024

Merged
merged 31 commits into from
Mar 26, 2024
Merged

Conversation

bkmartinjr
Copy link
Contributor

@bkmartinjr bkmartinjr commented Feb 28, 2024

Builder support for Census schema 2.0.0 / CELLxGENE schema 5.0.0

Fixes: #993
Fixes: #1022
Fixes: #796

Primary changes:

  • Request H5ADs in schema 5.0.0 format
  • Update Census schema to 2.0.0, in turn is pinned to CxG schema 5.0.0
  • Update assay filter to incorporate new RNA seq assays
  • Update UBERON release to match schema 5.0.0, used to infer tissue_general value
  • Clarify the handling of multi-species datasets and how they are included/excluded from the Census. As part of this, tighten typing on ExperimentSpecification to require a species/organism to be specified (i.e., mandate that each Experiment is associated with a single organism).
  • Add new census_info/organisms table specified in Census schema 2.0.0 (Store organism ontology term ID in census object #796)

Copy link

codecov bot commented Feb 28, 2024

Codecov Report

Attention: Patch coverage is 72.34043% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 81.33%. Comparing base (06ee454) to head (bcb0814).
Report is 3 commits behind head on main.

Files Patch % Lines
...llxgene_census_builder/build_soma/validate_soma.py 0.00% 8 Missing ⚠️
...ne_census_builder/build_soma/experiment_builder.py 30.00% 7 Missing ⚠️
...src/cellxgene_census_builder/build_soma/anndata.py 75.00% 4 Missing ⚠️
...lder/src/cellxgene_census_builder/build_soma/mp.py 50.00% 4 Missing ⚠️
.../cellxgene_census_builder/build_soma/build_soma.py 89.28% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1024   +/-   ##
=======================================
  Coverage   81.32%   81.33%           
=======================================
  Files          73       73           
  Lines        5553     5566   +13     
=======================================
+ Hits         4516     4527   +11     
- Misses       1037     1039    +2     
Flag Coverage Δ
unittests 81.33% <72.34%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bkmartinjr
Copy link
Contributor Author

@pablo-gar - thanks for the assay term changes.

Question/open issue: the Census schema also defines special handling for the normalized layer, for any "Smart-Seq" assays. Can we add the definitive list of which EFO terms are considered a Smart-Seq assay? Ideally this would be included in the Census schema (or referenced as you did with the assay filter terms).

@bkmartinjr bkmartinjr requested review from ebezzi and pablo-gar March 11, 2024 16:00
@bkmartinjr bkmartinjr marked this pull request as ready for review March 19, 2024 16:31
Copy link
Member

@ebezzi ebezzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -339,15 +223,15 @@ An example of this `SOMADataFrame` is shown below:
<tbody>
<tr>
<td>census_schema_version</td>
<td>1.3.0</td>
<td>2.0.0</td>
</tr>
<tr>
<td>census_build_date</td>
<td>2022-11-30</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpick, but it could be a good idea to move this date forward to make the example more realistic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@prathapsridharan
Copy link
Contributor

@bkmartinjr - In regards to issue #993 there is this point:

Confirm no use of CL:0000003, aka native cell. These cells will now be marked as unknown in both obs.cell_type and obs.cell_type_ontology_term_id columns. See related task: https://github.com/chanzuckerberg/cellxgene-census/issues/1019

Once a test build is produced, are we to check this manually in the python interpreter by doing a count of CL:0000003 in obs.cell_type_ontology_term_id and native cell in obs.cell_type and checking that the count is zero?

Or should this be captured in some type of of post build acceptance test where some sanity checks about the data are done?

@bkmartinjr
Copy link
Contributor Author

@prathapsridharan re:

are we to check

This is entirely an upstream issue in the DP process, and the builder does not enforce (or check) for this level of metadata compliance with the CxG schema. These checks for compliance with the CxG schema are the provenance of the schema validation toolkit used by Lattice, et al.

We could do this kind of checks, but it is redundant, and adds linkages across layers in the system that don't add much value IMHO.

Copy link
Contributor

@prathapsridharan prathapsridharan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@pablo-gar pablo-gar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the schema updates and the updated list of accepted assays. Looks good!

@bkmartinjr
Copy link
Contributor Author

@atarashansky - we have included your requested organisms info (#796) in Census schema 2.0.0. Current content will be:

In [4]: census['census_info']['organisms'].read().concat().to_pandas().set_index('soma_joinid')
Out[4]: 
            organism_ontology_term_id organism_label      organism
soma_joinid                                                       
0                      NCBITaxon:9606   Homo sapiens  homo_sapiens
1                     NCBITaxon:10090   Mus musculus  mus_musculus

Please feel free to leave comments on both the Schema MD file changes and the code.

@pablo-gar
Copy link
Contributor

pablo-gar commented Mar 25, 2024

Light QC on test build shows no issues

Checks

  • Verifying fidelity in data additions, in particular changes to obs and addition to census["census_info"]["organisms"].
  • Checking new assays added based on updated list of assays, and verifying the addition of expected datasets.
  • Validating the existence of previously missing data due to fixed filter for multi-species datasets.
  • Validating calculation of normalized layer for SMART-like technologies

See notebook
https://colab.research.google.com/drive/1kb2ZR0MPxVsWBgJlIgrtJrkGPx5wqxIk

@bkmartinjr bkmartinjr merged commit ed07891 into main Mar 26, 2024
15 checks passed
@bkmartinjr bkmartinjr deleted the bkmartinjr/993-schema-five branch March 26, 2024 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants