From ffb4e9bce535cd043f2c282423178774930d56d0 Mon Sep 17 00:00:00 2001 From: j-uranic <117292295+j-uranic@users.noreply.github.com> Date: Tue, 12 Nov 2024 09:45:52 -0500 Subject: [PATCH 1/4] Update cosmx-v2.0.yaml Fixes/updates --- .../directory-schemas/cosmx-v2.0.yaml | 26 +++++++++---------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/src/ingest_validation_tools/directory-schemas/cosmx-v2.0.yaml b/src/ingest_validation_tools/directory-schemas/cosmx-v2.0.yaml index d5ab8726..c46b062d 100644 --- a/src/ingest_validation_tools/directory-schemas/cosmx-v2.0.yaml +++ b/src/ingest_validation_tools/directory-schemas/cosmx-v2.0.yaml @@ -18,23 +18,23 @@ files: required: True description: All raw data files for the experiment. - - pattern: raw\/data\/[^\/]*_exprMat_file\.csv + pattern: raw\/[^\/]*_exprMat_file\.csv required: True - description: "Cell expression matrix with raw counts of genes for each identified cell. Apart from the gene specific columns, additional columns include: the origin of the cell (fov, unique cell id); negative probes; probes associated with the fiducial frame (SystemControl)" + description: Cell expression matrix with raw counts of genes for each identified cell. Apart from the gene specific columns, additional columns include: the origin of the cell (fov, unique cell id); negative probes; probes associated with the fiducial frame (SystemControl) - - pattern: raw\/data\/[^\/]*_fov_positions_file\.csv + pattern: raw\/[^\/]*_fov_positions_file\.csv required: True description: FOV Positions file that provides an overview of the tissue locations and to help specify separate regions and/or tissues on the slide. This contains information about the Slide - slide number it comes from; FOV - field of view; X_mm/Y_mm - x/y coordinates of FOV positions in mm (previously in px) - - pattern: raw\/data\/[^\/]*_metadata_file\.csv + pattern: raw\/[^\/]*_metadata_file\.csv required: True description: Cell metadata file containing the following information - the origin of the cell (fov, unique cell id); physical properties of the cell (area, aspect ratio, width, height); location of the cell centroid within each FOV (center X/Y local) and global position (center X/Y global); information about the protein staining (min/max intensity); type of protein, which may be specific to each experiment but generally includes DAPI, Membrane, PanCK, CD45; other (e.g. Seurat information if that pipeline was used within AtoMx, some data quality information) - - pattern: raw\/data\/[^\/]*_tx_file\.csv + pattern: raw\/[^\/]*_tx_file\.csv required: True description: Cell transcript file - - pattern: raw\/data\/[^\/]*_config\.ini + pattern: raw\/[^\/]*_config\.ini required: True description: Needed to generate the DCC file from the fastq file. Contains pipeline processing parameters. Generated by DSP run, prior to sequencing. - @@ -44,27 +44,27 @@ files: - pattern: raw\/additional_panels_used\.csv required: False - description: 'If multiple commercial probe panels were used, then the primary probe panel should be selected in the "oligo_probe_panel" metadata field. The additional panels must be included in this file. Each panel record should include: manufacturer, model/name, product code.' + description: If multiple commercial probe panels were used, then the primary probe panel should be selected in the "oligo_probe_panel" metadata field. The additional panels must be included in this file. Each panel record should include: manufacturer, model/name, product code. - pattern: raw\/gene_panel\.csv required: True - description: 'The list of target genes. The expected format is: gene_id (ensembl ID), gene_name.' + description: The list of target genes. The expected format is: gene_id (ensembl ID), gene_name. - pattern: raw\/custom_probe_set\.csv required: False - description: 'This file should contain any custom probes used and must be included if the metadata field "is_custom_probes_used" is "Yes". The file should minimally include: target gene id, probe seq, probe id. The contents of this file are modeled after the 10x Genomics probe set file (see ).' + description: This file should contain any custom probes used and must be included if the metadata field "is_custom_probes_used" is "Yes". The file should minimally include: target gene id, probe seq, probe id. The contents of this file are modeled after the 10x Genomics probe set file (see ). - pattern: raw\/transcript_locations\.csv required: True - description: 'The origin of the coordinate is 0,0 at the top left corner of the image. The file should include: gene name, x, y, z (optional), quality score (optional). It is expected that the first row in the file contains the column header.' + description: The origin of the coordinate is 0,0 at the top left corner of the image. The file should include: gene name, x, y, z (optional), quality score (optional). It is expected that the first row in the file contains the column header. - pattern: raw\/custom_gene_list\.csv required: False - description: 'This describes the target genes profiled by the assay. For advanced design, this can be probes sequences for splicing or other analysis for any target of interest. The format should minimally contain: gene name, ensemble ID' + description: This describes the target genes profiled by the assay. For advanced design, this can be probes sequences for splicing or other analysis for any target of interest. The format should minimally contain: gene name, ensemble ID - pattern: raw\/probes\.csv required: False - description: A CSV file describing the probe panel used. This is typically what's used to specific the probe set when ordering a probe panel for a Xenium run. + description: A CSV file describing the probe panel used. This is typically what's used to specifiy the probe set when ordering a probe panel for a Xenium run. - pattern: raw\/images\/overlay\.(?:jpeg|tiff) required: False @@ -86,7 +86,7 @@ files: required: True description: OME-TIFF files (multichannel, multi-layered) produced by the microscopy experiment. If compressed, must use loss-less compression algorithm. For Visium this stitched file should only include the single capture area relevant to the current dataset. For GeoMx there will be one OME TIFF file per slide, with each slide including multiple AOIs. See the following link for the set of fields that are required in the OME TIFF file XML header. is_qa_qc: False - example: lab_processed/images/HBM892.MDXS.293.ome.tiff + example: HBM892.MDXS.293.ome.tiff - pattern: lab_processed\/images\/[^\/]*ome-tiff\.channels\.csv required: True From c99482058f683fe9db9a207be669f75fe465a5f0 Mon Sep 17 00:00:00 2001 From: j-uranic <117292295+j-uranic@users.noreply.github.com> Date: Tue, 12 Nov 2024 09:47:25 -0500 Subject: [PATCH 2/4] Update CHANGELOG.md --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 1deb7b2f..44051ccc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,6 +1,7 @@ # Changelog ## v0.0.29 (in progress) - Add CosMX directory schema +- Update CosMX directory schema ## v0.0.28 - Update Xenium directory schema From 9a79decc0a15685709059689edfd3add0b336afa Mon Sep 17 00:00:00 2001 From: Juan Puerto <=> Date: Tue, 12 Nov 2024 10:05:07 -0500 Subject: [PATCH 3/4] Docs: Update dir schema --- .../directory-schemas/cosmx-v2.0.yaml | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/ingest_validation_tools/directory-schemas/cosmx-v2.0.yaml b/src/ingest_validation_tools/directory-schemas/cosmx-v2.0.yaml index c46b062d..6f9c66cf 100644 --- a/src/ingest_validation_tools/directory-schemas/cosmx-v2.0.yaml +++ b/src/ingest_validation_tools/directory-schemas/cosmx-v2.0.yaml @@ -20,7 +20,7 @@ files: - pattern: raw\/[^\/]*_exprMat_file\.csv required: True - description: Cell expression matrix with raw counts of genes for each identified cell. Apart from the gene specific columns, additional columns include: the origin of the cell (fov, unique cell id); negative probes; probes associated with the fiducial frame (SystemControl) + description: 'Cell expression matrix with raw counts of genes for each identified cell. Apart from the gene specific columns, additional columns include: the origin of the cell (fov, unique cell id); negative probes; probes associated with the fiducial frame (SystemControl)' - pattern: raw\/[^\/]*_fov_positions_file\.csv required: True @@ -44,23 +44,23 @@ files: - pattern: raw\/additional_panels_used\.csv required: False - description: If multiple commercial probe panels were used, then the primary probe panel should be selected in the "oligo_probe_panel" metadata field. The additional panels must be included in this file. Each panel record should include: manufacturer, model/name, product code. + description: 'If multiple commercial probe panels were used, then the primary probe panel should be selected in the "oligo_probe_panel" metadata field. The additional panels must be included in this file. Each panel record should include: manufacturer, model/name, product code.' - pattern: raw\/gene_panel\.csv required: True - description: The list of target genes. The expected format is: gene_id (ensembl ID), gene_name. + description: 'The list of target genes. The expected format is: gene_id (ensembl ID), gene_name.' - pattern: raw\/custom_probe_set\.csv required: False - description: This file should contain any custom probes used and must be included if the metadata field "is_custom_probes_used" is "Yes". The file should minimally include: target gene id, probe seq, probe id. The contents of this file are modeled after the 10x Genomics probe set file (see ). + description: 'This file should contain any custom probes used and must be included if the metadata field "is_custom_probes_used" is "Yes". The file should minimally include: target gene id, probe seq, probe id. The contents of this file are modeled after the 10x Genomics probe set file (see ).' - pattern: raw\/transcript_locations\.csv required: True - description: The origin of the coordinate is 0,0 at the top left corner of the image. The file should include: gene name, x, y, z (optional), quality score (optional). It is expected that the first row in the file contains the column header. + description: 'The origin of the coordinate is 0,0 at the top left corner of the image. The file should include: gene name, x, y, z (optional), quality score (optional). It is expected that the first row in the file contains the column header.' - pattern: raw\/custom_gene_list\.csv required: False - description: This describes the target genes profiled by the assay. For advanced design, this can be probes sequences for splicing or other analysis for any target of interest. The format should minimally contain: gene name, ensemble ID + description: 'This describes the target genes profiled by the assay. For advanced design, this can be probes sequences for splicing or other analysis for any target of interest. The format should minimally contain: gene name, ensemble ID' - pattern: raw\/probes\.csv required: False @@ -86,7 +86,7 @@ files: required: True description: OME-TIFF files (multichannel, multi-layered) produced by the microscopy experiment. If compressed, must use loss-less compression algorithm. For Visium this stitched file should only include the single capture area relevant to the current dataset. For GeoMx there will be one OME TIFF file per slide, with each slide including multiple AOIs. See the following link for the set of fields that are required in the OME TIFF file XML header. is_qa_qc: False - example: HBM892.MDXS.293.ome.tiff + example: lab_processed/images/HBM892.MDXS.293.ome.tiff - pattern: lab_processed\/images\/[^\/]*ome-tiff\.channels\.csv required: True From e689e7720b007eeeb4a397983af377a14899e800 Mon Sep 17 00:00:00 2001 From: Juan Puerto <=> Date: Tue, 12 Nov 2024 10:05:21 -0500 Subject: [PATCH 4/4] Docs: Update dir schema --- docs/cosmx/current/index.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/cosmx/current/index.md b/docs/cosmx/current/index.md index ab8e1020..17df8f9d 100644 --- a/docs/cosmx/current/index.md +++ b/docs/cosmx/current/index.md @@ -36,18 +36,18 @@ Related files: | extras\/microscope_hardware\.json | ✓ | **[QA/QC]** A file generated by the micro-meta app that contains a description of the hardware components of the microscope. Email HuBMAP Consortium Help Desk if help is required in generating this document. | | extras\/microscope_settings\.json | | **[QA/QC]** A file generated by the micro-meta app that contains a description of the settings that were used to acquire the image data. Email HuBMAP Consortium Help Desk if help is required in generating this document. | | raw\/.* | ✓ | All raw data files for the experiment. | -| raw\/data\/[^\/]*_exprMat_file\.csv | ✓ | Cell expression matrix with raw counts of genes for each identified cell. Apart from the gene specific columns, additional columns include: the origin of the cell (fov, unique cell id); negative probes; probes associated with the fiducial frame (SystemControl) | -| raw\/data\/[^\/]*_fov_positions_file\.csv | ✓ | FOV Positions file that provides an overview of the tissue locations and to help specify separate regions and/or tissues on the slide. This contains information about the Slide - slide number it comes from; FOV - field of view; X_mm/Y_mm - x/y coordinates of FOV positions in mm (previously in px) | -| raw\/data\/[^\/]*_metadata_file\.csv | ✓ | Cell metadata file containing the following information - the origin of the cell (fov, unique cell id); physical properties of the cell (area, aspect ratio, width, height); location of the cell centroid within each FOV (center X/Y local) and global position (center X/Y global); information about the protein staining (min/max intensity); type of protein, which may be specific to each experiment but generally includes DAPI, Membrane, PanCK, CD45; other (e.g. Seurat information if that pipeline was used within AtoMx, some data quality information) | -| raw\/data\/[^\/]*_tx_file\.csv | ✓ | Cell transcript file | -| raw\/data\/[^\/]*_config\.ini | ✓ | Needed to generate the DCC file from the fastq file. Contains pipeline processing parameters. Generated by DSP run, prior to sequencing. | +| raw\/[^\/]*_exprMat_file\.csv | ✓ | Cell expression matrix with raw counts of genes for each identified cell. Apart from the gene specific columns, additional columns include: the origin of the cell (fov, unique cell id); negative probes; probes associated with the fiducial frame (SystemControl) | +| raw\/[^\/]*_fov_positions_file\.csv | ✓ | FOV Positions file that provides an overview of the tissue locations and to help specify separate regions and/or tissues on the slide. This contains information about the Slide - slide number it comes from; FOV - field of view; X_mm/Y_mm - x/y coordinates of FOV positions in mm (previously in px) | +| raw\/[^\/]*_metadata_file\.csv | ✓ | Cell metadata file containing the following information - the origin of the cell (fov, unique cell id); physical properties of the cell (area, aspect ratio, width, height); location of the cell centroid within each FOV (center X/Y local) and global position (center X/Y global); information about the protein staining (min/max intensity); type of protein, which may be specific to each experiment but generally includes DAPI, Membrane, PanCK, CD45; other (e.g. Seurat information if that pipeline was used within AtoMx, some data quality information) | +| raw\/[^\/]*_tx_file\.csv | ✓ | Cell transcript file | +| raw\/[^\/]*_config\.ini | ✓ | Needed to generate the DCC file from the fastq file. Contains pipeline processing parameters. Generated by DSP run, prior to sequencing. | | raw\/markers\.csv | ✓ | A csv file describing any morphology markers used to guide ROI and/or AOI selection. | | raw\/additional_panels_used\.csv | | If multiple commercial probe panels were used, then the primary probe panel should be selected in the "oligo_probe_panel" metadata field. The additional panels must be included in this file. Each panel record should include: manufacturer, model/name, product code. | | raw\/gene_panel\.csv | ✓ | The list of target genes. The expected format is: gene_id (ensembl ID), gene_name. | | raw\/custom_probe_set\.csv | | This file should contain any custom probes used and must be included if the metadata field "is_custom_probes_used" is "Yes". The file should minimally include: target gene id, probe seq, probe id. The contents of this file are modeled after the 10x Genomics probe set file (see ). | | raw\/transcript_locations\.csv | ✓ | The origin of the coordinate is 0,0 at the top left corner of the image. The file should include: gene name, x, y, z (optional), quality score (optional). It is expected that the first row in the file contains the column header. | | raw\/custom_gene_list\.csv | | This describes the target genes profiled by the assay. For advanced design, this can be probes sequences for splicing or other analysis for any target of interest. The format should minimally contain: gene name, ensemble ID | -| raw\/probes\.csv | | A CSV file describing the probe panel used. This is typically what's used to specific the probe set when ordering a probe panel for a Xenium run. | +| raw\/probes\.csv | | A CSV file describing the probe panel used. This is typically what's used to specifiy the probe set when ordering a probe panel for a Xenium run. | | raw\/images\/overlay\.(?:jpeg|tiff) | | State whether an overlay image was used to guide ROI selection. If an overlay is used, then the overlay details will be provided in the protocols.io protocol. If used, this needs to be uploaded. It is not included in the OME TIFF. This can be a JPEG or TIFF file | | raw\/images\/preview_scan\.png | ✓ | Assists in selection of regions of FOVs using the grid FOV placement tool. | | lab_processed\/.* | ✓ | Experiment files that were processed by the lab generating the data. |