BREAKING CHANGES
-
velocity/scvelo
: updatescvelo
to0.3.3
, which also removes support for usingloom
input files. The component now uses aMuData
object as input. Several arguments were added to support selecting different inputs from the MuData file:counts_layer
,modality
,layer_spliced
,layer_unspliced
,layer_ambiguous
. Anoutput_h5mu
argument was has been added (PR #932). -
src/annotate/onclass
andsrc/annotate/celltypist
: Input parameter for gene name layers of input datasets has been updated to--input_var_gene_names
andreference_var_gene_names
(PR #919). -
Several components under
src/scgpt
(cross_check_genes
,tokenize_pad
,binning
) now processes the input (query) datasets differently. Instead of subsetting datasets based on genes in the model vocabulary and/or highly variable genes, these components require an input .var column with a boolean mask specifying this information. The results are written back to the original input data, preserving the dataset structure (PR #832). -
query/cellxgene_census
: The default output layer has been changed from.layers["counts"]
to.X
to be more aligned with the standard OpenPipelines format (PR #933).
Use argument--output_layer_counts counts
to revert the behaviour to the previous default. -
Added cell multiplexing support to the
from_cellranger_multi_to_h5mu
component and thecellranger_multi
workflow. For thefrom_cellranger_multi_to_h5mu
component, theoutput
argument now requires a value containing a wildcard character*
, which will be replaced by the sample ID to form the final output file names. Additionally, asample_csv
argument is added to thefrom_cellragner_multi_to_h5mu
component which describes the sample name per output file. No change is required for theoutput_h5mu
argument from thecellranger_multi
workflow, the workflow will just emit multiple events in case of a multiplexed run, one for each sample. The id of the events (and default output file names) are set by--sample_ids
(in case of cell multiplexing), or (as before) by the user providedid
for the input (PR #803 and PR #902). -
demux/bcl_convert
: update BCL convert from 3.10 to 4.2 (PR #774). -
demux/cellranger_mkfastq
,mapping/cellranger_count
,mapping/cellranger_multi
andreference/build_cellranger_reference
: update cellranger to8.0.1
(PR #774 and PR #811). -
Removed
--disable_library_compatibility_check
in favour of--check_library_compatibility
to themapping/cellranger_multi
component and theingestion/cellranger_multi
workflow (PR #818). -
lianapy
: bumped version to1.3.0
(PR #827 and PR #862). Additionally,groupby
is now a required argument. -
concat
: this component was deprecated and has now been removed, useconcatenate_h5mu
instead (PR #796). -
The
workflows
folder in the root of the project no longer contains symbolic links to the build workflows intarget
.
Using any workflows that was previously linked in this directory will now result in an error which will indicate
the location of the workflow to be used instead (PR #796). -
XGBoost
: bump version to2.0.3
(PR #646). -
Several components: update anndata to
0.11.1
and mudata to0.3.1
(PR #645 and PR #901), and scanpy to1.10.4
(PR #901). -
filter/filter_with_hvg
: this component was deprecated and has now been removed. Usefeature_annotation/highly_variable_features_scanpy
instead (PR #843). -
dataflow/concat
: this component was deprecated and has now been removed. Usedataflow/concatenate_h5mu
instead (PR #857). -
convert/from_h5mu_to_seurat
: bump seurat to latest version (PR #850). -
workflows/ingestion/bd_rhapsody
: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). -
mapping/bd_rhapsody
: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). -
reference/make_bdrhap_reference
: Upgrade BD Rhapsody 1.x to 2.x, thereby changing the interface of the workflow (PR #846). -
reference/build_star_reference
: Renamemapping/star_build_reference
toreference/build_star_reference
(PR #846). -
reference/cellranger_mkgtf
: Renamereference/mkgtf
toreference/cellranger_mkgtf
(PR #846). -
labels_transfer/xgboost
: Align interface with new annotation workflow- Store label probabilities instead of uncertainties
- Take
.h5mu
format as an input instead of.h5ad
-
reference/build_cellranger_arc_reference
: a default value of "output" is now specified for the argument--genome
, inline withreference/build_cellranger_reference
component. Additionally, providing a value for--organism
is no longer required and its default value ofHomo Sapiens
has been removed (PR #864).
NEW FUNCTIONALITY
Important
Workflows from the workflows/annotation
and workflows/integration/scgpt_leiden
namespaces, plus their newly implemented dependencies, are not yet considered to be part of the stable public API. Their functionality and interface may be subject to change.
-
velocyto_to_h5mu
: now writes counts to.X
(PR #932) -
qc/calculate_atac_qc_metrics
: new component for calculating ATAC QC metrics (PR #868). -
workflows/annotation/scgpt_integration_knn
workflow: Cell-type annotation based on scGPT integration with KNN label transfer (PR #875). -
CI: Use
params.resources_test
in test workflows in order to point to an alternative location (e.g. a cache) (PR #889). -
Added
demux/cellranger_atac_mkfastq
component: demultiplex raw sequencing data for ATAC experiments (PR #726). -
process_samples
,process_batches
andrna_multisample
workflows: added functionality to scale the log-normalized
gene expression data to unit variance and zero mean. The scaled data will be output to a different layer and the
representation with reduced dimensions will be created and stored in addition to the non-scaled data (PR #733). -
transform/scaling
: add--input_layer
and--output_layer
arguments (PR #733). -
CI: added checking of mudata contents for multiple workflows (PR #783).
-
Added multiple arguments to the
cellranger_multi
workflow in order to maintain feature parity with themapping/cellranger_multi
component (PR #803). -
convert/from_cellranger_to_h5mu
: add support for antigen analysis. -
Added
demux/cellranger_atac_mkfastq
component: demultiplex raw sequencing data for ATAC experiments (PR #726). -
Added
reference/build_cellranger_reference
component: build reference file compatible with ATAC and ATAC+GEX experiments (PR #726). -
demux/bcl_convert
: add support for no lane splitting (PR #804). -
reference/cellranger_mkgtf
component: Added cellranger mkgtf as a standalone component (PR #771). -
scgpt/cross_check_genes
component: Added a gene-model cross check component for scGPT (PR #758). -
scgpt/embedding
: component: Added scGPT embedding component (PR #761) -
scgpt/tokenize_pad
: component: Added scGPT padding and tokenization component (PR #754). -
scgpt/binning
component: Added a scGPT pre-processing binning component (PR #765). -
workflows/integration/scgpt_leiden
workflow with scGPT integration followed by Leiden clustering (PR #794). -
scgpt/cell_type_annotation
component: Added scGPT cell type annotation component (PR #798). -
resources_test_scripts/scGPT.sh
: Added script to include scGPT test resources (PR #800). -
transform/clr
component: Added the option to set theaxis
along which to apply CLR. Possible to override
on workflow level as well (PR #767). -
annotate/celltypist
component: Added a CellTypist annotation component (PR #825). -
dataflow/split_h5mu
component: Added a component to split a single h5mu file into multiple h5mu files based on the values of an .obs column (PR #824). -
workflows/test_workflows/ingestion
components &workflows/ingestion
: Added standalone components for integration testing of ingestion workflows (PR #801). -
workflows/ingestion/make_reference
: Add additional arguments passed through to the STAR and BD Rhapsody reference components (PR #846). -
annotate/random_forest_annotation
component: Added a random forest cell type annotation component (PR #848). -
dataflow/concatenate_h5mu
: data from.uns
, both originating from the global and per-modality slots, is now retained in the final concatenated output object. Additionally, added theuns_merge_mode
argument in order to tune the behavior when conflicting keys are detected across samples (PR #859). -
dimred/densmap
component: Added a densMAP dimensionality reduction component (PR #748). -
annotate/scanvi
component: Added a component to annotate cells using scANVI (PR #833). -
transform/bpcells_regress_out
component: Added a component to regress out effects of confounding variables in the count matrix using BPCells (PR #863). -
transform/regress_out
: Allow providing 'input' and 'output' layers for scanpy regress_out functionality (PR #863). -
workflows/ingestion/make_reference
: add possibility to build CellRanger ARC references. Added--motifs_file
,--non_nuclear_contigs
and--output_cellranger_arc
arguments (PR #864). -
Test resources (reference_gencodev41_chr1): switch reference genome for CellRanger to ARC variant (PR #864).
-
transform/bpcells_regress_out
component: Added a component to regress out effects of confounding variables in the count matrix using BPCells (PR #863). -
transform/regress_out
: Allow providing 'input' and 'output' layers for scanpy regress_out functionality (PR #863). -
Added
transform/tfidf
component: normalize ATAC data with TF-IDF (PR #870). -
Added
dimred/lsi
component (PR #552). -
metadata/duplicate_obs
component: Added a component to make a copy from one .obs field or index to another .obs field within the same MuData object (PR #874, PR #899). -
annotate/onclass
: component: Added a component to annotate cell types using OnClass (PR #844). -
annotate/svm
component: Added a component to annotate cell types using support vector machine (SVM) (PR #845). -
metadata/duplicate_var
component: Added a component to make a copy from one .var field or index to another .var field within the same MuData object (PR #877, PR #899). -
filter/subset_obsp
component: Added a component to subset an .obsp matrix by column based on the value of an .obs field. The resulting subset is moved to an .obsm field (PR #888). -
labels_transfer/knn
component: Enable using additional distance functions for KNN classification (PR #830) and allow to perform KNN classification based on a pre-calculated neighborhood graph (PR #890).
MAJOR CHANGES
- Bump popv to
0.4.2
(PR #901)
MINOR CHANGES
-
Pin
scikit-learn
forlabels_transfer/xgboost
to<1.6
(PR #931). -
filter/filter_with_scrublet
: provide cleaner error message when running scrublet on an empty modality (PR #929). -
Several component (cleanup): remove workaround for using being able to use shared utility functions with Nextflow Fusion (PR #920).
-
scgpt/cell_type_annotation
component update: Added support for multi-processing (PR #832). -
Several annotation (
src/annotate/
) components (onclass
,celltypist
,random_forest_annotation
,scanvi
,svm_annotation
): Updated input parameteres to ensure uniformity across components, implemented functionality to cross-check the overlap of genes between query and reference (model) datasets and implemented logic to allow for subsetting of genes (PR #919). -
workflows/annotation/scgpt_annotation
workflow: Added a scGPT transformer-based cell type annotation workflow (PR #832). -
scgpt/cross_check_genes
component update: Highly variable genes are now cross-checked based on the boolean mask invar_input
. The filtering information is stored in the--output_var_filter
.var field instead of subsetting the dataset (PR #832). -
scgpt/binning
component update: This component now requires the--var_input
parameter to provide gene filtering information. Binned data is written to the--output_obsm_binned_counts
.obsm field in the original input data (PR #832). -
scgpt/pad_tokenize
component update: Genes are padded and tokenized based on filtering information in--var_input
and--input_obsm_binned_counts
(PR #832). -
resources_test_scripts/scgpt.sh
: Update scGPT test resources to avoid subsetting of datasets (PR #926). -
workflows/integration/scgpt_leiden
workflow update: Update workflow such that input dataset is not subsetted for HVG but uses boolean masks in .var field instead (PR #875). -
dataflow/split_h5mu
: Optimize resource usage of the component (PR #913). -
Several components: bump python version (PR #901).
-
resources_test_scripts/cellranger_atac_tiny_bcl.sh
script: generate counts from fastq files using CellRanger atac count (PR #726). -
cellbender_remove_background_v0_2
: update base image tonvcr.io/nvidia/pytorch:23.12-py3
(PR #646). -
Bump scvelo to
0.3.2
(PR #828). -
Pin numpy<2 for several components (PR #815).
-
Added
resources_test_scripts/cellranger_atac_tiny_bcl.sh
script: download tiny bcl file with an ATAC experiment, download a motifs file, demultiplex bcl files to reads in fastq format (PR #726). -
mapping/cellranger_multi
component now outputs logs on failure of thecellranger multi
process (PR #766). -
Bump
viash-actions
tov6
(PR #821). -
reference/make_reference
: Do not try to extract genome fasta and transcriptome gtf if they are not gzipped (PR #856). -
Changes related to syncing the test resources (PR #867):
- Add
.info.test_resources
to_viash.yaml
to specify where test resources need to be synced from. download/sync_test_resources
: Use.info.test_resources
in_viash.yaml
to detect where test resources need to be synced from.- Update CI to use
project/sync-and-cache
instead ofproject/sync-and-cache-s3
.
- Add
BUG FIXES
-
Fix failing tests for
ingestion/cellranger_postprocessing
,ingestion/conversion
andmultiomics/process_batches
(PR #869). -
convert/from_10xh5_to_h5mu
: add .uns slot to mdata root when metrics file is provided (PR #887). -
Fix ingestion components not working when optional arguments are unset (PR #894).
-
transform/normalize_total
component: pass thetarget_sum
argument tosc.pp.normalize_total()
(PR #823). -
from_cellranger_multi_to_h5mu
: fix missingpytest
dependency (PR #897). -
scvi_leiden
workflow: fix the input layer argument of the workflow not being passed to the scVI component (PR #936 and PR #938).