Skip to content

Commit

Permalink
Merge pull request #186 from aertslab/dev2
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
cflerin authored Jul 17, 2020
2 parents c73e44b + 107fa42 commit 5daccba
Show file tree
Hide file tree
Showing 6 changed files with 15 additions and 4 deletions.
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ News and releases
0.10.0 | 2020-02-27
^^^^^^^^^^^^^^^^^^^

* Added a helper script `arboreto_with_multiprocessing.py <https://github.com/aertslab/pySCENIC/blob/master/scripts/arboreto_with_multiprocessing.py>`_ that runs the Arboreto GRN algorithms (GRNBoost2, GENIE3) without Dask for compatibility.
* Added a helper script `arboreto_with_multiprocessing.py <https://github.com/aertslab/pySCENIC/blob/master/src/pyscenic/cli/arboreto_with_multiprocessing.py>`_ that runs the Arboreto GRN algorithms (GRNBoost2, GENIE3) without Dask for compatibility.

* Ability to set a fixed seed in both the AUCell step and in the calculation of regulon thresholds (CLI parameter :code:`--seed`; aucell function parameter :code:`seed`).

Expand Down
2 changes: 1 addition & 1 deletion docs/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ It is recommended to use the older version of Dask and Distributed for stability
But in many cases this still results in issues with the GRN step.
An alternative is to use the multiprocessing implementation of Arboreto recently included in pySCENIC (`arboreto_with_multiprocessing.py <https://github.com/aertslab/pySCENIC/blob/master/scripts/arboreto_with_multiprocessing.py>`_).
An alternative is to use the multiprocessing implementation of Arboreto recently included in pySCENIC (`arboreto_with_multiprocessing.py <https://github.com/aertslab/pySCENIC/blob/master/src/pyscenic/cli/arboreto_with_multiprocessing.py>`_).
This script uses the Arboreto and pySCENIC codebase to run GRNBoost2 (or GENIE3) without Dask.
The eliminates the possibility of running the GRN step across multiple nodes, but brings provides additional stability.
The run time is generally equivalent to the Dask implementation using the same number of workers.
Expand Down
4 changes: 3 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,13 @@ def read_requirements(fname):
py_modules=[os.path.splitext(os.path.basename(path))[0] for path in glob.glob('src/*.py')],
include_package_data=True,
install_requires=read_requirements('requirements.txt'),
scripts=['src/pyscenic/cli/arboreto_with_multiprocessing.py'],
entry_points = {
'console_scripts': ['pyscenic = pyscenic.cli.pyscenic:main',
'db2feather = pyscenic.cli.db2feather:main',
'csv2loom = pyscenic.cli.csv2loom:main',
'invertdb = pyscenic.cli.invertdb:main',
'gmt2regions = pyscenic.cli.gmt2regions:main'],
}
)
)

File renamed without changes.
6 changes: 6 additions & 0 deletions src/pyscenic/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,11 @@ def module2features_auc1st_impl(db: Type[RankingDatabase], module: Regulon, moti
features, genes, rankings = df.index.values, df.columns.values, df.values
weights = np.asarray([module[gene] for gene in genes]) if weighted_recovery else np.ones(len(genes))

# include check for modules with no genes that could be mapped to the db. This can happen when including non protein-coding genes in the expression matrix.
if(df.empty):
LOGGER.warning("No genes in module {} could be mapped to {}. Skipping this module.".format(module.name, db.name))
return pd.DataFrame(), None, None, genes, None

# Calculate recovery curves, AUC and NES values.
# For fast unweighted implementation so weights to None.
aucs = calc_aucs(df, db.total_genes, weights, auc_threshold)
Expand Down Expand Up @@ -300,6 +305,7 @@ def df2regulons(df, save_columns=[]) -> Sequence[Regulon]:
:return: A sequence of regulons.
"""

assert not df.empty, 'Signatures dataframe is empty!'
print("Create regulons from a dataframe of enriched features.")
print("Additional columns saved: {}".format(save_columns))

Expand Down
5 changes: 4 additions & 1 deletion src/pyscenic/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,9 @@ def iter_modules(adjc, context):

# Add correlation column and create two disjoint set of adjacencies.
LOGGER.info("Calculating Pearson correlations.")
# test for genes present in the adjacencies but not present in the expression matrix:
unique_adj_genes = set(adjacencies[COLUMN_NAME_TF]).union(set(adjacencies[COLUMN_NAME_TARGET])) - set(ex_mtx.columns)
assert len(unique_adj_genes)==0, f"Found {len(unique_adj_genes)} genes present in the network (adjacencies) output, but missing from the expression matrix. Is this a different gene expression matrix?"
LOGGER.warn(f"Note on correlation calculation: the default behaviour for calculating the correlations has changed after pySCENIC verion 0.9.16. Previously, the default was to calculate the correlation between a TF and target gene using only cells with non-zero expression values (mask_dropouts=True). The current default is now to use all cells to match the behavior of the R verision of SCENIC. The original settings can be retained by setting 'rho_mask_dropouts=True' in the modules_from_adjacencies function, or '--mask_dropouts' from the CLI.\n\tDropout masking is currently set to [{rho_mask_dropouts}].")
adjacencies = add_correlation(adjacencies, ex_mtx,
rho_threshold=rho_threshold, mask_dropouts=rho_mask_dropouts)
Expand Down Expand Up @@ -316,7 +319,7 @@ def add_motif_url(df: pd.DataFrame, base_url: str):
:param base_url:
:return:
"""
df[("Enrichment", COLUMN_NAME_MOTIF_URL)] = list(map(partial(urljoin, base=base_url), df.index.get_level_values(COLUMN_NAME_MOTIF_ID)))
df[("Enrichment", COLUMN_NAME_MOTIF_URL)] = list(map(partial(urljoin, base_url), df.index.get_level_values(COLUMN_NAME_MOTIF_ID)))
return df


Expand Down

0 comments on commit 5daccba

Please sign in to comment.