Skip to content

Commit

Permalink
Merge pull request #31 from WangHong007/master
Browse files Browse the repository at this point in the history
pre-update of ibaqpy
  • Loading branch information
ypriverol authored Nov 15, 2023
2 parents 4fe81e3 + bd652cd commit 10249dc
Show file tree
Hide file tree
Showing 6 changed files with 143 additions and 247 deletions.
47 changes: 23 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,17 @@
[![PyPI version](https://badge.fury.io/py/ibaqpy.svg)](https://badge.fury.io/py/ibaqpy)
![PyPI - Downloads](https://img.shields.io/pypi/dm/ibaqpy)

iBAQ (intensity Based Absolute Quantification) determines the abundance of a protein by dividing the total precursor intensities by the number of theoretically observable peptides of the protein. The TPA (Total Protein Approach) value is determined by summing peptide intensities of each protein and then dividing by the molecular mass to determine the relative concentration of each protein. By using [ProteomicRuler](https://www.sciencedirect.com/science/article/pii/S1535947620337749), it is possible to calculate the protein copy number and absolute concentration. **ibaqpy** compute IBAQ values, TPA values, copy numbers and concentration for proteins starting from a msstats input file and a SDRF experimental design file. This package provides multiple tools:
iBAQ (intensity Based Absolute Quantification) determines the abundance of a protein by dividing the total precursor intensities by the number of theoretically observable peptides of the protein. The TPA (Total Protein Approach) value is determined by summing peptide intensities of each protein and then dividing by the molecular mass to determine the relative concentration of each protein. By using [ProteomicRuler](https://www.sciencedirect.com/science/article/pii/S1535947620337749), it is possible to calculate the protein copy number and absolute concentration. **ibaqpy** compute IBAQ values, TPA values, copy numbers and concentration for proteins starting from a msstats input file (or a feature parquet from [quantmsio](https://github.com/bigbio/quantms.io)) and a SDRF experimental design file. In addition, it supports the merging of iBAQ results from multiple datasets and the elimination of outliers and batch effects. This package provides multiple tools:

- `peptide_file_generation.py`: generate a peptide file from a msstats input file and a SDRF experimental design file.

- `peptide_normalization.py`: Normalize the input output file from script `peptide_file_generation.py`. It includes multiple steps such as peptidoform normalization, peptidorom to peptide summarization, peptide intensity normalization, and imputation.
- `peptide_normalization.py`: Generate the peptides dataframe from a msstats input file and a SDRF experimental design file (or directly from a feature parquet), then normalize the peptides dataframe. It includes multiple steps such as peptidoform normalization, peptidorom to peptide summarization, peptide intensity normalization, and imputation.

- `compute_ibaq.py`: Compute IBAQ values from the output file from script `peptide_normalization.py`.

- `compute_tpa.py`: Compute TPA values, protein copy numbers and concentration from the output file from script `peptide_file_generation.py`.
- `compute_tpa.py`: Compute TPA values, protein copy numbers and concentration from the output file from script `peptide_normalization.py`.

- `datasets_merge.py`: Merge ibaq results from multiple datasets. It consists of three steps: missing value imputation, outlier removal, and batch effect removal.

**NOTE:** In all scripts and result files, *uniprot accession* is used as the protein identifier.

### How to install ibaqpy

Expand Down Expand Up @@ -61,7 +62,7 @@ E.g. http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute
### Peptide Normalization - peptide_normalization.py

```asciidoc
python peptide_normalization.py --msstats PXD003947.sdrf_openms_design_msstats_in.csv --sdrf PXD003947.sdrf.tsv --remove_decoy_contaminants data/contaminants_ids.tsv --remove_low_frequency_peptides --output PXD003947-peptides-norm.tsv
python peptide_normalization.py --msstats PXD003947.sdrf_openms_design_msstats_in.csv --sdrf PXD003947.sdrf.tsv --remove_ids data/contaminants_ids.tsv --remove_decoy_contaminants --remove_low_frequency_peptides --output PXD003947-peptides-norm.csv
```

The command provides an additional `flag` for skip_normalization, pnormalization, compress, log2, violin, verbose.
Expand All @@ -71,7 +72,7 @@ Usage: peptide_normalization.py [OPTIONS]

Options:
-m, --msstats TEXT MsStats file import generated by quantms
-p, --parquet TEXT Parquet file import generated by quantms
-p, --parquet TEXT Parquet file import generated by quantmsio
-s, --sdrf TEXT SDRF file import generated by quantms
--min_aa INTEGER Minimum number of amino acids to filter
peptides
Expand Down Expand Up @@ -120,25 +121,25 @@ Peptide normalization starts from the peptides dataframe. The structure of the i
- SampleID: Sample ID `(e.g. PXD003947-Sample-3)`
- StudyID: Study ID `(e.g. PXD003947)`. In most of the cases the study ID is the same as the ProteomeXchange ID.
#### Removing Contaminants and Decoys
#### 1. Removing Contaminants and Decoys

The first step is to remove contaminants and decoys from the input file. The script `peptide_normalization.py` provides a parameter `--contaminants` for the user to provide a file with a list of protein accessions which represent each contaminant in the file. An example file can be seen in `data/contaminants.txt`. In addition to all the proteins accessions, the tool remove all the proteins with the following prefixes: `CONTAMINANT` and `DECOY`.
The first step is to remove contaminants and decoys. The script `peptide_normalization.py` provides a parameter `--remove_decoy_contaminants` as a flag to remove all the proteins with the following prefixes: `CONTAMINANT` and `DECOY`. And the user can provide a file with a list of protein accessions which represent each contaminant or high abundant protein in the file. An example file can be seen in `data/contaminants_ids.txt`.

#### Peptidoform Normalization
#### 2. Peptidoform Normalization

A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + BioReplicate + Fraction`. In the current version of the file, each row correspond to one peptidoform.

The current version of the tool uses the parackage [qnorm](https://pypi.org/project/qnorm/) to normalize the intensities for each peptidofrom. **qnorm** implements a quantile normalization method.

#### Peptidoform to Peptide Summarization
#### 3. Peptidoform to Peptide Summarization

For each peptidoform a peptide sequence (canonical) with not modification is generated. The intensity of all peptides group by biological replicate are `sum`.

Then, the intensities of the peptides across different biological replicates are summarize using the function `median`.

At the end of this step, for each peptide, the corresponding protein + the intensity of the peptide is stored.

#### Peptide Intensity Imputation and Normalization
#### 4. Peptide Intensity Imputation and Normalization

Before the final two steps (peptide normalization and imputation), the algorithm removes all peptides that are source of missing values significantly. The algorithm removes all peptides that have more than 80% of missing values and peptides that do not appear in more than 1 sample.

Expand Down Expand Up @@ -188,15 +189,15 @@ Options:
--help Show this message and exit.
```

#### Performs the Enzymatic Digestion
#### 1. Performs the Enzymatic Digestion
The current version of this tool uses OpenMS method to load fasta file, and use [ProteaseDigestion](https://openms.de/current_doxygen/html/classOpenMS_1_1ProteaseDigestion.html) to enzyme digestion of protein sequences, and finally get the theoretical peptide number of each protein.

#### Calculate the IBAQ Value
#### 2. Calculate the IBAQ Value
First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.

If protein-group exists in the peptide intensity dataframe, the intensity of all proteins in the protein-group is summed based on the above steps, and then multiplied by the number of proteins in the protein-group.

#### IBAQ Normalization
#### 3. IBAQ Normalization
Normalize the ibaq values using the total ibaq of the sample. The resulted ibaq values are then multiplied by 100'000'000 (PRIDE database noramalization), for the ibaq ppb and log10 shifted by 10 (ProteomicsDB)

### Compute TPA - compute_tpa.py
Expand All @@ -206,9 +207,7 @@ The total protein approach (TPA) is a label- and standard-free method for absolu

```asciidoc
python compute_tpa.py --fasta Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --contaminants contaminants_ids.tsv --peptides PXD003947-peptides.csv --ruler --ploidy 2 --cpc 200 --output PXD003947-tpa.tsv
```

Note: The peptide intensity file used in the calculation of TPA and the ProteomicRuler is not normalized!
```

```asciidoc
python compute_tpa.py --help
Expand Down Expand Up @@ -242,10 +241,10 @@ Options:
--help Show this message and exit.
```

#### Calculate the TPA Value
#### 1. Calculate the TPA Value
The OpenMS tool was used to calculate the theoretical molecular mass of each protein. Similar to the calculation of IBAQ, the TPA value of protein-group was the sum of its intensity divided by the sum of the theoretical molecular mass.

#### Calculate the Cellular Protein Copy Number and Concentration
#### 2. Calculate the Cellular Protein Copy Number and Concentration
The protein copy calculation follows the following formula:
```
protein copies per cell = protein MS-signal * (avogadro / molecular mass) * (DNA mass / histone MS-signal)
Expand Down Expand Up @@ -297,15 +296,15 @@ Options:
--help Show this message and exit.
```

#### Impute Missing Values
#### 1. Impute Missing Values
Remove proteins missing more than 30% of all samples. Users can keep tissue parts interested, and data will be converted to a expression matrix (samples as columns, proteins as rows), then missing values will be imputed with `KNNImputer`.

#### Remove Outliers
#### 2. Remove Outliers
When outliers are removed, multiple hierarchical clustering is performed using `hdbscan.HDBSCAN`, where outliers are labeled -1 in the PCA plot. When clustering is performed, the default cluster minimum value and the minimum number of neighbors of the core point are the minimum number of samples in all datasets.

**Users can skip this step and do outliers removal manually.*
*Users can skip this step and do outliers removal manually.*

#### Batch Effect Correction
#### 3. Batch Effect Correction
Using `pycombat` for batch effect correction, and batch is set to `datasets` (refers specifically to PXD ids) and the covariate should be `tissue_part`.

### Credits
Expand Down
15 changes: 8 additions & 7 deletions bin/compute_ibaq.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,13 @@ def normalize_ibaq(res: DataFrame) -> DataFrame:
return res


def parse_uniprot_name(identifier: str) -> str:
def parse_uniprot_accession(identifier: str) -> str:
"""
Parse the uniprot name from the identifier (e.g. sp|P12345|PROT_NAME)
Parse the uniprot accession from the identifier (e.g. sp|P12345|PROT_NAME)
:param identifier: Uniprot identifier
:return:
"""
return identifier.split("|")[2]
return identifier.split("|")[1]


@click.command()
Expand Down Expand Up @@ -119,7 +119,7 @@ def ibaq_compute(
exit(1)

fasta_proteins = list() # type: list[FASTAEntry]
protein_names = list()
protein_accessions = list()
FASTAFile().load(fasta, fasta_proteins)
uniquepepcounts = dict() # type: dict[str, int]
digestor = ProteaseDigestion()
Expand All @@ -144,12 +144,13 @@ def get_average_nr_peptides_unique_bygroup(pdrow: Series) -> Series:
digest = list() # type: list[str]
digestor.digest(AASequence().fromString(entry.sequence), digest, min_aa, max_aa)
digestuniq = set(digest)
protein_name = parse_uniprot_name(entry.identifier)
# TODO: We keep uniprot accessions rather than names.
protein_name = parse_uniprot_accession(entry.identifier)
uniquepepcounts[protein_name] = len(digestuniq)
protein_names.append(protein_name)
protein_accessions.append(protein_name)

data = pd.read_csv(peptides, sep=",")
data = data[data[PROTEIN_NAME].isin(protein_names)]
data = data[data[PROTEIN_NAME].isin(protein_accessions)]
print(data.head())
# next line assumes unique peptides only (at least per indistinguishable group)

Expand Down
86 changes: 16 additions & 70 deletions bin/peptide_normalization.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env python
import gc

import uuid
import click
import matplotlib.pyplot as plt
import numpy as np
Expand Down Expand Up @@ -35,34 +35,22 @@
TMT11plex,
TMT16plex,
get_canonical_peptide,
get_reference_name,
get_spectrum_prefix,
get_study_accession,
parquet_map,
parse_uniprot_accession,
plot_box_plot,
plot_distributions,
remove_contaminants_decoys,
remove_extension_file,
remove_protein_by_ids,
sum_peptidoform_intensities,
get_peptidoform_normalize_intensities,
average_peptide_intensities,
print_help_msg,
)


def print_dataset_size(dataset: DataFrame, message: str, verbose: bool) -> None:
if verbose:
print(message + str(len(dataset.index)))


def print_help_msg(command) -> None:
"""
Print help information
:param command: command to print helps
:return: print help
"""
with click.Context(command) as ctx:
click.echo(command.get_help(ctx))


# TODO: The following two func are useless.
def remove_outliers_iqr(dataset: DataFrame):
"""
This method removes outliers from the dataframe inplace, the variable used for the outlier removal is Intensity
Expand All @@ -88,6 +76,11 @@ def remove_missing_values(normalize_df: DataFrame, ratio: float = 0.3) -> DataFr
return normalize_df


def print_dataset_size(dataset: DataFrame, message: str, verbose: bool) -> None:
if verbose:
print(message + str(len(dataset.index)))


def intensity_normalization(
dataset: DataFrame,
field: str,
Expand Down Expand Up @@ -160,53 +153,6 @@ def intensity_normalization(
return dataset


def get_peptidoform_normalize_intensities(
dataset: DataFrame, higher_intensity: bool = True
) -> DataFrame:
"""
Select the best peptidoform for the same sample and the same replicates. A peptidoform is the combination of
a (PeptideSequence + Modifications) + Charge state.
:param dataset: dataset including all properties
:param higher_intensity: select based on normalize intensity, if false based on best scored peptide
:return:
"""
dataset = dataset[dataset[NORM_INTENSITY].notna()]
if higher_intensity:
dataset = dataset.loc[
dataset.groupby(
[PEPTIDE_SEQUENCE, PEPTIDE_CHARGE, SAMPLE_ID, CONDITION, BIOREPLICATE],
observed=True,
)[NORM_INTENSITY].idxmax()
].reset_index(drop=True)
else:
dataset = dataset.loc[
dataset.groupby(
[PEPTIDE_SEQUENCE, PEPTIDE_CHARGE, SAMPLE_ID, CONDITION, BIOREPLICATE],
observed=True,
)[SEARCH_ENGINE].idxmax()
].reset_index(drop=True)
return dataset


def average_peptide_intensities(dataset: DataFrame) -> DataFrame:
"""
Median the intensities of all the peptidoforms for a specific peptide sample combination.
:param dataset: Dataframe containing all the peptidoforms
:return: New dataframe
"""
dataset_df = dataset.groupby(
[PEPTIDE_CANONICAL, SAMPLE_ID, CONDITION], observed=True
)[NORM_INTENSITY].median()
dataset_df = dataset_df.reset_index()
dataset_df = pd.merge(
dataset_df,
dataset[[PROTEIN_NAME, PEPTIDE_CANONICAL, SAMPLE_ID, CONDITION]],
how="left",
on=[PEPTIDE_CANONICAL, SAMPLE_ID, CONDITION],
)
return dataset_df


def remove_low_frequency_peptides_(
dataset_df: DataFrame, percentage_samples: float = 0.20
):
Expand Down Expand Up @@ -411,7 +357,7 @@ def impute_peptide_intensities(dataset_df, field, class_field):
@click.option(
"--qc_report",
help="PDF file to store multiple QC images",
default="peptideNorm-QCprofile.pdf",
default=f"peptideNorm-QCprofile-{str(uuid.uuid4())}.pdf",
)
def peptide_normalization(
msstats: str,
Expand Down Expand Up @@ -469,7 +415,7 @@ def peptide_normalization(

# Read the sdrf file
sdrf_df = pd.read_csv(sdrf, sep="\t", compression=compression_method)
sdrf_df[REFERENCE] = sdrf_df["comment[data file]"].apply(remove_extension_file)
sdrf_df[REFERENCE] = sdrf_df["comment[data file]"].apply(get_spectrum_prefix)
print(sdrf_df)

if FRACTION not in feature_df.columns:
Expand All @@ -493,7 +439,7 @@ def peptide_normalization(
# Merged the SDRF with the Resulted file
labels = set(sdrf_df["comment[label]"])
if CHANNEL not in feature_df.columns:
feature_df[REFERENCE] = feature_df[REFERENCE].apply(remove_extension_file)
feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_spectrum_prefix)
dataset_df = pd.merge(
feature_df,
sdrf_df[["source name", REFERENCE]],
Expand Down Expand Up @@ -522,7 +468,7 @@ def peptide_normalization(
.rename(columns={"index": "comment[label]"})
)
sdrf_df = sdrf_df.merge(choice, on="comment[label]", how="left")
feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_reference_name)
feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_spectrum_prefix)
dataset_df = pd.merge(
feature_df,
sdrf_df[["source name", REFERENCE, CHANNEL]],
Expand All @@ -543,7 +489,7 @@ def peptide_normalization(
.rename(columns={"index": "comment[label]"})
)
sdrf_df = sdrf_df.merge(choice, on="comment[label]", how="left")
feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_reference_name)
feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_spectrum_prefix)
dataset_df = pd.merge(
feature_df,
sdrf_df[["source name", REFERENCE, CHANNEL]],
Expand Down
Loading

0 comments on commit 10249dc

Please sign in to comment.