Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update: readme #61

Merged
merged 1 commit into from
Jun 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 22 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,9 @@ As mentioned before, ibaq values are calculated by dividing the total precursor

- _Total precursor intensities_, the total intensity of a protein is calculated by summing the intensity of all peptides that belong to the protein. The intensity values are obtained from the feature parquet file in [quantms.io](https://github.com/bigbio/quantms.io).

## TODO @PingZheng: Can you double-check that for protein group we multiply the intensity or divided?
> Note: If protein-group exists in the peptide intensity dataframe, the intensity of all proteins in the protein-group is summed based on the above steps, and then divided by the number of proteins in the protein-group.

> Note: If protein-group exists in the peptide intensity dataframe, the intensity of all proteins in the protein-group is summed based on the above steps, and then multiplied by the number of proteins in the protein-group.

#### Other normalized IBAQ values
### Other normalized IBAQ values

- `IbaqNorm` - normalize the ibaq values using the total ibaq of the sample `ibaq / sum(ibaq)`, the sum is applied for proteins in the same _sample + condition_.

Expand Down Expand Up @@ -52,49 +50,34 @@ In summary, each feature is the unique combination of a peptide sequence includi

#### Data preprocessing

## TODO @PingZheng: Can you confirm that the following steps are correct?

In this section, ibaqpy will do:
In this section`features2peptides`, ibaqpy will do:
- Parse the identifier of proteins and retain only unique peptides.
- Remove lines where intensity or study condition is empty: This could happen in the following cases:
- The DIA pipeline sometimes for some features releases intensities with value 0.
- The quantms.io do not contain feature information for some conditions. This extreme case could happen when not ID/Quant was found for a given condition during the analysis.
- Parse the identifier of proteins and retain only unique peptides.
- Filter peptides with less amino acids than min_aa.
- Low-confidence proteins were removed according to the threshold of unique peptides: We use a thershold of 2 unique peptides to consider a protein as a high-confidence protein. This parameter is applied if not specified by the user, and the default value is 2. If users want to change this threshold, they can use the `--min_unique` parameter.
- Filter decoy, contaminants, entrapment: Proteins with the following prefix are removed by default: `DECOY, CONTAMINANT, ENTRAPMENT` could be removed, by default, the filter is not applied. If users want to remove these proteins, they can use the `--remove_decoy_contaminants` parameter.
- Filter user-specified proteins: The user can provide a list of protein identifiers to remove from the analysis using the `--remove_ids` parameter. The remove ids parameters will remove proteins from the analysis that could be potential to influence the intensity normalization. For example, ALBU_HUMAN could be over expressed in human tissues, and that is why we may want to remove it when analyzing tissue data.
- Normalize at feature level between ms runs (technical repetitions):
- When `MS runs > 1` in the sample, the `mean` of all average(`mean`, `median` or `iqr`) in each MS run is calculated(SampleMean)
- The ratio between SampleMean and the average MS run is used as a reference to scale the original intensity
- Merge peptidoforms across fractions and technical repetitions: Combine technical replicates and fragments from the same sample.
- Normalize the data at the sample level:
- `globalMedian`: A global median that adjusts the median of all samples.
- `conditionMedian`: All samples under the same conditions were adjusted to the median value under the current conditions.
- Remove peptides with low frequency if `sample number > 1`: This parameter is applied always unless the user specifies the `--remove_low_frequency_peptides` parameter. The default value is 20% of the samples. If users want to change this threshold, they can use the `--remove_low_frequency_peptides` parameter.
- Assembly peptidoforms to peptides:
A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + BioReplicate + Fraction` (among other features), and a peptide is a combination of a `PeptideSequence(Canonical) + BioReplicate`. ibaqpy will do:
- Select peptidoforms with the highest intensity across different modifications, fractions, and technical replicates
- Merge peptidoforms across different charges and combined into peptides. In order to merge peptidoforms, the package will applied the `sum` of the intensity values of the peptidoforms.
- Intensity transformation to log: The user can specify the `--log2` parameter to transform the peptide intensity values to log2 before normalization.

> Note: At the moment, ibaqpy computes the ibaq values only based on unique peptides. Shared peptides are discarded. However, if a group of proteins share the same unique peptides (e.g., Pep1 -> Prot1;Prot2 and Pep2 -> Prot1;Prot2), the intensity of the proteins is summed and divided by the number of proteins in the group.

#### Feature normalization

## TODO @PingZheng: Can you confirm that the following steps are correct?

In this section, ibaqpy corrects the intensity of each feature MS runs in the sample, eliminating the effect between runs (including technique repetitions). It will do:

- When `MS runs > 1` in the sample, the `mean` of all average(`mean`, `median` or `iqr`) in each MS run is calculated(SampleMean)
- The ratio between SampleMean and the average MS run is used as a reference to scale the original intensity

#### Features to peptides

A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + BioReplicate + Fraction` (among other features), and a peptide is a combination of a `PeptideSequence(Canonical) + BioReplicate`. ibaqpy will do:

- Select peptidoforms with the highest intensity across different modifications, fractions, and technical replicates
- Merge peptidoforms across different charges and combined into peptides. In order to merge peptidoforms, the package will applied the `sum` of the intensity values of the peptidoforms.

#### Peptide Intensity Normalization

Finally, the intensity of the peptide was normalized globally by `globalMedian`,`conditionMedian`. In the sample, the protein was combined with its unique peptides.

#### Computing Ibaq and TPA

The iBAQ value was calculated as the Intensity of the protein divided by the theoretical number of unique peptides cut by the enzyme. This command also normalized iBAQ as riBAQ. See details in `peptides2proteins`.

Compute TPA values, protein copy numbers and concentration from the output file from script `commands/features2peptides`.

**Merge projects**: Merge ibaq results from multiple datasets. It consists of three steps: missing value imputation, outlier removal, and batch effect removal.

#### Calculate the IBAQ Value
First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.See details in `peptides2proteins`.

> Note: In all scripts and result files, *uniprot accession* is used as the protein identifier.

### How to install ibaqpy
Expand Down Expand Up @@ -144,7 +127,7 @@ E.g. http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute
#### Features to peptides

```asciidoc
ibaqpy features2peptides -p tests/PXD003947/PXD003947-featrue.parquet -s tests/PXD003947/PXD003947.sdrf.tsv --remove_ids data/contaminants_ids.tsv --remove_decoy_contaminants --remove_low_frequency_peptides --output tests/PXD003947/PXD003947-peptides-norm.csv --log2 --save_parquet
ibaqpy features2peptides -p tests/PXD003947/PXD003947-featrue.parquet -s tests/PXD003947/PXD003947.sdrf.tsv --remove_ids data/contaminants_ids.tsv --remove_decoy_contaminants --remove_low_frequency_peptides --output tests/PXD003947/PXD003947-peptides-norm.csv
```
```asciidoc
Usage: features2peptides.py [OPTIONS]
Expand All @@ -167,11 +150,11 @@ Options:
properties for normalization
--skip_normalization Skip normalization step
--nmethod TEXT Normalization method used to normalize
feature intensities for all samples
feature intensities for tec
(options: mean, median, iqr, none)
--pnmethod TEXT Normalization method used to normalize
peptides intensities for all samples
(options: globalMedian,conditionMedian)
(options: globalMedian,conditionMedian,none)
--log2 Transform to log2 the peptide intensity
values before normalization
--save_parquet Save normalized peptides to parquet
Expand Down
Binary file modified benchmarks/images/missing_peptides_by_sample.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
87 changes: 43 additions & 44 deletions ibaqpy/ibaq/peptide_normalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -553,56 +553,28 @@ def peptide_normalization(
med_map = F.get_median_map_to_condition()
for samples, df in F.iter_samples():
for sample in samples:
## TODO: Perform data preprocessing on every sample
# Perform data preprocessing on every sample
print(f"{str(sample).upper()}: Data preprocessing...")
dataset_df = df[df["sample_accession"] == sample].copy()
# Step1: Parse the identifier of proteins and retain only unique peptides.
dataset_df = dataset_df[dataset_df["unique"] == 1]
if not skip_normalization:
if pnmethod == "globalMedian":
dataset_df.loc[:, "intensity"] = (
dataset_df["intensity"] / med_map[sample]
)
elif pnmethod == "conditionMedian":
con = dataset_df["condition"].unique()[0]
dataset_df.loc[:, "intensity"] = (
dataset_df["intensity"] / med_map[con][sample]
)
dataset_df = dataset_df[PARQUET_COLUMNS]
dataset_df = parquet_common_process(dataset_df, label, choice)
# Step2: Remove lines where intensity or study condition is empty.
# Step3: Filter peptides with less amino acids than min_aa.
dataset_df = data_common_process(dataset_df, min_aa)
# Only proteins with unique peptides number greater than min_unique (default: 2) are retained
# Step4: Delete low-confidence proteins.
dataset_df = dataset_df.groupby(PROTEIN_NAME).filter(
lambda x: len(set(x[PEPTIDE_CANONICAL])) >= min_unique
)
dataset_df.rename(columns={INTENSITY: NORM_INTENSITY}, inplace=True)
## TODO: @PingZheng I think the removal of these features must happen before the normalization step.
## TODO: The entire point to catch any outliers and remove them before normalization.

# Remove high abundant entrapment's, contaminants, proteins and the outliers
if remove_ids is not None:
dataset_df = remove_protein_by_ids(dataset_df, remove_ids)
# Step5: Filter decoy, contaminants, entrapment
if remove_decoy_contaminants:
dataset_df = remove_contaminants_entrapments_decoys(dataset_df)

if remove_low_frequency_peptides and len(sample_names) > 1:
dataset_df.set_index(
[PROTEIN_NAME, PEPTIDE_CANONICAL], drop=True, inplace=True
)
dataset_df = dataset_df[
~dataset_df.index.isin(low_frequency_peptides)
].reset_index()
print(
f"{str(sample).upper()}: Peptides after remove low frequency peptides: {len(dataset_df.index)}"
)

if len(dataset_df[FRACTION].unique().tolist()) > 1:
print(f"{str(sample).upper()}: Merge features across fractions.. ")
dataset_df = merge_fractions(dataset_df)
print(
f"{str(sample).upper()}: Number of features after merging fractions: {len(dataset_df.index)}"
)

# TODO: Normalize at feature level between ms runs (technical repetitions)
# Step6: Filter user-specified proteins
if remove_ids is not None:
dataset_df = remove_protein_by_ids(dataset_df, remove_ids)
dataset_df.rename(columns={INTENSITY: NORM_INTENSITY}, inplace=True)
# Step7: Normalize at feature level between ms runs (technical repetitions)
if (
not skip_normalization
and nmethod != "none"
Expand All @@ -613,21 +585,48 @@ def peptide_normalization(
print(
f"{str(sample).upper()}: Number of features after normalization: {len(dataset_df.index)}"
)

## TODO: Assembly features to peptides
# Merge peptidoforms across fractions and technical repetitions
# Step8: Merge peptidoforms across fractions and technical repetitions
dataset_df = get_peptidoform_normalize_intensities(dataset_df)
print(
f"{str(sample).upper()}: Number of peptides after peptidofrom selection: {len(dataset_df.index)}"
)
if len(dataset_df[FRACTION].unique().tolist()) > 1:
print(f"{str(sample).upper()}: Merge features across fractions.. ")
dataset_df = merge_fractions(dataset_df)
print(
f"{str(sample).upper()}: Number of features after merging fractions: {len(dataset_df.index)}"
)
# Step9: Normalize the data.
if not skip_normalization:
if pnmethod == "globalMedian":
dataset_df.loc[:, NORM_INTENSITY] = (
dataset_df[NORM_INTENSITY] / med_map[sample]
)
elif pnmethod == "conditionMedian":
con = dataset_df[CONDITION].unique()[0]
dataset_df.loc[:, NORM_INTENSITY] = (
dataset_df[NORM_INTENSITY] / med_map[con][sample]
)

# Step10: Remove peptides with low frequency.
if remove_low_frequency_peptides and len(sample_names) > 1:
dataset_df.set_index(
[PROTEIN_NAME, PEPTIDE_CANONICAL], drop=True, inplace=True
)
dataset_df = dataset_df[
~dataset_df.index.isin(low_frequency_peptides)
].reset_index()
print(
f"{str(sample).upper()}: Peptides after remove low frequency peptides: {len(dataset_df.index)}"
)

# Assembly peptidoforms to peptides
# Step11: Assembly peptidoforms to peptides.
print(f"{str(sample).upper()}: Sum all peptidoforms per Sample...")
dataset_df = sum_peptidoform_intensities(dataset_df)
print(
f"{str(sample).upper()}: Number of peptides after selection: {len(dataset_df.index)}"
)

# Step12: Intensity transformation to log.
if log2:
dataset_df[NORM_INTENSITY] = np.log2(dataset_df[NORM_INTENSITY])

Expand Down
Loading