Merge pull request #31 from WangHong007/master

pre-update of ibaqpy
bigbio · Nov 15, 2023 · 10249dc · 10249dc
2 parents 4fe81e3 + bd652cd
commit 10249dc
Show file tree

Hide file tree

Showing 6 changed files with 143 additions and 247 deletions.
diff --git a/README.md b/README.md
@@ -6,16 +6,17 @@
 [![PyPI version](https://badge.fury.io/py/ibaqpy.svg)](https://badge.fury.io/py/ibaqpy)
 ![PyPI - Downloads](https://img.shields.io/pypi/dm/ibaqpy)
 
-iBAQ (intensity Based Absolute Quantification) determines the abundance of a protein by dividing the total precursor intensities by the number of theoretically observable peptides of the protein. The TPA (Total Protein Approach) value is determined by summing peptide intensities of each protein and then dividing by the molecular mass to determine the relative concentration of each protein. By using [ProteomicRuler](https://www.sciencedirect.com/science/article/pii/S1535947620337749), it is possible to calculate the protein copy number and absolute concentration. **ibaqpy** compute IBAQ values, TPA values, copy numbers and concentration for proteins starting from a msstats input file and a SDRF experimental design file. This package provides multiple tools: 
+iBAQ (intensity Based Absolute Quantification) determines the abundance of a protein by dividing the total precursor intensities by the number of theoretically observable peptides of the protein. The TPA (Total Protein Approach) value is determined by summing peptide intensities of each protein and then dividing by the molecular mass to determine the relative concentration of each protein. By using [ProteomicRuler](https://www.sciencedirect.com/science/article/pii/S1535947620337749), it is possible to calculate the protein copy number and absolute concentration. **ibaqpy** compute IBAQ values, TPA values, copy numbers and concentration for proteins starting from a msstats input file (or a feature parquet from [quantmsio](https://github.com/bigbio/quantms.io)) and a SDRF experimental design file. In addition, it supports the merging of iBAQ results from multiple datasets and the elimination of outliers and batch effects. This package provides multiple tools: 
 
-- `peptide_file_generation.py`: generate a peptide file from a msstats input file and a SDRF experimental design file. 
-
-- `peptide_normalization.py`: Normalize the input output file from script `peptide_file_generation.py`. It includes multiple steps such as peptidoform normalization, peptidorom to peptide summarization, peptide intensity normalization, and imputation. 
+- `peptide_normalization.py`: Generate the peptides dataframe from a msstats input file and a SDRF experimental design file (or directly from a feature parquet), then normalize the peptides dataframe. It includes multiple steps such as peptidoform normalization, peptidorom to peptide summarization, peptide intensity normalization, and imputation. 
 
 - `compute_ibaq.py`: Compute IBAQ values from the output file from script `peptide_normalization.py`.
 
-- `compute_tpa.py`: Compute TPA values, protein copy numbers and concentration from the output file from script `peptide_file_generation.py`.
+- `compute_tpa.py`: Compute TPA values, protein copy numbers and concentration from the output file from script `peptide_normalization.py`.
+
+- `datasets_merge.py`: Merge ibaq results from multiple datasets. It consists of three steps: missing value imputation, outlier removal, and batch effect removal.
 
+**NOTE:** In all scripts and result files, *uniprot accession* is used as the protein identifier.
 
 ### How to install ibaqpy
 
@@ -61,7 +62,7 @@ E.g. http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute
 ### Peptide Normalization - peptide_normalization.py
 
 ```asciidoc
-python peptide_normalization.py --msstats PXD003947.sdrf_openms_design_msstats_in.csv --sdrf PXD003947.sdrf.tsv --remove_decoy_contaminants data/contaminants_ids.tsv --remove_low_frequency_peptides --output PXD003947-peptides-norm.tsv
+python peptide_normalization.py --msstats PXD003947.sdrf_openms_design_msstats_in.csv --sdrf PXD003947.sdrf.tsv --remove_ids data/contaminants_ids.tsv --remove_decoy_contaminants --remove_low_frequency_peptides --output PXD003947-peptides-norm.csv
 ``` 
 
 The command provides an additional `flag` for skip_normalization, pnormalization, compress, log2, violin, verbose.
@@ -71,7 +72,7 @@ Usage: peptide_normalization.py [OPTIONS]
 
 Options:
   -m, --msstats TEXT              MsStats file import generated by quantms
-  -p, --parquet TEXT              Parquet file import generated by quantms
+  -p, --parquet TEXT              Parquet file import generated by quantmsio
   -s, --sdrf TEXT                 SDRF file import generated by quantms
   --min_aa INTEGER                Minimum number of amino acids to filter
                                   peptides
@@ -120,25 +121,25 @@ Peptide normalization starts from the peptides dataframe. The structure of the i
 - SampleID: Sample ID `(e.g. PXD003947-Sample-3)`
 - StudyID: Study ID `(e.g. PXD003947)`. In most of the cases the study ID is the same as the ProteomeXchange ID.
 
-#### Removing Contaminants and Decoys
+#### 1. Removing Contaminants and Decoys
 
-The first step is to remove contaminants and decoys from the input file. The script `peptide_normalization.py` provides a parameter `--contaminants` for the user to provide a file with a list of protein accessions which represent each contaminant in the file. An example file can be seen in `data/contaminants.txt`. In addition to all the proteins accessions, the tool remove all the proteins with the following prefixes: `CONTAMINANT` and `DECOY`.
+The first step is to remove contaminants and decoys. The script `peptide_normalization.py` provides a parameter `--remove_decoy_contaminants` as a flag to remove all the proteins with the following prefixes: `CONTAMINANT` and `DECOY`. And the user can provide a file with a list of protein accessions which represent each contaminant or high abundant protein in the file. An example file can be seen in `data/contaminants_ids.txt`.
 
-#### Peptidoform Normalization
+#### 2. Peptidoform Normalization
 
 A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + BioReplicate + Fraction`. In the current version of the file, each row correspond to one peptidoform. 
 
 The current version of the tool uses the parackage [qnorm](https://pypi.org/project/qnorm/) to normalize the intensities for each peptidofrom. **qnorm** implements a quantile normalization method. 
 
-#### Peptidoform to Peptide Summarization
+#### 3. Peptidoform to Peptide Summarization
 
 For each peptidoform a peptide sequence (canonical) with not modification is generated. The intensity of all peptides group by biological replicate are `sum`. 
 
 Then, the intensities of the peptides across different biological replicates are summarize using the function `median`. 
 
 At the end of this step, for each peptide, the corresponding protein + the intensity of the peptide is stored. 
 
-#### Peptide Intensity Imputation and Normalization
+#### 4. Peptide Intensity Imputation and Normalization
 
 Before the final two steps (peptide normalization and imputation), the algorithm removes all peptides that are source of missing values significantly. The algorithm removes all peptides that have more than 80% of missing values and peptides that do not appear in more than 1 sample. 
 
@@ -188,15 +189,15 @@ Options:
   --help                Show this message and exit.
 ```
 
-#### Performs the Enzymatic Digestion
+#### 1. Performs the Enzymatic Digestion
 The current version of this tool uses OpenMS method to load fasta file, and use [ProteaseDigestion](https://openms.de/current_doxygen/html/classOpenMS_1_1ProteaseDigestion.html) to enzyme digestion of protein sequences, and finally get the theoretical peptide number of each protein.
 
-#### Calculate the IBAQ Value
+#### 2. Calculate the IBAQ Value
 First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.
 
 If protein-group exists in the peptide intensity dataframe, the intensity of all proteins in the protein-group is summed based on the above steps, and then multiplied by the number of proteins in the protein-group.
 
-#### IBAQ Normalization  
+#### 3. IBAQ Normalization  
 Normalize the ibaq values using the total ibaq of the sample. The resulted ibaq values are then multiplied by 100'000'000 (PRIDE database noramalization), for the ibaq ppb and log10 shifted by 10 (ProteomicsDB)
 
 ### Compute TPA - compute_tpa.py
@@ -206,9 +207,7 @@ The total protein approach (TPA) is a label- and standard-free method for absolu
 
 ```asciidoc
 python compute_tpa.py --fasta Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --contaminants contaminants_ids.tsv --peptides PXD003947-peptides.csv --ruler --ploidy 2 --cpc 200 --output PXD003947-tpa.tsv
-``` 
-
-Note: The peptide intensity file used in the calculation of TPA and the ProteomicRuler is not normalized!
+```
 
 ```asciidoc
 python compute_tpa.py --help
@@ -242,10 +241,10 @@ Options:
   --help                Show this message and exit.
 ```
 
-#### Calculate the TPA Value
+#### 1. Calculate the TPA Value
 The OpenMS tool was used to calculate the theoretical molecular mass of each protein. Similar to the calculation of IBAQ, the TPA value of protein-group was the sum of its intensity divided by the sum of the theoretical molecular mass.
 
-#### Calculate the Cellular Protein Copy Number and Concentration
+#### 2. Calculate the Cellular Protein Copy Number and Concentration
 The protein copy calculation follows the following formula:
 ```
 protein copies per cell = protein MS-signal *  (avogadro / molecular mass) * (DNA mass / histone MS-signal)
@@ -297,15 +296,15 @@ Options:
   --help                          Show this message and exit.
 ```
 
-#### Impute Missing Values
+#### 1. Impute Missing Values
 Remove proteins missing more than 30% of all samples. Users can keep tissue parts interested, and data will be converted to a expression matrix (samples as columns, proteins as rows), then missing values will be imputed with `KNNImputer`. 
 
-#### Remove Outliers
+#### 2. Remove Outliers
 When outliers are removed, multiple hierarchical clustering is performed using `hdbscan.HDBSCAN`, where outliers are labeled -1 in the PCA plot. When clustering is performed, the default cluster minimum value and the minimum number of neighbors of the core point are the minimum number of samples in all datasets.
 
-**Users can skip this step and do outliers removal manually.*
+*Users can skip this step and do outliers removal manually.*
 
-#### Batch Effect Correction
+#### 3. Batch Effect Correction
 Using `pycombat` for batch effect correction, and batch is set to `datasets` (refers specifically to PXD ids) and the covariate should be `tissue_part`.
 
 ### Credits 

diff --git a/bin/compute_ibaq.py b/bin/compute_ibaq.py
@@ -43,13 +43,13 @@ def normalize_ibaq(res: DataFrame) -> DataFrame:
     return res
 
 
-def parse_uniprot_name(identifier: str) -> str:
+def parse_uniprot_accession(identifier: str) -> str:
     """
-    Parse the uniprot name from the identifier  (e.g. sp|P12345|PROT_NAME)
+    Parse the uniprot accession from the identifier  (e.g. sp|P12345|PROT_NAME)
     :param identifier: Uniprot identifier
     :return:
     """
-    return identifier.split("|")[2]
+    return identifier.split("|")[1]
 
 
 @click.command()
@@ -119,7 +119,7 @@ def ibaq_compute(
         exit(1)
 
     fasta_proteins = list()  # type: list[FASTAEntry]
-    protein_names = list()
+    protein_accessions = list()
     FASTAFile().load(fasta, fasta_proteins)
     uniquepepcounts = dict()  # type: dict[str, int]
     digestor = ProteaseDigestion()
@@ -144,12 +144,13 @@ def get_average_nr_peptides_unique_bygroup(pdrow: Series) -> Series:
         digest = list()  # type: list[str]
         digestor.digest(AASequence().fromString(entry.sequence), digest, min_aa, max_aa)
         digestuniq = set(digest)
-        protein_name = parse_uniprot_name(entry.identifier)
+        # TODO: We keep uniprot accessions rather than names.
+        protein_name = parse_uniprot_accession(entry.identifier)
         uniquepepcounts[protein_name] = len(digestuniq)
-        protein_names.append(protein_name)
+        protein_accessions.append(protein_name)
 
     data = pd.read_csv(peptides, sep=",")
-    data = data[data[PROTEIN_NAME].isin(protein_names)]
+    data = data[data[PROTEIN_NAME].isin(protein_accessions)]
     print(data.head())
     # next line assumes unique peptides only (at least per indistinguishable group)
 

diff --git a/bin/peptide_normalization.py b/bin/peptide_normalization.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python
 import gc
-
+import uuid
 import click
 import matplotlib.pyplot as plt
 import numpy as np
@@ -35,34 +35,22 @@
     TMT11plex,
     TMT16plex,
     get_canonical_peptide,
-    get_reference_name,
+    get_spectrum_prefix,
     get_study_accession,
     parquet_map,
     parse_uniprot_accession,
     plot_box_plot,
     plot_distributions,
     remove_contaminants_decoys,
-    remove_extension_file,
     remove_protein_by_ids,
     sum_peptidoform_intensities,
+    get_peptidoform_normalize_intensities,
+    average_peptide_intensities,
+    print_help_msg,
 )
 
 
-def print_dataset_size(dataset: DataFrame, message: str, verbose: bool) -> None:
-    if verbose:
-        print(message + str(len(dataset.index)))
-
-
-def print_help_msg(command) -> None:
-    """
-    Print help information
-    :param command: command to print helps
-    :return: print help
-    """
-    with click.Context(command) as ctx:
-        click.echo(command.get_help(ctx))
-
-
+# TODO: The following two func are useless.
 def remove_outliers_iqr(dataset: DataFrame):
     """
     This method removes outliers from the dataframe inplace, the variable used for the outlier removal is Intensity
@@ -88,6 +76,11 @@ def remove_missing_values(normalize_df: DataFrame, ratio: float = 0.3) -> DataFr
     return normalize_df
 
 
+def print_dataset_size(dataset: DataFrame, message: str, verbose: bool) -> None:
+    if verbose:
+        print(message + str(len(dataset.index)))
+
+
 def intensity_normalization(
     dataset: DataFrame,
     field: str,
@@ -160,53 +153,6 @@ def intensity_normalization(
     return dataset
 
 
-def get_peptidoform_normalize_intensities(
-    dataset: DataFrame, higher_intensity: bool = True
-) -> DataFrame:
-    """
-    Select the best peptidoform for the same sample and the same replicates. A peptidoform is the combination of
-    a (PeptideSequence + Modifications) + Charge state.
-    :param dataset: dataset including all properties
-    :param higher_intensity: select based on normalize intensity, if false based on best scored peptide
-    :return:
-    """
-    dataset = dataset[dataset[NORM_INTENSITY].notna()]
-    if higher_intensity:
-        dataset = dataset.loc[
-            dataset.groupby(
-                [PEPTIDE_SEQUENCE, PEPTIDE_CHARGE, SAMPLE_ID, CONDITION, BIOREPLICATE],
-                observed=True,
-            )[NORM_INTENSITY].idxmax()
-        ].reset_index(drop=True)
-    else:
-        dataset = dataset.loc[
-            dataset.groupby(
-                [PEPTIDE_SEQUENCE, PEPTIDE_CHARGE, SAMPLE_ID, CONDITION, BIOREPLICATE],
-                observed=True,
-            )[SEARCH_ENGINE].idxmax()
-        ].reset_index(drop=True)
-    return dataset
-
-
-def average_peptide_intensities(dataset: DataFrame) -> DataFrame:
-    """
-    Median the intensities of all the peptidoforms for a specific peptide sample combination.
-    :param dataset: Dataframe containing all the peptidoforms
-    :return: New dataframe
-    """
-    dataset_df = dataset.groupby(
-        [PEPTIDE_CANONICAL, SAMPLE_ID, CONDITION], observed=True
-    )[NORM_INTENSITY].median()
-    dataset_df = dataset_df.reset_index()
-    dataset_df = pd.merge(
-        dataset_df,
-        dataset[[PROTEIN_NAME, PEPTIDE_CANONICAL, SAMPLE_ID, CONDITION]],
-        how="left",
-        on=[PEPTIDE_CANONICAL, SAMPLE_ID, CONDITION],
-    )
-    return dataset_df
-
-
 def remove_low_frequency_peptides_(
     dataset_df: DataFrame, percentage_samples: float = 0.20
 ):
@@ -411,7 +357,7 @@ def impute_peptide_intensities(dataset_df, field, class_field):
 @click.option(
     "--qc_report",
     help="PDF file to store multiple QC images",
-    default="peptideNorm-QCprofile.pdf",
+    default=f"peptideNorm-QCprofile-{str(uuid.uuid4())}.pdf",
 )
 def peptide_normalization(
     msstats: str,
@@ -469,7 +415,7 @@ def peptide_normalization(
 
         # Read the sdrf file
         sdrf_df = pd.read_csv(sdrf, sep="\t", compression=compression_method)
-        sdrf_df[REFERENCE] = sdrf_df["comment[data file]"].apply(remove_extension_file)
+        sdrf_df[REFERENCE] = sdrf_df["comment[data file]"].apply(get_spectrum_prefix)
         print(sdrf_df)
 
         if FRACTION not in feature_df.columns:
@@ -493,7 +439,7 @@ def peptide_normalization(
         # Merged the SDRF with the Resulted file
         labels = set(sdrf_df["comment[label]"])
         if CHANNEL not in feature_df.columns:
-            feature_df[REFERENCE] = feature_df[REFERENCE].apply(remove_extension_file)
+            feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_spectrum_prefix)
             dataset_df = pd.merge(
                 feature_df,
                 sdrf_df[["source name", REFERENCE]],
@@ -522,7 +468,7 @@ def peptide_normalization(
                 .rename(columns={"index": "comment[label]"})
             )
             sdrf_df = sdrf_df.merge(choice, on="comment[label]", how="left")
-            feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_reference_name)
+            feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_spectrum_prefix)
             dataset_df = pd.merge(
                 feature_df,
                 sdrf_df[["source name", REFERENCE, CHANNEL]],
@@ -543,7 +489,7 @@ def peptide_normalization(
                 .rename(columns={"index": "comment[label]"})
             )
             sdrf_df = sdrf_df.merge(choice, on="comment[label]", how="left")
-            feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_reference_name)
+            feature_df[REFERENCE] = feature_df[REFERENCE].apply(get_spectrum_prefix)
             dataset_df = pd.merge(
                 feature_df,
                 sdrf_df[["source name", REFERENCE, CHANNEL]],