Skip to content

Commit

Permalink
Merge pull request #65 from zprobot/master
Browse files Browse the repository at this point in the history
update: benchmark
  • Loading branch information
ypriverol authored Jun 4, 2024
2 parents 9a10435 + 9789a34 commit 946c7fe
Show file tree
Hide file tree
Showing 3 changed files with 90 additions and 10 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ A peptidoform is a combination of a `PeptideSequence(Modifications) + Charge + B
> Note: At the moment, ibaqpy computes the ibaq values only based on unique peptides. Shared peptides are discarded. However, if a group of proteins share the same unique peptides (e.g., Pep1 -> Prot1;Prot2 and Pep2 -> Prot1;Prot2), the intensity of the proteins is summed and divided by the number of proteins in the group.
#### Calculate the IBAQ Value
First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.See details in `peptides2proteins`.
First, peptide intensity dataframe was grouped according to protein name, sample name and condition. The protein intensity of each group was summed. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the normalized intensity of the protein is divided by the number of theoretical peptides.See details in `peptides2proteins`.

> Note: In all scripts and result files, *uniprot accession* is used as the protein identifier.
Expand Down
89 changes: 84 additions & 5 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,12 @@ In summary, both datasets were searched with three search engines _SAGE_, _COMET

#### Coefficient of Variation (CV)

Coefficient of variation for all samples in both experiments using `quantile`, `median`, `median-cov`. We extracted human proteins common to 11 samples from IBAQ data. The mean of the coefficient of variation of all proteins in 11 samples was then calculated.
Coefficient of variation for all samples in both experiments using `quantile`, `median`, `median-cov`.
- `quantile`: In the data preprocessing, adjust the samples to ensure that the mean and variance of all samples are equal. Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.
- `median`: In the data preprocessing, adjust the samples to ensure that the median of all samples are equal. Finally, the sum of the intensity of the protein is divided by the number of theoretical peptides.
- `median-cov`: In the data preprocessing, adjust the samples to ensure that the median of all samples are equal. Due to the experimental type, the same protein may exhibit missing peptides in different samples, resulting in variations in the number of peptides detected for the protein across different samples. To handle this difference, normalization within the same group can be achieved by using the formula `sum(peptides) / n`(n represents the number of detected peptide segments). Finally, the normalized intensity of the protein is divided by the number of theoretical peptides.

We extracted human proteins common to 11 samples from IBAQ data. The mean of the coefficient of variation of all proteins in 11 samples was then calculated.

Compared to the `quantile`, `median` and `median-cov` has a smaller coefficient of variation. `median-cov` has the smallest CV in the lfq experiment.

Expand Down Expand Up @@ -179,8 +184,82 @@ We will normalize the MaxLFQ values of the proteins in the DIANN report by divid
</center>

### Performance testing
The [PXD030304](https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/absolute-expression/PXD030304/) project collected mass spectrometry data from 949 cancer cell lines and reanalyzed it using the DIANN analysis pipeline within the quantms platform.The size of the `diann_report.tsv` file is 167GB, after being converted to a parquet file using quantmsio, the size is 15.8GB.We conducted performance testing in a 128GB memory environment.

| Project | Samples | Size(diann report) | Size(parquet file) | Runn time |
|--------|---------|----------|----------|----------|
| PXD030304 | 2013 | 167G | 15.8G | 2.75h |
We have conducted performance tests on three methods. Since `median` and `median-cov` only differ when calculating ibaq, they are referred to as `median` below. It can be seen that the `median` is based on the sample level. It does not read all data at once like the `quantile`, but reads it in batches (by default, it reads 20 samples at a time), which greatly reduces memory consumption.

<table align="center">
<thead>
<tr>
<th>Project</th>
<th>File size(original)</th>
<th>File size(transform)</th>
<th>Ms runs</th>
<th>Samples</th>
<th>Method</th>
<th>Memory</th>
<th>Run time</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan=2>PXD016999.1</td>
<td rowspan=2>5.7 G</td>
<td rowspan=2>292 M</td>
<td rowspan=2>336</td>
<td rowspan=2>280</td>
<td>quantile</td>
<td>36.4 G</td>
<td>14 min</td>
</tr>
<tr>
<td>median</td>
<td>8.4 G</td>
<td>20 min</td>
</tr>
<tr>
<td rowspan=2>PXD019909</td>
<td rowspan=2>1.9 G</td>
<td rowspan=2>171 M</td>
<td rowspan=2>43</td>
<td rowspan=2>43</td>
<td>quantile</td>
<td>7.9 G</td>
<td>30 s</td>
</tr>
<tr>
<td>median</td>
<td>4.0 G</td>
<td>1.4 min</td>
</tr>
<tr>
<td rowspan=2>PXD010154</td>
<td rowspan=2>1.9 G</td>
<td rowspan=2>287 M</td>
<td rowspan=2>1367</td>
<td rowspan=2>38</td>
<td>quantile</td>
<td>32.1 G</td>
<td>8 min</td>
</tr>
<tr>
<td>median</td>
<td>16.2 G</td>
<td>12 min</td>
</tr>
<tr>
<td rowspan=2>PXD030304</td>
<td rowspan=2>167 G</td>
<td rowspan=2>15.8 G</td>
<td rowspan=2>6862</td>
<td rowspan=2>2013</td>
<td>quantile</td>
<td>> 128 G</td>
<td>> 2 days</td>
</tr>
<tr>
<td>median</td>
<td>13.1 G</td>
<td>2.75 h</td>
</tr>
</tbody>
</table>
9 changes: 5 additions & 4 deletions ibaqpy/ibaq/peptide_normalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,9 +194,9 @@ def data_common_process(data_df: pd.DataFrame, min_aa: int) -> pd.DataFrame:
data_df = data_df[data_df["Condition"] != "Empty"]

# Filter peptides with less amino acids than min_aa (default: 7)
data_df = data_df[
data_df.apply(lambda x: len(x[PEPTIDE_CANONICAL]) >= min_aa, axis=1)
]
data_df.loc[:,'len'] = data_df[PEPTIDE_CANONICAL].apply(len)
data_df = data_df[data_df['len']>=min_aa]
data_df.drop(['len'],inplace=True,axis=1)
data_df[PROTEIN_NAME] = data_df[PROTEIN_NAME].apply(parse_uniprot_accession)
if FRACTION not in data_df.columns:
data_df[FRACTION] = 1
Expand Down Expand Up @@ -561,7 +561,8 @@ def peptide_normalization(
technical_repetitions, label, sample_names, choice = analyse_sdrf(sdrf)
else:
technical_repetitions, label, sample_names, choice = feature.experimental_inference
low_frequency_peptides = feature.low_frequency_peptides
if remove_low_frequency_peptides:
low_frequency_peptides = feature.low_frequency_peptides
header = False
if not skip_normalization and pnmethod == "globalMedian":
med_map = feature.get_median_map()
Expand Down

0 comments on commit 946c7fe

Please sign in to comment.