major improvements of mzml statistics #37

ypriverol · 2024-11-29T19:20:17Z

This PR, instead of loading everything in the panda's dataframe all the data; it writes the intensity values using parquet batches.

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new batch processing option for mass spectrometry data, allowing users to specify batch sizes for improved efficiency.
- Added a new class for enhanced reading and processing of Peptide Spectrum Matches (PSMs).
Improvements
- Updated version number of the quantms-utils package to 0.0.16.
- Improved error handling and logic for extracting global q-values in PSM conversion.
- Enhanced mass spectrometry file handling with new functions for better path resolution and data extraction.
Chores
- Updated .gitignore to exclude specific test data directories.

Fixed ms2rescore bugs

coderabbitai · 2024-11-29T19:20:24Z

Walkthrough

The pull request introduces updates to the quantms-utils project, primarily focusing on version increments across multiple files. The version number is updated from 0.0.15 to 0.0.16 in pyproject.toml, __init__.py, and recipe/meta.yaml. Significant enhancements are made to the mzml_statistics.py file, including new parameters and functions for batch processing mzML files. Other modifications include improvements to the convert_psm function and the addition of a new class in ms2rescore.py for enhanced PSM processing.

Changes

File Path	Change Summary
`pyproject.toml`	Version updated from `0.0.15` to `0.0.16`.
`quantmsutils/__init__.py`	Version updated from `"0.0.15"` to `"0.0.16"`.
`quantmsutils/mzml/mzml_statistics.py`	Updated `mzml_statistics` function signature to include `batch_size`. New function `batch_write_mzml_streaming` added. Replaced `parse_mzml` with `batch_write_mzml_streaming`. New function `_resolve_ms_path` added. Enhanced batch processing logic.
`quantmsutils/psm/psm_conversion.py`	Modified `convert_psm` function to include conditional logic for `global_qvalue`.
`quantmsutils/rescoring/ms2rescore.py`	Added `IDXMLReaderPatch` class with methods for PSM handling. Modified `rescore_idxml` to use new class.
`recipe/meta.yaml`	Version updated from `0.0.15` to `0.0.16`.
`.gitignore`	New entry added to ignore test data directory: `/tests/test_data/hMICAL1_coiPAnP-N2-200_3Murea-1Mthiourea-200mMtcep_14733.d/`

Possibly related PRs

upgrade the sdrf-pipelines 0.0.29 -> 0.0.30 #29: The changes in this PR involve updating the version of the sdrf-pipelines dependency, which is related to the overall versioning updates reflected in the main PR for the quantms-utils package.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (6)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d1b7d62 and 8d3e0ad.

📒 Files selected for processing (6)

pyproject.toml (1 hunks)
quantmsutils/__init__.py (1 hunks)
quantmsutils/mzml/mzml_statistics.py (2 hunks)
quantmsutils/psm/psm_conversion.py (1 hunks)
quantmsutils/rescoring/ms2rescore.py (2 hunks)
recipe/meta.yaml (1 hunks)

✅ Files skipped from review due to trivial changes (3)

pyproject.toml
quantmsutils/init.py
recipe/meta.yaml

🔇 Additional comments (1)

quantmsutils/psm/psm_conversion.py (1)

144-145: Verify q-value data type consistency.

Let's ensure the q-value meta value maintains consistent data type with other global_qvalue sources.

✅ Verification successful

Let me gather more information about the data type handling in the codebase.

Let me check one more aspect to understand the type handling in the OpenMS interface.

Based on the gathered information, I can now provide the final response:

Q-value data type is consistently handled

The codebase shows consistent handling of q-values across different sources:

Q-values from getScore() are numeric (float) by default in pyOpenMS
Q-values from getMetaValue() for both "q-value" and "MS:1001491" are stored in global_qvalue which is initialized as np.nan (numeric)
All score-related values in the codebase are handled as floating-point numbers

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the data types of q-values across different idXML files

# Search for q-value related code to understand type handling
rg -A 3 "q-value|MS:1001491" --type py

# Look for any type conversion or validation
ast-grep --pattern 'getMetaValue("q-value")'
ast-grep --pattern 'getMetaValue("MS:1001491")'

Length of output: 2712

Script:

#!/bin/bash
# Look for global_qvalue assignments and type handling
rg -A 2 "global_qvalue =" --type py

# Look for any type validation or conversion of q-values
ast-grep --pattern 'float($_)'

Length of output: 2835

Script:

#!/bin/bash
# Look for OpenMS type handling and initialization
rg -A 3 "import pyopenms|from pyopenms" --type py
rg -A 3 "getScore|setScore" --type py

Length of output: 3484

quantmsutils/mzml/mzml_statistics.py

coderabbitai · 2024-11-29T19:25:28Z

quantmsutils/rescoring/ms2rescore.py

+                accession.decode() for accession in peptide_hit.extractProteinAccessionsSet()
+            ],


⚠️ Potential issue

Verify the need for decoding protein accessions

In the protein_list comprehension, you are using accession.decode() on each accession obtained from peptide_hit.extractProteinAccessionsSet(). If accession is already a string, calling decode() will raise an AttributeError. Confirm whether accession is a bytes object that requires decoding. If accession is a string, you should remove the decode() call.

Apply this diff if accession is already a string:

protein_list=[ - accession.decode() for accession in peptide_hit.extractProteinAccessionsSet() + accession for accession in peptide_hit.extractProteinAccessionsSet() ],

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2024-11-29T19:25:28Z

quantmsutils/rescoring/ms2rescore.py

+        self.protein_ids, self.peptide_ids = self._parse_idxml()
+        self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0])
+        self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0])


⚠️ Potential issue

Handle potential IndexError when accessing peptide IDs and hits

In the __init__ method of IDXMLReaderPatch, you are accessing self.peptide_ids[0].getHits()[0] to initialize self.user_params_metadata and self.rescoring_features. If self.peptide_ids is empty or if self.peptide_ids[0].getHits() is empty, this will raise an IndexError. Consider adding a check to ensure that peptide IDs and hits are available before accessing them.

Apply this diff to handle the potential empty lists:

def __init__(self, filename: Union[Path, str], *args, **kwargs) -> None: super().__init__(filename, *args, **kwargs) self.protein_ids, self.peptide_ids = self._parse_idxml() + if not self.peptide_ids or not self.peptide_ids[0].getHits(): + raise ValueError("No peptide IDs or peptide hits found in the idXML file.") self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0]) self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0]) self.skip_invalid_psm = 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.protein_ids, self.peptide_ids = self._parse_idxml()

self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0])

self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0])

self.protein_ids, self.peptide_ids = self._parse_idxml()

if not self.peptide_ids or not self.peptide_ids[0].getHits():

raise ValueError("No peptide IDs or peptide hits found in the idXML file.")

self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0])

self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0])

quantmsutils/mzml/mzml_statistics.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8d3e0ad and 5154daf.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (2 hunks)

🔇 Additional comments (5)

quantmsutils/mzml/mzml_statistics.py (5)

184-186: LGTM! Efficient streaming implementation

The implementation correctly uses PyOpenMS's streaming mode through MzMLFile().transform() for memory-efficient processing of mzML files.

24-24: ⚠️ Potential issue

Fix incorrect regular expression pattern

The pattern r"[scan|spectrum]=(\d+)" will not correctly match "scan=" or "spectrum=". Square brackets [] define a character class that matches any single character within them.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

52-56: ⚠️ Potential issue

Add null check for precursors

Direct access to spectrum.getPrecursors()[0] can cause an IndexError if no precursors are available.

         if ms_level == 2:
-            precursor = spectrum.getPrecursors()[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+            precursors = spectrum.getPrecursors()
+            if precursors:
+                precursor = precursors[0]
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()
+            else:
+                charge_state = None
+                exp_mz = None

57-63: 🛠️ Refactor suggestion

Consider streaming spectrum data instead of accumulating in memory

When id_only is True, all spectrum data is accumulated in self.psm_parts before being written at once. For large files, this could lead to memory issues.

Consider writing spectrum data in batches similar to how batch_data is handled:

             if self.id_only:
                 scan_id = self.scan_pattern.findall(spectrum.getNativeID())[0]
-                self.psm_parts.append([
-                    scan_id, ms_level,
-                    mz_array.tolist(),
-                    intensity_array.tolist()
-                ])
+                if len(self.psm_parts) >= self.batch_size:
+                    self._write_spectrum_batch()
+                self.psm_parts.append({
+                    "scan": scan_id,
+                    "ms_level": ms_level,
+                    "mz": mz_array.tolist(),
+                    "intensity": intensity_array.tolist()
+                })

209-246: ⚠️ Potential issue

Improve error handling for parquet writer

The parquet_writer is created outside the try block but closed in the finally block. This could raise an UnboundLocalError if writer creation fails.

-            # Set up parquet writer
-            parquet_writer = pq.ParquetWriter(
-                output_path,
-                parquet_schema,
-                compression='gzip'
-            )
-
+            parquet_writer = None
             try:
+                # Set up parquet writer
+                parquet_writer = pq.ParquetWriter(
+                    output_path,
+                    parquet_schema,
+                    compression='gzip'
+                )
+
                 # Stream data in batches
                 for chunk in pd.read_sql_query(
                     """
                     SELECT 
                         Id, 
                         CASE 
                             WHEN MsMsType IN (8, 9) THEN 2 
                             WHEN MsMsType = 0 THEN 1 
                             ELSE NULL 
                         END as MsMsType,
                         NumPeaks, 
                         MaxIntensity, 
                         SummedIntensities, 
                         Time,
                         Charge,
                         MonoisotopicMz
                     FROM frames
                     """,
                     conn,
                     chunksize=batch_size
                 ):
                     # Prepare batch
                     chunk['AcquisitionDateTime'] = acquisition_date_time

                     # Convert to pyarrow table and write
                     batch_table = pa.Table.from_pandas(chunk, schema=parquet_schema)
                     parquet_writer.write_table(batch_table)

             finally:
-                parquet_writer.close()
+                if parquet_writer:
+                    parquet_writer.close()

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

quantmsutils/mzml/mzml_statistics.py (2)

139-143: Add validation for batch_size parameter

The batch_size parameter should be validated to ensure it's a positive integer.

 @click.option(
     "--batch_size",
     type=int,
     default=10000,
+    callback=lambda ctx, param, value: value if value > 0 else 
+        ctx.fail('batch_size must be positive'),
     help="Number of rows to write in each batch"
 )

218-249: Enhance error handling in batch_write_bruker_d

The function should handle potential database errors and provide more specific error messages.

             try:
                 # Stream data in batches
                 for chunk in pd.read_sql_query(
                     """
                     SELECT 
                         Id, 
                         CASE 
                             WHEN MsMsType IN (8, 9) THEN 2 
                             WHEN MsMsType = 0 THEN 1 
                             ELSE NULL 
                         END as MsMsType,
                         NumPeaks, 
                         MaxIntensity, 
                         SummedIntensities, 
                         Time,
                         Charge,
                         MonoisotopicMz
                     FROM frames
                     """,
                     conn,
                     chunksize=batch_size
                 ):
                     # Prepare batch
                     chunk['AcquisitionDateTime'] = acquisition_date_time
 
                     # Convert to pyarrow table and write
                     batch_table = pa.Table.from_pandas(chunk, schema=parquet_schema)
                     parquet_writer.write_table(batch_table)
 
+            except sqlite3.Error as e:
+                print(f"Database error: {e}")
+                raise
+            except Exception as e:
+                print(f"Error processing Bruker data: {e}")
+                raise
             finally:
                 parquet_writer.close()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5154daf and db2f3b9.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (2 hunks)

🔇 Additional comments (4)

quantmsutils/mzml/mzml_statistics.py (4)

4-11: LGTM: Required imports added for batch processing

The new imports (Optional, numpy, pyarrow) are appropriate for implementing batch processing and parquet file handling.

264-283: LGTM: Robust file path resolution implementation

The _resolve_ms_path function provides comprehensive file path resolution with proper validation of extensions and clear error messages.

24-24: ⚠️ Potential issue

Fix incorrect regular expression pattern

The pattern r"[scan|spectrum]=(\d+)" is incorrect. Square brackets [] define a character class that matches any single character within them, not the words "scan" or "spectrum".

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

52-56: ⚠️ Potential issue

Add null check for precursors

Direct access to spectrum.getPrecursors()[0] could raise an IndexError if no precursors are available.

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                charge_state = None
+                exp_mz = None
+            else:
-            precursor = spectrum.getPrecursors()[0]
+                precursor = precursors[0]
                charge_state = precursor.getCharge()
                exp_mz = precursor.getMZ()

quantmsutils/mzml/mzml_statistics.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)

quantmsutils/mzml/mzml_statistics.py (3)

15-22: Add type hints to improve code maintainability

Consider adding type hints to the class attributes for better code maintainability and IDE support.

-    def __init__(self, parquet_schema, output_path, batch_size=10000, id_only=False):
+    def __init__(self, parquet_schema: pa.Schema, output_path: str, batch_size: int = 10000, id_only: bool = False):
         self.parquet_schema = parquet_schema
         self.output_path = output_path
         self.batch_size = batch_size
         self.id_only = id_only
-        self.batch_data = []
-        self.psm_parts = []
-        self.parquet_writer = None
+        self.batch_data: list[dict] = []
+        self.psm_parts: list[list] = []
+        self.parquet_writer: Optional[pq.ParquetWriter] = None

264-265: Improve error message for unsupported file types

The error message could be more helpful by listing the supported file types.

-        raise RuntimeError(f"Unsupported file type: {ms_path}")
+        supported_types = [".mzml", ".d"]
+        raise RuntimeError(
+            f"Unsupported file type: {ms_path}. "
+            f"Supported file types are: {', '.join(supported_types)}"
+        )

208-212: Use parameterized query to prevent SQL injection

Even though the SQL query uses a fixed string, it's a good practice to use parameterized queries consistently.

-            acquisition_date_time = conn.execute(
-                "SELECT Value FROM GlobalMetadata WHERE key='AcquisitionDateTime'"
-            ).fetchone()[0]
+            acquisition_date_time = conn.execute(
+                "SELECT Value FROM GlobalMetadata WHERE key = ?",
+                ("AcquisitionDateTime",)
+            ).fetchone()[0]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between db2f3b9 and 9f9c82f.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (2 hunks)

🔇 Additional comments (2)

quantmsutils/mzml/mzml_statistics.py (2)

24-24: ⚠️ Potential issue

Fix the regular expression pattern for scan ID extraction

The regex pattern r"[scan|spectrum]=(\d+)" is incorrect as it uses square brackets which create a character class. This means it matches any single character that is 's', 'c', 'a', 'n', '|', 'p', 'e', 't', 'r', 'u', 'm', leading to incorrect scan ID extraction.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

52-56: ⚠️ Potential issue

Add null check for precursors to prevent IndexError

Direct access to spectrum.getPrecursors()[0] can raise an IndexError if the spectrum has no precursors.

         if ms_level == 2:
-            precursor = spectrum.getPrecursors()[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+            precursors = spectrum.getPrecursors()
+            if precursors:
+                precursor = precursors[0]
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()
+            else:
+                charge_state = None
+                exp_mz = None

quantmsutils/mzml/mzml_statistics.py

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9f9c82f and d64503d.

📒 Files selected for processing (2)

.gitignore (1 hunks)
quantmsutils/mzml/mzml_statistics.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

.gitignore

🧰 Additional context used

🪛 Ruff (0.8.0)

quantmsutils/mzml/mzml_statistics.py

4-4: typing.Dict imported but unused

Remove unused import: typing.Dict

(F401)

🔇 Additional comments (3)

quantmsutils/mzml/mzml_statistics.py (3)

299-318: Well-implemented path resolution logic

The _resolve_ms_path function is robustly implemented with:

Proper path existence checking
Case-insensitive extension handling
Clear error messages
Support for multiple file formats

198-210: Efficient streaming implementation

The implementation correctly uses pyopenms streaming mode with BatchWritingConsumer, addressing the memory efficiency concerns raised in previous reviews.

29-29: ⚠️ Potential issue

Fix incorrect regular expression pattern

The current pattern r"[scan|spectrum]=(\d+)" uses square brackets which creates a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

quantmsutils/mzml/mzml_statistics.py

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)

quantmsutils/mzml/mzml_statistics.py (2)

4-4: Remove unused import

The Dict type hint is imported but never used in the code.

-from typing import Optional, List, Dict
+from typing import Optional, List

🧰 Tools

🪛 Ruff (0.8.0)

4-4: typing.Dict imported but unused

Remove unused import: typing.Dict

(F401)

289-301: Enhance error handling for database operations

Add specific error handling for database operations to provide more informative error messages.

             try:
                 # Stream data in batches
                 for chunk in pd.read_sql_query(query, conn, chunksize=batch_size):
                     chunk['AcquisitionDateTime'] = acquisition_date_time
                     for col in schema.names:
                         if col not in chunk.columns:
                             chunk[col] = None
                     batch_table = pa.Table.from_pandas(chunk, schema=schema)
                     parquet_writer.write_table(batch_table)
+            except sqlite3.Error as e:
+                print(f"Database error: {e}")
+                raise
+            except Exception as e:
+                print(f"Error processing data: {e}")
+                raise
             finally:
                 parquet_writer.close()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d64503d and ddda856.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.8.0)

quantmsutils/mzml/mzml_statistics.py

4-4: typing.Dict imported but unused

Remove unused import: typing.Dict

(F401)

🔇 Additional comments (4)

quantmsutils/mzml/mzml_statistics.py (4)

68-69: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary conversions

Converting numpy arrays to Python lists creates unnecessary copies of large arrays. Keep the data in numpy arrays throughout the process.

-                        "mz": mz_array.tolist(),
-                        "intensity": intensity_array.tolist()
+                        "mz": mz_array,
+                        "intensity": intensity_array

31-31: ⚠️ Potential issue

Fix incorrect regular expression pattern

The current pattern [scan|spectrum]=(\d+) uses square brackets which creates a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

57-61: ⚠️ Potential issue

Add null check for precursors

Accessing spectrum.getPrecursors()[0] without checking if precursors exist could raise an IndexError.

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                return
-            precursor = spectrum.getPrecursors()[0]
+            precursor = precursors[0]
             charge_state = precursor.getCharge()
             exp_mz = precursor.getMZ()

261-266: 🛠️ Refactor suggestion

Improve resource management for parquet writer

The parquet writer should be initialized within a try block to ensure proper cleanup in case of exceptions.

-            parquet_writer = pq.ParquetWriter(
-                output_path,
-                schema=schema,
-                compression='gzip'
-            )
+            parquet_writer = None
+            try:
+                parquet_writer = pq.ParquetWriter(
+                    output_path,
+                    schema=schema,
+                    compression='gzip'
+                )

quantmsutils/mzml/mzml_statistics.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ddda856 and 5999bfb.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (1 hunks)

🔇 Additional comments (5)

quantmsutils/mzml/mzml_statistics.py (5)

31-31: ⚠️ Potential issue

Fix incorrect regular expression pattern

The pattern [scan|spectrum]=(\d+) is incorrect as it creates a character class matching any single character in scan|pectrum, not the words "scan" or "spectrum".

Apply this fix:

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

68-69: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary copies

Converting numpy arrays to Python lists creates unnecessary copies of large arrays.

Keep the data in numpy arrays:

-                        "mz": mz_array.tolist(),
-                        "intensity": intensity_array.tolist()
+                        "mz": mz_array,
+                        "intensity": intensity_array

129-129: ⚠️ Potential issue

Fix incorrect parquet writer usage

Using self.parquet_writer instead of self.id_parquet_writer for spectrum data is incorrect.

-                self.parquet_writer.write_table(spectrum_table)
+                self.id_parquet_writer.write_table(spectrum_table)

261-266: 🛠️ Refactor suggestion

Improve resource management for parquet writer

The parquet writer should be initialized within a try block to ensure proper cleanup in case of exceptions.

-            parquet_writer = pq.ParquetWriter(
-                output_path,
-                schema=schema,
-                compression='gzip'
-            )
+            parquet_writer = None
+            try:
+                parquet_writer = pq.ParquetWriter(
+                    output_path,
+                    schema=schema,
+                    compression='gzip'
+                )
 
                 # ... existing code ...
 
             finally:
-                parquet_writer.close()
+                if parquet_writer is not None:
+                    parquet_writer.close()

57-61: ⚠️ Potential issue

Add validation for precursor existence

Accessing spectrum.getPrecursors()[0] without checking if precursors exist could raise an IndexError.

Add validation:

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                return
-            precursor = spectrum.getPrecursors()[0]
+            precursor = precursors[0]
             charge_state = precursor.getCharge()
             exp_mz = precursor.getMZ()

# Conflicts: # quantmsutils/mzml/mzml_statistics.py

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5999bfb and 3e4613c.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (1 hunks)

🔇 Additional comments (5)

quantmsutils/mzml/mzml_statistics.py (5)

68-69: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary conversions

Converting numpy arrays to Python lists creates unnecessary copies of potentially large arrays. Consider keeping the data in numpy arrays.

-                        "mz": mz_array.tolist(),
-                        "intensity": intensity_array.tolist()
+                        "mz": mz_array,
+                        "intensity": intensity_array

31-31: ⚠️ Potential issue

Fix the regular expression pattern for scan/spectrum matching

The current pattern r"[scan|spectrum]=(\d+)" uses square brackets which create a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

261-266: 🛠️ Refactor suggestion

Improve resource management for parquet writer

The parquet writer should be initialized within a context manager or try block to ensure proper cleanup in case of exceptions.

-            parquet_writer = pq.ParquetWriter(
-                output_path,
-                schema=schema,
-                compression='gzip'
-            )
+            parquet_writer = None
+            try:
+                parquet_writer = pq.ParquetWriter(
+                    output_path,
+                    schema=schema,
+                    compression='gzip'
+                )
+            except Exception as e:
+                print(f"Error initializing parquet writer: {e}")
+                raise

57-61: ⚠️ Potential issue

Add precursor check to prevent IndexError

The code directly accesses the first precursor without checking if any precursors exist, which could raise an IndexError.

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                return
-            precursor = spectrum.getPrecursors()[0]
+            precursor = precursors[0]
             charge_state = precursor.getCharge()
             exp_mz = precursor.getMZ()

136-152: 🛠️ Refactor suggestion

Improve error handling in finalize method

The finalize method should ensure proper cleanup of resources even if an exception occurs.

     def finalize(self):
+        try:
             if self.batch_data:
                 self._write_batch()

             if self.id_only and self.psm_parts:
                 self._write_batch()

+        except Exception as e:
+            print(f"Error during finalization: {e}")
+            raise
+        finally:
             if self.parquet_writer:
                 self.parquet_writer.close()
+                self.parquet_writer = None

             if self.id_parquet_writer:
                 self.id_parquet_writer.close()
+                self.id_parquet_writer = None

quantmsutils/mzml/mzml_statistics.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

quantmsutils/mzml/mzml_statistics.py (1)

293-294: Improve SQL query construction

The current approach to sanitizing column names could be more robust.

-            safe_columns = [col for col in base_columns if col.replace(" ", "").isalnum()] # Remove spaces
-            query = f"""SELECT {', '.join(safe_columns)} FROM frames """
+            # Whitelist of allowed SQL expressions
+            allowed_expressions = {
+                "CASE WHEN MsMsType IN (8, 9) THEN 2 WHEN MsMsType = 0 THEN 1 ELSE NULL END as MsMsType",
+                "Id", "NumPeaks", "MaxIntensity", "SummedIntensities", "Time", "Charge", "MonoisotopicMz"
+            }
+            safe_columns = [col for col in base_columns if col in allowed_expressions]
+            query = f"""SELECT {', '.join(safe_columns)} FROM frames"""

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between dbfc89d and fe1739e.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (1 hunks)

🔇 Additional comments (5)

quantmsutils/mzml/mzml_statistics.py (5)

76-77: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary conversions

Converting numpy arrays to Python lists creates unnecessary copies of large arrays.

-                            "mz": mz_array.tolist(),
-                            "intensity": intensity_array.tolist(),
+                            "mz": mz_array,
+                            "intensity": intensity_array,

38-38: ⚠️ Potential issue

Fix incorrect regular expression pattern

The current regex pattern r"[scan|spectrum]=(\d+)" uses square brackets [] which defines a character class matching any single character in the set. This will not correctly match "scan=" or "spectrum=".

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

238-238: ⚠️ Potential issue

Fix missing parameter in BatchWritingConsumer constructor

The constructor call is missing the required id_parquet_schema parameter.

-        consumer = BatchWritingConsumer(parquet_schema, output_path, batch_size, id_only)
+        consumer = BatchWritingConsumer(parquet_schema, id_parquet_schema, output_path, batch_size, id_only)

64-68: ⚠️ Potential issue

Add check for empty precursors list

Accessing spectrum.getPrecursors()[0] without checking if the list is non-empty can raise an IndexError.

         if ms_level == 2:
+            precursors = spectrum.getPrecursors()
+            if not precursors:
+                charge_state = None
+                exp_mz = None
+            else:
-            precursor = spectrum.getPrecursors()[0]
+                precursor = precursors[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()

153-169: 🛠️ Refactor suggestion

Improve resource management in finalize method

The finalize method should ensure proper cleanup of resources even if an exception occurs.

     def finalize(self):
+        try:
             if self.batch_data:
                 self._write_batch()

             if self.id_only and self.psm_parts:
                 self._write_batch()
+        except Exception as e:
+            print(f"Error during finalization: {e}")
+            raise
+        finally:
             if self.parquet_writer:
-                self.parquet_writer.close()
+                try:
+                    self.parquet_writer.close()
+                finally:
+                    self.parquet_writer = None

             if self.id_parquet_writer:
-                self.id_parquet_writer.close()
+                try:
+                    self.id_parquet_writer.close()
+                finally:
+                    self.id_parquet_writer = None

jpfeuffer

See my remaining change request

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

quantmsutils/mzml/mzml_statistics.py (2)

299-302: Strengthen SQL column name validation

While the current validation removes spaces, it could be more robust by using a stricter pattern for valid SQL column names.

             safe_columns = [
-                col for col in base_columns if col.replace(" ", "").isalnum()
+                col for col in base_columns 
+                if all(part.isalnum() or part in ['_', ' '] for part in col)
+                and not col.startswith(('_', ' '))
             ]

348-349: Ensure case-insensitive file extension handling

The extension comparison could be made more robust by normalizing extensions to lowercase before comparison.

-    valid_extensions = {".d", ".mzml", ".mzML"}
-    candidates = [str(c.resolve()) for c in candidates if c.suffix.lower() in valid_extensions]
+    valid_extensions = {ext.lower() for ext in {".d", ".mzml"}}
+    candidates = [str(c.resolve()) for c in candidates if c.suffix.lower() in valid_extensions]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fe1739e and 2ca002c.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (1 hunks)

🔇 Additional comments (4)

quantmsutils/mzml/mzml_statistics.py (4)

76-77: 🛠️ Refactor suggestion

Optimize array handling to avoid unnecessary conversions

Converting numpy arrays to Python lists creates unnecessary copies of large arrays, impacting performance.

-                            "mz": mz_array.tolist(),
-                            "intensity": intensity_array.tolist(),
+                            "mz": mz_array,
+                            "intensity": intensity_array,

38-38: ⚠️ Potential issue

Fix incorrect regular expression pattern

The current pattern r"[scan|spectrum]=(\d+)" is incorrect. Square brackets [] define a character class that matches any single character within the brackets, not alternation between words. This will fail to correctly extract scan IDs.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

64-68: ⚠️ Potential issue

Add precursor existence check to prevent IndexError

Accessing spectrum.getPrecursors()[0] without checking if precursors exist can raise an IndexError.

         if ms_level == 2:
-            precursor = spectrum.getPrecursors()[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+            precursors = spectrum.getPrecursors()
+            if precursors:
+                precursor = precursors[0]
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()
+            else:
+                charge_state = None
+                exp_mz = None

153-169: 🛠️ Refactor suggestion

Improve resource management in finalize method

The finalize method should ensure proper cleanup of resources even if an exception occurs.

     def finalize(self):
+        try:
             if self.batch_data:
                 self._write_batch()

             if self.id_only and self.psm_parts:
                 self._write_batch()
+        finally:
             if self.parquet_writer:
                 self.parquet_writer.close()
+                self.parquet_writer = None

             if self.id_parquet_writer:
                 self.id_parquet_writer.close()
+                self.id_parquet_writer = None

quantmsutils/mzml/mzml_statistics.py

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2b5cc7f and 4d3351f.

📒 Files selected for processing (1)

quantmsutils/mzml/mzml_statistics.py (1 hunks)

🔇 Additional comments (2)

quantmsutils/mzml/mzml_statistics.py (2)

38-38: ⚠️ Potential issue

Fix incorrect regular expression pattern

The pattern r"[scan|spectrum]=(\d+)" is incorrect as square brackets [] define a character class that matches any single character within them. This will match any of the characters 's', 'c', 'a', 'n', '|', 'p', 'e', 't', 'r', 'u', 'm' followed by '='.

-        self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+        self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")

64-68: ⚠️ Potential issue

Add precursor existence check to prevent IndexError

Direct access to spectrum.getPrecursors()[0] can raise an IndexError if no precursors are available.

         if ms_level == 2:
-            precursor = spectrum.getPrecursors()[0]
-            charge_state = precursor.getCharge()
-            exp_mz = precursor.getMZ()
+            precursors = spectrum.getPrecursors()
+            if precursors:
+                precursor = precursors[0]
+                charge_state = precursor.getCharge()
+                exp_mz = precursor.getMZ()
+            else:
+                charge_state = None
+                exp_mz = None

quantmsutils/mzml/mzml_statistics.py

ypriverol and others added 8 commits November 27, 2024 09:15

improvements in speed in mzml stats

1e141c3

Fixed ms2rescore bugs

58fc54b

Update ms2rescore.py

8d6bc7d

Update ms2rescore.py

ad2ea80

Merge pull request #36 from daichengxin/dev

6d285fc

Fixed ms2rescore bugs

Merge branch 'main' of https://github.com/bigbio/pyquantms into dev

ab91de8

Merge remote-tracking branch 'origin/dev' into dev

90ed8aa

improvements in speed in mzml stats

8d3e0ad

coderabbitai bot reviewed Nov 29, 2024

View reviewed changes

jpfeuffer reviewed Nov 29, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

improve to streaming.

5154daf

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

improve to streaming.

db2f3b9

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

Update quantmsutils/mzml/mzml_statistics.py

9f9c82f

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

jpfeuffer requested changes Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

improvements parsing .d

d64503d

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

ypriverol added 2 commits November 30, 2024 13:38

id_only move to batch writing.

04ba2a9

id_only move to batch writing.

ddda856

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

quantmsutils/mzml/mzml_statistics.py Show resolved Hide resolved

id_only move to batch writing.

5999bfb

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

ypriverol added 2 commits November 30, 2024 14:22

improvements in speed in mzml stats

5264eb8

Merge remote-tracking branch 'origin/dev' into dev

3e4613c

# Conflicts: # quantmsutils/mzml/mzml_statistics.py

ypriverol requested a review from jpfeuffer November 30, 2024 14:25

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

black applied.

dbfc89d

black applied.

fe1739e

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

black applied.

2ca002c

jpfeuffer requested changes Nov 30, 2024

View reviewed changes

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

black applied.

ce697c7

jpfeuffer reviewed Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

jpfeuffer reviewed Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Outdated Show resolved Hide resolved

Apply suggestions from code review

2b5cc7f

jpfeuffer self-requested a review November 30, 2024 16:43

jpfeuffer approved these changes Nov 30, 2024

View reviewed changes

ypriverol added 2 commits November 30, 2024 19:49

black applied.

71784a6

black applied.

4d3351f

coderabbitai bot reviewed Nov 30, 2024

View reviewed changes

quantmsutils/mzml/mzml_statistics.py Show resolved Hide resolved

ypriverol merged commit 5620ead into main Nov 30, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

major improvements of mzml statistics #37

major improvements of mzml statistics #37

ypriverol commented Nov 29, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 29, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Nov 29, 2024

coderabbitai bot Nov 29, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

jpfeuffer left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

		accession.decode() for accession in peptide_hit.extractProteinAccessionsSet()
		],

major improvements of mzml statistics #37

major improvements of mzml statistics #37

Conversation

ypriverol commented Nov 29, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Nov 29, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

jpfeuffer left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

ypriverol commented Nov 29, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 29, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)