Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core-clp): Add LZMA Compressor implementation and LZMA dependency. #614

Merged
merged 68 commits into from
Dec 23, 2024

Conversation

Bill-hbrhbr
Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr commented Nov 27, 2024

Description

  1. Port the working version of LZMA compression code into open-source
  2. Refactor the compressor-related files as a continuation of refactor(core): Fix clang-tidy warnings in the streaming compressor interface and its implementations; Modernize and refactor test-StreamingCompression for conciseness. #599
  3. Add scripts to install LZMA dependencies for all builds

Validation performed

Updated test-StreamingCompression.cpp to include LZMA compressors. Tested that compressing workflow finishes without errors.

Summary by CodeRabbit

  • New Features

    • Introduced LZMA compression support, expanding available compression options.
    • Added new constants and enumerators related to LZMA compression.
    • Enhanced testing suite to support LZMA compression functionality.
  • Bug Fixes

    • Improved error handling and validation in the LZMA compressor implementation.
  • Chores

    • Updated installation scripts to include the LZMA library and related dependencies across various platforms.
    • Updated installation scripts to install the latest version of the Boost library.
    • Added additional packages to installation scripts for various platforms.

Copy link
Contributor

coderabbitai bot commented Nov 27, 2024

Walkthrough

This pull request introduces comprehensive support for LZMA compression in the CLP project. The changes span multiple files across the project, including configuration files, CMake scripts, source code, and installation scripts. The implementation adds LZMA as a new compression option, with modifications to the build system, compression interfaces, constants, and test suites. The changes also update library installation scripts for various operating systems to ensure proper setup and dependency management.

Changes

File Change Summary
components/core/.clang-format Added lzma to library header regex pattern
components/core/CMakeLists.txt Added LZMA compression support, library finding, and build configuration
components/core/src/clp/streaming_compression/Constants.hpp Added LZMA enumerator to CompressorType
components/core/src/clp/streaming_compression/lzma/Compressor.cpp Implemented LZMA compression logic and error handling
components/core/src/clp/streaming_compression/lzma/Compressor.hpp Defined LZMA compressor class and interfaces
components/core/src/clp/streaming_compression/lzma/Constants.hpp Defined LZMA-specific compression constants
components/core/tests/test-StreamingCompression.cpp Added LZMA compression test section
components/core/tools/scripts/lib_install/liblzma.sh Added script for installing liblzma library
components/core/tools/scripts/lib_install/ubuntu-focal/install-packages-from-source.sh Updated boost version and added liblzma installation command
components/core/tools/scripts/lib_install/ubuntu-focal/install-prebuilt-packages.sh Added installation of jq and liblzma-dev packages
components/core/tools/scripts/lib_install/ubuntu-jammy/install-packages-from-source.sh Added commands for installing boost and liblzma
components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh Replaced boost packages with jq and liblzma-dev
components/core/tools/scripts/lib_install/centos-stream-9/install-prebuilt-packages.sh Added diffutils and jq packages
components/core/tools/scripts/lib_install/macos/install-all.sh Added xz package to installation list

Possibly Related PRs

Suggested Reviewers

  • gibber9809
  • kirkrodrigues

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Bill-hbrhbr
Copy link
Contributor Author

I guess two comments:

  1. Let me know if you want me to remove the decompressor files. I did a direct copy-paste so that I can include unittest for LZMA in this PR.
  2. I added lib install scripts for Ubuntu machines. Do I need to also look into CentOS and MacOS?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Outside diff range and nitpick comments (18)
components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh (1)

23-23: Consider pinning the package version for reproducibility.

To ensure consistent builds across different environments and times, consider pinning the package version.

-  liblzma-dev \
+  liblzma-dev=5.2.5-2ubuntu1 \
components/core/tools/scripts/lib_install/ubuntu-jammy/install-packages-from-source.sh (2)

Line range hint 1-20: Consider adding version compatibility comments

The script structure is good, but it would be helpful to document any version interdependencies between the libraries.

Add a comment block at the top of the file:

 #!/usr/bin/env bash

+# Library versions and compatibility notes:
+# - liblzma 5.4.6: Compatible with libarchive 3.5.1
+
 # Exit on any error
 set -e

14-14: Consider error handling for LZMA installation

While the script has basic error handling with set -e, consider adding specific error handling for the LZMA installation.

-"$lib_install_scripts_dir"/liblzma.sh 5.4.6
+if ! "$lib_install_scripts_dir"/liblzma.sh 5.4.6; then
+    echo "Error: LZMA installation failed. Please check system requirements."
+    exit 1
+fi
components/core/tools/scripts/lib_install/ubuntu-focal/install-packages-from-source.sh (1)

17-17: Consider installation order implications

The LZMA library is installed after libarchive. If libarchive is built with LZMA support, it might need to be rebuilt after LZMA installation.

Consider either:

  1. Moving the LZMA installation before libarchive
  2. Adding a rebuild step for libarchive after LZMA installation if necessary
components/core/tools/scripts/lib_install/liblzma.sh (1)

1-66: Consider adding logging, dry-run mode, and uninstall capability

To improve the script's maintainability and user experience, consider these architectural improvements:

  1. Add logging functionality with different verbosity levels
  2. Implement a dry-run mode for testing
  3. Add uninstall capability
  4. Consider adding a progress indicator for long-running operations

These improvements would make the script more robust and user-friendly for both development and production use.

Would you like me to provide an implementation for any of these features?

components/core/tests/test-StreamingCompression.cpp (2)

60-64: Consider enhancing test coverage for LZMA-specific scenarios.

While the basic compression/decompression test is good, consider adding LZMA-specific test cases to validate:

  • Different compression levels
  • Dictionary size variations
  • Error handling for corrupted LZMA streams

Would you like me to provide example test cases for these scenarios?


Line range hint 1-113: Consider refactoring to use parameterized tests.

The test structure is duplicated across compression types. Consider using Catch2's test case templates to reduce duplication and make it easier to add new compression methods.

Here's a suggested approach:

TEMPLATE_TEST_CASE("StreamingCompression", "[StreamingCompression]",
    std::pair<clp::streaming_compression::zstd::Compressor, clp::streaming_compression::zstd::Decompressor>,
    std::pair<clp::streaming_compression::passthrough::Compressor, clp::streaming_compression::passthrough::Decompressor>,
    std::pair<clp::streaming_compression::lzma::Compressor, clp::streaming_compression::lzma::Decompressor>
) {
    auto compressor = std::make_unique<typename TestType::first_type>();
    auto decompressor = std::make_unique<typename TestType::second_type>();
    // ... rest of the test ...
}
components/core/CMakeLists.txt (4)

21-23: Standardize the status message format

The status message format differs from other compressors. Consider using a consistent format:

-    message(STATUS "Using Lempel–Ziv–Markov chain Algorithm compression")
+    message(STATUS "Using LZMA compression")

241-241: Use STATUS for LZMA include directory message

The include directory message should use the STATUS level for consistency with other similar messages:

-message("LZMA Include Dir: ${LIBLZMA_INCLUDE_DIRS}")
+message(STATUS "LZMA Include Dir: ${LIBLZMA_INCLUDE_DIRS}")

231-232: Address the static/shared library support TODO

The TODO comment indicates missing functionality for handling static vs. shared libraries.

Would you like me to help create a CMake module to handle static/shared library selection for LZMA? This would align with the project's existing pattern for other libraries.


573-573: Consider grouping compression-related libraries together

For better organization, consider grouping LZMA with other compression-related libraries:

-        ${LIBLZMA_LIBRARIES}
-        ZStd::ZStd
+        ZStd::ZStd
+        ${LIBLZMA_LIBRARIES}
components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1)

120-120: Use type alias LzmaAction for consistency.

In the compress method, consider using the type alias LzmaAction instead of lzma_action to maintain consistency with the defined type aliases and improve readability.

Apply this diff for consistency:

-auto compress(lzma_action action) -> void;
+auto compress(LzmaAction action) -> void;
components/core/src/clp/streaming_compression/lzma/Decompressor.hpp (3)

9-9: Remove unnecessary inclusion of <zlib.h>.

The <zlib.h> header included at line 9 may be unnecessary since this decompressor uses LZMA compression. Eliminating unused headers can reduce dependencies and improve compile times.

Apply this diff to remove the unnecessary include:

-#include <zlib.h>

61-62: Complete or remove the incomplete Doxygen comment.

There is an incomplete Doxygen comment starting at line 61 without any description. To maintain clear documentation, please complete the comment or remove it if it's not needed.

Apply this diff to complete the comment:

 /**
+ * Reads an exact number of bytes from the decompressor into the buffer.
+ * @param buf The buffer to read data into.
+ * @param num_bytes_to_read The exact number of bytes to read.
+ * @param num_bytes_read The actual number of bytes read.
 */
 void exact_read(char* buf, size_t num_bytes_to_read, size_t& num_bytes_read);

89-90: Correct the reference to the appropriate method in the comment.

The comment at line 89 incorrectly references streaming_compression::zstd::Decompressor::try_seek_from_begin, which is not relevant to the LZMA decompressor. It should refer to the appropriate LZMA method or be updated to accurately reflect the functionality.

Apply this diff to correct the comment:

- * @return Same as streaming_compression::zstd::Decompressor::try_seek_from_begin
+ * @return Same as Decompressor::try_seek_from_begin
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)

86-86: Use LZMA_STREAM_INIT for proper initialization of m_compression_stream

Instead of using memset to initialize m_compression_stream, it's recommended to use the LZMA_STREAM_INIT macro provided by the LZMA library. This ensures all fields are correctly initialized, including any internal pointers, and is the preferred method according to the LZMA documentation.

Apply this diff to initialize m_compression_stream correctly:

-    memset(m_compression_stream.get(), 0, sizeof(LzmaStream));
+    *m_compression_stream = LZMA_STREAM_INIT;
components/core/src/clp/streaming_compression/lzma/Decompressor.cpp (2)

113-113: Correct error messages to reference LZMA functions

The error messages incorrectly mention inflate(), which is associated with zlib. Since we're using LZMA, please update the messages to reference lzma_code() to accurately reflect the context.

Apply this diff to fix the error messages:

-                    SPDLOG_ERROR("streaming_compression::lzma::Decompressor inflate() ran out of memory");
+                    SPDLOG_ERROR("streaming_compression::lzma::Decompressor lzma_code() ran out of memory");

...

-                    SPDLOG_ERROR("inflate() returned an unexpected value - {}.", int(return_value));
+                    SPDLOG_ERROR("lzma_code() returned an unexpected value - {}.", int(return_value));

Also applies to: 117-117


130-130: Update comment to reflect LZMA instead of ZStd

The comment references 'ZStd', but this is an LZMA decompressor. Please correct the comment to avoid confusion.

Apply this diff to fix the comment:

-            // ZStd has no way for us to seek back to the desired position, so just reset the stream
+            // LZMA has no way for us to seek back to the desired position, so just reset the stream
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7aea626 and 707c412.

📒 Files selected for processing (14)
  • components/core/.clang-format (1 hunks)
  • components/core/CMakeLists.txt (4 hunks)
  • components/core/src/clp/streaming_compression/Constants.hpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Constants.hpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Decompressor.cpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Decompressor.hpp (1 hunks)
  • components/core/tests/test-StreamingCompression.cpp (2 hunks)
  • components/core/tools/scripts/lib_install/liblzma.sh (1 hunks)
  • components/core/tools/scripts/lib_install/ubuntu-focal/install-packages-from-source.sh (1 hunks)
  • components/core/tools/scripts/lib_install/ubuntu-focal/install-prebuilt-packages.sh (1 hunks)
  • components/core/tools/scripts/lib_install/ubuntu-jammy/install-packages-from-source.sh (1 hunks)
  • components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • components/core/.clang-format
  • components/core/src/clp/streaming_compression/lzma/Constants.hpp
🧰 Additional context used
🪛 cppcheck (2.10-2)
components/core/src/clp/streaming_compression/lzma/Decompressor.cpp

[error] 41-41: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)


[error] 45-45: Rethrowing current exception with 'throw;', it seems there is no current exception to rethrow. If there is no current exception this calls std

(rethrowNoCurrentException)

🔇 Additional comments (11)
components/core/src/clp/streaming_compression/Constants.hpp (2)

10-10: Well-structured enum value addition!

The chosen value (0x20) follows good practices:

  • Maintains consistent power-of-2 spacing with existing values
  • Leaves room for future compression types between values
  • Follows the established naming convention

10-10: Verify enum value handling across the codebase

Let's ensure the new LZMA enum value is properly handled in all relevant code paths.

✅ Verification successful

Let me gather more information about the CompressorType enum usage and switch statements.


Based on the comprehensive analysis of the codebase, I can now provide a final response:

LZMA enum value is properly integrated and handled

The new LZMA enum value (0x20) is correctly integrated into the codebase with proper handling:

  • The CompressorType enum is consistently defined across namespaces (clp::streaming_compression)
  • LZMA implementation follows the same pattern as existing compressors (ZSTD, Passthrough)
  • No switch statements directly use CompressorType enum, as the type is handled through class inheritance
  • Proper initialization of LZMA compressor/decompressor classes using CompressorType::LZMA
  • Test cases include LZMA compression alongside other compression types
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find potential switch statements and enum handling code that might need updates

# Search for switch statements handling CompressorType
echo "Checking for switch statements handling CompressorType..."
rg -l "switch.*CompressorType" 

# Search for serialization/deserialization code
echo "Checking for serialization/deserialization code..."
rg -l "serialize.*CompressorType|deserialize.*CompressorType"

# Look for direct CompressorType comparisons
echo "Checking for CompressorType comparisons..."
rg -l "CompressorType::(ZSTD|Passthrough)"

Length of output: 989


Script:

#!/bin/bash
# Search for CompressorType enum usage with context
echo "Checking CompressorType enum usage with context..."
rg -A 5 "CompressorType" 

# Search for switch statements in compression-related files
echo "Checking switch statements in compression files..."
fd -e cpp -e hpp | grep -i "compress" | xargs rg "switch.*\{" -A 5

# Look for factory or creation methods that might use CompressorType
echo "Checking factory methods..."
rg "create.*Compressor|make.*Compressor" -A 5

Length of output: 26592

components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh (1)

23-23: LGTM! Verify package availability in Ubuntu Jammy.

The addition of liblzma-dev is correct and properly placed in alphabetical order.

✅ Verification successful

Package addition is properly integrated and verified

The liblzma-dev package addition is well-integrated into the existing LZMA compression workflow:

  • Package exists in Ubuntu Jammy repositories
  • Consistent with Ubuntu Focal configuration
  • Properly supports the LZMA compression option in CMakeLists.txt
  • Aligns with existing LZMA build infrastructure (liblzma.sh)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify liblzma-dev package availability and its dependencies in Ubuntu Jammy
# Expected: Package should be available and not have conflicting dependencies

# Check if the package exists in Ubuntu Jammy repositories
curl -s "https://packages.ubuntu.com/jammy/liblzma-dev" | grep -q "Page not found" && echo "Package not found" || echo "Package exists"

# Get package details including version and dependencies
curl -s "https://packages.ubuntu.com/jammy/liblzma-dev" | grep -A 5 "Package:"

Length of output: 300


Script:

#!/bin/bash
# Let's verify the package dependencies and version availability more thoroughly
apt-cache policy liblzma-dev 2>/dev/null || echo "Package info not available in sandbox"

# Check if the package is already used elsewhere in the codebase
rg -l "liblzma-dev" --type sh

# Look for any LZMA related build dependencies or configurations
rg -l "lzma" --type cmake --type make --type sh

Length of output: 787


Script:

#!/bin/bash
# Let's check the CMakeLists.txt to understand LZMA integration
cat components/core/CMakeLists.txt | grep -A 5 -B 5 "lzma"

# Check the content of liblzma.sh to understand existing LZMA setup
cat components/core/tools/scripts/lib_install/liblzma.sh

# Check if the package is also used in Focal
cat components/core/tools/scripts/lib_install/ubuntu-focal/install-prebuilt-packages.sh | grep "liblzma-dev"

Length of output: 3455

components/core/tools/scripts/lib_install/ubuntu-jammy/install-packages-from-source.sh (1)

14-14: Verify LZMA installation script and dependencies

The addition of LZMA library installation looks good, but let's verify a few important aspects:

✅ Verification successful

Let me gather more information about the liblzma.sh script content and its error handling:


LZMA installation script verified and properly configured

The installation script for LZMA 5.4.6 has been thoroughly verified and shows:

  • Proper error handling with set -e and set -u
  • Secure download from the official tukaani-project repository
  • Appropriate privilege management
  • CMake configuration with position-independent code
  • Parallel build support
  • Clean installation process

The script follows the project's established patterns and integrates well with the existing build system. The CMake configuration properly requires and validates LZMA presence.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify LZMA installation script and its dependencies

# Check if liblzma.sh exists and is executable
echo "Checking liblzma.sh existence and permissions..."
fd --type f "liblzma.sh" --exec ls -l {}

# Look for version constraints or dependencies in build files
echo -e "\nChecking for version constraints..."
rg -g "CMakeLists.txt" --multiline "find_package\(.*LZMA.*\)" -A 5

Length of output: 872


Script:

#!/bin/bash
# Check the content of liblzma.sh for error handling and version validation
echo "Checking liblzma.sh content..."
cat ./components/core/tools/scripts/lib_install/liblzma.sh

# Check if there are any version-specific dependencies or constraints in the codebase
echo -e "\nChecking for LZMA version references..."
rg "LZMA.*5\.4" --type cpp --type cmake

Length of output: 1777

components/core/tools/scripts/lib_install/ubuntu-focal/install-packages-from-source.sh (1)

17-17: Verify LZMA version compatibility and security

Please ensure that LZMA version 5.4.6 meets the project's requirements and has no known vulnerabilities.

✅ Verification successful

Let me gather more information about LZMA usage and dependencies in the project.


LZMA 5.4.6 is compatible with the project requirements

Based on the codebase analysis:

  • LZMA is used as an optional general-purpose compressor in the project
  • The project uses CMake's find_package(LibLZMA REQUIRED) without version constraints
  • No known vulnerabilities were found for LZMA 5.4.6
  • The installation script location appears to be incorrect as liblzma.sh was not found at the expected path

The version 5.4.6 is suitable for use, but please ensure the installation script is properly placed at components/core/tools/scripts/lib_install/ubuntu-focal/liblzma.sh before proceeding.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for security advisories and compatibility issues with LZMA 5.4.6

# Check for CVEs in the National Vulnerability Database
curl -s "https://services.nvd.nist.gov/rest/json/cves/2.0?keywordSearch=liblzma%205.4.6" | jq '.vulnerabilities[].cve.descriptions[].value'

# Check if any other components in the project specify LZMA version requirements
rg -i "lzma.*version|version.*lzma" --type cmake --type=txt

Length of output: 302


Script:

#!/bin/bash
# Check for LZMA dependencies and usage
rg -i "lzma" --type cmake -A 3 -B 3

# Check the liblzma installation script
cat components/core/tools/scripts/lib_install/ubuntu-focal/liblzma.sh

# Check if there are any specific LZMA version requirements in package configs
fd -e pc -e cmake -x cat {} \; | rg -i "lzma"

Length of output: 4152

components/core/tools/scripts/lib_install/ubuntu-focal/install-prebuilt-packages.sh (2)

23-23: LGTM! The LZMA development package is correctly added.

The addition of liblzma-dev is properly positioned in alphabetical order and aligns with the script's existing package installation pattern.


23-23: Consider documenting Ubuntu version compatibility.

Since this script is specifically for Ubuntu Focal (20.04), we should verify and document LZMA package compatibility for other Ubuntu versions that might be supported by the project.

components/core/tests/test-StreamingCompression.cpp (1)

18-19: LGTM! Headers are properly organized.

The LZMA compression headers follow the established pattern and are correctly placed with other compression-related includes.

components/core/src/clp/streaming_compression/lzma/Decompressor.hpp (2)

69-69: Verify the correctness of the method reference in the comment.

The comment at line 69 mentions ReaderInterface::try_read_exact_length, but it's unclear if this is the correct method reference. Please ensure that this comment accurately describes the return behaviour and references the appropriate method.


146-146: ⚠️ Potential issue

Ensure proper initialisation and cleanup of m_decompression_stream.

The member variable m_decompression_stream is a raw pointer initialised to nullptr. To prevent potential memory leaks, ensure that it is properly allocated during initialisation and that all resources are released in the destructor.

Would you like assistance in reviewing the resource management for m_decompression_stream?

components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)

20-75: Initialization of the LZMA encoder is correctly implemented

The init_lzma_encoder function properly initializes the LZMA encoder with the specified compression level and dictionary size. Error handling for initialization failures is thorough and follows best practices.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
components/core/CMakeLists.txt (3)

234-235: Fix inconsistent variable name in status message

The status message uses Lzma instead of the package name LibLZMA:

-    message(STATUS "Found Lzma ${LIBLZMA_VERSION_STRING}")
+    message(STATUS "Found LibLZMA ${LIBLZMA_VERSION_STRING}")

241-241: Remove redundant debug message

The include directory message is redundant since CMake already logs this information at a more appropriate debug level.

-message("Lzma Include Dir: ${LIBLZMA_INCLUDE_DIRS}")

231-232: Add support for static/shared LZMA library selection

The TODO comment indicates missing support for selecting between static and shared LZMA libraries. This should be implemented to maintain consistency with other library configurations in the project.

Would you like me to help create a FindLZMA.cmake module that supports static/shared library selection?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 89b5707 and c646cea.

📒 Files selected for processing (4)
  • components/core/CMakeLists.txt (4 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1 hunks)
  • components/core/tools/scripts/lib_install/liblzma.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp
  • components/core/tools/scripts/lib_install/liblzma.sh
🔇 Additional comments (3)
components/core/CMakeLists.txt (1)

481-485: Missing test coverage for LZMA compression implementation

The LZMA source files have been added to the test suite in CMakeLists.txt, but there are no corresponding test cases in the test-StreamingCompression.cpp file. This needs to be addressed to ensure the reliability of the LZMA compression functionality.

components/core/src/clp/streaming_compression/lzma/Compressor.cpp (2)

173-183: ⚠️ Potential issue

Prevent potential infinite loops by handling LZMA_BUF_ERROR appropriately

In the compress function, when lzma_code returns LZMA_BUF_ERROR, the current implementation continues the loop without taking corrective action. This could lead to an infinite loop, as LZMA_BUF_ERROR indicates that no progress can be made due to insufficient input or output buffer space.

Consider handling LZMA_BUF_ERROR explicitly to prevent potential infinite loops.

Apply this diff to handle LZMA_BUF_ERROR:

             switch (rc) {
                 case LZMA_OK:
                     break;
-                case LZMA_BUF_ERROR:
-                    break;
+                case LZMA_BUF_ERROR:
+                    // No progress can be made; handle or report the error
+                    SPDLOG_ERROR("Compression stalled: LZMA_BUF_ERROR encountered.");
+                    throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
                 case LZMA_STREAM_END:
                     hit_stream_end = true;
                     break;

207-213: 🛠️ Refactor suggestion

Handle potential exceptions from m_compressed_stream_file_writer->write

Currently, the code assumes that the write operation always succeeds. To enhance robustness, consider handling any exceptions or errors that may occur during the write operation to ensure that failures are properly reported and managed.

Apply this diff to handle possible write errors:

 auto Compressor::pipe_data() -> void {
+    try {
         m_compressed_stream_file_writer->write(
             size_checked_pointer_cast<char>(m_compressed_stream_block_buffer.data()),
             cCompressedStreamBlockBufferSize - m_compression_stream.avail_out
         );
+    } catch (const std::exception& e) {
+        SPDLOG_ERROR("Failed to write compressed data: {}", e.what());
+        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+    }
     m_compression_stream.next_out = m_compressed_stream_block_buffer.data();
     m_compression_stream.avail_out = cCompressedStreamBlockBufferSize;
 }

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
components/core/tests/test-StreamingCompression.cpp (2)

59-61: Add documentation for intentionally omitted decompressor

While the LZMA test section follows the established pattern, it's worth documenting why the decompressor is intentionally not initialized here, especially since it differs from other test sections.

Consider adding a comment like:

 SECTION("LZMA compression") {
+    // Note: Decompressor intentionally not initialized as it's handled separately
     compressor = std::make_unique<clp::streaming_compression::lzma::Compressor>();
 }

82-85: Consider adding test reporting for skipped decompression

While the null check correctly handles missing decompressors, it silently skips the decompression test. Consider using Catch2's INFO or WARN to report when decompression is skipped.

 if (nullptr == decompressor) {
+    WARN("Skipping decompression test - decompressor not initialized");
     boost::filesystem::remove(compressed_file_path);
     return;
 }
components/core/CMakeLists.txt (2)

241-241: Remove redundant debug message

The include directory message is redundant since CMake already logs this information when the package is found.

-message("Lzma Include Dir: ${LIBLZMA_INCLUDE_DIRS}")

231-232: Add support for static/shared LZMA library selection

The TODO comment indicates missing support for selecting between static and shared LZMA libraries. This should be implemented to maintain consistency with other compression libraries in the project.

Would you like me to help create a FindLibLZMA.cmake module that supports static/shared library selection?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c646cea and c91e5fb.

📒 Files selected for processing (2)
  • components/core/CMakeLists.txt (4 hunks)
  • components/core/tests/test-StreamingCompression.cpp (3 hunks)
🔇 Additional comments (4)
components/core/tests/test-StreamingCompression.cpp (2)

18-18: LGTM! Header inclusion follows project conventions

The LZMA compressor header is properly included alongside other compression-related headers.


59-61: Verify LZMA-specific test scenarios

While the test covers basic compression functionality, consider whether LZMA-specific scenarios need testing:

  • Different compression levels
  • Dictionary size variations
  • Error handling for invalid LZMA streams
✅ Verification successful

Let me gather more information about the LZMA implementation and test coverage.


Test coverage appears adequate for LZMA compression

The implementation already includes essential LZMA-specific configurations and error handling:

  • Compression level configuration via init_lzma_encoder
  • Dictionary size customization
  • Error handling for LZMA stream operations
  • CRC64 checksum verification

The basic test case is sufficient as it exercises the core compression path, which internally handles these configurations. The implementation's error handling is covered through the LZMA stream workflow.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any existing LZMA-specific test cases
rg -l "LZMA" --type cpp --glob "test-*.cpp"

# Check for LZMA configuration options in source
ast-grep --pattern 'class $_ {
  $$$
  // LZMA specific configuration
  $$$
}'

Length of output: 176


Script:

#!/bin/bash
# Check the LZMA compressor implementation and available configuration options
rg -A 10 "class.*[Ll]zma.*Compressor" --type cpp

# Look for LZMA-specific test cases in more detail
rg -B 2 -A 10 "SECTION.*LZMA" components/core/tests/test-StreamingCompression.cpp

# Search for any LZMA configuration or parameter settings
rg "lzma.*\(" --type cpp

Length of output: 2117

components/core/CMakeLists.txt (2)

235-235: Fix incorrect variable name in status message

The status message uses Lzma instead of the package name LibLZMA.


481-483: LGTM! Source files and library linking look good

The LZMA source files are properly organized in the streaming_compression directory, and the library is correctly linked in the target_link_libraries command.

Also applies to: 571-571

@Bill-hbrhbr Bill-hbrhbr requested a review from haiqi96 November 27, 2024 20:48
@Bill-hbrhbr Bill-hbrhbr requested a review from haiqi96 November 30, 2024 07:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (3)

31-35: Consider simplifying the comment block.

The comment about XZ Embedded could be more concise while maintaining its informative value.

-    // Initialize the encoder using a preset. Set the integrity to check
-    // to CRC64, which is the default in the xz command line tool. If
-    // the .xz file needs to be decompressed with XZ Embedded, use
-    // LZMA_CHECK_CRC32 instead.
+    // Initialize encoder with CRC64 integrity check (use LZMA_CHECK_CRC32 for XZ Embedded compatibility)

171-174: Document LZMA_BUF_ERROR handling strategy.

The code handles LZMA_BUF_ERROR but should document why it's treated the same as LZMA_OK.

             case LZMA_OK:
             case LZMA_BUF_ERROR:
+                // LZMA_BUF_ERROR indicates temporary buffer condition, safe to continue
                 break;

1-207: Consider adding performance optimizations for large datasets.

The current implementation could benefit from:

  1. Buffer size tuning based on input characteristics
  2. Parallel compression for large datasets
  3. Memory-mapped I/O for large files
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c91e5fb and 26b0663.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp
🔇 Additional comments (5)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (5)

1-17: LGTM! Header organization and includes are well-structured.

The includes are logically organized into standard libraries, external dependencies, and project-specific headers.


70-93: LGTM! Well-structured initialization with proper validation.

The method properly validates the initialization state and compression level before setting up the compression stream and buffers.


95-108: LGTM! Proper resource cleanup and finalization.

The method ensures proper cleanup of resources and follows RAII principles.


110-131: ⚠️ Potential issue

Add error handling for the compression operation.

The write operation should handle potential compression failures.

     run_lzma(LZMA_RUN);
+    if (!m_compression_stream_is_flushed) {
+        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+    }

199-206: ⚠️ Potential issue

Add exception handling for file write operations.

The write operation could fail and should be wrapped in a try-catch block.

 auto Compressor::pipe_data() -> void {
+    try {
         m_compressed_stream_file_writer->write(
                 clp::size_checked_pointer_cast<char>(m_compressed_stream_block_buffer.data()),
                 cCompressedStreamBlockBufferSize - m_compression_stream.avail_out
         );
+    } catch (const std::exception& e) {
+        SPDLOG_ERROR("Failed to write compressed data: {}", e.what());
+        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+    }
     m_compression_stream.next_out = m_compressed_stream_block_buffer.data();
     m_compression_stream.avail_out = cCompressedStreamBlockBufferSize;
 }

@haiqi96
Copy link
Contributor

haiqi96 commented Dec 2, 2024

  • I added lib install scripts for Ubuntu machines.

Sorry for missing this question. Yes I think you need to look into CentOS and MacOS. It looks like you have already added the dependency for centos.

Actually I would expect the build of MacOS to fail because LZMA library is not installed, and the updated Unittest should require the lzma dependency. Maybe it means the MacOS already comes with LZMA dependencies?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🔭 Outside diff range comments (3)
components/core/tools/scripts/lib_install/centos-stream-9/install-packages-from-source.sh (1)

Line range hint 13-19: Add missing LZMA dependency installation

The script is missing the LZMA library installation which is required for the new LZMA compression functionality. This should be consistent with the Ubuntu installation scripts.

Add the following line after the existing package installations:

 "${lib_install_scripts_dir}/fmtlib.sh" 8.0.1
 "${lib_install_scripts_dir}/spdlog.sh" 1.9.2
 "${lib_install_scripts_dir}/mongocxx.sh" 3.10.2
 "${lib_install_scripts_dir}/msgpack.sh" 7.0.0
+"${lib_install_scripts_dir}/liblzma.sh" 5.4.6
components/core/src/clp/StringReader.cpp (1)

Line range hint 29-35: Fix type consistency in the read loop

The loop counter should use size_t to match the parameter type:

-    for (int i = 0; i < num_bytes_to_read; i++) {
+    for (size_t i = 0; i < num_bytes_to_read; i++) {
         buf[i] = input_string[i + pos];
     }
components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (1)

Line range hint 37-52: Add error handling for Docker commands.

The stop_running_container function should handle potential Docker command failures gracefully.

 def stop_running_container(container_name: str, already_exited_containers: List[str], force: bool):
     if is_container_running(container_name):
         logger.info(f"Stopping {container_name}...")
-        cmd = ["docker", "stop", container_name]
-        subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+        try:
+            cmd = ["docker", "stop", container_name]
+            subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to stop container {container_name}: {e}")
+            return
🧹 Nitpick comments (59)
components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)

118-119: Consider conditional instantiation of schema trees.
The addition of two separate shared pointers for auto-generated and user-generated schema trees appears correct. However, if there are usage scenarios where one of these structures might be optional, you might consider instantiating it only upon actual usage to reduce resource overhead.


187-193: Validate parameter usage in function call.
The function passes multiple parameters to deserialize the KV-pair log event (including both schema trees). If any of these parameters are consistently immutable or read-only, passing them by const reference may optimize performance and clarify their usage.


215-215: Clarify error response for existing node.
Returning a protocol error for an already existing node is plausible; however, it might be beneficial to provide a more specific error code or a more descriptive message to make troubleshooting easier.


225-225: Consider handling insertion outcomes.
Currently, the insertion result is ignored. If insertions can fail or become unexpectedly modified, it would be prudent to handle, assert, or log any failures for better diagnostic information.

components/core/tests/test-ffi_IrUnitHandlerInterface.cpp (1)

90-96: Consider documenting the purpose of empty brace parameters.

While the code is functionally correct, adding comments to explain what the empty braces {} represent would improve readability and make the test's intent clearer.

     auto test_log_event_result{KeyValuePairLogEvent::create(
             std::make_shared<SchemaTree>(),
             std::make_shared<SchemaTree>(),
-            {},
-            {},
+            {},  // Empty auto-generated key-value pairs
+            {},  // Empty user-generated key-value pairs
             cTestUtcOffset
     )};
components/clp-package-utils/clp_package_utils/scripts/search.py (1)

Line range hint 83-117: Consider enhancing container operation error handling

The container operations section could benefit from more robust error handling. Consider:

  1. Adding error handling for container creation failures
  2. Providing more detailed feedback when container operations fail
  3. Implementing cleanup in case of partial failures

Example improvement:

    cmd = container_start_cmd + search_cmd
-    subprocess.run(cmd, check=True)
+    try:
+        subprocess.run(cmd, check=True, capture_output=True, text=True)
+    except subprocess.CalledProcessError as e:
+        print(f"Container operation failed: {e.stderr}", file=sys.stderr)
+        return -1
+    finally:
+        # Ensure cleanup happens even if container operation fails
+        if generated_config_path_on_host.exists():
+            generated_config_path_on_host.unlink()
components/core/src/clp_s/TimestampDictionaryWriter.hpp (2)

52-54: Switch to std::string_view for ingest_entry is beneficial.
This helps reduce copies and improve performance. Ensure that upstream code does not pass ephemeral references (like temporary string data) that outlive the function.


91-96: Refactored write_timestamp_entries with stream usage.
Removing the compressor parameter in favour of a stream simplifies the logic. Confirm that compression is correctly handled elsewhere if needed.

Also applies to: 102-102

components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (2)

18-18: Logger is declared at the global scope.
Consider fine-grained logging or prefixing messages to differentiate context if logs become large.


47-55: Robust exception handling for config loading.
Catching broad exceptions can occasionally mask specific errors. If feasible, consider narrower exception handling.

components/core/src/clp/ffi/KeyValuePairLogEvent.hpp (1)

65-70: Consider extra pointer checks for defensive programming.

Even though the create method already validates non-null pointers, re-checking in getters (such as get_user_gen_keys_schema_tree) can provide additional safety if these methods are ever called outside the normal lifecycle guarantees.

components/core/src/clp_s/ArchiveWriter.hpp (1)

181-203: Check concurrency if archives are written in parallel.

Although no concurrency usage is shown here, if future changes introduce parallel writing to single-file archives, ensure proper synchronisation. No immediate issues are found in the current code.

components/core/src/clp/streaming_compression/lzma/Compressor.cpp (3)

90-111: Watch for potential infinite loops

Within encode_lzma(), the loop continues while m_lzma_stream.avail_in() > 0. The LZMA_BUF_ERROR case is handled by throwing an exception, which is appropriate to avoid deadlocks or infinite loops. You might consider logging more context (e.g., offset, identifications) to facilitate debugging.


113-152: Ensure flush action validity

flush_lzma() checks LzmaStream::is_flush_action(flush_action) to confirm a valid flush action and throws if invalid. This is a robust measure. To improve clarity, consider capturing the flush status in a local variable or enumerated type, so the code more clearly indicates what each flush action entails.


154-170: Reset the output buffer upon flush completion

flush_stream_output_block_buffer() writes the compressed data and reattaches the output buffer. This is correct. Consider adding debug logs indicating how many bytes were written whenever you flush; it can be useful for performance monitoring and troubleshooting.

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py (3)

7-7: Add docstring for type usage

Importing typing hints is beneficial. Consider adding or updating docstrings and type annotations for functions that interact with typed parameters or return typed dictionaries.


16-19: Clearly separate database config items

The newly imported S3Config, StorageEngine, StorageType, and WorkerConfig are introduced. It may be helpful to group them logically or add brief inline comments to clarify each import’s purpose. This helps future maintainers quickly grasp how these classes relate to compression tasks.


22-23: Revisit cross-module dependencies with caution

Reading the config with read_yaml_config_file is a powerful approach, but it can clutter the boundary between modules. Consider verifying that config reading is done in a dedicated layer to keep separation of concerns.

components/core/src/clp_s/ArchiveWriter.cpp (2)

59-76: Carefully handle offset accumulation

Within the single-file-archive branch, each file’s offset is tracked in a vector. This is a straightforward approach. If future expansions require parallel or partial writes, consider a more structured metadata approach (e.g. a dedicated offset manager).


Line range hint 282-398: Sort large schema tables logically

The logic sorts schemas in descending order of their uncompressed size. This seems optimal for chunking. If streaming or concurrency become concerns, splitting large tables across multiple streams might be needed.

components/clp-py-utils/clp_py_utils/clp_config.py (4)

3-3: Favour consistent type hints

Using from typing import Literal, Optional, Union fosters clarity. Confirm all function signatures leverage these type hints consistently, benefiting lint checks and maintainability.


312-339: Add coverage for missing region or bucket credentials

S3Config mandates region_code and bucket. Should any default region be used if region_code is not specified? The current approach of raising a ValueError is safe, but confirm that is the desired user experience.


411-420: Set directory logic for FS vs. S3

The set_directory method updates either the folder path or the staging directory, depending on the storage type. This is a succinct approach. If additional storage types appear (e.g., Azure, GCP), this method may need extension or a strategy pattern.


544-556: Storage engine vs. storage type correlation

validate_archive_output_config() verifies that S3 is only used with CLP_S. This is crucial for consistent data handling. Consider adding a more descriptive error that instructs users how to fix the mismatch.

components/core/src/clp/ffi/KeyValuePairLogEvent.cpp (2)

247-247: Consider vector sizing constructs.
Using a dynamic size for 'key_duplication_checked_node_id_bitmap' is acceptable, but be mindful of potential performance costs if this is invoked frequently.


333-364: Validate correctness of marking parent nodes in the bitmap.
The logic in 'get_schema_subtree_bitmap' to iteratively mark parent nodes is sound. However, watch for potential off-by-one or boundary errors when node IDs approach the maximum schema size.

components/core/tests/test-ffi_KeyValuePairLogEvent.cpp (2)

221-237: Ensure raw pointer usage is minimized.
The new function is well-structured, but check that the usage of raw pointers in other test areas remains safe, given previous feedback about adopting references where possible.


Line range hint 356-426: Extend coverage of mismatched types.
The loops that insert invalid node-ID-value pairs are well designed. Consider adding corner cases (e.g., extremely large or empty array structures) to ensure comprehensive coverage.

components/core/src/clp_s/CommandLineArguments.cpp (3)

9-9: Include usage rationale.
The addition of "type_utils.hpp" is not immediately obvious in this file. Include documentation or comments if it’s used for specialized type traits to clarify its necessity.


197-199: Separate single-file-archive logic.
The new 'single-file-archive' switch is helpful. Consider grouping this near other arguments that affect output style, such as 'print-archive-stats', for improved discoverability.


208-211: Validate file-type compatibility.
The check preventing 'kv-ir' usage along with '--structurize-arrays' is sensible. Evaluate whether additional constraints or warnings are needed for future compression modes.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)

Line range hint 631-643: Mount for compression output.
Appending 'mounts.archives_output_dir' for compression workers ensures direct access to the final archive location. Validate path correctness on systems with different directory structures.

components/core/src/clp/ffi/ir_stream/IrErrorCode.hpp (2)

9-12: Consider enhancing the documentation.

While the current documentation provides a basic overview, it would be beneficial to add more detailed information about:

  • The context in which these error codes are used
  • The relationship with the error handling framework
  • Example usage scenarios

13-17: Add documentation for individual enum values.

Each enum value would benefit from documentation explaining:

  • The specific conditions that trigger this error
  • How to handle or recover from each error state

Consider adding documentation like this:

 enum class IrErrorCodeEnum : uint8_t {
+    /// Indicates a failure in the decoding method implementation
     DecodingMethodFailure,
+    /// Signals that the end of the stream has been reached
     EndOfStream,
+    /// Indicates the stream data is incomplete or truncated
     IncompleteStream,
 };
components/core/src/clp/clo/constants.hpp (1)

Line range hint 15-20: Consider documenting the SearchOutput constants.

The new OrigFileId constant and existing constants would benefit from documentation explaining their purpose and usage context.

Consider adding documentation like this:

 namespace SearchOutput {
+    /// Unique identifier for the original file
     constexpr char OrigFileId[]{"orig_file_id"};
+    /// Path to the original file in the filesystem
     constexpr char OrigFilePath[]{"orig_file_path"};
     // ... (document other constants similarly)
 }
components/core/src/clp/ffi/ir_stream/IrErrorCode.cpp (1)

13-26: Consider enhancing the default error message

While the error handling is well-implemented, the default message could be more informative by including the actual enum value.

Consider this enhancement:

-            return "Unknown error code enum.";
+            return "Unknown error code enum: " + std::to_string(static_cast<int>(error_enum));
components/core/src/clp/BoundedReader.cpp (1)

21-42: Consider optimizing the read operation

While the implementation is correct, the double condition check for EOF could be consolidated:

-    if (m_curr_pos == m_bound) {
-        num_bytes_read = 0;
-        return ErrorCode_EndOfFile;
-    }
-
     if ((m_curr_pos + num_bytes_to_read) > m_bound) {
+        if (m_curr_pos == m_bound) {
+            num_bytes_read = 0;
+            return ErrorCode_EndOfFile;
+        }
         num_bytes_to_read = m_bound - m_curr_pos;
     }
components/core/src/clp_s/SingleFileArchiveDefs.hpp (3)

17-26: Consider structure alignment for performance

The ArchiveHeader structure might benefit from explicit alignment:

+#pragma pack(push, 1)
 struct ArchiveHeader {
     uint8_t magic_number[4];
     uint32_t version;
     uint64_t uncompressed_size;
     uint64_t compressed_size;
     uint64_t reserved_padding[4];
     uint32_t metadata_section_size;
     uint16_t compression_type;
     uint16_t padding;
 };
+#pragma pack(pop)

38-43: Address the TODO comment for future fields

The TODO comment indicates missing fields in ArchiveInfoPacket. Let's track this for future implementation.

Would you like me to create a GitHub issue to track the addition of future fields to ArchiveInfoPacket?


45-50: Document the purpose of abbreviated field names

The ArchiveFileInfo structure uses abbreviated field names (n, o). Consider adding more detailed documentation.

 struct ArchiveFileInfo {
+    // File name within the archive
     std::string n;  // name
+    // Offset position in the archive where the file content begins
     uint64_t o;  // offset

     MSGPACK_DEFINE_MAP(n, o);
 };
components/clp-py-utils/clp_py_utils/s3_utils.py (2)

39-50: Consider using context manager for S3 operations

The current implementation could benefit from better resource management and more specific error handling.

-    with open(src_file, "rb") as file_data:
-        try:
-            my_s3_client.put_object(
-                Bucket=s3_config.bucket, Body=file_data, Key=s3_config.key_prefix + dest_file_name
-            )
-        except ClientError as e:
-            error_code = e.response["Error"]["Code"]
-            error_message = e.response["Error"]["Message"]
-            return Err(f"ClientError: {error_code} - {error_message}")
-        except Exception as e:
-            return Err(f"An unexpected error occurred: {e}")
+    try:
+        with open(src_file, "rb") as file_data:
+            my_s3_client.put_object(
+                Bucket=s3_config.bucket,
+                Body=file_data,
+                Key=s3_config.key_prefix + dest_file_name
+            )
+    except FileNotFoundError:
+        return Err(f"Failed to open file: {src_file}")
+    except PermissionError:
+        return Err(f"Permission denied accessing file: {src_file}")
+    except ClientError as e:
+        error_code = e.response["Error"]["Code"]
+        error_message = e.response["Error"]["Message"]
+        return Err(f"S3 ClientError: {error_code} - {error_message}")
+    except Exception as e:
+        return Err(f"Unexpected error during S3 upload: {str(e)}")

31-37: Extract S3 client creation to a separate function

Consider moving the S3 client initialization to a dedicated function for better reusability and testing.

+def create_s3_client(s3_config: S3Config, total_max_attempts: int) -> boto3.client:
+    """
+    Create an S3 client with the given configuration.
+    
+    :param s3_config: S3 configuration containing credentials and region
+    :param total_max_attempts: Maximum number of retry attempts
+    :return: Configured S3 client
+    """
+    config = Config(retries=dict(total_max_attempts=total_max_attempts, mode="adaptive"))
+    return boto3.client(
+        "s3",
+        region_name=s3_config.region_code,
+        aws_access_key_id=s3_config.access_key_id,
+        aws_secret_access_key=s3_config.secret_access_key,
+        config=config,
+    )
components/core/tests/test_log_files/test_no_floats_sorted.jsonl (1)

1-4: Add newline at end of file

The file should end with a newline character to follow POSIX standards and ensure proper text file handling.

Consider improving test file readability

While the test cases are comprehensive, consider formatting the JSON objects with proper indentation for better readability and maintenance. This can be achieved using tools like jq.

LGTM: Comprehensive test coverage

The test cases effectively cover:

  • Various primitive types
  • Edge cases for integer bounds
  • Nested objects and arrays
  • Empty and non-empty containers
components/core/src/clp_s/JsonParser.hpp (1)

33-33: Add documentation for new member variables

The newly added member variables lack documentation explaining their purpose and valid values.

+    /**
+     * Type of input file to be parsed. Defaults to JSON.
+     */
     CommandLineArguments::FileType input_file_type{CommandLineArguments::FileType::Json};
     std::string timestamp_key;
     std::string archives_dir;
     size_t target_encoded_size{};
     size_t max_document_size{};
     size_t min_table_size{};
     int compression_level{};
     bool print_archive_stats{};
     bool structurize_arrays{};
     bool record_log_order{true};
+    /**
+     * When true, creates a single archive file instead of splitting into multiple files.
+     */
     bool single_file_archive{false};

Also applies to: 43-43

components/core/src/clp_s/TimestampEntry.hpp (2)

Line range hint 47-54: Consider adding string_view validation

The constructor accepting string_view should validate the input to ensure it's not empty or contains invalid characters.

-    TimestampEntry(std::string_view key_name)
+    explicit TimestampEntry(std::string_view key_name)
             : m_encoding(UnkownTimestampEncoding),
               m_epoch_start_double(cDoubleEpochTimeMax),
               m_epoch_end_double(cDoubleEpochTimeMin),
               m_epoch_start(cEpochTimeMax),
               m_epoch_end(cEpochTimeMin),
-              m_key_name(key_name) {}
+              m_key_name(key_name) {
+        if (key_name.empty()) {
+            throw OperationFailed(ErrorCode_InvalidArgument, __FILE__, __LINE__);
+        }
+    }

70-73: Update method documentation

The documentation for write_to_stream is outdated and refers to the wrong parameter name.

-     * Write the timestamp entry to a buffered stream.
-     * @param compressor
+     * Write the timestamp entry to a string stream.
+     * @param stream The output string stream to write to
      */
     void write_to_stream(std::stringstream& stream) const;
components/core/tests/test-BoundedReader.cpp (1)

31-32: Consider following the coding guideline for condition checks

Per coding guidelines, prefer false == <expression> over !<expression>. Consider refactoring condition checks to match this style:

-        REQUIRE(clp::ErrorCode_Success == rc);
+        REQUIRE(false == (clp::ErrorCode_Success != rc));

Also applies to: 43-45, 62-63, 73-75, 84-86, 95-97

components/core/tests/test-error_handling.cpp (2)

19-44: Consider adding documentation for error categories.

The error code enums and categories are well-structured, but adding documentation would improve maintainability by explaining:

  • The purpose of AlwaysSuccessErrorCodeEnum vs BinaryErrorCodeEnum
  • The significance of the error conditions in cFailureConditions and cNoneFailureConditions

95-156: Consider adding more test cases for edge cases.

While the current tests are comprehensive for the happy path, consider adding tests for:

  • Error code construction with invalid enum values
  • Thread safety of the error category singleton
  • Boundary conditions for error code comparisons

Example test cases to add:

TEST_CASE("test_error_code_edge_cases", "[error_handling][ErrorCode]") {
    // Test invalid enum value
    BinaryErrorCode const invalid{static_cast<BinaryErrorCodeEnum>(255)};
    REQUIRE(invalid.message() == cUnrecognizedErrorCode);
    
    // Test thread safety of category singleton
    auto verify_category = []() {
        return &BinaryErrorCode::get_category() == &BinaryErrorCode::get_category();
    };
    std::vector<std::thread> threads;
    for (int i = 0; i < 10; ++i) {
        threads.emplace_back([&verify_category]() {
            REQUIRE(verify_category());
        });
    }
    for (auto& thread : threads) {
        thread.join();
    }
}
components/core/src/clp/streaming_compression/lzma/Compressor.hpp (2)

182-182: Consider making the buffer size configurable.

The hard-coded buffer size of 4KiB might not be optimal for all use cases. Consider making it configurable through the constructor or a static configuration.


193-193: Add documentation for error codes.

The encode_lzma method's documentation should specify which error codes can be thrown in the OperationFailed exception.

components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (1)

34-34: Consider adding log format configuration.

The current logging setup could benefit from a consistent format across the application. Consider adding a formatter to match other scripts in the codebase.

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

84-87: Consider enhancing error message with supported storage types

The error message could be more helpful by listing the supported storage types.

-        logger.error(f"File extraction is not supported for archive storage type: {storage_type}.")
+        logger.error(f"File extraction is not supported for archive storage type: {storage_type}. Only {StorageType.FS} is supported.")

164-169: Maintain consistency with file extraction error message

The stream extraction error message should follow the same format as the file extraction error message for consistency.

-        logger.error(
-            f"Stream extraction is not supported for archive storage type: {storage_type}."
-        )
+        logger.error(f"Stream extraction is not supported for archive storage type: {storage_type}. Only {StorageType.FS} is supported.")
components/core/src/clp/ffi/SchemaTree.hpp (1)

254-256: Consider adding noexcept specifier

Since the vector's operator== is noexcept, this operator could also be marked as noexcept.

-    [[nodiscard]] auto operator==(SchemaTree const& rhs) const -> bool {
+    [[nodiscard]] auto operator==(SchemaTree const& rhs) const noexcept -> bool {
         return m_tree_nodes == rhs.m_tree_nodes;
     }
components/core/tests/test-NetworkReader.cpp (1)

207-226: Excellent improvement to test reliability.

The addition of retry logic for handling transient HTTP errors is a great improvement. This prevents the test from failing due to temporary server issues while still maintaining its validation purpose.

Consider extracting the retry logic into a helper function for reuse in other network-dependent tests.

components/core/src/clp_s/TimestampEntry.cpp (1)

60-74: Consider adding error handling for stream operations.

While the stream-based writing implementation is good, it could benefit from error handling for potential stream failures.

Consider adding error checking:

 void TimestampEntry::write_to_stream(std::stringstream& stream) const {
+    if (!stream.good()) {
+        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+    }
     write_numeric_value<uint64_t>(stream, m_key_name.size());
     stream.write(m_key_name.data(), m_key_name.size());
+    if (stream.fail()) {
+        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+    }
     // ... rest of the method
 }
components/core/src/clp_s/clp-s.cpp (1)

117-127: Implementation placeholder for KeyValueIr compression

The code correctly handles the unsupported file type case. The comment indicates that implementation is planned for a future PR.

Would you like assistance in implementing the KeyValueIr compression functionality or creating a GitHub issue to track this task?

components/core/CMakeLists.txt (1)

230-244: Add proper CMake module for LZMA

The TODO comment indicates the need for a proper CMake module. This would improve the library detection and configuration process.

Consider creating a FindLZMA.cmake module in ./cmake/Modules/ that:

  1. Properly handles both static and dynamic linking
  2. Provides standardized variables for version, libraries, and includes
  3. Includes proper version checking
  4. Handles platform-specific library names and locations
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cedb25 and 02f0b8f.

📒 Files selected for processing (82)
  • .github/workflows/clp-pr-title-checks.yaml (1 hunks)
  • components/clp-package-utils/clp_package_utils/__init__.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/general.py (5 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/decompress.py (4 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (0 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py (0 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/search.py (2 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (10 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/stop_clp.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (11 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (1 hunks)
  • components/clp-py-utils/pyproject.toml (2 hunks)
  • components/core/CMakeLists.txt (10 hunks)
  • components/core/src/clp/BoundedReader.cpp (1 hunks)
  • components/core/src/clp/BoundedReader.hpp (1 hunks)
  • components/core/src/clp/StringReader.cpp (1 hunks)
  • components/core/src/clp/clo/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp/clo/CommandLineArguments.hpp (0 hunks)
  • components/core/src/clp/clo/OutputHandler.cpp (1 hunks)
  • components/core/src/clp/clo/clo.cpp (4 hunks)
  • components/core/src/clp/clo/constants.hpp (1 hunks)
  • components/core/src/clp/clp/CommandLineArguments.cpp (0 hunks)
  • components/core/src/clp/clp/CommandLineArguments.hpp (0 hunks)
  • components/core/src/clp/clp/FileDecompressor.hpp (2 hunks)
  • components/core/src/clp/clp/decompression.cpp (3 hunks)
  • components/core/src/clp/clp/utils.cpp (1 hunks)
  • components/core/src/clp/clp/utils.hpp (1 hunks)
  • components/core/src/clp/error_handling/ErrorCode.hpp (1 hunks)
  • components/core/src/clp/ffi/KeyValuePairLogEvent.cpp (9 hunks)
  • components/core/src/clp/ffi/KeyValuePairLogEvent.hpp (3 hunks)
  • components/core/src/clp/ffi/SchemaTree.hpp (2 hunks)
  • components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4 hunks)
  • components/core/src/clp/ffi/ir_stream/IrErrorCode.cpp (1 hunks)
  • components/core/src/clp/ffi/ir_stream/IrErrorCode.hpp (1 hunks)
  • components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.cpp (2 hunks)
  • components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.hpp (2 hunks)
  • components/core/src/clp/ir/EncodedTextAst.cpp (1 hunks)
  • components/core/src/clp/ir/LogEvent.hpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1 hunks)
  • components/core/src/clp_s/ArchiveWriter.cpp (7 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (6 hunks)
  • components/core/src/clp_s/CommandLineArguments.cpp (5 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (4 hunks)
  • components/core/src/clp_s/JsonConstructor.cpp (2 hunks)
  • components/core/src/clp_s/JsonParser.cpp (1 hunks)
  • components/core/src/clp_s/JsonParser.hpp (3 hunks)
  • components/core/src/clp_s/ParsedMessage.hpp (3 hunks)
  • components/core/src/clp_s/SingleFileArchiveDefs.hpp (1 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.cpp (5 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.hpp (5 hunks)
  • components/core/src/clp_s/TimestampEntry.cpp (2 hunks)
  • components/core/src/clp_s/TimestampEntry.hpp (3 hunks)
  • components/core/src/clp_s/TimestampPattern.cpp (10 hunks)
  • components/core/src/clp_s/TimestampPattern.hpp (3 hunks)
  • components/core/src/clp_s/Utils.hpp (2 hunks)
  • components/core/src/clp_s/archive_constants.hpp (2 hunks)
  • components/core/src/clp_s/clp-s.cpp (2 hunks)
  • components/core/src/clp_s/search/StringLiteral.hpp (2 hunks)
  • components/core/src/reducer/reducer_server.cpp (2 hunks)
  • components/core/tests/test-BoundedReader.cpp (1 hunks)
  • components/core/tests/test-NetworkReader.cpp (1 hunks)
  • components/core/tests/test-StreamingCompression.cpp (4 hunks)
  • components/core/tests/test-clp_s-end_to_end.cpp (1 hunks)
  • components/core/tests/test-error_handling.cpp (1 hunks)
  • components/core/tests/test-ffi_IrUnitHandlerInterface.cpp (2 hunks)
  • components/core/tests/test-ffi_KeyValuePairLogEvent.cpp (7 hunks)
  • components/core/tests/test-ir_encoding_methods.cpp (1 hunks)
  • components/core/tests/test_log_files/test_no_floats_sorted.jsonl (1 hunks)
  • components/core/tools/scripts/lib_install/centos-stream-9/install-packages-from-source.sh (1 hunks)
  • components/core/tools/scripts/lib_install/centos-stream-9/install-prebuilt-packages.sh (1 hunks)
  • components/core/tools/scripts/lib_install/install-boost.sh (1 hunks)
  • components/core/tools/scripts/lib_install/ubuntu-focal/install-packages-from-source.sh (1 hunks)
  • components/core/tools/scripts/lib_install/ubuntu-focal/install-prebuilt-packages.sh (1 hunks)
  • components/core/tools/scripts/lib_install/ubuntu-jammy/install-packages-from-source.sh (1 hunks)
  • components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh (1 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py (12 hunks)
⛔ Files not processed due to max files limit (16)
  • components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
  • components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
  • components/job-orchestration/job_orchestration/executor/query/utils.py
  • components/job-orchestration/job_orchestration/executor/utils.py
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
  • components/log-viewer-webui/client/src/api/query.js
  • components/log-viewer-webui/server/src/DbManager.js
  • components/log-viewer-webui/yscope-log-viewer
  • components/package-template/src/etc/clp-config.yml
  • components/package-template/src/sbin/admin-tools/del-archives.sh
  • deps-tasks.yml
  • docs/requirements.txt
  • docs/src/dev-guide/components-core/centos-stream-9-deps-install.md
  • docs/src/dev-guide/components-core/ubuntu-focal-deps-install.md
  • docs/src/dev-guide/components-core/ubuntu-jammy-deps-install.md
  • docs/src/user-guide/reference-json-search-syntax.md
💤 Files with no reviewable changes (5)
  • components/core/src/clp/clp/CommandLineArguments.hpp
  • components/core/src/clp/clp/CommandLineArguments.cpp
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/core/src/clp/clo/CommandLineArguments.hpp
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
✅ Files skipped from review due to trivial changes (5)
  • components/clp-package-utils/clp_package_utils/init.py
  • components/core/src/clp/ir/EncodedTextAst.cpp
  • components/core/src/clp/ir/LogEvent.hpp
  • components/core/src/clp/clp/utils.hpp
  • components/core/src/clp/clp/utils.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
  • components/core/tools/scripts/lib_install/centos-stream-9/install-prebuilt-packages.sh
  • components/core/tools/scripts/lib_install/ubuntu-focal/install-prebuilt-packages.sh
  • components/core/tools/scripts/lib_install/ubuntu-jammy/install-packages-from-source.sh
  • components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh
  • components/core/tools/scripts/lib_install/ubuntu-focal/install-packages-from-source.sh
🧰 Additional context used
📓 Path-based instructions (48)
components/core/src/clp_s/search/StringLiteral.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/StringReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/ffi/ir_stream/IrErrorCode.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonConstructor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/ffi/ir_stream/IrErrorCode.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clp/FileDecompressor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/archive_constants.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ParsedMessage.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/tests/test-NetworkReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clo/OutputHandler.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/tests/test-BoundedReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/tests/test-ir_encoding_methods.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/BoundedReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/tests/test-ffi_IrUnitHandlerInterface.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampEntry.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampEntry.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clo/constants.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/ffi/SchemaTree.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/tests/test-ffi_KeyValuePairLogEvent.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/ffi/ir_stream/Deserializer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampPattern.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/BoundedReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/tests/test-error_handling.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clp/decompression.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clo/clo.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/SingleFileArchiveDefs.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/reducer/reducer_server.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/clo/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/tests/test-clp_s-end_to_end.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/tests/test-StreamingCompression.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampPattern.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/TimestampDictionaryWriter.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/error_handling/ErrorCode.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/ffi/KeyValuePairLogEvent.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveWriter.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveWriter.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/ffi/KeyValuePairLogEvent.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (16)
components/core/src/clp/ffi/ir_stream/IrErrorCode.hpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp#549
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:209-210
Timestamp: 2024-11-10T16:46:53.300Z
Learning: Until #486 is merged with customized error code support, introducing a layer from `IRErrorCode` to `std::errc` is unnecessary.
components/core/tests/test-NetworkReader.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp#593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
components/core/src/clp/BoundedReader.cpp (1)
Learnt from: gibber9809
PR: y-scope/clp#624
File: components/core/src/clp/BoundedReader.hpp:25-27
Timestamp: 2024-12-10T16:56:33.545Z
Learning: In the CLP project, within constructors like `BoundedReader` in `components/core/src/clp/BoundedReader.hpp`, it's acceptable to call methods like `get_pos()` that may throw exceptions without handling potential errors via `try_get_pos()`, since the constructor itself handles errors by throwing.
components/core/tests/test-ffi_KeyValuePairLogEvent.cpp (3)
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:228-236
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, helper functions like `assert_kv_pair_log_event_creation_failure` return booleans to the caller, and the caller asserts using `REQUIRE` to provide clearer failure information.
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:85-103
Timestamp: 2024-11-10T16:46:53.300Z
Learning: The function `assert_kv_pair_log_event_creation_failure` is correctly placed within the anonymous namespace in `test-ffi_KeyValuePairLogEvent.cpp`.
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
components/core/src/clp/ffi/ir_stream/Deserializer.hpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:119-120
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `KeyValuePairLogEvent::create` requires `std::shared_ptr<SchemaTree>`, so `m_auto_generated_schema_tree` and `m_user_generated_schema_tree` need to be `std::shared_ptr`.
components/core/src/clp/BoundedReader.hpp (1)
Learnt from: gibber9809
PR: y-scope/clp#624
File: components/core/src/clp/BoundedReader.hpp:25-27
Timestamp: 2024-12-10T16:56:33.545Z
Learning: In the CLP project, within constructors like `BoundedReader` in `components/core/src/clp/BoundedReader.hpp`, it's acceptable to call methods like `get_pos()` that may throw exceptions without handling potential errors via `try_get_pos()`, since the constructor itself handles errors by throwing.
components/core/tests/test-error_handling.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp#486
File: components/core/tests/test-error_handling.cpp:44-91
Timestamp: 2024-11-26T19:16:19.933Z
Learning: In `components/core/tests/test-error_handling.cpp` (C++), when specializing template methods of `ErrorCategory` via type aliases (e.g., `AlwaysSuccessErrorCategory`), it's acceptable to define these specializations in the current namespace, even though the primary template is in the `clp::error_handling` namespace, due to the use of type aliases and template instantiation.
components/core/CMakeLists.txt (1)
Learnt from: LinZhihao-723
PR: y-scope/clp#549
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:209-210
Timestamp: 2024-11-10T16:46:53.300Z
Learning: Until #486 is merged with customized error code support, introducing a layer from `IRErrorCode` to `std::errc` is unnecessary.
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.cpp:169-170
Timestamp: 2024-12-19T01:39:30.094Z
Learning: In the LzmaStreamOperations context (Compressor.cpp), m_p is guaranteed never to be null.
components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.cpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/src/clp/ffi/ir_stream/Deserializer.hpp:119-120
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `KeyValuePairLogEvent::create` requires `std::shared_ptr<SchemaTree>`, so `m_auto_generated_schema_tree` and `m_user_generated_schema_tree` need to be `std::shared_ptr`.
components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.hpp:65-75
Timestamp: 2024-12-19T01:40:13.272Z
Learning: "try_get_pos" method in `Compressor.hpp` does not throw exceptions, as clarified by Bill-hbrhbr.
components/core/tests/test-clp_s-end_to_end.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
components/core/tests/test-StreamingCompression.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
components/core/src/clp/error_handling/ErrorCode.hpp (1)
Learnt from: LinZhihao-723
PR: y-scope/clp#486
File: components/core/src/clp/error_handling/ErrorCode.hpp:131-134
Timestamp: 2024-11-26T19:12:09.102Z
Learning: In the error handling system (`components/core/src/clp/error_handling/ErrorCode.hpp`), the `ErrorCode` class is intentionally designed to wrap the `ErrorCodeEnum`, not to use the enum directly. This wrapper provides a general interface for adding customized error enums.
components/core/src/clp/ffi/KeyValuePairLogEvent.cpp (3)
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:474-480
Timestamp: 2024-12-09T15:25:14.043Z
Learning: In `components/core/src/clp/ffi/KeyValuePairLogEvent.cpp`, node IDs are validated before accessing `child_schema_tree_node` in the function `serialize_node_id_value_pairs_to_json`, ensuring `get_node` does not throw exceptions due to invalid node IDs.
Learnt from: LinZhihao-723
PR: y-scope/clp#554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/src/clp/ffi/KeyValuePairLogEvent.hpp:62-68
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `KeyValuePairLogEvent.hpp`, the methods `get_auto_generated_schema_tree()` and `get_user_generated_schema_tree()` do not need to check for null shared pointers because the factory function ensures they are not null.
components/core/src/clp/ffi/KeyValuePairLogEvent.hpp (3)
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/src/clp/ffi/KeyValuePairLogEvent.hpp:62-68
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `KeyValuePairLogEvent.hpp`, the methods `get_auto_generated_schema_tree()` and `get_user_generated_schema_tree()` do not need to check for null shared pointers because the factory function ensures they are not null.
Learnt from: LinZhihao-723
PR: y-scope/clp#554
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:109-111
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `KeyValuePairLogEvent.cpp`, for the class `JsonSerializationIterator`, it's acceptable to use raw pointers for member variables like `m_schema_tree_node`, and there's no need to replace them with references or smart pointers in this use case.
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/src/clp/ffi/KeyValuePairLogEvent.cpp:474-480
Timestamp: 2024-12-09T15:25:14.043Z
Learning: In `components/core/src/clp/ffi/KeyValuePairLogEvent.cpp`, node IDs are validated before accessing `child_schema_tree_node` in the function `serialize_node_id_value_pairs_to_json`, ensuring `get_node` does not throw exceptions due to invalid node IDs.
🪛 cppcheck (2.10-2)
components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (116)
components/core/tools/scripts/lib_install/centos-stream-9/install-packages-from-source.sh (1)

13-13: LGTM on Boost version update!

The update to Boost 1.87.0 aligns with the versions in the Ubuntu installation scripts.

components/core/tests/test-ffi_IrUnitHandlerInterface.cpp (1)

134-134: LGTM! The assertion correctly validates user-generated node ID value pairs.

The change properly reflects the separation between auto-generated and user-generated key-value pairs, maintaining consistency with the updated data model.

components/core/src/clp/clp/decompression.cpp (3)

10-10: LGTM: Include path update is correct

The relative path update for the IR constants header aligns with the project's file structure.


285-285: LGTM: Parameter name simplification

The simplified parameter name removes redundancy while maintaining clarity. The [[maybe_unused]] attribute is correctly applied.


313-313: Verify the output directory path change impact

The change from get_ir_temp_output_dir() to get_output_dir() needs verification to ensure it doesn't affect other operations expecting the IR temp directory.

✅ Verification successful

Output directory change is safe to proceed

The verification shows that get_ir_temp_output_dir() is not used anywhere in the codebase, while get_output_dir() is consistently used across all output directory handling in both CLP and GLT components. This confirms the change aligns with the codebase's unified approach to output directory management.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other usages of get_ir_temp_output_dir() to ensure they're updated consistently

# Search for any remaining uses of get_ir_temp_output_dir
rg "get_ir_temp_output_dir" --type cpp

# Search for related IR output directory handling
ast-grep --pattern 'command_line_args.get_output_dir()'

Length of output: 2771

components/clp-package-utils/clp_package_utils/scripts/search.py (3)

10-10: LGTM: Clean import addition

The StorageType import is properly placed and necessary for the new storage type validation functionality.


78-82: Review logging implementation

While the storage type validation is good, there might be an issue with the logging implementation since the logging setup was removed. Consider these alternatives:

  1. Re-implement proper logging setup
  2. Use print statements for error messages
  3. Raise a specific exception instead of logging and returning -1

Let's verify the logging setup in the codebase:

✅ Verification successful

Let me analyze the logging usage in the codebase based on the search results.

The search results show extensive use of logging throughout the codebase with proper logger configuration. Specifically:

  1. The logging pattern logger.error/info/debug is consistently used across multiple components
  2. The logging setup appears to be properly configured as evidenced by:
    • Logging level configuration in extract_stream_task.py: set_logging_level(logger, clp_logging_level)
    • Proper logging handlers in reducer.py: logger.removeHandler(logging_file_handler)
    • Consistent error handling patterns with logging across the codebase

Let me generate the final response:

Logging implementation is properly configured

The logging setup is correctly implemented in the codebase. The error logging at line 80 follows the established pattern used throughout the project and will work as expected.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if logging is configured elsewhere or if other files use similar error handling
rg -g '*.py' 'logger\.(error|warning|info|debug)' -A 3 -B 3

Length of output: 68634


83-83: LGTM: Verify JobType enum usage

The explicit string conversion of JobType.SEARCH is a good practice. However, let's verify that this pattern is consistent across the codebase.

✅ Verification successful

Let's search for both JobType usage and container name generation to understand the patterns better. The previous ast-grep pattern might have failed due to syntax, so we're using ripgrep to get a broader view of the usage patterns.


Pattern is consistent across container name generation

The explicit string conversion of JobType enums is consistently applied across all container name generation calls in the codebase:

  • str(JobType.SEARCH) in search.py
  • str(JobType.FILE_EXTRACTION) in decompress.py
  • str(JobType.IR_EXTRACTION) in decompress.py
  • str(JobType.COMPRESSION) in compress.py

The pattern is uniform and type-safe across all usages, confirming that the change aligns with the codebase's practices.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check JobType usage patterns across the codebase
ast-grep --pattern 'generate_container_name($expr)'

Length of output: 53


Script:

#!/bin/bash
# Search for JobType usage and container name generation patterns
rg "JobType\." -A 2
rg "generate_container_name" -A 2

Length of output: 11598

components/core/src/clp_s/TimestampDictionaryWriter.hpp (5)

4-8: Ensure necessary includes are minimally present.
These added headers (cstdint, map, sstream, string_view, etc.) look appropriate for their usage within this file.


28-28: Empty default constructor is acceptable.
A default constructor with no initializations can be beneficial if the data members have default initializers. If additional initializations are needed, consider adding them in future.


31-34: Method signature for write is straightforward.
This new function is a clear approach for stream-based writing. Consider ensuring you handle any stream errors (e.g., fail states) upstream or downstream.


64-66: Added overload for double timestamp ingestion is well-defined.
Ensure usage scenarios clearly differentiate between double and int64_t timestamps.


84-88: clear() method for resetting internal state is helpful.
No issues noted. This is a clean approach for reuse.

components/core/src/clp/BoundedReader.hpp (8)

1-2: Header guards and includes appear correct.
They are properly set up and there are no cyclical dependencies.

Also applies to: 4-5, 7-8


10-20: Class documentation is clear and descriptive.
It effectively communicates the purpose and constraints of this reader.


23-33: Constructor checks for null and enforces bound.
Your usage of “if (nullptr == m_reader)” is consistent with the coding style, and throwing on invalid input is valid. This helps maintain a strong contract.


35-45: No concerns with doc comments for try_get_pos.
Implementation defers to m_reader, which is correct for a wrapper pattern.


53-53: Method signature for try_seek_from_begin is consistent.
Be mindful to check whether the underlying reader supports rewinding if seeking a position lower than the current.


64-65: Method signature for try_read is consistent with the interface.
Limiting read by bound is handled carefully downstream. No additional remarks.


73-80: try_read_to_delimiter is intentionally unsupported.
Returning ErrorCode_Unsupported is appropriate, preventing leakage beyond the bound.


82-86: Private member variables well declared and initialised.
No concurrency or performance issues identified.

components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py (5)

1-11: Imports are properly organised.
No potential circular dependencies or extraneous imports noted.


12-16: Use of clp_config and SQL_Adapter is consistent with the project structure.
Ensuring these modules are available in the environment is crucial.


21-46: Argument parsing is direct and user-friendly.
Providing a default config path is a helpful convenience for new users.


56-67: Verifies existence of archives directory prior to deletion.
Early return on failure is a clean approach.


138-139: Correct invocation of main via `if name == "main":
Standard practice to ensure code runs only as a script.

components/core/src/clp_s/TimestampDictionaryWriter.cpp (7)

3-6: Includes align with header declarations.
No duplication or missing headers.


12-17: write_timestamp_entries: direct write to stream is valid.
Eliminating compression may require reintroducing it at a different layer if needed.
[Architecture_advice]


21-32: New write method merges ranges before writing.
This is consistent with the removal of file-based compression. Confirm that the function is idempotent if called multiple times.


47-49: ingest_entry now uses std::string_view for key and timestamp.
This is good for performance. Ensure that key references remain valid as needed.


93-95: Overload for double timestamp ingestion.
The logic is straightforward. No concurrency concerns identified.


108-108: Overload for int64_t timestamp ingestion.
Behaviour closely parallels the double-based ingestion. No issues detected.


156-162: clear method: reset all members.
This is a neat approach for reusability. No pitfalls identified.

components/core/src/clp/error_handling/ErrorCode.hpp (3)

15-19: Ensure concept constraints are fully enforced.

The new concept "ErrorCodeEnumType" correctly ensures that only integer-based enumerations are used. This approach helps prevent misusage with incompatible types and is well-structured.


41-43: Confirm robust enum casting.

When converting error numbers back into enums, confirm that the range of valid error codes is not exceeded. Though this seldom occurs with well-defined enumerations, consider adding a verification step if the enum grows large.


128-129: Leverage the default error condition.

Using the default error condition simplifies equivalence checks. This is a straightforward and lightweight approach to error comparison. No issues or race conditions identified.

components/core/src/clp/ffi/KeyValuePairLogEvent.hpp (2)

46-49: Validate schema pointers before usage.

The factory method ensures that schema pointers are non-null before constructing the object. Continue to propagate an error code if a null pointer is encountered to prevent unexpected dereferences.


119-122: Constructor usage of std::move is correct.

The constructor safely moves parameters into member fields. This avoids unnecessary copying when constructing large containers for auto-generated and user-generated keys.

components/core/tests/test-clp_s-end_to_end.cpp (3)

30-44: Good cleanup strategy.

The TestOutputCleaner class responsibly removes all intermediate files and directories. This practice ensures that test results remain deterministic and free from leftover data.


62-87: Appropriate usage of test assertions and filesystem checks.

The logic properly ensures that the directory is created before compression and that it is not empty afterwards. Using "false == std::filesystem::is_empty(...)" matches coding guidelines.


118-143: Safe usage of std::system in a controlled test environment.

Though shelling out can be risky, the test environment is strictly controlled, and the checks for "jq" and "diff" are valid. This approach does not appear to introduce vulnerabilities, given the context and the cleanup protections.

components/core/src/clp_s/ArchiveWriter.hpp (2)

26-26: single_file_archive flag is a welcome addition.

It accommodates a more streamlined archiving process, enabling single-file versus multi-file approaches. This design choice is clear and well-documented.


125-130: Efficient usage of std::string_view.

Switching to std::string_view for timestamp ingestion helps reduce copies, improving performance in high-throughput scenarios. This aligns with modern best practices.

components/core/src/clp/streaming_compression/lzma/Compressor.cpp (2)

17-33: Validate readiness state before opening

The code checks if m_compressed_stream_file_writer is already set, and throws an exception if so. This is good defensive programming to ensure that the compressor is not opened multiple times without being closed first.


35-49: Process unhandled input at close

The logic processes unhandled input data by performing a flush before finalizing the LZMA stream. This ensures pending data is not lost. Ensure that performing flush_lzma(LZMA_FULL_FLUSH) before LZMA_FINISH is universally the intended behaviour.

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py (2)

Line range hint 141-172: Review logic branching for S3 vs. FS pathways

The function run_clp() now checks worker_config to decide if S3 writes should be enabled. The condition that "S3 not supported by CLP engine except for CLP_S" is enforced with an error message. This ensures clarity and prevents misconfiguration.


239-283: Handle partial archive uploads gracefully

While the code attempts immediate S3 uploads once an archive is complete, it terminates the process upon encountering an error. Evaluate whether partial completion logic or retries are required for large archives. In certain network conditions, immediate termination might hamper recovery strategies.

components/core/src/clp_s/ArchiveWriter.cpp (3)

18-25: Leverage std::filesystem for portability

Switching to std::filesystem is correct for better portability. Ensure consistent usage of error codes and standard library calls.


104-118: Ensure robust checks for empty timestamps

In write_timestamp_dict(), we compress the dictionary content. Confirm that an empty or uninitialised dictionary is valid in all usage scenarios. Potentially add an early return or a log statement if the dictionary is empty.


138-175: ZstdCompressor usage for metadata

Usage of ZstdCompressor for writing metadata is consistent. Confirm that the metadata content remains feasible for Zstd and does not need an alternate algorithm for partial reading. Since metadata is likely smaller than logs, this approach is likely efficient.

components/clp-py-utils/clp_py_utils/clp_config.py (3)

51-54: Validate new storage types carefully

The new StorageType enumeration (FS, S3) broadens config options. Ensure that code gracefully handles unrecognised values, even if that means raising a clear error early.


81-82: Confirm secure password handling

Database credentials are stored in optional fields. Consider verifying that these credentials are masked or avoided in logs to maintain secrecy.


638-640: WorkerConfig fosters reusability

The WorkerConfig class inherits from BaseModel, enabling reuse for different tasks. Confirm that no extraneous fields appear in WorkerConfig for scenarios where streaming is not used.

components/core/src/clp/ffi/KeyValuePairLogEvent.cpp (2)

272-277: Ensure comprehensive sibling and ancestor checks.
The code properly checks for key uniqueness among siblings and iterates upward to catch duplication among ancestors. Review any potential performance impacts of nested loops where large schema trees might exist.

Also applies to: 278-303


Line range hint 429-506: Robustness of DFS-based serialization.
The DFS approach for partial JSON assembly is clear. If numbers of nested or child nodes grow large, confirm that this stack-based approach remains performant and doesn’t risk deep recursion limits.

components/core/tests/test-ffi_KeyValuePairLogEvent.cpp (2)

85-103: 中央ize repeated error-checking logic.
The helper function 'assert_kv_pair_log_event_creation_failure' centralizes repeating logic and clarifies expected errors. This is a good approach for test consistency.


497-517: Confirm consistent handling of sibling key conflicts.
Tests for sibling key conflicts appear thorough. If multiple sibling conflicts can occur simultaneously, consider adding a second pair of duplicate siblings for deeper coverage.

components/core/src/clp_s/CommandLineArguments.cpp (1)

151-153: Defaulting file_type to JSON.
Setting 'file_type' to JSON by default is reasonable. Ensure that calls from other modules expecting KV-IR are updated accordingly, to avoid accidental mismatches.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3)

32-32: Check usage of 'StorageType'.
The newly imported 'StorageType' is used below to handle mounting logic for queries. This is consistent and clearly extends existing logic for filesystem-based storage.


46-46: Good usage of 'generate_worker_config'.
Centralizing worker configuration creation promotes maintainability. Confirm that all relevant worker settings are included, especially for newly added compression features.


657-659: Conditional mount for FS-based storage.
Adding 'mounts.archives_output_dir' only when 'StorageType.FS' is used is a sensible approach. Confirm that existing monitor logic triggers an error if the path is missing or offline.

components/clp-py-utils/pyproject.toml (1)

13-13: Verify compatibility and security of new dependencies.

Please ensure:

  1. The specific version of boto3 (^1.35.81) is required, as it's quite precise
  2. The result library (^0.17.0) has no known security vulnerabilities

Let's check for any security advisories:

Also applies to: 23-23

✅ Verification successful

Dependencies are up-to-date and secure

The verification shows:

  • boto3: Latest version is 1.35.85 (very close to specified 1.35.81)
  • result: Using latest version 0.17.0
  • No security vulnerabilities found for either package
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for security advisories and latest versions

# Check PyPI for latest versions and known vulnerabilities
for package in "boto3" "result"; do
  echo "Checking $package..."
  curl -s "https://pypi.org/pypi/$package/json" | jq -r '.info | "Latest version: " + .version'
  
  # Check for security advisories
  gh api graphql -f query="{
    securityVulnerabilities(first: 5, ecosystem: PIP, package: \"$package\") {
      nodes {
        advisory {
          summary
          severity
          publishedAt
        }
        vulnerableVersionRange
        firstPatchedVersion {
          identifier
        }
      }
    }
  }"
done

Length of output: 1121

components/core/src/clp/clo/constants.hpp (1)

8-8: Verify the impact of renamed constants.

The constants have been renamed:

  • Added StreamId (likely replacing removed FileSplitId)
  • Changed IsLastIrChunk to IsLastChunk

Let's verify all usages have been updated:

Also applies to: 11-11

✅ Verification successful

Constants have been properly renamed and updated throughout the codebase

The search results confirm that:

  • No references to old constants (FileSplitId or IsLastIrChunk) exist in the codebase
  • The new constants StreamId and IsLastChunk are consistently used in both declaration and usage:
    • Declared in components/core/src/clp/clo/constants.hpp
    • Used appropriately in components/core/src/clp/clo/clo.cpp
    • Similar constants exist in a separate namespace in archive_constants.hpp with different prefixes (cStreamId, cIsLastChunk)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to old constant names

echo "Checking for old constant references..."
rg -l "FileSplitId|IsLastIrChunk" --type cpp --type hpp

echo "Verifying new constant usage..."
rg -A 2 "StreamId|IsLastChunk" --type cpp --type hpp

Length of output: 330


Script:

#!/bin/bash
# Description: Check for any remaining references to old and new constant names using correct file extensions

echo "Checking for old constant references..."
rg -l "FileSplitId|IsLastIrChunk" --type-add 'cpp:*.{cpp,hpp}' --type cpp

echo "Verifying new constant usage..."
rg -A 2 "StreamId|IsLastChunk" --type-add 'cpp:*.{cpp,hpp}' --type cpp

Length of output: 2373

.github/workflows/clp-pr-title-checks.yaml (2)

5-9: Well-implemented security enhancements!

The changes follow security best practices by:

  • Documenting security implications of pull_request_target
  • Following the principle of least privilege with empty default permissions

Also applies to: 13-13


Line range hint 23-24: LGTM - Explicit permission settings

Properly restricts permissions to only what's needed for the semantic PR check.

components/core/src/clp/ffi/ir_stream/IrErrorCode.cpp (1)

8-11: LGTM - Clear category identification

The implementation provides a clear, namespaced identifier for the error category.

components/core/tools/scripts/lib_install/install-boost.sh (1)

37-37: Document dependency addition and verify compatibility

While adding the 'url' library is straightforward, please:

  1. Add a comment explaining why this library is required
  2. Verify that the 'url' library is available in all supported Boost versions

Run this script to check 'url' library availability across versions:

✅ Verification successful

Let me try a different approach to find the Boost version requirements.


Boost URL library compatibility verified

From the CMakeLists.txt, I can see that the project requires Boost 1.81 or higher, which includes the URL library as a stable component. The URL library was officially added to Boost in version 1.81.0 (December 2022), so there are no compatibility concerns.

Suggested documentation to add above line 37:

# URL library required for handling URIs in network communications (added in Boost 1.81.0)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if 'url' library exists in supported Boost versions

# Get the minimum required Boost version from CMakeLists.txt
boost_version=$(fd -t f "CMakeLists.txt" | xargs grep -l "find_package.*Boost" | xargs grep "find_package.*Boost" | grep -oP "(?<=\s)\d+\.\d+\.\d+(?=\s|$)" | head -1)

echo "Checking Boost version $boost_version for 'url' library..."
curl -s "https://www.boost.org/doc/libs/$boost_version/libs/url/doc/html/index.html" -o /dev/null -w "%{http_code}\n"

Length of output: 384


Script:

#!/bin/bash
# Search for Boost version requirements in multiple potential locations
echo "=== Searching in CMakeLists.txt files ==="
rg -i "boost" --type cmake -A 3 -B 3

echo -e "\n=== Searching in build configuration files ==="
fd -e sh "build|install" | xargs rg -i "boost.*version|version.*boost" -A 2 -B 2

echo -e "\n=== Checking install-boost.sh for version info ==="
cat components/core/tools/scripts/lib_install/install-boost.sh

Length of output: 8063

components/core/src/clp/BoundedReader.cpp (1)

8-19: LGTM: Robust seek implementation with proper bounds checking

The implementation correctly handles boundary conditions and maintains consistent state even during errors.

components/core/src/clp/StringReader.cpp (1)

44-47: LGTM: Proper bounds checking added

The implementation correctly handles seeking beyond the string bounds.

components/core/src/clp_s/archive_constants.hpp (2)

7-9: LGTM: Temporary file postfix constant added

The addition of cTmpPostfix follows the established naming convention and is appropriately placed in the constants namespace.


32-32: Verify the impact of terminology changes

The renaming from file-based to stream-based terminology (cOrigFileIdcStreamId and cIsLastIrChunkcIsLastChunk) suggests an architectural shift. Ensure all dependent code has been updated accordingly.

Also applies to: 35-35

components/core/src/clp_s/search/StringLiteral.hpp (1)

72-74: Verify the implementation of StringUtils::has_unescaped_wildcards

The refactoring improves readability by moving the wildcard detection logic to a utility function. However, we should verify its implementation.

✅ Verification successful

Implementation of StringUtils::has_unescaped_wildcards is correct and well-designed

The implementation in Utils.cpp correctly checks for unescaped wildcards ('*' or '?') while properly handling escape sequences. The function:

  • Iterates through the string character by character
  • Returns true if it finds an unescaped '*' or '?'
  • Properly skips escaped characters by incrementing the index when a backslash is encountered
  • Is being used consistently across the codebase (found in Output.cpp and StringLiteral.hpp)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for the implementation of has_unescaped_wildcards
ast-grep --pattern 'bool has_unescaped_wildcards($_) {
  $$$
}'

Length of output: 65


Script:

#!/bin/bash
# Let's try searching for the implementation with ripgrep
rg -A 10 "has_unescaped_wildcards" --type cpp

# Also search for the header declaration
rg "has_unescaped_wildcards" --type hpp

Length of output: 3255

components/core/src/clp_s/ParsedMessage.hpp (3)

7-7: LGTM: Appropriate header inclusion for string_view support

The addition of string_view header supports the new method overloads and follows modern C++ practices.


39-41: LGTM: Well-implemented string_view overload

The implementation efficiently handles string_view parameters by converting to string only when storing in the map, which optimizes string handling performance.


64-66: LGTM: Consistent implementation of unordered value handling

The implementation maintains consistency with the regular add_value method and properly handles string_view conversion.

components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.hpp (1)

60-63: LGTM: Well-documented schema tree separation

The separation of auto-generated and user-generated schema trees is well-implemented and thoroughly documented. The parameter names are descriptive and the documentation clearly explains their purposes.

Verify schema tree parameter usage

Ensure that all callers of this method have been updated to provide both schema tree parameters.

Also applies to: 77-78

✅ Verification successful

Let me gather more context to see the full method call in the Deserializer.hpp file.


Schema tree parameter usage verified successfully

The caller in Deserializer.hpp correctly provides both schema tree parameters (m_auto_gen_keys_schema_tree and m_user_gen_keys_schema_tree) when invoking the deserialize_ir_unit_kv_pair_log_event method. This is the only caller of this method in the codebase, and it has been properly updated to match the new method signature.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for calls to deserialize_ir_unit_kv_pair_log_event
rg -A 3 "deserialize_ir_unit_kv_pair_log_event\(" --type cpp

Length of output: 1393


Script:

#!/bin/bash
# Get more context around the method call in Deserializer.hpp
rg -B 2 -A 5 "deserialize_ir_unit_kv_pair_log_event\(" components/core/src/clp/ffi/ir_stream/Deserializer.hpp

Length of output: 423

components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

61-61: LGTM: Type conversion ensures compatibility

The explicit conversion of JobType.COMPRESSION to string aligns with the updated signature of generate_container_name function.

components/core/tests/test-BoundedReader.cpp (1)

12-99: LGTM: Comprehensive test coverage

The test suite thoroughly covers essential scenarios including boundary conditions, error cases, and normal operations.

components/core/src/clp_s/TimestampPattern.hpp (1)

9-10: LGTM: Performance optimization using string_view

The change from std::string to std::string_view for method parameters is a good optimization as it:

  1. Avoids unnecessary string copies for read-only operations
  2. Maintains backward compatibility with both string literals and std::string
  3. Reduces memory overhead

Also applies to: 88-91, 126-130

components/core/tests/test-StreamingCompression.cpp (4)

Line range hint 7-47: LGTM! Well-structured includes and constants.

The includes are properly ordered, and constants are appropriately scoped in an anonymous namespace. The use of constexpr for buffer and chunk sizes is a good practice for compile-time optimization.


Line range hint 49-105: LGTM! Well-designed helper functions.

The helper functions are well-structured:

  • compress: Efficiently handles compression in chunks
  • decompress_and_compare: Thoroughly validates the decompressed data against the original

Note: The repeated compression of the same data is intentional for testing purposes.


108-119: LGTM! Effective test data generation.

The test setup creates a predictable pattern using the alphabet, which is ideal for compression testing as it provides both repetition and variability.


134-141: ⚠️ Potential issue

Add decompression test for LZMA.

The LZMA section only tests compression, unlike the ZStd and Passthrough sections which test both compression and decompression. This creates an inconsistency in test coverage.

Add decompression testing similar to other sections:

 SECTION("LZMA compression") {
     compressor = std::make_unique<clp::streaming_compression::lzma::Compressor>(
             clp::streaming_compression::lzma::cDefaultCompressionLevel,
             clp::streaming_compression::lzma::cDefaultDictionarySize,
             LZMA_CHECK_CRC64
     );
     compress(std::move(compressor), uncompressed_buffer.data());
+    decompressor = std::make_unique<clp::streaming_compression::lzma::Decompressor>();
+    decompress_and_compare(std::move(decompressor), uncompressed_buffer, decompressed_buffer);
 }

Likely invalid or redundant comment.

components/core/src/clp/clp/FileDecompressor.hpp (1)

9-19: LGTM! Minor path and parameter name updates.

The changes only involve:

  • Updating include paths to use relative paths
  • Renaming parameter from is_last_ir_chunk to is_last_chunk for better clarity

Also applies to: 42-42

components/core/src/clp/clo/OutputHandler.cpp (1)

103-103: LGTM! Consistent namespace usage.

The update to use cResultsCacheKeys::SearchOutput:: namespace for BSON document keys improves code organization and maintainability.

components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (1)

55-55: Verify the necessity of increasing the VARCHAR length.

The change increases the maximum length of status messages from 255 to 512 characters. While this change is backwards compatible, it could impact storage requirements if there are many rows.

✅ Verification successful

Increasing VARCHAR length to 512 is justified

The codebase analysis reveals that status_msg is primarily used to store error messages that are concatenated from multiple sources:

  • Error messages from S3 client errors (including AWS error messages)
  • Multiple task errors concatenated with newlines (error_message += f"task {task_result.task_id}: {task_result.error_message}\n")
  • Detailed error information including log file paths (error_msgs.append(f"See logs {stderr_log_path}"))

These concatenated messages, especially when including multiple task failures and log paths, could easily exceed 255 characters, making the increase to 512 a necessary change to prevent message truncation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there are any status messages that would benefit from the increased length.

# Search for status message assignments in the codebase
rg -A 2 'status_msg.*=' 

# Search for any existing truncation handling
rg 'truncate.*status_msg|status_msg.*truncate'

Length of output: 765


Script:

#!/bin/bash
# Search for status message content or error messages that might be stored in status_msg
rg -A 3 'status_msg.*=.*' --type py

# Search for SQL operations involving status_msg to understand its usage
rg -A 2 'status_msg.*FROM|INSERT.*status_msg|UPDATE.*status_msg' --type py

# Look for any error message construction that might be stored in status_msg
rg -A 2 'error_message.*=|error_msg.*=' --type py

Length of output: 6970

components/core/src/clp_s/JsonConstructor.cpp (1)

125-125: Verify MongoDB queries for renamed fields.

The field names in the BSON document have been changed:

  • cOrigFileIdcStreamId
  • cIsLastIrChunkcIsLastChunk

These changes could impact existing MongoDB queries and data.

Also applies to: 137-137

components/core/tests/test-error_handling.cpp (1)

46-93: LGTM! Well-implemented error category methods.

The implementation is clean and efficient:

  • Proper use of std::any_of for condition matching
  • Comprehensive switch statements with default cases
  • Consistent error message handling
components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1)

89-180: Well-designed LZMA stream wrapper.

The LzmaStream class effectively encapsulates LZMA functionality with proper:

  • Resource management
  • Error handling
  • Move semantics
  • Input/output stream management
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

8-8: LGTM: Clean import addition

The addition of StorageType from clp_py_utils.clp_config is well-placed and necessary for the new storage type validation.

components/core/src/clp/ffi/SchemaTree.hpp (1)

131-131: LGTM: Appropriate use of defaulted operator==

The defaulted equality operator is appropriate here as it will perform member-wise comparison of all Node members.

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

Line range hint 1-1: Unable to review the file.

The file's code changes are not visible in the provided content, making it impossible to perform a meaningful review.

components/core/tests/test-NetworkReader.cpp (2)

199-199: Good type choice improvement.

Changed from int to size_t for better type consistency with container sizes.


227-228: Good error handling with optional.

Using std::optional for content provides clear error handling and state management.

components/core/src/clp_s/TimestampEntry.cpp (1)

4-6: LGTM: Clean header organization.

The addition of the stream header and proper ordering with local headers is correct.

components/core/src/reducer/reducer_server.cpp (2)

208-209: LGTM: Consistent timer API usage with improved readability

The change maintains consistency with the modern boost::asio timer API while improving code readability through better line formatting.


124-124: LGTM: Timer API update follows best practices

The change from expires_from_now to expires_after aligns with modern boost::asio timer API recommendations.

✅ Verification successful

Timer API update is consistently implemented

The codebase shows consistent usage of the modern expires_after API throughout, with no instances of the deprecated expires_from_now. The timer implementation is properly using boost::asio::steady_timer with the recommended API.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of expires_after across the codebase
rg -l "expires_from_now" 

Length of output: 24


Script:

#!/bin/bash
# Search for both expires_from_now and expires_after to understand the timer API usage
rg "expires_(from_now|after)" -A 2

Length of output: 670


Script:

#!/bin/bash
# Look for timer-related includes to verify boost::asio usage
rg "boost.*timer" --type cpp

Length of output: 245


Script:

#!/bin/bash
# Search for timer-related class and method definitions
ast-grep --pattern 'class $_ {
  $$$
  operator()($$$) {
    $$$
    expires_$_($$$);
    $$$
  }
}'

Length of output: 102

components/core/src/clp_s/clp-s.cpp (1)

91-99: LGTM: Configuration options properly initialized

The new configuration options are correctly initialized from command line arguments, enhancing the flexibility of the JSON parser.

components/clp-package-utils/clp_package_utils/general.py (3)

244-254: LGTM: Improved encapsulation through proper method usage

The change properly utilizes setter and getter methods instead of direct property access, improving encapsulation and maintainability.


273-282: LGTM: Well-structured worker configuration generation

The new function properly encapsulates worker configuration generation with appropriate deep copying to prevent shared state issues.


499-499: LGTM: Enhanced archive output validation

The change to use validate_archive_output_config suggests more comprehensive validation, which aligns with the S3 integration objectives.

✅ Verification successful

Method change verified and aligns with S3 integration requirements

The implementation confirms that validate_archive_output_config performs additional S3-specific validation, ensuring that S3 storage type is only used with the appropriate storage engine (CLP_S). This change is necessary and correctly implemented across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent validation method usage
rg "validate_archive_output" 

Length of output: 359


Script:

#!/bin/bash
# Get implementation of validate_archive_output_config
ast-grep --pattern 'def validate_archive_output_config($$$)'

Length of output: 62


Script:

#!/bin/bash
# Get implementation details using ripgrep with context
rg -A 10 "def validate_archive_output_config" components/clp-py-utils/clp_py_utils/clp_config.py

Length of output: 577

components/core/CMakeLists.txt (1)

537-539: ⚠️ Potential issue

Missing LZMA decompressor implementation

The LZMA compression implementation is incomplete. Other compression methods (zstd, passthrough) have both compressor and decompressor implementations.

Required files:

  • src/clp/streaming_compression/lzma/Decompressor.cpp
  • src/clp/streaming_compression/lzma/Decompressor.hpp

Would you like me to help create these files?

⛔ Skipped due to learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.cpp:169-170
Timestamp: 2024-12-19T01:39:30.094Z
Learning: In the LzmaStreamOperations context (Compressor.cpp), m_p is guaranteed never to be null.
components/core/src/clp_s/JsonParser.cpp (1)

40-40: LGTM: Proper initialization of single file archive option

The initialization of m_archive_options.single_file_archive is correctly placed in the constructor's initialization list and properly copies the value from the constructor parameter.

components/core/src/clp/clo/CommandLineArguments.cpp (1)

11-11: LGTM: Added necessary include for reducer types

The addition of the reducer types header is appropriate for the functionality.

components/core/src/clp_s/TimestampPattern.cpp (4)

7-8: LGTM: Proper include and using declarations for string_view

The addition of string_view-related includes and using directive is appropriate for the string handling improvements.

Also applies to: 17-17


77-77: LGTM: Efficient string parameter passing using string_view

Changed string parameters to string_view for better performance by avoiding unnecessary string copies.

Also applies to: 95-95, 131-131, 160-160, 312-312, 348-348


1073-1073: LGTM: Consistent string conversion usage

Changed std::to_string to to_string for consistency with the using directive at the top of the file.

Also applies to: 1080-1080


833-846: Verify the floating-point timestamp parsing logic

The changes to floating-point timestamp parsing look correct but should be verified with test cases, particularly around the nanosecond precision handling.

✅ Verification successful

Floating-point timestamp parsing is well-tested

The test file shows comprehensive testing of timestamp parsing, including:

The changes to the floating-point timestamp parsing are supported by existing test cases that verify both the parsing logic and precision handling.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for test cases that verify floating-point timestamp parsing
rg -A 5 "TEST.*floating.*point.*timestamp" || rg -A 5 "TEST.*nanosecond.*precision"

Length of output: 84


Script:

#!/bin/bash
# Search for test files related to TimestampPattern
fd "Test.*\.cpp" | grep -i "timestamp"

# Then search for test cases involving timestamp parsing
rg -A 10 "TEST.*TimestampPattern" 

# Look for any test cases with decimal points or dots in timestamps
rg -A 5 "TEST.*[0-9]+\.[0-9]+" components/core/test

Length of output: 1037


Script:

#!/bin/bash
# Let's examine the test cases in test-TimestampPattern.cpp
cat components/core/tests/test-TimestampPattern.cpp

# Also search for any test cases with nanosecond timestamps
rg -A 5 "\..*[0-9]{9}" components/core/tests/test-TimestampPattern.cpp

Length of output: 23149

components/core/tests/test-ir_encoding_methods.cpp (2)

1249-1249: Method name change reflects improved specificity.

The change from get_node_id_value_pairs() to get_user_gen_node_id_value_pairs() better represents the actual data being retrieved, making the test more precise.


1254-1256: Enhanced test coverage with explicit data type validation.

The new code properly separates and validates auto-generated and user-generated data:

  1. Destructures the serialization result into auto-generated and user-generated components
  2. Verifies that auto-generated data is empty
  3. Validates that user-generated data matches the expected JSON
components/core/src/clp_s/CommandLineArguments.hpp (4)

39-42: LGTM! Well-structured enum definition.

The FileType enum is well-defined with clear values and uses space-efficient uint8_t as the underlying type.


110-110: LGTM! Clear and concise getter method.

The method follows the getter naming convention and has a straightforward implementation.


124-124: LGTM! Well-implemented getter with modern C++ practices.

The method uses [[nodiscard]] appropriately and follows the codebase's trailing return type convention.


188-188: LGTM! Well-structured member variables with appropriate defaults.

The member variables follow naming conventions and have sensible default values.

Also applies to: 194-194

components/core/src/clp/clo/clo.cpp (1)

174-174: LGTM! Consistent and well-documented changes.

The changes improve clarity through better naming and align with the updated functionality:

  • Parameter name change from is_last_ir_chunk to is_last_chunk improves readability
  • Key usage updated from OrigFileId to StreamId reflects the new schema
  • Output directory method call updated to use the new functionality

Also applies to: 198-198, 210-211, 223-223

components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.cpp (1)

554-555: LGTM! Changes align with shared_ptr requirements.

The modifications correctly implement the requirement for std::shared_ptr<SchemaTree> in KeyValuePairLogEvent::create, as noted in the learnings. The separation of auto-generated and user-generated schema trees improves code organization and type safety.

Also applies to: 583-585

Comment on lines 70 to 136
def _delete_archives(
archives_dir: Path,
database_config: Database,
begin_ts: int,
end_ts: int,
) -> int:
"""
Deletes all archives where `begin_ts <= archive.begin_timestamp` and
`archive.end_timestamp <= end_ts` from both the metadata database and disk.
:param archives_dir:
:param database_config:
:param begin_ts:
:param end_ts:
:return: 0 on success, -1 otherwise.
"""

archive_ids: List[str]
logger.info("Starting to delete archives from the database.")
try:
sql_adapter = SQL_Adapter(database_config)
clp_db_connection_params = database_config.get_clp_connection_params_and_type(True)
table_prefix = clp_db_connection_params["table_prefix"]
with closing(sql_adapter.create_connection(True)) as db_conn, closing(
db_conn.cursor(dictionary=True)
) as db_cursor:
db_cursor.execute(
f"""
DELETE FROM `{table_prefix}archives`
WHERE begin_timestamp >= %s AND end_timestamp <= %s
RETURNING id
""",
(begin_ts, end_ts),
)
results = db_cursor.fetchall()

if 0 == len(results):
logger.info("No archives (exclusively) within the specified time range.")
return 0

archive_ids = [result["id"] for result in results]
db_cursor.execute(
f"""
DELETE FROM `{table_prefix}files`
WHERE archive_id in ({', '.join(['%s'] * len(archive_ids))})
""",
archive_ids,
)
db_conn.commit()
except Exception:
logger.exception("Failed to delete archives from the database. Aborting deletion.")
return -1

logger.info(f"Finished deleting archives from the database.")

for archive_id in archive_ids:
archive_path = archives_dir / archive_id
if not archive_path.is_dir():
logger.warning(f"Archive {archive_id} is not a directory. Skipping deletion.")
continue

logger.info(f"Deleting archive {archive_id} from disk.")
shutil.rmtree(archive_path)

logger.info(f"Finished deleting archives from disk.")

return 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Comprehensive _delete_archives function.
Involves database operations and file system updates. Ensure that partial successes (e.g., database updated but partial disk deletion) are either handled or documented.

Consider a transaction or rollback approach, or at least confirm partial cleanup if an error occurs during disk deletion.

Comment on lines 120 to 136
void ArchiveWriter::write_single_file_archive(std::vector<ArchiveFileInfo> const& files) {
std::string single_file_archive_path = (std::filesystem::path(m_archives_dir) / m_id).string();
FileWriter archive_writer;
archive_writer.open(single_file_archive_path, FileWriter::OpenMode::CreateForWriting);

write_archive_metadata(archive_writer, files);
size_t metadata_section_size = archive_writer.get_pos() - sizeof(ArchiveHeader);
write_archive_files(archive_writer, files);
m_compressed_size = archive_writer.get_pos();
write_archive_header(archive_writer, metadata_section_size);

archive_writer.close();
std::error_code ec;
if (false == std::filesystem::remove(m_archive_path, ec)) {
throw OperationFailed(ErrorCodeFileExists, __FILENAME__, __LINE__);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Temporary directory cleanup

In write_single_file_archive(), the code removes the temporary directory (m_archive_path). Confirm that it contains only ephemeral data or data that is fully transferred. For large archives, a partial error scenario could result in data loss if not carefully handled.

Comment on lines 26 to 27
if src_file.stat().st_size > 5 * 1024 * 1024 * 1024:
return Err(f"{src_file} is larger than the limit (5GiB) for a single PutObject operation.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Extract file size limit to a constant

The 5 GiB limit should be defined as a named constant at the module level for better maintainability.

+# Maximum file size for S3 PutObject operation (5 GiB)
+MAX_S3_PUT_OBJECT_SIZE = 5 * 1024 * 1024 * 1024
+
 def s3_put(
     s3_config: S3Config, src_file: Path, dest_file_name: str, total_max_attempts: int = 3
 ) -> Result[bool, str]:
     # ...
-    if src_file.stat().st_size > 5 * 1024 * 1024 * 1024:
+    if src_file.stat().st_size > MAX_S3_PUT_OBJECT_SIZE:

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 103 to 104
# Remove generated files
generated_config_path_on_host.unlink()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for file cleanup

The cleanup operation should handle potential failures when removing generated files.

-    generated_config_path_on_host.unlink()
+    try:
+        if generated_config_path_on_host.exists():
+            generated_config_path_on_host.unlink()
+    except Exception as e:
+        logger.warning(f"Failed to clean up generated config file: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Remove generated files
generated_config_path_on_host.unlink()
# Remove generated files
try:
if generated_config_path_on_host.exists():
generated_config_path_on_host.unlink()
except Exception as e:
logger.warning(f"Failed to clean up generated config file: {e}")

Comment on lines 58 to 60
except:
logger.exception("Failed to load config.")
return -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add specific error handling for different failure scenarios

The bare except clause catches all exceptions without distinguishing between different error types. This could mask specific issues that require different handling.

-    except:
+    except (FileNotFoundError, PermissionError) as e:
+        logger.exception(f"Failed to access config file: {e}")
+        return -1
+    except ValueError as e:
+        logger.exception(f"Invalid config file format: {e}")
+        return -1
+    except Exception as e:
+        logger.exception(f"Unexpected error loading config: {e}")
+        return -1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except:
logger.exception("Failed to load config.")
return -1
except (FileNotFoundError, PermissionError) as e:
logger.exception(f"Failed to access config file: {e}")
return -1
except ValueError as e:
logger.exception(f"Invalid config file format: {e}")
return -1
except Exception as e:
logger.exception(f"Unexpected error loading config: {e}")
return -1

Comment on lines 100 to 101
cmd = container_start_cmd + del_archive_cmd
subprocess.run(cmd, check=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for subprocess execution

The subprocess execution lacks proper error handling and logging for potential failures.

-    subprocess.run(cmd, check=True)
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+    except subprocess.CalledProcessError as e:
+        logger.error(f"Archive deletion failed: {e.stderr}")
+        return -1
+    except Exception as e:
+        logger.error(f"Failed to execute archive deletion: {e}")
+        return -1

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 258 to 267
/**
* Writes a numeric value to a stringstream.
* @param stream
* @param value
* @tparam ValueType
*/
template <typename ValueType>
void write_numeric_value(std::stringstream& stream, ValueType value) {
stream.write(reinterpret_cast<char*>(&value), sizeof(value));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical: Potential undefined behavior in write_numeric_value

The current implementation has several issues:

  1. Using reinterpret_cast on potentially unaligned data can cause undefined behavior
  2. No endianness handling for cross-platform compatibility
  3. No error handling for stream write failures

Consider this safer implementation:

 template <typename ValueType>
-void write_numeric_value(std::stringstream& stream, ValueType value) {
-    stream.write(reinterpret_cast<char*>(&value), sizeof(value));
+void write_numeric_value(std::stringstream& stream, ValueType value) {
+    // Use bit_cast for safe type punning
+    std::array<char, sizeof(ValueType)> buffer;
+    std::memcpy(buffer.data(), &value, sizeof(ValueType));
+    
+    // Handle endianness if needed
+    if constexpr (sizeof(ValueType) > 1) {
+        // Convert to network byte order for consistent cross-platform behavior
+        // Implementation depends on your endianness requirements
+    }
+    
+    if (!stream.write(buffer.data(), sizeof(ValueType))) {
+        throw std::runtime_error("Failed to write numeric value to stream");
+    }
+}

Committable suggestion skipped: line range outside the PR's diff.

components/clp-package-utils/clp_package_utils/general.py Outdated Show resolved Hide resolved
components/core/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Contributor Author

@Bill-hbrhbr Bill-hbrhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments are straightforward suggested changes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (3)

41-43: Improve warning message clarity.

The warning message is verbose and can be more concise while maintaining its meaning.

-        SPDLOG_WARN("Trying to close LZMA compressor with unprocessed input data. Processing and "
-                    "flushing remaining data.");
+        SPDLOG_WARN("Processing and flushing remaining data before closing LZMA compressor.");

91-93: Improve error message specificity for LZMA_BUF_ERROR.

The error message could be more specific about the actual condition causing the error.

-                SPDLOG_ERROR("LZMA compressor input stream is corrupt. No encoding "
-                             "progress can be made.");
+                SPDLOG_ERROR("LZMA compression failed: Unable to make progress with available input and output buffers.");

179-200: Refactor error message handling in LzmaStream constructor.

Consider using a map or constexpr function to handle error messages more elegantly.

-    char const* msg{nullptr};
-    switch (rc) {
-        case LZMA_MEM_ERROR:
-            msg = "Memory allocation failed";
-            break;
-        case LZMA_OPTIONS_ERROR:
-            msg = "Specified preset is not supported";
-            break;
-        case LZMA_UNSUPPORTED_CHECK:
-            msg = "Specified integrity check is not supported";
-            break;
-        case LZMA_PROG_ERROR:
-            msg = "Input arguments are not sane";
-            break;
-        default:
-            msg = "Unknown error";
-            break;
-    }
+    static constexpr auto get_error_message(lzma_ret code) -> char const* {
+        switch (code) {
+            case LZMA_MEM_ERROR:
+                return "Memory allocation failed";
+            case LZMA_OPTIONS_ERROR:
+                return "Specified preset is not supported";
+            case LZMA_UNSUPPORTED_CHECK:
+                return "Specified integrity check is not supported";
+            case LZMA_PROG_ERROR:
+                return "Input arguments are not sane";
+            default:
+                return "Unknown error";
+        }
+    }
+    auto const msg = get_error_message(rc);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcfc73a and d139bd5.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.cpp:169-170
Timestamp: 2024-12-19T01:39:30.094Z
Learning: In the LzmaStreamOperations context (Compressor.cpp), m_p is guaranteed never to be null.
🔇 Additional comments (1)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)

148-151: 🛠️ Refactor suggestion

Add error handling for write operations.

The write operation to the file writer should be wrapped in a try-catch block to handle potential I/O errors.

+    try {
         m_compressed_stream_file_writer->write(
                 clp::size_checked_pointer_cast<char>(m_compressed_stream_block_buffer.data()),
                 cCompressedStreamBlockBufferSize - m_lzma_stream.avail_out()
         );
+    } catch (const std::exception& e) {
+        SPDLOG_ERROR("Failed to write compressed data: {}", e.what());
+        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+    }

Likely invalid or redundant comment.

davidlion
davidlion previously approved these changes Dec 20, 2024
Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the code's behavior looks weird to me:

  1. In write(), we always call detach_lzma_input before exiting, which will set m_lzma_stream.avail_in() to be 0.
  2. In close(), we call flush_lzma(LZMA_FULL_FLUSH) if (m_lzma_stream.avail_in() > 0). Because of (1), this should never happen.
  3. In flush(), we always call flush_lzma(LZMA_SYNC_FLUSH); Similarly, because of (1), m_lzma_stream.avail_in() = 0

It looks to me that there's some contradiction.
What I read from 2 is that flush_lzma(LZMA_SYNC_FLUSH) has something to do with avail_in. but in 3, we are not checking the avail_in at all.

I didn't read LZMA functions carefully, but my guess is that:
flush_lzma(LZMA_SYNC_FLUSH) has something to do with the data internally buffered by lzma_string. so we should call it in flush()
flush_lzma(LZMA_FINISH) may or may not handle the internally buffered data.

  • If it does not handle the internally buffered data, shouldn't close() always call flush_lzma(LZMA_SYNC_FLUSH) before calling flush_lzma(LZMA_FINISH)?
  • If it handle the internally buffered data, close() won't need to call flush_lzma(LZMA_SYNC_FLUSH).

If what my guess is wrong and there are some subtlety in the lzma flush API, then we should document it in flush_lzma()'s docstring,

@Bill-hbrhbr Bill-hbrhbr requested review from kirkrodrigues and removed request for kirkrodrigues December 20, 2024 22:11
@Bill-hbrhbr
Copy link
Contributor Author

Bill-hbrhbr commented Dec 20, 2024

Part of the code's behavior looks weird to me:

  1. In write(), we always call detach_lzma_input before exiting, which will set m_lzma_stream.avail_in() to be 0.
  2. In close(), we call flush_lzma(LZMA_FULL_FLUSH) if (m_lzma_stream.avail_in() > 0). Because of (1), this should never happen.
  3. In flush(), we always call flush_lzma(LZMA_SYNC_FLUSH); Similarly, because of (1), m_lzma_stream.avail_in() = 0

It looks to me that there's some contradiction. What I read from 2 is that flush_lzma(LZMA_SYNC_FLUSH) has something to do with avail_in. but in 3, we are not checking the avail_in at all.

I didn't read LZMA functions carefully, but my guess is that: flush_lzma(LZMA_SYNC_FLUSH) has something to do with the data internally buffered by lzma_string. so we should call it in flush() flush_lzma(LZMA_FINISH) may or may not handle the internally buffered data.

  • If it does not handle the internally buffered data, shouldn't close() always call flush_lzma(LZMA_SYNC_FLUSH) before calling flush_lzma(LZMA_FINISH)?
  • If it handle the internally buffered data, close() won't need to call flush_lzma(LZMA_SYNC_FLUSH).

If what my guess is wrong and there are some subtlety in the lzma flush API, then we should document it in flush_lzma()'s docstring,

LZMA_FINISH also flushes the internally buffered data, and potentially do some tear-down operations.
LZMA_SYNC_FLUSH only flushes but allows existing writes to continue/new writes coming in.
LZMA_FULL_FLUSH not only flushes but also processes input data until it exhausts. After that, need to have new input sources.

For (2) I'll make it a class operation error instead of spdlog error

@haiqi96
Copy link
Contributor

haiqi96 commented Dec 20, 2024

Part of the code's behavior looks weird to me:

  1. In write(), we always call detach_lzma_input before exiting, which will set m_lzma_stream.avail_in() to be 0.
  2. In close(), we call flush_lzma(LZMA_FULL_FLUSH) if (m_lzma_stream.avail_in() > 0). Because of (1), this should never happen.
  3. In flush(), we always call flush_lzma(LZMA_SYNC_FLUSH); Similarly, because of (1), m_lzma_stream.avail_in() = 0

It looks to me that there's some contradiction. What I read from 2 is that flush_lzma(LZMA_SYNC_FLUSH) has something to do with avail_in. but in 3, we are not checking the avail_in at all.
I didn't read LZMA functions carefully, but my guess is that: flush_lzma(LZMA_SYNC_FLUSH) has something to do with the data internally buffered by lzma_string. so we should call it in flush() flush_lzma(LZMA_FINISH) may or may not handle the internally buffered data.

  • If it does not handle the internally buffered data, shouldn't close() always call flush_lzma(LZMA_SYNC_FLUSH) before calling flush_lzma(LZMA_FINISH)?
  • If it handle the internally buffered data, close() won't need to call flush_lzma(LZMA_SYNC_FLUSH).

If what my guess is wrong and there are some subtlety in the lzma flush API, then we should document it in flush_lzma()'s docstring,

LZMA_FINISH also flushes the internally buffered data, and potentially do some tear-down operations. LZMA_SYNC_FLUSH only flushes but allows existing writes to continue/new writes coming in. LZMA_FULL_FLUSH not only flushes but also processes input data until it exhausts. After that, need to have new input sources.

For (2) I'll make it a class operation error instead of spdlog error

I see, yeah I think the change is fair

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (3)

17-33: Consider adding documentation for the open method.

The method handles initialization well, but adding documentation about its purpose and possible exceptions would improve maintainability.


35-47: Consider handling remaining data in close().

The method throws an error if there's unprocessed input data. Consider processing the remaining data before closing to ensure no data is lost.

 if (m_lzma_stream.avail_in() > 0) {
-    throw OperationFailed(ErrorCode_Corrupt, __FILENAME__, __LINE__);
+    SPDLOG_WARN("Processing remaining data before closing");
+    encode_lzma();
 }

160-202: Consider separating error handling into a dedicated method.

The error handling logic in the constructor is quite lengthy. Consider extracting it into a separate private method for better maintainability.

+    private:
+        static auto get_error_message(lzma_ret rc) -> const char* {
+            switch (rc) {
+                case LZMA_MEM_ERROR:
+                    return "Memory allocation failed";
+                case LZMA_OPTIONS_ERROR:
+                    return "Specified preset is not supported";
+                case LZMA_UNSUPPORTED_CHECK:
+                    return "Specified integrity check is not supported";
+                case LZMA_PROG_ERROR:
+                    return "Input arguments are not sane";
+                default:
+                    return "Unknown error";
+            }
+        }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68c4c36 and dcb843e.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1 hunks)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp/streaming_compression/lzma/Compressor.hpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.cpp:169-170
Timestamp: 2024-12-19T01:39:30.094Z
Learning: In the LzmaStreamOperations context (Compressor.cpp), m_p is guaranteed never to be null.
🔇 Additional comments (2)
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (2)

1-16: LGTM! Well-organized includes and namespace declaration.

The includes are properly organized and all necessary dependencies are present.


142-158: 🛠️ Refactor suggestion

Handle potential write failures.

The write operation to m_compressed_stream_file_writer should be wrapped in a try-catch block to handle potential I/O errors.

+    try {
         m_compressed_stream_file_writer->write(
                 clp::size_checked_pointer_cast<char>(m_compressed_stream_block_buffer.data()),
                 cCompressedStreamBlockBufferSize - m_lzma_stream.avail_out()
         );
+    } catch (const std::exception& e) {
+        SPDLOG_ERROR("Failed to write compressed data: {}", e.what());
+        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+    }

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
components/core/tests/test-StreamingCompression.cpp (1)

63-72: Consider enhancing error handling in the compress function

While the implementation is correct, consider adding error handling for the file operations and compression steps.

 auto compress(std::unique_ptr<Compressor> compressor, char const* src) -> void {
     FileWriter file_writer;
-    file_writer.open(string(cCompressedFilePath), FileWriter::OpenMode::CREATE_FOR_WRITING);
+    auto const result = file_writer.open(string(cCompressedFilePath), FileWriter::OpenMode::CREATE_FOR_WRITING);
+    REQUIRE(ErrorCode_Success == result);
     compressor->open(file_writer);
     for (auto const chunk_size : cCompressionChunkSizes) {
-        compressor->write(src, chunk_size);
+        REQUIRE(ErrorCode_Success == compressor->write(src, chunk_size));
     }
     compressor->close();
     file_writer.close();
 }
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (3)

35-47: Consider error handling refinement in close().

The error code ErrorCode_Corrupt on line 41 might not accurately represent the condition of having unprocessed input data. Consider using ErrorCode_Failure or a more descriptive error code.

-        throw OperationFailed(ErrorCode_Corrupt, __FILENAME__, __LINE__);
+        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);

79-100: Enhance LZMA_BUF_ERROR handling in encode_lzma.

The error message could be more specific about the nature of the error, especially since we know both buffers are available in this context.

             case LZMA_BUF_ERROR:
                 SPDLOG_ERROR("LZMA compressor input stream is corrupt. No encoding "
-                             "progress can be made.");
+                             "progress can be made with available input and output buffers.");
                 throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);

191-194: Consider more specific error handling for LZMA_PROG_ERROR.

The error message could provide more specific information about which input arguments are invalid.

         case LZMA_PROG_ERROR:
-            msg = "Input arguments are not sane";
+            msg = "Invalid input: compression level, dictionary size, or check parameter is out of range";
             break;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dcb843e and b20162f.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1 hunks)
  • components/core/tests/test-StreamingCompression.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/tests/test-StreamingCompression.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
components/core/tests/test-StreamingCompression.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.cpp:169-170
Timestamp: 2024-12-19T01:39:30.094Z
Learning: In the LzmaStreamOperations context (Compressor.cpp), m_p is guaranteed never to be null.
🔇 Additional comments (5)
components/core/tests/test-StreamingCompression.cpp (3)

7-8: LGTM! Good use of string_view for optimization

The addition of string_view and related headers is appropriate, and the organization of includes is clean and logical.

Also applies to: 12-12, 32-33


35-61: Well-structured test constants and helper functions

The interleaved no-op (0-sized) chunks in cCompressionChunkSizes effectively test the compressor state integrity between operations. The use of string_view for constants and clear function signatures enhance code efficiency.


140-143: Document LZMA section's compression-only scope

The LZMA section differs from other compression tests as it only tests compression without decompression. This should be documented for clarity.

-    SECTION("LZMA compression") {
+    SECTION("LZMA compression (compression-only test)") {
+        // Note: This section intentionally tests only compression functionality
+        // without decompression, as per the current implementation scope.
         compressor = std::make_unique<clp::streaming_compression::lzma::Compressor>();
         compress(std::move(compressor), uncompressed_buffer.data());
     }
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (2)

1-16: LGTM! Well-organized includes and namespace declaration.

The includes are logically grouped (standard library, external dependencies, and project headers) and all necessary dependencies are present.


142-158: LGTM! Robust buffer management implementation.

The method properly handles buffer checks, writes, and memory management with appropriate error handling.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
components/core/tests/test-StreamingCompression.cpp (1)

34-54: Consider adding documentation for the helper functions

The constants and function declarations are well-structured, but adding documentation would improve maintainability.

Add function documentation like this:

+/**
+ * Compresses the source data using the provided compressor with varying chunk sizes
+ * @param compressor The compression algorithm implementation
+ * @param src The source data to compress
+ */
auto compress(std::unique_ptr<Compressor> compressor, char const* src) -> void;

+/**
+ * Decompresses the data and verifies it matches the original content
+ * @param decompressor The decompression algorithm implementation
+ * @param uncompressed_buffer The original uncompressed data for comparison
+ * @param decompressed_buffer Buffer to store decompressed data
+ */
auto decompress_and_compare(
        std::unique_ptr<Decompressor> decompressor,
        Array<char> const& uncompressed_buffer,
        Array<char>& decompressed_buffer
) -> void;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b20162f and df41b22.

📒 Files selected for processing (1)
  • components/core/tests/test-StreamingCompression.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/tests/test-StreamingCompression.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/tests/test-StreamingCompression.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
🔇 Additional comments (4)
components/core/tests/test-StreamingCompression.cpp (4)

7-8: LGTM! Header includes and using declarations are well-organized

The additions follow the existing pattern and improve code readability.

Also applies to: 20-20, 31-32


Line range hint 56-104: LGTM! Helper functions are well-implemented

The functions demonstrate good practices:

  • RAII for resource management
  • Proper error handling
  • Comprehensive validation of decompressed data

133-136: Improve documentation for LZMA compression test section

The LZMA section differs from other compression tests as it only tests compression. This should be clearly documented.

Apply this change:

-    SECTION("LZMA compression") {
+    SECTION("LZMA compression (compression-only test)") {
+        // Note: Decompression testing is intentionally omitted as this section
+        // focuses on verifying the compression functionality
         compressor = std::make_unique<clp::streaming_compression::lzma::Compressor>();
         compress(std::move(compressor), uncompressed_buffer.data());
     }

138-138: LGTM! Proper cleanup of test artifacts

The cleanup code ensures no test files are left behind.

@Bill-hbrhbr Bill-hbrhbr requested a review from haiqi96 December 22, 2024 22:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
components/core/tests/test-StreamingCompression.cpp (1)

LZMA decompressor implementation is missing

Based on the codebase analysis, while LZMA compression is implemented (Compressor.hpp, Compressor.cpp, Constants.hpp), there is no corresponding decompressor implementation. This is unlike the ZStd and Passthrough implementations which have both compressor and decompressor components. The test file should not include decompressor header as it doesn't exist yet.

  • components/core/src/clp/streaming_compression/lzma/ needs a decompressor implementation
  • Remove LZMA-related test cases from test-StreamingCompression.cpp until decompression support is added
🔗 Analysis chain

Line range hint 20-24: Consider adding LZMA decompressor header for consistency

While the LZMA compressor header is included, the corresponding decompressor header is missing, unlike the ZStd and Passthrough implementations which include both. This might indicate an incomplete implementation or missing functionality.

Let's verify if the LZMA decompressor exists:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for LZMA decompressor header
fd -e hpp -e cpp "Decompressor" | rg "lzma"

Length of output: 41


Script:

#!/bin/bash
# Let's first check if LZMA decompressor exists in the streaming_compression directory
rg -l "lzma.*Decompressor" --type cpp --type hpp

# Also check the directory structure to understand the implementation pattern
fd . "src/clp/streaming_compression"

Length of output: 232


Script:

#!/bin/bash
# Let's check the directory structure first to understand the correct paths
fd streaming_compression

# Then search for any LZMA-related files
rg -l "lzma" --type cpp --type c --type h

Length of output: 411

🧹 Nitpick comments (4)
components/core/tests/test-StreamingCompression.cpp (2)

58-58: Consider avoiding string conversion overhead

The string(cCompressedFilePath) conversion could be avoided by updating FileWriter::open to accept string_view.


133-136: Improve test section documentation

While the compression-only testing for LZMA is intentional (as discussed in PR comments), the test section could be more self-documenting.

Consider adding a comment explaining why decompression is not tested:

     SECTION("LZMA compression") {
+        // Note: This section only tests compression as LZMA decompression
+        // is handled separately. See PR #614 for details.
         compressor = std::make_unique<clp::streaming_compression::lzma::Compressor>();
         compress(std::move(compressor), uncompressed_buffer.data());
     }
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (2)

146-149: Add error handling for write operations.

The write operation to the file writer should be wrapped in a try-catch block to handle potential I/O errors.

+    try {
         m_compressed_stream_file_writer->write(
                 clp::size_checked_pointer_cast<char>(m_compressed_stream_block_buffer.data()),
                 cCompressedStreamBlockBufferSize - m_lzma_stream.avail_out()
         );
+    } catch (const std::exception& e) {
+        SPDLOG_ERROR("Failed to write compressed data: {}", e.what());
+        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+    }

200-201: Enhance error logging with additional context.

Include the compression level, dictionary size, and check type in the error message to aid in debugging.

-    SPDLOG_ERROR("Error initializing the encoder: {} (error code {})", msg, static_cast<int>(rc));
+    SPDLOG_ERROR("Error initializing the encoder: {} (error code {}). Compression level: {}, Dictionary size: {}, Check: {}",
+                 msg, static_cast<int>(rc), compression_level, dict_size, static_cast<int>(check));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df41b22 and 524fe1d.

📒 Files selected for processing (2)
  • components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1 hunks)
  • components/core/tests/test-StreamingCompression.cpp (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/tests/test-StreamingCompression.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
components/core/tests/test-StreamingCompression.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (1)
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.cpp:169-170
Timestamp: 2024-12-19T01:39:30.094Z
Learning: In the LzmaStreamOperations context (Compressor.cpp), m_p is guaranteed never to be null.
🔇 Additional comments (3)
components/core/tests/test-StreamingCompression.cpp (1)

34-54: LGTM! Well-structured constants and function declarations

The code demonstrates good practices:

  • Constants properly scoped in anonymous namespace
  • Use of constexpr for compile-time evaluation
  • Modern C++ auto return type syntax
  • Clear function signatures with appropriate parameter types
components/core/src/clp/streaming_compression/lzma/Compressor.cpp (2)

40-42: 🛠️ Refactor suggestion

Consider processing remaining data before throwing an error.

Instead of throwing an error when there's unprocessed input data during close, consider processing the remaining data to ensure no data is lost.

 if (m_lzma_stream.avail_in() > 0) {
-    throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+    encode_lzma();  // Process remaining data
 }

Likely invalid or redundant comment.


126-130: 🛠️ Refactor suggestion

Enhance error handling for different flush actions.

The current error handling for LZMA_BUF_ERROR doesn't distinguish between different flush actions. For LZMA_FULL_FLUSH and LZMA_FULL_BARRIER, a buffer error might be expected when the output buffer is full.

             case LZMA_BUF_ERROR:
-                // NOTE: this can happen if we are using LZMA_FULL_FLUSH or LZMA_FULL_BARRIER. These
-                // two actions keeps encoding input data alongside flushing buffered encoded data.
-                SPDLOG_ERROR("LZMA compressor input stream is corrupt. No encoding "
-                             "progress can be made.");
-                throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+                if (flush_action == LZMA_FULL_FLUSH || flush_action == LZMA_FULL_BARRIER) {
+                    if (m_lzma_stream.avail_in() > 0 && m_lzma_stream.avail_out() > 0) {
+                        SPDLOG_ERROR("LZMA flush failed: Unable to make progress with available buffers");
+                        throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);
+                    }
+                    break;
+                }
+                SPDLOG_ERROR("Unexpected LZMA_BUF_ERROR during flush operation");
+                throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__);

Likely invalid or redundant comment.

Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now. Have you guys agreed on a PR title?

@Bill-hbrhbr
Copy link
Contributor Author

Looks good to me now. Have you guys agreed on a PR title?

Yes. The PR title is changed per David and Kirk's suggestion.

@Bill-hbrhbr Bill-hbrhbr merged commit 329edf6 into y-scope:main Dec 23, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.