feat(clp-s): Add support for kv-pair-IR ingestion. #630

AVMatthews · 2024-12-09T17:53:07Z

Description

This pull request is the continuation or PR #618, adding the functionality promised. This adds the corresponding ingestion functionality to compress IRv2 into archive.

Add additional public function to JsonParser to expose parsing from IR; parse_from_ir()
Add an class implementing the clp::ffi::ir_stream::IrUnitHandlerInterface for use deserializing IRv2 files; IrUnitHandler
Add function to iterate over pairs in the KeyValuePairLogEvent converting the log event into the correct format for archive; parse_kv_log_event()
Add two helper functions to help convert IR schema nodes into archive schema nodes:
- Find the archive schema node type based on the IR schema node type and the value of the node in the current log event; get_archive_node_type
- Find or add the correct archive schema node id for a given IR schema node id and correct archive schema node type for the value of the node in the current log event; get_archive_node_id

Validation performed

For our posted JSON Datasets:

Generated the corresponding IRv2 file
Compress the IRv2 file to archive
- ./clp-s c <archive dir> <ir file or directory w/ ir file> --file-type kv-ir
Extract archive back to JSON
- ./clp-s x <archive dir> <output JSON dir>
Sort keys and compare to JSON generated by clp-s from compressing and extracting directly from original JSON (spark, elasticsearch, and postgresql)
- jq -S -c '.' <output JSON dir>/original > out.json
  - You can sort the json order if not using built in log ordering
  - jq -S -c '.' <output JSON dir>/original | sort > out_sorted.json
- diff -u out.json clp-s_out.json
  - To generate clp-s_out.json:
    - ./clp-s c <archive dir> <original JSON>
    - ./clp-s x <archive dir> <output JSON dir>
    - jq -S -c '.' <output JSON dir>/original > clp-s_out.json
Spot checked match of mongodb and cockroach database
Notes:
- I noticed some differences in numbers of significant digits for floats between both clp-s and the IRv2 to archive, and the original JSON.
- ex.) original: 1.0; clp_s: 1.00000; IRv2: 1.00000
- ex.) original: 0.9758767; clp_s: 0.9758767; IRv2: 0.975877

Summary by CodeRabbit

New Features
- Enhanced JsonParser to support parsing of Key Value IR Streams.
- Introduced new functionalities for handling log events and schema trees.
- Improved error handling in the compress function for parsing Key Value IR files.
- Added validation checks in command-line argument parsing for better user feedback.
Bug Fixes
- Enhanced error logging for parsing failures in the compress function.

…er-Functionality

coderabbitai · 2024-12-09T17:53:16Z

Walkthrough

This pull request introduces multiple new source and header files to the CMakeLists.txt for the CLP module, enhancing its capabilities in serialization, deserialization, and data handling. Key modifications are made to the JsonParser class, including the addition of a new IrUnitHandler class and several methods for processing log events and IR units. The compress function is updated to incorporate error handling for parsing Key Value IR files, replacing placeholder comments with functional code. Overall, the changes expand the functionality and robustness of the CLP module.

Changes

File Path	Change Summary
components/core/src/clp_s/CMakeLists.txt	Added multiple new source and header files to `CLP_SOURCES`, `CLP_S_SOURCES`, and `CLP_S_SEARCH_SOURCES`.
components/core/src/clp_s/JsonParser.cpp	Introduced `IrUnitHandler` class and new methods in `JsonParser` for handling log events and IR units.
components/core/src/clp_s/JsonParser.hpp	Added new public and private methods in `JsonParser` for parsing and handling key-value pairs and schemas.
components/core/src/clp_s/clp-s.cpp	Enhanced error handling in the `compress` function for parsing Key Value IR files.
components/core/src/clp_s/CommandLineArguments.cpp	Improved error handling and validation in `parse_arguments` method regarding `--file-type` and `--timestamp-key`.

Possibly related PRs

ffi: Extract IR unit deserialization methods from Deserializer into standalone functions to prepare for a templatized Deserializer. #544: This PR adds new source files related to IR unit deserialization methods, which are directly relevant to the changes in the main PR that introduce new files for serialization and deserialization in the CMakeLists.txt.
ffi: Redesign Deserializer API to deserialize key-value pair IR streams one IR unit at a time (fixes #539). #549: This PR redesigns the Deserializer API to handle key-value pair IR streams, which aligns with the main PR's focus on enhancing serialization and deserialization functionalities.
feat(ffi): Update IR stream protocol version handling in preparation for releasing the kv-pair IR stream format: #573: This PR updates the IR stream protocol version handling, which is relevant to the changes in the main PR that involve new serialization and deserialization methods.
refactor(clp-package): Unify the metadata schema for JSON and IR streams. #620: This PR unifies the metadata schema for JSON and IR streams, which is related to the main PR's focus on enhancing serialization and deserialization processes.
feat(ffi): Add initial implementation of IrErrorCode (using the ErrorCode template) which will replace the IRErrorCode enum. #623: This PR introduces the IrErrorCode system, which will replace the existing IRErrorCode enum, directly impacting error handling in the serialization and deserialization processes introduced in the main PR.

Suggested reviewers

gibber9809
kirkrodrigues

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)

components/core/src/clp_s/CMakeLists.txt (1)

Line range hint 11-59: Consider grouping related components into separate CMake targets.

The current structure adds all files to a single source list. Consider organizing related components (IR stream, key-value processing, compression) into separate CMake targets for better modularity and dependency management.

Example restructuring:
# IR Stream component
add_library(ir_stream STATIC
  ../clp/ffi/ir_stream/decoding_methods.cpp
  ../clp/ffi/ir_stream/encoding_methods.cpp
  # ... other IR stream files
)

# Key-Value processing component
add_library(kv_processing STATIC
  ../clp/ffi/KeyValuePairLogEvent.cpp
  ../clp/ffi/SchemaTree.cpp
  # ... other KV files
)

# Update main target to use these components
target_link_libraries(clp-s
  PRIVATE
  ir_stream
  kv_processing
  # ... other dependencies
)

components/core/src/clp_s/clp-s.cpp (1)

118-121: Refactor to eliminate code duplication in error handling

Both branches of the if-statement handle parsing errors identically. Consider refactoring to reduce duplication and enhance readability.

Apply this diff to refactor the code:

     bool success = false;
     if (CommandLineArguments::FileType::KeyValueIr == option.input_file_type) {
-        if (false == parser.parse_from_ir()) {
-            SPDLOG_ERROR("Encountered error while parsing input");
-            return false;
-        }
+        success = parser.parse_from_ir();
     } else {
-        if (false == parser.parse()) {
-            SPDLOG_ERROR("Encountered error while parsing input");
-            return false;
-        }
+        success = parser.parse();
     }
+    if (false == success) {
+        SPDLOG_ERROR("Encountered error while parsing input");
+        return false;
+    }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7176c5e and 914b727.

📒 Files selected for processing (4)

components/core/src/clp_s/CMakeLists.txt (2 hunks)
components/core/src/clp_s/JsonParser.cpp (2 hunks)
components/core/src/clp_s/JsonParser.hpp (4 hunks)
components/core/src/clp_s/clp-s.cpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)

components/core/src/clp_s/JsonParser.hpp (1)

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

components/core/src/clp_s/JsonParser.cpp (1)

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

🔇 Additional comments (8)

components/core/src/clp_s/CMakeLists.txt (3)

11-28: Well-organized IR stream and key-value processing components.

The organization of the IR stream and key-value processing components follows a clean separation of concerns and maintains consistent naming conventions. The structure aligns well with the PR objectives for IRv2 ingestion functionality.

Line range hint 11-59: Consider adding corresponding test files.

While the implementation files are well-organized, I notice that corresponding test files aren't included in the CMake configuration. Consider adding test files for:

IR stream serialization/deserialization
Key-value pair processing
Schema handling
UTF-8 utilities

Would you like me to help create a template for these test files or open a GitHub issue to track this task?

54-55: Verify ZSTD compression integration with IR stream handling.

The addition of ZSTD compression components alongside IR stream handling suggests a dependency between these features. Please ensure that the IR stream serialization properly integrates with the ZSTD compression.

Also applies to: 11-21

✅ Verification successful

ZSTD compression is properly integrated with IR stream handling

The verification shows that ZSTD compression is already properly integrated with IR stream handling:

LogEventSerializer uses ZSTD compression for writing IR data through m_zstd_compressor
The test suite confirms the integration by using Decompressor to read the ZSTD-compressed IR stream
The implementation follows a proper lifecycle (open, write, close) for compression handling

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for ZSTD compression usage in IR stream handling

# Search for ZSTD compression usage in IR stream files
echo "Checking ZSTD integration..."
rg -l "zstd|ZSTD" ../clp/ffi/ir_stream/

Length of output: 170

Script:

#!/bin/bash
# Let's first find the correct paths for IR stream and ZSTD files
echo "Finding IR stream and ZSTD related files..."
fd -t f "ir_stream|IR_Stream|zstd|ZSTD" --full-path

# Then check for any connections between IR stream and ZSTD compression
echo -e "\nChecking for ZSTD usage in IR stream related files..."
rg -l "zstd|ZSTD|Zstd" $(fd -t f "ir_stream|IR_Stream")

# Look for IR stream mentions in ZSTD files
echo -e "\nChecking for IR stream references in ZSTD files..."
rg -l "ir_stream|IR_Stream" $(fd -t f "zstd|ZSTD")

Length of output: 9711

Script:

#!/bin/bash
# Let's examine the actual integration points between IR stream and ZSTD
echo "Checking IR serialization code for compression usage..."
rg -C 3 "Compress|Decompress|zstd|ZSTD" components/core/src/clp/ir/LogEventSerializer.cpp components/core/src/clp/ir/LogEventDeserializer.cpp

echo -e "\nChecking test files for IR and ZSTD integration..."
rg -C 3 "Compress|Decompress|zstd|ZSTD" components/core/tests/test-ir_serializer.cpp

Length of output: 2862

components/core/src/clp_s/JsonParser.hpp (5)

4-4: Inclusion of necessary headers

The added headers provide the required declarations for the new functionalities implemented in the class.

Also applies to: 8-8, 15-17

74-79: Addition of parse_from_ir() method

The introduction of the parse_from_ir() public method enhances the JsonParser class by enabling parsing from IR streams, aligning with the PR objectives.

95-107: Definition of get_archive_node_type() method

The get_archive_node_type() method correctly determines the archive node type based on the IR node type and its value, which is essential for accurate data conversion.

108-123: Implementation of get_archive_node_id() method

The get_archive_node_id() method efficiently maps IR node IDs to archive node IDs, utilizing caching to optimize performance.

124-135: Creation of parse_kv_log_event() method

The parse_kv_log_event() method effectively parses key-value log events, facilitating their conversion into the archive format.

components/core/src/clp_s/JsonParser.cpp

gibber9809

Added an initial set of comments. Still need to go through parse_from_ir onwards, but should be enough comments to iterate on for now.

The other high level question I have is whether the --timestamp-key argument is supported for this ingestion path? If not we should throw in the same way we do if the --structurize-arrays argument is passed for IR ingestion.

gibber9809 · 2024-12-10T14:41:26Z

components/core/src/clp_s/JsonParser.cpp

+            [[maybe_unused]] UtcOffset utc_offset_old,
+            [[maybe_unused]] UtcOffset utc_offset_new
+    ) -> IRErrorCode {
+        return IRErrorCode::IRErrorCode_Success;


It seems like this doesn't actually handle utc offset changes, so shouldn't this return something like ErrorCode_Unsupported instead?

Similar questions for the other functions that return success without doing anything. Mostly wondering what the responsibilities/expected behaviour for these functions is supposed to be.

The deserializer expects a handler like the one laid out in components/core/src/clp/ffi/ir_stream/IrUnitHandlerInterface.hpp. In the test case for the deserializer the unused functions returned success like this, so I just followed suit. I'm not sure what the goal of the functions were. I think it exposed flexibility where if we wanted to handle each schema node one at a time, do something specific with each event, etc we could. Really we'd have to consult Zhihao to see if he had a specific thing in mind. All functions have pathways to get called by the deserializer, so we need to offer them. Is there something I'm supposed to used in the archive or parser with a UTC change?

components/core/src/clp_s/JsonParser.cpp

gibber9809 · 2024-12-10T15:56:24Z

components/core/src/clp_s/JsonParser.cpp

+                                            .decode_and_unparse()
+                                            .value();
+                }
+                auto validated_escaped_encoded_string


Again, don't want to escape the string here.

For this and other utf8 validation cases we decided offline to leave it as is for now unless the escaping changes get merged before this PR is complete.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (7)

components/core/src/clp_s/JsonParser.hpp (6)

76-80: Consider enhancing method documentation.

While the documentation is clear, it would be beneficial to add:

@throw section documenting potential exceptions

Example of expected IR Stream format

Description of failure scenarios

97-108: Fix documentation inconsistency with parameters.

The documentation mentions parameters that don't match the method signature:

ir_node_type → not in signature

node_has_value → not in signature

node_value → not in signature

Update documentation to match the actual parameters:

tree

kv_pair

110-122: Add missing documentation elements.

Please add:

@return description

Update parameter list to match signature (parent_node_id is in signature but not documented)

136-140: Enhance method documentation.

Consider adding:

Description of the parsing process

@throw section if applicable

Expected format of the KeyValuePairLogEvent

186-187: Add documentation for the new member variable.

Please add a comment explaining:

Purpose of the mapping

Relationship between IR nodes and archive nodes

Why flat_hash_map was chosen over std::unordered_map

76-80: Document error handling behaviour.

Based on previous implementation patterns, please document that errors from kv_log_event_result.error() aside from std::errc::no_message_available and std::errc::result_out_of_range are expected behaviour.
components/core/src/clp_s/JsonParser.cpp (1)
688-749: Consider converting recursive implementation to iterative.

The current recursive implementation could potentially cause stack overflow with deep trees. Additionally, profilers often have trouble attributing runtime to recursive functions.

Consider converting this to an iterative implementation using a stack in a follow-up PR.

Also, use auto const for immutable views:
-    int curr_node_archive_id{constants::cRootNodeId};
+    auto const curr_node_archive_id{constants::cRootNodeId};

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 914b727 and f76da3d.

📒 Files selected for processing (2)

components/core/src/clp_s/JsonParser.cpp (2 hunks)
components/core/src/clp_s/JsonParser.hpp (5 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)

components/core/src/clp_s/JsonParser.hpp (1)

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

components/core/src/clp_s/JsonParser.cpp (3)

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

🔇 Additional comments (5)

components/core/src/clp_s/JsonParser.hpp (2)

4-19: LGTM! Includes are well-organized and necessary.

The new includes are properly ordered and support the added IR functionality.

36-36: LGTM! Using directive is appropriately scoped.

The using directive is specific to the needed type and well-placed after includes.

components/core/src/clp_s/JsonParser.cpp (3)

3-33: LGTM! Well-organized includes and namespace declarations.

The includes are logically grouped and the using declarations improve code readability.

35-75: LGTM! Clean implementation of IrUnitHandlerInterface.

The class provides a clear implementation of the interface with appropriate documentation. The empty implementations returning success are consistent with the test cases and expected behaviour.

842-919: LGTM! Well-structured IR parsing implementation.

The method provides comprehensive error handling and follows a clear flow of operations. The cleanup of resources is properly handled in error cases.

coderabbitai · 2024-12-14T08:28:37Z

components/core/src/clp_s/JsonParser.cpp

+void JsonParser::parse_kv_log_event(KeyValuePairLogEvent const& kv) {
+    clp::ffi::SchemaTree const& tree = kv.get_schema_tree();
+    for (auto const& pair : kv.get_node_id_value_pairs()) {
+        NodeType const archive_node_type = get_archive_node_type(tree, pair);
+        auto const node_id = get_archive_node_id(pair.first, archive_node_type, tree);
+
+        switch (archive_node_type) {
+            case NodeType::Integer: {
+                auto const i64_value
+                        = pair.second.value().get_immutable_view<clp::ffi::value_int_t>();
+                m_current_parsed_message.add_value(node_id, i64_value);
+            } break;
+            case NodeType::Float: {
+                auto const d_value
+                        = pair.second.value().get_immutable_view<clp::ffi::value_float_t>();
+                m_current_parsed_message.add_value(node_id, d_value);
+            } break;
+            case NodeType::Boolean: {
+                auto const b_value
+                        = pair.second.value().get_immutable_view<clp::ffi::value_bool_t>();
+                m_current_parsed_message.add_value(node_id, b_value);
+            } break;
+            case NodeType::VarString: {
+                auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
+                        pair.second.value().get_immutable_view<std::string>()
+                );
+                std::string str;
+                if (validated_escaped_string.has_value()) {
+                    str = validated_escaped_string.value();
+                } else {
+                    SPDLOG_ERROR(
+                            "String is not utf8 compliant: \"{}\"",
+                            pair.second.value().get_immutable_view<std::string>()
+                    );
+                    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+                }
+                m_current_parsed_message.add_value(node_id, str);
+            } break;
+            case NodeType::ClpString: {
+                std::string encoded_str;
+                std::string decoded_value;
+                if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) {
+                    decoded_value = pair.second.value()
+                                            .get_immutable_view<clp::ir::EightByteEncodedTextAst>()
+                                            .decode_and_unparse()
+                                            .value();
+
+                } else {
+                    decoded_value = pair.second.value()
+                                            .get_immutable_view<clp::ir::FourByteEncodedTextAst>()
+                                            .decode_and_unparse()
+                                            .value();
+                }
+                auto const validated_escaped_encoded_string
+                        = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str());
+                if (validated_escaped_encoded_string.has_value()) {
+                    encoded_str = validated_escaped_encoded_string.value();
+                } else {
+                    SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value);
+                    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+                }
+                m_current_parsed_message.add_value(node_id, encoded_str);
+            } break;
+            case NodeType::UnstructuredArray: {
+                std::string array_str;
+                if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) {
+                    array_str = pair.second.value()
+                                        .get_immutable_view<clp::ir::EightByteEncodedTextAst>()
+                                        .decode_and_unparse()
+                                        .value();
+                } else {
+                    array_str = pair.second.value()
+                                        .get_immutable_view<clp::ir::FourByteEncodedTextAst>()
+                                        .decode_and_unparse()
+                                        .value();
+                }
+                m_current_parsed_message.add_value(node_id, array_str);
+                break;
+            }
+            default:
+                // Don't need to add value for obj or null
+                break;
+        }
+        m_current_schema.insert_ordered(node_id);
+    }
+
+    int32_t const current_schema_id = m_archive_writer->add_schema(m_current_schema);
+    m_current_parsed_message.set_id(current_schema_id);
+    m_archive_writer->append_message(current_schema_id, m_current_schema, m_current_parsed_message);
+}


🛠️ Refactor suggestion

Simplify string handling and remove unnecessary UTF-8 validation.

The method contains unnecessary UTF-8 validation and string escaping. These operations should be handled during compression.

Apply these changes to simplify string handling:

Remove UTF-8 validation in VarString case

Remove UTF-8 validation in ClpString case

Use auto const for immutable views

case NodeType::VarString: { - auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string( - pair.second.value().get_immutable_view<std::string>() - ); - std::string str; - if (validated_escaped_string.has_value()) { - str = validated_escaped_string.value(); - } else { - SPDLOG_ERROR( - "String is not utf8 compliant: \"{}\"", - pair.second.value().get_immutable_view<std::string>() - ); - throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); - } + auto const& str = pair.second.value().get_immutable_view<std::string>(); m_current_parsed_message.add_value(node_id, str); } break;

Committable suggestion skipped: line range outside the PR's diff.

components/core/src/clp_s/JsonParser.cpp

coderabbitai · 2024-12-14T08:28:38Z

components/core/src/clp_s/JsonParser.cpp

+auto JsonParser::get_archive_node_type(
+        clp::ffi::SchemaTree const& tree,
+        std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair
+) -> NodeType {
+    clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first);
+    clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type();
+    bool const node_has_value = kv_pair.second.has_value();
+    std::optional<clp::ffi::Value> node_value{};
+    if (node_has_value) {
+        node_value = kv_pair.second.value();
+    }
+    switch (ir_node_type) {
+        case clp::ffi::SchemaTree::Node::Type::Int:
+            return NodeType::Integer;
+        case clp::ffi::SchemaTree::Node::Type::Float:
+            return NodeType::Float;
+        case clp::ffi::SchemaTree::Node::Type::Bool:
+            return NodeType::Boolean;
+        case clp::ffi::SchemaTree::Node::Type::UnstructuredArray:
+            return NodeType::UnstructuredArray;
+        case clp::ffi::SchemaTree::Node::Type::Str:
+            if (node_value && node_value->is<std::string>()) {
+                return NodeType::VarString;
+            } else {
+                return NodeType::ClpString;
+            }
+        case clp::ffi::SchemaTree::Node::Type::Obj:
+            if (node_has_value) {
+                if (node_value->is_null()) {
+                    return NodeType::NullValue;
+                }
+            }
+            return NodeType::Object;
+        default:
+            SPDLOG_ERROR("Unknown IR Node Type Detected");
+            throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+    }
+}


🛠️ Refactor suggestion

Simplify function structure and improve error handling.

Consider these improvements:

Simplify the function by directly returning types in the switch cases

Return NodeType::Unknown for unknown IR node types instead of throwing an error

Apply this diff to simplify the function:

auto JsonParser::get_archive_node_type( clp::ffi::SchemaTree const& tree, std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair ) -> NodeType { clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first); clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type(); - bool const node_has_value = kv_pair.second.has_value(); - std::optional<clp::ffi::Value> node_value{}; - if (node_has_value) { - node_value = kv_pair.second.value(); - } + auto const& node_value = kv_pair.second; switch (ir_node_type) { case clp::ffi::SchemaTree::Node::Type::Int: return NodeType::Integer; case clp::ffi::SchemaTree::Node::Type::Float: return NodeType::Float; case clp::ffi::SchemaTree::Node::Type::Bool: return NodeType::Boolean; case clp::ffi::SchemaTree::Node::Type::UnstructuredArray: return NodeType::UnstructuredArray; case clp::ffi::SchemaTree::Node::Type::Str: if (node_value && node_value->is<std::string>()) { return NodeType::VarString; } return NodeType::ClpString; case clp::ffi::SchemaTree::Node::Type::Obj: if (node_value && node_value->is_null()) { return NodeType::NullValue; } return NodeType::Object; default: - SPDLOG_ERROR("Unknown IR Node Type Detected"); - throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__); + return NodeType::Unknown; } }

Committable suggestion skipped: line range outside the PR's diff.

…rchive-JsonParser-Functionality

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

components/core/src/clp_s/JsonParser.cpp (4)

624-661: 🛠️ Refactor suggestion

Simplify the get_archive_node_type function for clarity

Consider refactoring the get_archive_node_type function to directly return the node type in each case. This simplifies the function, enhances readability, and aligns with best practices.

Apply this diff to simplify the function:

auto JsonParser::get_archive_node_type(
        clp::ffi::SchemaTree const& tree,
        std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair
) -> NodeType {
    clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first);
    clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type();
    auto const& node_value = kv_pair.second;

    switch (ir_node_type) {
        case clp::ffi::SchemaTree::Node::Type::Int:
            return NodeType::Integer;
        case clp::ffi::SchemaTree::Node::Type::Float:
            return NodeType::Float;
        case clp::ffi::SchemaTree::Node::Type::Bool:
            return NodeType::Boolean;
        case clp::ffi::SchemaTree::Node::Type::UnstructuredArray:
            return NodeType::UnstructuredArray;
        case clp::ffi::SchemaTree::Node::Type::Str:
            if (node_value && node_value->is<std::string>()) {
                return NodeType::VarString;
            } else {
                return NodeType::ClpString;
            }
        case clp::ffi::SchemaTree::Node::Type::Obj:
            if (node_value && node_value->is_null()) {
                return NodeType::NullValue;
            } else {
                return NodeType::Object;
            }
        default:
            return NodeType::Unknown;
    }
}

663-686: 🛠️ Refactor suggestion

Remove unnecessary UTF-8 validation in add_node_to_archive_and_translations

UTF-8 validation is unnecessary at this stage and should be handled during compression. Removing this validation simplifies the code and reduces overhead.

Apply this diff to remove the UTF-8 validation:

auto JsonParser::add_node_to_archive_and_translations(
        uint32_t ir_node_id,
        clp::ffi::SchemaTree::Node const& ir_node_to_add,
        NodeType archive_node_type,
        int32_t parent_node_id
) -> int {
-    auto validated_escaped_key
-            = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
-    std::string node_key;
-    if (validated_escaped_key.has_value()) {
-        node_key = validated_escaped_key.value();
-    } else {
-        SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    std::string const& node_key = ir_node_to_add.get_key_name();

    int const curr_node_archive_id
            = m_archive_writer->add_node(parent_node_id, archive_node_type, node_key);

    m_ir_node_to_archive_node_id_mapping.emplace(
            std::make_pair(ir_node_id, archive_node_type),
            curr_node_archive_id
    );
    return curr_node_archive_id;
}

774-786: 🛠️ Refactor suggestion

Eliminate unnecessary UTF-8 validation in parse_kv_log_event for VarString nodes

UTF-8 validation is not required here and can be safely removed. This change reduces redundancy and enhances performance.

Apply this diff:

case NodeType::VarString: {
-    auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
-            pair.second.value().get_immutable_view<std::string>()
-    );
-    std::string str;
-    if (validated_escaped_string.has_value()) {
-        str = validated_escaped_string.value();
-    } else {
-        SPDLOG_ERROR(
-                "String is not utf8 compliant: \"{}\"",
-                pair.second.value().get_immutable_view<std::string>()
-        );
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    auto const& str = pair.second.value().get_immutable_view<std::string>();
    m_current_parsed_message.add_value(node_id, str);
} break;

790-813: 🛠️ Refactor suggestion

Remove unnecessary UTF-8 validation for encoded strings in parse_kv_log_event

UTF-8 validation of encoded strings is not needed at this point and should be deferred to the compression phase. Simplifying this code improves efficiency.

Apply this diff:

case NodeType::ClpString: {
    std::string encoded_str;
    std::string decoded_value;
    if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) {
        decoded_value = pair.second.value()
                                            .get_immutable_view<clp::ir::EightByteEncodedTextAst>()
                                            .decode_and_unparse()
                                            .value();

    } else {
        decoded_value = pair.second.value()
                                            .get_immutable_view<clp::ir::FourByteEncodedTextAst>()
                                            .decode_and_unparse()
                                            .value();
    }
-    auto const validated_escaped_encoded_string
-            = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str());
-    if (validated_escaped_encoded_string.has_value()) {
-        encoded_str = validated_escaped_encoded_string.value();
-    } else {
-        SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value);
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    encoded_str = decoded_value;
    m_current_parsed_message.add_value(node_id, encoded_str);
} break;

🧹 Nitpick comments (1)

components/core/src/clp_s/JsonParser.cpp (1)
728-728: Adhere to coding guidelines by replacing not with false ==

Per the coding guidelines, prefer false == <expression> rather than !<expression>. Replace while (not ir_id_stack.empty()) with while (false == ir_id_stack.empty()) for consistency.

Apply this diff:
-    while (not ir_id_stack.empty()) {
+    while (false == ir_id_stack.empty()) {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f76da3d and 860e784.

📒 Files selected for processing (1)

components/core/src/clp_s/JsonParser.cpp (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)

components/core/src/clp_s/JsonParser.cpp (3)

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

coderabbitai · 2024-12-16T06:52:45Z

components/core/src/clp_s/JsonParser.cpp

+                    SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event");
+                    zd.close();
+                    return false;
+                }


🛠️ Refactor suggestion

Avoid unnecessary error logging in parse_from_ir

Encountering errors in kv_log_event_result.error() is anticipated behaviour and does not require additional logging. Removing this error log prevents cluttering the logs with expected messages.

Apply this diff:

} catch (...) { - SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event"); zd.close(); return false; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event");

zd.close();

return false;

}

zd.close();

return false;

}

gibber9809

Mostly smaller changes this time.

Still want to confirm about the --timestamp-key behaviour I asked about in the last round of review.

Also wondering how we're tracking original size of the IR stream for metadata purposes. I didn't see anything in our code, and I'm guessing that the ir stream itself doesn't provide anything, but we should probably at least use an estimate like number of decompressed bytes consumed by the Decompressor or something.

components/core/src/clp_s/JsonParser.hpp

gibber9809 · 2024-12-16T17:31:36Z

components/core/src/clp_s/JsonParser.cpp

+
+auto JsonParser::parse_from_ir() -> bool {
+    for (auto& file_path : m_file_paths) {
+        auto const fsize = std::filesystem::file_size(file_path);


Do we do this because the Deserializer treats empty streams as invalid streams? I can see why this makes sense in that case, but it might be simpler to just let it fail on empty streams.

Up to you whether you want to change this or not.

I guess I could also just continue onto the next file rather than returning. What do you think? I'll leave it as is for now, but would be a quick change to remove or to change to continue instead of return.

Let's remove it then.

components/core/src/clp_s/JsonParser.cpp

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

components/core/src/clp_s/JsonParser.cpp (3)
665-673: 🛠️ Refactor suggestion

Remove unnecessary UTF-8 validation and exception throwing

Based on previous discussions and learnings, the validation and exception throwing for UTF-8 compliance of ir_node_to_add.get_key_name() are unnecessary here and should be omitted.

Apply this diff to simplify the code:
-auto validated_escaped_key
-        = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
-std::string node_key;
-if (validated_escaped_key.has_value()) {
-    node_key = validated_escaped_key.value();
-} else {
-    SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
-    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-}
+std::string const& node_key = ir_node_to_add.get_key_name();
770-782: 🛠️ Refactor suggestion

Eliminate redundant UTF-8 validation and exception handling

UTF-8 validation and exception throwing for string values are unnecessary at this stage and should be removed to simplify the code.

Apply this diff to remove the validation:
-auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
-        pair.second.value().get_immutable_view<std::string>()
-);
-std::string str;
-if (validated_escaped_string.has_value()) {
-    str = validated_escaped_string.value();
-} else {
-    SPDLOG_ERROR(
-            "String is not utf8 compliant: \"{}\"",
-            pair.second.value().get_immutable_view<std::string>()
-    );
-    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-}
+auto const& str = pair.second.value().get_immutable_view<std::string>();
801-807: 🛠️ Refactor suggestion

Remove unnecessary validation and exception throwing for encoded strings

The UTF-8 validation and exception handling for encoded strings are unnecessary and can be safely removed.

Apply this diff to simplify the code:
-auto const validated_escaped_encoded_string
-        = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str());
-if (validated_escaped_encoded_string.has_value()) {
-    encoded_str = validated_escaped_encoded_string.value();
-} else {
-    SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value);
-    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-}
+std::string const& encoded_str = decoded_value;

🧹 Nitpick comments (3)

components/core/src/clp_s/JsonParser.cpp (3)
892-899: Simplify exception handling in parse_from_ir

The catch block rethrows the exception after logging and clean-up, which may be unnecessary. Consider allowing the exception to propagate naturally or handle it differently to reduce code complexity.

Consider modifying the code as follows:
try {
    parse_kv_log_event(*kv_log_event);
} catch (std::exception const& e) {
-    SPDLOG_ERROR("Encountered error while parsing a kv log event - {}", e.what());
-    decompressor.try_get_pos(curr_pos);
-    m_archive_writer->increment_uncompressed_size(curr_pos - last_pos);
-    decompressor.close();
-    return false;
+    throw;
}
650-652: Simplify condition by relying on node_value

Since node_value is only meaningful when it has a value, you can simplify the condition by checking node_value directly.

Modify the condition as follows:
-if (node_has_value && node_value.is_null()) {
+if (node_value && node_value->is_null()) {
631-634: Avoid unnecessary default initialization of node_value

Initializing node_value when it may not be used can be avoided. Use a pointer or reference to the optional value to simplify the code.

Consider changing the code to:
-clp::ffi::Value node_value{};
-if (node_has_value) {
-    node_value = kv_pair.second.value();
-}
+auto const& node_value = kv_pair.second;
Then adjust the subsequent code to handle node_value appropriately.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 860e784 and 75bb9e4.

📒 Files selected for processing (3)

components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
components/core/src/clp_s/JsonParser.cpp (2 hunks)
components/core/src/clp_s/JsonParser.hpp (5 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)

components/core/src/clp_s/JsonParser.cpp (2)

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

🔇 Additional comments (2)

components/core/src/clp_s/CommandLineArguments.cpp (1)

277-284: LGTM! Input validation enhancement.

The added validation ensures that the --timestamp-key option is not used with kv-ir file type, preventing potential issues during processing. The error message is clear and follows the existing error handling pattern.

components/core/src/clp_s/JsonParser.hpp (1)

98-103: Update function documentation to match parameters

The documentation for get_archive_node_type does not align with its current parameters. Please update the function comments to accurately reflect the parameters and their descriptions.

gibber9809

A few more minor changes. Just confirming whether it is required to support UtcOffsetChange/expected behaviour for handling it. Should be good to go after that.

components/core/src/clp_s/JsonParser.cpp

…eserializer

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

components/core/src/clp_s/JsonParser.cpp (3)

624-657: Return a safer fallback instead of throwing an exception for unknown IR node types
Throwing an exception here might cause the ingestion process to terminate abruptly for unrecognized node types. Consider returning a meaningful fallback node type (e.g., NodeType::Unknown) or providing a more precise error message that indicates the node type in question.

684-745: Refine the iterative construction of archive nodes
The logic here recursively reconstructs the path from the IR node to the root. Although it is correct, consider converting the approach to a more straightforward iterative method that stops when you first find a known ancestor. This can reduce function complexity and improve readability.

838-844: Consider skipping empty files without calling close()
Currently, encountering an empty file leads to closing the archive writer and returning false. You can opt to skip empty files gracefully, continuing ingestion for the rest of the files. This ensures partial successes for other valid files.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75bb9e4 and 51be440.

📒 Files selected for processing (1)

components/core/src/clp_s/JsonParser.cpp (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)

components/core/src/clp_s/JsonParser.cpp (3)

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

🔇 Additional comments (4)

components/core/src/clp_s/JsonParser.cpp (4)

659-673: Remove unnecessary UTF-8 validation for node keys
Validation and exception throwing for UTF-8 compliance have been flagged as unnecessary for IR nodes. As previously discussed, the IR logic or compression layer should handle invalid UTF-8. This local check increases code complexity and can be removed.

-    auto validated_escaped_key
-            = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
-    std::string node_key;
-    if (validated_escaped_key.has_value()) {
-        node_key = validated_escaped_key.value();
-    } else {
-        SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    std::string const& node_key = ir_node_to_add.get_key_name();

747-782: Remove unnecessary UTF-8 validation for VarString
Mirroring previous learning, local UTF-8 validation is unwarranted in the IR ingestion stage. Eliminate the validation step to reduce code repetition and avoid unexpected failures for borderline cases.

-    auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
-            pair.second.value().get_immutable_view<std::string>()
-    );
-    std::string str;
-    if (validated_escaped_string.has_value()) {
-        str = validated_escaped_string.value();
-    } else {
-        SPDLOG_ERROR(
-                "String is not utf8 compliant: \"{}\"",
-                pair.second.value().get_immutable_view<std::string>()
-        );
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
-    m_current_parsed_message.add_value(node_id, str);
+    auto const& str = pair.second.value().get_immutable_view<std::string>();
+    m_current_parsed_message.add_value(node_id, str);

789-807: Remove unnecessary UTF-8 validation for ClpString
Similar to the VarString branch, the ClpString logic should avoid performing UTF-8 checks. If needed, handle them closer to the data extraction step rather than here.

-    auto const validated_escaped_encoded_string
-            = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str());
-    if (validated_escaped_encoded_string.has_value()) {
-        encoded_str = validated_escaped_encoded_string.value();
-    } else {
-        SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value);
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
-    m_current_parsed_message.add_value(node_id, encoded_str);
+    m_current_parsed_message.add_value(node_id, decoded_value);

915-921: Enhance support for unhandled IR unit types
When encountering IR units that are neither log events nor schema node insertions, you return false. If partial ingestion is acceptable, consider logging a warning and skipping the unhandled unit. If partial ingestion is undesirable, this logic is fine as-is.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

components/core/src/clp_s/JsonParser.cpp (1)

683-744: Consider rewriting recursion as iteration

The recursive implementation could lead to stack overflow with deep trees and may impact performance. Consider rewriting this function iteratively using a stack.

This change would:

Improve performance

Avoid potential stack overflow issues

Make profiling more accurate

Would you like me to provide an iterative implementation in a follow-up PR?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51be440 and 7c0e500.

📒 Files selected for processing (1)

components/core/src/clp_s/JsonParser.cpp (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)

components/core/src/clp_s/JsonParser.cpp (3)

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

🔇 Additional comments (2)

components/core/src/clp_s/JsonParser.cpp (2)

664-672: 🛠️ Refactor suggestion

Remove unnecessary UTF-8 validation

Per previous discussions, UTF-8 validation should be handled during compression, not at this stage.

Apply this diff:

-    auto validated_escaped_key
-            = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
-    std::string node_key;
-    if (validated_escaped_key.has_value()) {
-        node_key = validated_escaped_key.value();
-    } else {
-        SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    std::string const& node_key = ir_node_to_add.get_key_name();

Likely invalid or redundant comment.

654-655: ⚠️ Potential issue

Return NodeType::Unknown for unknown IR node types

Instead of throwing an error, return NodeType::Unknown to allow better error handling at the caller's side.

Apply this diff:

-            throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+            return NodeType::Unknown;

Likely invalid or redundant comment.

components/core/src/clp_s/JsonParser.cpp

gibber9809

LGTM. For PR title how about feat(clp-s): Add support for kv-pair-IR ingestion.?

AVMatthews and others added 3 commits December 9, 2024 11:50

Functionality complete, linted and tidyed

14aab7a

Merge branch 'y-scope:main' into feat(clp-s)-IRv2-->-Archive-JsonPars…

e4cb2fa

…er-Functionality

Add correct parser call the clp-s.cpp

914b727

coderabbitai bot reviewed Dec 9, 2024

View reviewed changes

components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved

components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved

components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved

gibber9809 requested changes Dec 10, 2024

View reviewed changes

modify get_archive_node_id to be iterative, improve error handling

f76da3d

coderabbitai bot reviewed Dec 14, 2024

View reviewed changes

AVMatthews added 2 commits December 15, 2024 00:48

Merge remote-tracking branch 'origin/main' into feat(clp-s)-IRv2-->-A…

206032b

…rchive-JsonParser-Functionality

Fixed compile issues after merging main

860e784

coderabbitai bot reviewed Dec 16, 2024

View reviewed changes

gibber9809 requested changes Dec 16, 2024

View reviewed changes

Clean up & add command line argumetn error case

75bb9e4

coderabbitai bot reviewed Dec 17, 2024

View reviewed changes

gibber9809 self-requested a review December 18, 2024 16:53

gibber9809 requested changes Dec 18, 2024

View reviewed changes

remove unnecessary clean up in error cases and fail with error from d…

51be440

…eserializer

coderabbitai bot reviewed Dec 18, 2024

View reviewed changes

remove file size check

7c0e500

coderabbitai bot reviewed Dec 18, 2024

View reviewed changes

components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved

components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved

components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved

components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved

AVMatthews requested a review from gibber9809 December 19, 2024 15:18

gibber9809 approved these changes Dec 19, 2024

View reviewed changes

AVMatthews changed the title ~~feat(clp s): IRv2 ingestion to archive functionality~~ feat(clp-s): Add support for kv-pair-IR ingestion. Dec 19, 2024

gibber9809 merged commit 81d627b into y-scope:main Dec 19, 2024
21 checks passed

coderabbitai bot mentioned this pull request Dec 19, 2024

feat(clp-s): Add support for ingesting logs from S3. #639

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(clp-s): Add support for kv-pair-IR ingestion. #630

feat(clp-s): Add support for kv-pair-IR ingestion. #630

AVMatthews commented Dec 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

gibber9809 left a comment

gibber9809 Dec 10, 2024

gibber9809 Dec 10, 2024

AVMatthews Dec 10, 2024

gibber9809 Dec 10, 2024

gibber9809 Dec 16, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 14, 2024

coderabbitai bot Dec 14, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 16, 2024

gibber9809 left a comment

gibber9809 Dec 16, 2024

AVMatthews Dec 17, 2024

gibber9809 Dec 18, 2024

coderabbitai bot left a comment

gibber9809 left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

gibber9809 left a comment

feat(clp-s): Add support for kv-pair-IR ingestion. #630

feat(clp-s): Add support for kv-pair-IR ingestion. #630

Conversation

AVMatthews commented Dec 9, 2024 • edited by coderabbitai bot Loading

Description

Validation performed

Summary by CodeRabbit

coderabbitai bot commented Dec 9, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

gibber9809 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 14, 2024

Choose a reason for hiding this comment

coderabbitai bot Dec 14, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 16, 2024

Choose a reason for hiding this comment

gibber9809 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

gibber9809 left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

gibber9809 left a comment

Choose a reason for hiding this comment

AVMatthews commented Dec 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)