Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-s): Add support for kv-pair-IR ingestion. #630

Conversation

AVMatthews
Copy link
Contributor

@AVMatthews AVMatthews commented Dec 9, 2024

Description

This pull request is the continuation or PR #618, adding the functionality promised. This adds the corresponding ingestion functionality to compress IRv2 into archive.

  • Add additional public function to JsonParser to expose parsing from IR; parse_from_ir()
  • Add an class implementing the clp::ffi::ir_stream::IrUnitHandlerInterface for use deserializing IRv2 files; IrUnitHandler
  • Add function to iterate over pairs in the KeyValuePairLogEvent converting the log event into the correct format for archive; parse_kv_log_event()
  • Add two helper functions to help convert IR schema nodes into archive schema nodes:
    • Find the archive schema node type based on the IR schema node type and the value of the node in the current log event; get_archive_node_type
    • Find or add the correct archive schema node id for a given IR schema node id and correct archive schema node type for the value of the node in the current log event; get_archive_node_id

Validation performed

For our posted JSON Datasets:

  • Generated the corresponding IRv2 file

  • Compress the IRv2 file to archive

    • ./clp-s c <archive dir> <ir file or directory w/ ir file> --file-type kv-ir
  • Extract archive back to JSON

    • ./clp-s x <archive dir> <output JSON dir>
  • Sort keys and compare to JSON generated by clp-s from compressing and extracting directly from original JSON (spark, elasticsearch, and postgresql)

    • jq -S -c '.' <output JSON dir>/original > out.json
      • You can sort the json order if not using built in log ordering
      • jq -S -c '.' <output JSON dir>/original | sort > out_sorted.json
    • diff -u out.json clp-s_out.json
      • To generate clp-s_out.json:
        • ./clp-s c <archive dir> <original JSON>
        • ./clp-s x <archive dir> <output JSON dir>
        • jq -S -c '.' <output JSON dir>/original > clp-s_out.json
  • Spot checked match of mongodb and cockroach database

  • Notes:

    • I noticed some differences in numbers of significant digits for floats between both clp-s and the IRv2 to archive, and the original JSON.
    • ex.) original: 1.0; clp_s: 1.00000; IRv2: 1.00000
    • ex.) original: 0.9758767; clp_s: 0.9758767; IRv2: 0.975877

Summary by CodeRabbit

  • New Features

    • Enhanced JsonParser to support parsing of Key Value IR Streams.
    • Introduced new functionalities for handling log events and schema trees.
    • Improved error handling in the compress function for parsing Key Value IR files.
    • Added validation checks in command-line argument parsing for better user feedback.
  • Bug Fixes

    • Enhanced error logging for parsing failures in the compress function.

Copy link
Contributor

coderabbitai bot commented Dec 9, 2024

Walkthrough

This pull request introduces multiple new source and header files to the CMakeLists.txt for the CLP module, enhancing its capabilities in serialization, deserialization, and data handling. Key modifications are made to the JsonParser class, including the addition of a new IrUnitHandler class and several methods for processing log events and IR units. The compress function is updated to incorporate error handling for parsing Key Value IR files, replacing placeholder comments with functional code. Overall, the changes expand the functionality and robustness of the CLP module.

Changes

File Path Change Summary
components/core/src/clp_s/CMakeLists.txt Added multiple new source and header files to CLP_SOURCES, CLP_S_SOURCES, and CLP_S_SEARCH_SOURCES.
components/core/src/clp_s/JsonParser.cpp Introduced IrUnitHandler class and new methods in JsonParser for handling log events and IR units.
components/core/src/clp_s/JsonParser.hpp Added new public and private methods in JsonParser for parsing and handling key-value pairs and schemas.
components/core/src/clp_s/clp-s.cpp Enhanced error handling in the compress function for parsing Key Value IR files.
components/core/src/clp_s/CommandLineArguments.cpp Improved error handling and validation in parse_arguments method regarding --file-type and --timestamp-key.

Possibly related PRs

Suggested reviewers

  • gibber9809
  • kirkrodrigues

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
components/core/src/clp_s/CMakeLists.txt (1)

Line range hint 11-59: Consider grouping related components into separate CMake targets.

The current structure adds all files to a single source list. Consider organizing related components (IR stream, key-value processing, compression) into separate CMake targets for better modularity and dependency management.

Example restructuring:

# IR Stream component
add_library(ir_stream STATIC
  ../clp/ffi/ir_stream/decoding_methods.cpp
  ../clp/ffi/ir_stream/encoding_methods.cpp
  # ... other IR stream files
)

# Key-Value processing component
add_library(kv_processing STATIC
  ../clp/ffi/KeyValuePairLogEvent.cpp
  ../clp/ffi/SchemaTree.cpp
  # ... other KV files
)

# Update main target to use these components
target_link_libraries(clp-s
  PRIVATE
  ir_stream
  kv_processing
  # ... other dependencies
)
components/core/src/clp_s/clp-s.cpp (1)

118-121: Refactor to eliminate code duplication in error handling

Both branches of the if-statement handle parsing errors identically. Consider refactoring to reduce duplication and enhance readability.

Apply this diff to refactor the code:

     bool success = false;
     if (CommandLineArguments::FileType::KeyValueIr == option.input_file_type) {
-        if (false == parser.parse_from_ir()) {
-            SPDLOG_ERROR("Encountered error while parsing input");
-            return false;
-        }
+        success = parser.parse_from_ir();
     } else {
-        if (false == parser.parse()) {
-            SPDLOG_ERROR("Encountered error while parsing input");
-            return false;
-        }
+        success = parser.parse();
     }
+    if (false == success) {
+        SPDLOG_ERROR("Encountered error while parsing input");
+        return false;
+    }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7176c5e and 914b727.

📒 Files selected for processing (4)
  • components/core/src/clp_s/CMakeLists.txt (2 hunks)
  • components/core/src/clp_s/JsonParser.cpp (2 hunks)
  • components/core/src/clp_s/JsonParser.hpp (4 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
components/core/src/clp_s/JsonParser.hpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
components/core/src/clp_s/JsonParser.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
🔇 Additional comments (8)
components/core/src/clp_s/CMakeLists.txt (3)

11-28: Well-organized IR stream and key-value processing components.

The organization of the IR stream and key-value processing components follows a clean separation of concerns and maintains consistent naming conventions. The structure aligns well with the PR objectives for IRv2 ingestion functionality.


Line range hint 11-59: Consider adding corresponding test files.

While the implementation files are well-organized, I notice that corresponding test files aren't included in the CMake configuration. Consider adding test files for:

  • IR stream serialization/deserialization
  • Key-value pair processing
  • Schema handling
  • UTF-8 utilities

Would you like me to help create a template for these test files or open a GitHub issue to track this task?


54-55: Verify ZSTD compression integration with IR stream handling.

The addition of ZSTD compression components alongside IR stream handling suggests a dependency between these features. Please ensure that the IR stream serialization properly integrates with the ZSTD compression.

Also applies to: 11-21

✅ Verification successful

ZSTD compression is properly integrated with IR stream handling

The verification shows that ZSTD compression is already properly integrated with IR stream handling:

  • LogEventSerializer uses ZSTD compression for writing IR data through m_zstd_compressor
  • The test suite confirms the integration by using Decompressor to read the ZSTD-compressed IR stream
  • The implementation follows a proper lifecycle (open, write, close) for compression handling
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for ZSTD compression usage in IR stream handling

# Search for ZSTD compression usage in IR stream files
echo "Checking ZSTD integration..."
rg -l "zstd|ZSTD" ../clp/ffi/ir_stream/

Length of output: 170


Script:

#!/bin/bash
# Let's first find the correct paths for IR stream and ZSTD files
echo "Finding IR stream and ZSTD related files..."
fd -t f "ir_stream|IR_Stream|zstd|ZSTD" --full-path

# Then check for any connections between IR stream and ZSTD compression
echo -e "\nChecking for ZSTD usage in IR stream related files..."
rg -l "zstd|ZSTD|Zstd" $(fd -t f "ir_stream|IR_Stream")

# Look for IR stream mentions in ZSTD files
echo -e "\nChecking for IR stream references in ZSTD files..."
rg -l "ir_stream|IR_Stream" $(fd -t f "zstd|ZSTD")

Length of output: 9711


Script:

#!/bin/bash
# Let's examine the actual integration points between IR stream and ZSTD
echo "Checking IR serialization code for compression usage..."
rg -C 3 "Compress|Decompress|zstd|ZSTD" components/core/src/clp/ir/LogEventSerializer.cpp components/core/src/clp/ir/LogEventDeserializer.cpp

echo -e "\nChecking test files for IR and ZSTD integration..."
rg -C 3 "Compress|Decompress|zstd|ZSTD" components/core/tests/test-ir_serializer.cpp

Length of output: 2862

components/core/src/clp_s/JsonParser.hpp (5)

4-4: Inclusion of necessary headers

The added headers provide the required declarations for the new functionalities implemented in the class.

Also applies to: 8-8, 15-17


74-79: Addition of parse_from_ir() method

The introduction of the parse_from_ir() public method enhances the JsonParser class by enabling parsing from IR streams, aligning with the PR objectives.


95-107: Definition of get_archive_node_type() method

The get_archive_node_type() method correctly determines the archive node type based on the IR node type and its value, which is essential for accurate data conversion.


108-123: Implementation of get_archive_node_id() method

The get_archive_node_id() method efficiently maps IR node IDs to archive node IDs, utilizing caching to optimize performance.


124-135: Creation of parse_kv_log_event() method

The parse_kv_log_event() method effectively parses key-value log events, facilitating their conversion into the archive format.

components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an initial set of comments. Still need to go through parse_from_ir onwards, but should be enough comments to iterate on for now.

The other high level question I have is whether the --timestamp-key argument is supported for this ingestion path? If not we should throw in the same way we do if the --structurize-arrays argument is passed for IR ingestion.

[[maybe_unused]] UtcOffset utc_offset_old,
[[maybe_unused]] UtcOffset utc_offset_new
) -> IRErrorCode {
return IRErrorCode::IRErrorCode_Success;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this doesn't actually handle utc offset changes, so shouldn't this return something like ErrorCode_Unsupported instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar questions for the other functions that return success without doing anything. Mostly wondering what the responsibilities/expected behaviour for these functions is supposed to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deserializer expects a handler like the one laid out in components/core/src/clp/ffi/ir_stream/IrUnitHandlerInterface.hpp. In the test case for the deserializer the unused functions returned success like this, so I just followed suit. I'm not sure what the goal of the functions were. I think it exposed flexibility where if we wanted to handle each schema node one at a time, do something specific with each event, etc we could. Really we'd have to consult Zhihao to see if he had a specific thing in mind. All functions have pathways to get called by the deserializer, so we need to offer them. Is there something I'm supposed to used in the archive or parser with a UTC change?

components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
.decode_and_unparse()
.value();
}
auto validated_escaped_encoded_string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, don't want to escape the string here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and other utf8 validation cases we decided offline to leave it as is for now unless the escaping changes get merged before this PR is complete.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (7)
components/core/src/clp_s/JsonParser.hpp (6)

76-80: Consider enhancing method documentation.

While the documentation is clear, it would be beneficial to add:

  • @throw section documenting potential exceptions
  • Example of expected IR Stream format
  • Description of failure scenarios

97-108: Fix documentation inconsistency with parameters.

The documentation mentions parameters that don't match the method signature:

  • ir_node_type → not in signature
  • node_has_value → not in signature
  • node_value → not in signature

Update documentation to match the actual parameters:

  • tree
  • kv_pair

110-122: Add missing documentation elements.

Please add:

  • @return description
  • Update parameter list to match signature (parent_node_id is in signature but not documented)

136-140: Enhance method documentation.

Consider adding:

  • Description of the parsing process
  • @throw section if applicable
  • Expected format of the KeyValuePairLogEvent

186-187: Add documentation for the new member variable.

Please add a comment explaining:

  • Purpose of the mapping
  • Relationship between IR nodes and archive nodes
  • Why flat_hash_map was chosen over std::unordered_map

76-80: Document error handling behaviour.

Based on previous implementation patterns, please document that errors from kv_log_event_result.error() aside from std::errc::no_message_available and std::errc::result_out_of_range are expected behaviour.

components/core/src/clp_s/JsonParser.cpp (1)

688-749: Consider converting recursive implementation to iterative.

The current recursive implementation could potentially cause stack overflow with deep trees. Additionally, profilers often have trouble attributing runtime to recursive functions.

Consider converting this to an iterative implementation using a stack in a follow-up PR.

Also, use auto const for immutable views:

-    int curr_node_archive_id{constants::cRootNodeId};
+    auto const curr_node_archive_id{constants::cRootNodeId};
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 914b727 and f76da3d.

📒 Files selected for processing (2)
  • components/core/src/clp_s/JsonParser.cpp (2 hunks)
  • components/core/src/clp_s/JsonParser.hpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
components/core/src/clp_s/JsonParser.hpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
components/core/src/clp_s/JsonParser.cpp (3)
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
🔇 Additional comments (5)
components/core/src/clp_s/JsonParser.hpp (2)

4-19: LGTM! Includes are well-organized and necessary.

The new includes are properly ordered and support the added IR functionality.


36-36: LGTM! Using directive is appropriately scoped.

The using directive is specific to the needed type and well-placed after includes.

components/core/src/clp_s/JsonParser.cpp (3)

3-33: LGTM! Well-organized includes and namespace declarations.

The includes are logically grouped and the using declarations improve code readability.


35-75: LGTM! Clean implementation of IrUnitHandlerInterface.

The class provides a clear implementation of the interface with appropriate documentation. The empty implementations returning success are consistent with the test cases and expected behaviour.


842-919: LGTM! Well-structured IR parsing implementation.

The method provides comprehensive error handling and follows a clear flow of operations. The cleanup of resources is properly handled in error cases.

Comment on lines 751 to 840
void JsonParser::parse_kv_log_event(KeyValuePairLogEvent const& kv) {
clp::ffi::SchemaTree const& tree = kv.get_schema_tree();
for (auto const& pair : kv.get_node_id_value_pairs()) {
NodeType const archive_node_type = get_archive_node_type(tree, pair);
auto const node_id = get_archive_node_id(pair.first, archive_node_type, tree);

switch (archive_node_type) {
case NodeType::Integer: {
auto const i64_value
= pair.second.value().get_immutable_view<clp::ffi::value_int_t>();
m_current_parsed_message.add_value(node_id, i64_value);
} break;
case NodeType::Float: {
auto const d_value
= pair.second.value().get_immutable_view<clp::ffi::value_float_t>();
m_current_parsed_message.add_value(node_id, d_value);
} break;
case NodeType::Boolean: {
auto const b_value
= pair.second.value().get_immutable_view<clp::ffi::value_bool_t>();
m_current_parsed_message.add_value(node_id, b_value);
} break;
case NodeType::VarString: {
auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
pair.second.value().get_immutable_view<std::string>()
);
std::string str;
if (validated_escaped_string.has_value()) {
str = validated_escaped_string.value();
} else {
SPDLOG_ERROR(
"String is not utf8 compliant: \"{}\"",
pair.second.value().get_immutable_view<std::string>()
);
throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
}
m_current_parsed_message.add_value(node_id, str);
} break;
case NodeType::ClpString: {
std::string encoded_str;
std::string decoded_value;
if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) {
decoded_value = pair.second.value()
.get_immutable_view<clp::ir::EightByteEncodedTextAst>()
.decode_and_unparse()
.value();

} else {
decoded_value = pair.second.value()
.get_immutable_view<clp::ir::FourByteEncodedTextAst>()
.decode_and_unparse()
.value();
}
auto const validated_escaped_encoded_string
= clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str());
if (validated_escaped_encoded_string.has_value()) {
encoded_str = validated_escaped_encoded_string.value();
} else {
SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value);
throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
}
m_current_parsed_message.add_value(node_id, encoded_str);
} break;
case NodeType::UnstructuredArray: {
std::string array_str;
if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) {
array_str = pair.second.value()
.get_immutable_view<clp::ir::EightByteEncodedTextAst>()
.decode_and_unparse()
.value();
} else {
array_str = pair.second.value()
.get_immutable_view<clp::ir::FourByteEncodedTextAst>()
.decode_and_unparse()
.value();
}
m_current_parsed_message.add_value(node_id, array_str);
break;
}
default:
// Don't need to add value for obj or null
break;
}
m_current_schema.insert_ordered(node_id);
}

int32_t const current_schema_id = m_archive_writer->add_schema(m_current_schema);
m_current_parsed_message.set_id(current_schema_id);
m_archive_writer->append_message(current_schema_id, m_current_schema, m_current_parsed_message);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify string handling and remove unnecessary UTF-8 validation.

The method contains unnecessary UTF-8 validation and string escaping. These operations should be handled during compression.

Apply these changes to simplify string handling:

  1. Remove UTF-8 validation in VarString case
  2. Remove UTF-8 validation in ClpString case
  3. Use auto const for immutable views
             case NodeType::VarString: {
-                auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
-                        pair.second.value().get_immutable_view<std::string>()
-                );
-                std::string str;
-                if (validated_escaped_string.has_value()) {
-                    str = validated_escaped_string.value();
-                } else {
-                    SPDLOG_ERROR(
-                            "String is not utf8 compliant: \"{}\"",
-                            pair.second.value().get_immutable_view<std::string>()
-                    );
-                    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-                }
+                auto const& str = pair.second.value().get_immutable_view<std::string>();
                 m_current_parsed_message.add_value(node_id, str);
             } break;

Committable suggestion skipped: line range outside the PR's diff.

components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved
Comment on lines 624 to 661
auto JsonParser::get_archive_node_type(
clp::ffi::SchemaTree const& tree,
std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair
) -> NodeType {
clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first);
clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type();
bool const node_has_value = kv_pair.second.has_value();
std::optional<clp::ffi::Value> node_value{};
if (node_has_value) {
node_value = kv_pair.second.value();
}
switch (ir_node_type) {
case clp::ffi::SchemaTree::Node::Type::Int:
return NodeType::Integer;
case clp::ffi::SchemaTree::Node::Type::Float:
return NodeType::Float;
case clp::ffi::SchemaTree::Node::Type::Bool:
return NodeType::Boolean;
case clp::ffi::SchemaTree::Node::Type::UnstructuredArray:
return NodeType::UnstructuredArray;
case clp::ffi::SchemaTree::Node::Type::Str:
if (node_value && node_value->is<std::string>()) {
return NodeType::VarString;
} else {
return NodeType::ClpString;
}
case clp::ffi::SchemaTree::Node::Type::Obj:
if (node_has_value) {
if (node_value->is_null()) {
return NodeType::NullValue;
}
}
return NodeType::Object;
default:
SPDLOG_ERROR("Unknown IR Node Type Detected");
throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify function structure and improve error handling.

Consider these improvements:

  1. Simplify the function by directly returning types in the switch cases
  2. Return NodeType::Unknown for unknown IR node types instead of throwing an error

Apply this diff to simplify the function:

 auto JsonParser::get_archive_node_type(
         clp::ffi::SchemaTree const& tree,
         std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair
 ) -> NodeType {
     clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first);
     clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type();
-    bool const node_has_value = kv_pair.second.has_value();
-    std::optional<clp::ffi::Value> node_value{};
-    if (node_has_value) {
-        node_value = kv_pair.second.value();
-    }
+    auto const& node_value = kv_pair.second;
     switch (ir_node_type) {
         case clp::ffi::SchemaTree::Node::Type::Int:
             return NodeType::Integer;
         case clp::ffi::SchemaTree::Node::Type::Float:
             return NodeType::Float;
         case clp::ffi::SchemaTree::Node::Type::Bool:
             return NodeType::Boolean;
         case clp::ffi::SchemaTree::Node::Type::UnstructuredArray:
             return NodeType::UnstructuredArray;
         case clp::ffi::SchemaTree::Node::Type::Str:
             if (node_value && node_value->is<std::string>()) {
                 return NodeType::VarString;
             }
             return NodeType::ClpString;
         case clp::ffi::SchemaTree::Node::Type::Obj:
             if (node_value && node_value->is_null()) {
                 return NodeType::NullValue;
             }
             return NodeType::Object;
         default:
-            SPDLOG_ERROR("Unknown IR Node Type Detected");
-            throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+            return NodeType::Unknown;
     }
 }

Committable suggestion skipped: line range outside the PR's diff.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
components/core/src/clp_s/JsonParser.cpp (4)

624-661: 🛠️ Refactor suggestion

Simplify the get_archive_node_type function for clarity

Consider refactoring the get_archive_node_type function to directly return the node type in each case. This simplifies the function, enhances readability, and aligns with best practices.

Apply this diff to simplify the function:

auto JsonParser::get_archive_node_type(
        clp::ffi::SchemaTree const& tree,
        std::pair<clp::ffi::SchemaTree::Node::id_t, std::optional<clp::ffi::Value>> const& kv_pair
) -> NodeType {
    clp::ffi::SchemaTree::Node const& tree_node = tree.get_node(kv_pair.first);
    clp::ffi::SchemaTree::Node::Type const ir_node_type = tree_node.get_type();
    auto const& node_value = kv_pair.second;

    switch (ir_node_type) {
        case clp::ffi::SchemaTree::Node::Type::Int:
            return NodeType::Integer;
        case clp::ffi::SchemaTree::Node::Type::Float:
            return NodeType::Float;
        case clp::ffi::SchemaTree::Node::Type::Bool:
            return NodeType::Boolean;
        case clp::ffi::SchemaTree::Node::Type::UnstructuredArray:
            return NodeType::UnstructuredArray;
        case clp::ffi::SchemaTree::Node::Type::Str:
            if (node_value && node_value->is<std::string>()) {
                return NodeType::VarString;
            } else {
                return NodeType::ClpString;
            }
        case clp::ffi::SchemaTree::Node::Type::Obj:
            if (node_value && node_value->is_null()) {
                return NodeType::NullValue;
            } else {
                return NodeType::Object;
            }
        default:
            return NodeType::Unknown;
    }
}

663-686: 🛠️ Refactor suggestion

Remove unnecessary UTF-8 validation in add_node_to_archive_and_translations

UTF-8 validation is unnecessary at this stage and should be handled during compression. Removing this validation simplifies the code and reduces overhead.

Apply this diff to remove the UTF-8 validation:

auto JsonParser::add_node_to_archive_and_translations(
        uint32_t ir_node_id,
        clp::ffi::SchemaTree::Node const& ir_node_to_add,
        NodeType archive_node_type,
        int32_t parent_node_id
) -> int {
-    auto validated_escaped_key
-            = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
-    std::string node_key;
-    if (validated_escaped_key.has_value()) {
-        node_key = validated_escaped_key.value();
-    } else {
-        SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    std::string const& node_key = ir_node_to_add.get_key_name();

    int const curr_node_archive_id
            = m_archive_writer->add_node(parent_node_id, archive_node_type, node_key);

    m_ir_node_to_archive_node_id_mapping.emplace(
            std::make_pair(ir_node_id, archive_node_type),
            curr_node_archive_id
    );
    return curr_node_archive_id;
}

774-786: 🛠️ Refactor suggestion

Eliminate unnecessary UTF-8 validation in parse_kv_log_event for VarString nodes

UTF-8 validation is not required here and can be safely removed. This change reduces redundancy and enhances performance.

Apply this diff:

case NodeType::VarString: {
-    auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
-            pair.second.value().get_immutable_view<std::string>()
-    );
-    std::string str;
-    if (validated_escaped_string.has_value()) {
-        str = validated_escaped_string.value();
-    } else {
-        SPDLOG_ERROR(
-                "String is not utf8 compliant: \"{}\"",
-                pair.second.value().get_immutable_view<std::string>()
-        );
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    auto const& str = pair.second.value().get_immutable_view<std::string>();
    m_current_parsed_message.add_value(node_id, str);
} break;

790-813: 🛠️ Refactor suggestion

Remove unnecessary UTF-8 validation for encoded strings in parse_kv_log_event

UTF-8 validation of encoded strings is not needed at this point and should be deferred to the compression phase. Simplifying this code improves efficiency.

Apply this diff:

case NodeType::ClpString: {
    std::string encoded_str;
    std::string decoded_value;
    if (pair.second.value().is<clp::ir::EightByteEncodedTextAst>()) {
        decoded_value = pair.second.value()
                                            .get_immutable_view<clp::ir::EightByteEncodedTextAst>()
                                            .decode_and_unparse()
                                            .value();

    } else {
        decoded_value = pair.second.value()
                                            .get_immutable_view<clp::ir::FourByteEncodedTextAst>()
                                            .decode_and_unparse()
                                            .value();
    }
-    auto const validated_escaped_encoded_string
-            = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str());
-    if (validated_escaped_encoded_string.has_value()) {
-        encoded_str = validated_escaped_encoded_string.value();
-    } else {
-        SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value);
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    encoded_str = decoded_value;
    m_current_parsed_message.add_value(node_id, encoded_str);
} break;
🧹 Nitpick comments (1)
components/core/src/clp_s/JsonParser.cpp (1)

728-728: Adhere to coding guidelines by replacing not with false ==

Per the coding guidelines, prefer false == <expression> rather than !<expression>. Replace while (not ir_id_stack.empty()) with while (false == ir_id_stack.empty()) for consistency.

Apply this diff:

-    while (not ir_id_stack.empty()) {
+    while (false == ir_id_stack.empty()) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f76da3d and 860e784.

📒 Files selected for processing (1)
  • components/core/src/clp_s/JsonParser.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp_s/JsonParser.cpp (3)
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Comment on lines 898 to 901
SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event");
zd.close();
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid unnecessary error logging in parse_from_ir

Encountering errors in kv_log_event_result.error() is anticipated behaviour and does not require additional logging. Removing this error log prevents cluttering the logs with expected messages.

Apply this diff:

} catch (...) {
-    SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event");
    zd.close();
    return false;
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SPDLOG_ERROR("ERROR: Encountered error while parsing a kv log event");
zd.close();
return false;
}
zd.close();
return false;
}

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly smaller changes this time.

Still want to confirm about the --timestamp-key behaviour I asked about in the last round of review.

Also wondering how we're tracking original size of the IR stream for metadata purposes. I didn't see anything in our code, and I'm guessing that the ir stream itself doesn't provide anything, but we should probably at least use an estimate like number of decompressed bytes consumed by the Decompressor or something.

components/core/src/clp_s/JsonParser.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.hpp Outdated Show resolved Hide resolved

auto JsonParser::parse_from_ir() -> bool {
for (auto& file_path : m_file_paths) {
auto const fsize = std::filesystem::file_size(file_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do this because the Deserializer treats empty streams as invalid streams? I can see why this makes sense in that case, but it might be simpler to just let it fail on empty streams.

Up to you whether you want to change this or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could also just continue onto the next file rather than returning. What do you think? I'll leave it as is for now, but would be a quick change to remove or to change to continue instead of return.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it then.

components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
components/core/src/clp_s/JsonParser.cpp (3)

665-673: 🛠️ Refactor suggestion

Remove unnecessary UTF-8 validation and exception throwing

Based on previous discussions and learnings, the validation and exception throwing for UTF-8 compliance of ir_node_to_add.get_key_name() are unnecessary here and should be omitted.

Apply this diff to simplify the code:

-auto validated_escaped_key
-        = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
-std::string node_key;
-if (validated_escaped_key.has_value()) {
-    node_key = validated_escaped_key.value();
-} else {
-    SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
-    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-}
+std::string const& node_key = ir_node_to_add.get_key_name();

770-782: 🛠️ Refactor suggestion

Eliminate redundant UTF-8 validation and exception handling

UTF-8 validation and exception throwing for string values are unnecessary at this stage and should be removed to simplify the code.

Apply this diff to remove the validation:

-auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
-        pair.second.value().get_immutable_view<std::string>()
-);
-std::string str;
-if (validated_escaped_string.has_value()) {
-    str = validated_escaped_string.value();
-} else {
-    SPDLOG_ERROR(
-            "String is not utf8 compliant: \"{}\"",
-            pair.second.value().get_immutable_view<std::string>()
-    );
-    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-}
+auto const& str = pair.second.value().get_immutable_view<std::string>();

801-807: 🛠️ Refactor suggestion

Remove unnecessary validation and exception throwing for encoded strings

The UTF-8 validation and exception handling for encoded strings are unnecessary and can be safely removed.

Apply this diff to simplify the code:

-auto const validated_escaped_encoded_string
-        = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str());
-if (validated_escaped_encoded_string.has_value()) {
-    encoded_str = validated_escaped_encoded_string.value();
-} else {
-    SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value);
-    throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-}
+std::string const& encoded_str = decoded_value;
🧹 Nitpick comments (3)
components/core/src/clp_s/JsonParser.cpp (3)

892-899: Simplify exception handling in parse_from_ir

The catch block rethrows the exception after logging and clean-up, which may be unnecessary. Consider allowing the exception to propagate naturally or handle it differently to reduce code complexity.

Consider modifying the code as follows:

try {
    parse_kv_log_event(*kv_log_event);
} catch (std::exception const& e) {
-    SPDLOG_ERROR("Encountered error while parsing a kv log event - {}", e.what());
-    decompressor.try_get_pos(curr_pos);
-    m_archive_writer->increment_uncompressed_size(curr_pos - last_pos);
-    decompressor.close();
-    return false;
+    throw;
}

650-652: Simplify condition by relying on node_value

Since node_value is only meaningful when it has a value, you can simplify the condition by checking node_value directly.

Modify the condition as follows:

-if (node_has_value && node_value.is_null()) {
+if (node_value && node_value->is_null()) {

631-634: Avoid unnecessary default initialization of node_value

Initializing node_value when it may not be used can be avoided. Use a pointer or reference to the optional value to simplify the code.

Consider changing the code to:

-clp::ffi::Value node_value{};
-if (node_has_value) {
-    node_value = kv_pair.second.value();
-}
+auto const& node_value = kv_pair.second;

Then adjust the subsequent code to handle node_value appropriately.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 860e784 and 75bb9e4.

📒 Files selected for processing (3)
  • components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp_s/JsonParser.cpp (2 hunks)
  • components/core/src/clp_s/JsonParser.hpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp_s/JsonParser.cpp (2)
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
🔇 Additional comments (2)
components/core/src/clp_s/CommandLineArguments.cpp (1)

277-284: LGTM! Input validation enhancement.

The added validation ensures that the --timestamp-key option is not used with kv-ir file type, preventing potential issues during processing. The error message is clear and follows the existing error handling pattern.

components/core/src/clp_s/JsonParser.hpp (1)

98-103: Update function documentation to match parameters

The documentation for get_archive_node_type does not align with its current parameters. Please update the function comments to accurately reflect the parameters and their descriptions.

@gibber9809 gibber9809 self-requested a review December 18, 2024 16:53
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor changes. Just confirming whether it is required to support UtcOffsetChange/expected behaviour for handling it. Should be good to go after that.

components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/core/src/clp_s/JsonParser.cpp (3)

624-657: Return a safer fallback instead of throwing an exception for unknown IR node types
Throwing an exception here might cause the ingestion process to terminate abruptly for unrecognized node types. Consider returning a meaningful fallback node type (e.g., NodeType::Unknown) or providing a more precise error message that indicates the node type in question.


684-745: Refine the iterative construction of archive nodes
The logic here recursively reconstructs the path from the IR node to the root. Although it is correct, consider converting the approach to a more straightforward iterative method that stops when you first find a known ancestor. This can reduce function complexity and improve readability.


838-844: Consider skipping empty files without calling close()
Currently, encountering an empty file leads to closing the archive writer and returning false. You can opt to skip empty files gracefully, continuing ingestion for the rest of the files. This ensures partial successes for other valid files.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75bb9e4 and 51be440.

📒 Files selected for processing (1)
  • components/core/src/clp_s/JsonParser.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp_s/JsonParser.cpp (3)
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
🔇 Additional comments (4)
components/core/src/clp_s/JsonParser.cpp (4)

659-673: Remove unnecessary UTF-8 validation for node keys
Validation and exception throwing for UTF-8 compliance have been flagged as unnecessary for IR nodes. As previously discussed, the IR logic or compression layer should handle invalid UTF-8. This local check increases code complexity and can be removed.

-    auto validated_escaped_key
-            = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
-    std::string node_key;
-    if (validated_escaped_key.has_value()) {
-        node_key = validated_escaped_key.value();
-    } else {
-        SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    std::string const& node_key = ir_node_to_add.get_key_name();

747-782: Remove unnecessary UTF-8 validation for VarString
Mirroring previous learning, local UTF-8 validation is unwarranted in the IR ingestion stage. Eliminate the validation step to reduce code repetition and avoid unexpected failures for borderline cases.

-    auto const validated_escaped_string = clp::ffi::validate_and_escape_utf8_string(
-            pair.second.value().get_immutable_view<std::string>()
-    );
-    std::string str;
-    if (validated_escaped_string.has_value()) {
-        str = validated_escaped_string.value();
-    } else {
-        SPDLOG_ERROR(
-                "String is not utf8 compliant: \"{}\"",
-                pair.second.value().get_immutable_view<std::string>()
-        );
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
-    m_current_parsed_message.add_value(node_id, str);
+    auto const& str = pair.second.value().get_immutable_view<std::string>();
+    m_current_parsed_message.add_value(node_id, str);

789-807: Remove unnecessary UTF-8 validation for ClpString
Similar to the VarString branch, the ClpString logic should avoid performing UTF-8 checks. If needed, handle them closer to the data extraction step rather than here.

-    auto const validated_escaped_encoded_string
-            = clp::ffi::validate_and_escape_utf8_string(decoded_value.c_str());
-    if (validated_escaped_encoded_string.has_value()) {
-        encoded_str = validated_escaped_encoded_string.value();
-    } else {
-        SPDLOG_ERROR("Encoded string is not utf8 compliant: \"{}\"", decoded_value);
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
-    m_current_parsed_message.add_value(node_id, encoded_str);
+    m_current_parsed_message.add_value(node_id, decoded_value);

915-921: Enhance support for unhandled IR unit types
When encountering IR units that are neither log events nor schema node insertions, you return false. If partial ingestion is acceptable, consider logging a warning and skipping the unhandled unit. If partial ingestion is undesirable, this logic is fine as-is.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
components/core/src/clp_s/JsonParser.cpp (1)

683-744: Consider rewriting recursion as iteration

The recursive implementation could lead to stack overflow with deep trees and may impact performance. Consider rewriting this function iteratively using a stack.

This change would:

  1. Improve performance
  2. Avoid potential stack overflow issues
  3. Make profiling more accurate

Would you like me to provide an iterative implementation in a follow-up PR?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51be440 and 7c0e500.

📒 Files selected for processing (1)
  • components/core/src/clp_s/JsonParser.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
components/core/src/clp_s/JsonParser.cpp (3)
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-11-10T16:46:53.300Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
🔇 Additional comments (2)
components/core/src/clp_s/JsonParser.cpp (2)

664-672: 🛠️ Refactor suggestion

Remove unnecessary UTF-8 validation

Per previous discussions, UTF-8 validation should be handled during compression, not at this stage.

Apply this diff:

-    auto validated_escaped_key
-            = clp::ffi::validate_and_escape_utf8_string(ir_node_to_add.get_key_name());
-    std::string node_key;
-    if (validated_escaped_key.has_value()) {
-        node_key = validated_escaped_key.value();
-    } else {
-        SPDLOG_ERROR("Key is not UTF-8 compliant: \"{}\"", ir_node_to_add.get_key_name());
-        throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
-    }
+    std::string const& node_key = ir_node_to_add.get_key_name();

Likely invalid or redundant comment.


654-655: ⚠️ Potential issue

Return NodeType::Unknown for unknown IR node types

Instead of throwing an error, return NodeType::Unknown to allow better error handling at the caller's side.

Apply this diff:

-            throw OperationFailed(ErrorCodeFailure, __FILENAME__, __LINE__);
+            return NodeType::Unknown;

Likely invalid or redundant comment.

components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved
components/core/src/clp_s/JsonParser.cpp Show resolved Hide resolved
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. For PR title how about feat(clp-s): Add support for kv-pair-IR ingestion.?

@AVMatthews AVMatthews changed the title feat(clp s): IRv2 ingestion to archive functionality feat(clp-s): Add support for kv-pair-IR ingestion. Dec 19, 2024
@gibber9809 gibber9809 merged commit 81d627b into y-scope:main Dec 19, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants