-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp-s): Add support for ingesting logs from S3. #639
base: main
Are you sure you want to change the base?
Conversation
…rce including network auth
…dd utilities for finding archive Id given a Path
WalkthroughThis pull request introduces significant changes to the CLP (Compressed Log Processing) project, focusing on enhancing input path handling, network authentication, and filesystem operations. The modifications primarily involve transitioning from Boost Filesystem to C++17 standard filesystem, introducing new input configuration management, and updating various components to support more flexible file and archive handling. The changes include adding new classes like Changes
Suggested Reviewers
Possibly Related PRs
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🔭 Outside diff range comments (1)
components/core/src/clp_s/JsonFileIterator.hpp (1)
Line range hint
17-26
: Update constructor documentation to reflect parameter changes.The Doxygen comments for the constructor still refer to
file_name
, but the parameter has been changed toreader
. Please update the documentation to match the new parameter.
🧹 Nitpick comments (22)
components/core/src/clp_s/ZstdDecompressor.cpp (2)
6-8
: Consider evaluating std::filesystem's memory mapping capabilitiesWhile the transition to std::filesystem is a positive step, the code still relies on Boost for memory mapping. Consider evaluating whether std::filesystem's memory mapping capabilities could replace boost::iostreams for a more consistent standard library approach.
206-212
: Align condition with coding guidelinesAccording to the coding guidelines, prefer
false == <expression>
over!<expression>
. Consider updating the error check:- if (error_code) { + if (false == !error_code) {components/core/src/clp_s/ReaderUtils.hpp (1)
73-74
: Add documentation for the new method.The method name suggests it might fail, but there's no documentation explaining:
- The purpose of the method
- Expected failure cases
- Return value semantics (when nullptr might be returned)
- Parameter requirements
Add documentation like this:
+ /** + * Attempts to create a reader for the given path with optional network authentication + * @param path The path to the resource (local or network) + * @param network_auth Network authentication options if required + * @return A reader interface if successful, nullptr otherwise + * @throws OperationFailed if the resource exists but cannot be read + */ static std::shared_ptr<clp::ReaderInterface> try_create_reader(Path const& path, NetworkAuthOption const& network_auth);components/core/tests/test-clp_s-end_to_end.cpp (1)
103-112
: Enhance error handling for archive path operations.While the implementation correctly uses the new
Path
struct, consider adding error handling for scenarios where:
- The archive path is inaccessible
- The archive path permissions are insufficient
Consider this improvement:
for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) { if (false == entry.is_directory()) { // Skip non-directories continue; } + try { constructor_option.archive_path = clp_s::Path{ .source{clp_s::InputSource::Filesystem}, .path{entry.path().string()} }; clp_s::JsonConstructor constructor{constructor_option}; constructor.store(); + } catch (const std::filesystem::filesystem_error& e) { + FAIL("Failed to process archive: " << e.what()); + } }components/core/src/clp_s/Utils.hpp (1)
48-61
: Consider additional URI edge cases.The method documentation provides good examples, but consider handling these edge cases:
- Empty URIs
- URIs with query parameters but no path
- URIs with fragments
- URIs with encoded characters
Consider adding validation for these cases and updating the documentation with examples:
/** * Gets the last component of a uri. * * For example: * https://www.something.org/abc-xyz -> abc-xyz * https://www.something.org/aaa/bbb/abc-xyz?something=something -> abc-xyz + * https://www.something.org -> "" (empty URI path) + * https://www.something.org?param=value -> "" (no path component) + * https://www.something.org/abc%20xyz -> "abc xyz" (encoded characters) + * https://www.something.org/abc#section -> "abc" (with fragment) * * @param uri * @param name Returned component name * @return true on success, false otherwise */components/core/src/clp_s/CMakeLists.txt (2)
221-222
: Verify version constraints for new dependenciesThe addition of Boost::regex, Boost::url, CURL, and OpenSSL::Crypto libraries requires careful version management:
- Consider adding minimum version requirements for these dependencies
- Document any specific version requirements in the project's README
Consider adding version checks in CMake:
find_package(Boost 1.81 REQUIRED COMPONENTS regex url) find_package(CURL 7.88 REQUIRED) find_package(OpenSSL 3.0 REQUIRED)Also applies to: 228-228
82-83
: Consider separating network configurationThe new InputConfig component appears to handle both local and network paths. Consider:
- Separating network authentication configuration into a dedicated component
- Using the factory pattern for reader creation based on path type
components/core/src/clp_s/InputConfig.hpp (2)
16-16
: Typographical error in comment: "definining" should be "defining"There is a typo in the comment on line 16. The word "definining" should be corrected to "defining".
49-49
: Incomplete parameter documentation in function commentsThe
@param
annotations lack descriptions for the parameters. Please provide brief explanations for each parameter to improve code readability and maintainability.Also applies to: 55-55, 63-63, 72-72, 80-80, 89-89, 98-98
components/core/src/clp_s/ArchiveReader.hpp (1)
34-35
: Update parameter documentation.The parameter documentation should be updated to describe the new parameters:
archive_path
: Path to the archivenetwork_auth
: Network authentication optionscomponents/core/src/clp_s/clp-s.cpp (1)
355-361
: Enhance error message specificity.While the error handling is good, consider making the error message more specific by including the failure reason:
- SPDLOG_ERROR("Failed to open archive - {}", e.what()); + SPDLOG_ERROR("Failed to open archive at {} - {}", archive_path.path, e.what());components/core/src/clp_s/JsonParser.cpp (1)
427-429
: Add specific error message for reader creation failure.While the error handling exists, it would be helpful to log the specific reason for reader creation failure:
auto reader{ReaderUtils::try_create_reader(path, m_network_auth)}; if (nullptr == reader) { + SPDLOG_ERROR("Failed to create reader for path {}", path.path); m_archive_writer->close(); return false; }
components/core/src/clp_s/InputConfig.cpp (3)
11-17
: Consider logging exceptions inget_source_for_path
for better debuggingIn the function
get_source_for_path
, exceptions are caught but not logged. Logging the exception details can assist in diagnosing issues when the function defaults toInputSource::Network
due to an exception.Apply the following diff to add exception logging:
try { return std::filesystem::exists(path) ? InputSource::Filesystem : InputSource::Network; } catch (std::exception const& e) { + Logger::error("Exception occurred while checking path existence: {}", e.what()); return InputSource::Network; }
76-86
: Remove unreachable code after theswitch
statement inget_archive_id_from_path
The
return true;
statement at line 85 is unreachable because all paths in theswitch
statement end with areturn
statement. Removing this line will clean up the code and prevent any confusion.Apply the following diff:
default: return false; } - return true; }
27-47
: Refactor duplicated logic into a shared utility functionThe functions
get_input_files_for_path
andget_input_archives_for_path
share similar logic for handling input paths and populating vectors. Consider refactoring the common code into a shared helper function to reduce duplication and enhance maintainability.Also applies to: 54-74
components/core/src/clp_s/ReaderUtils.cpp (2)
163-180
: Log exceptions when URL signing failsIn
try_sign_url
, exceptions are caught but not logged. Logging exceptions can provide valuable information during debugging.Apply this diff to add error logging:
} catch (std::exception const& e) { + SPDLOG_ERROR("Exception during URL signing: {}", e.what()); return false; } return true; }
206-215
: Add logging for unhandledInputSource
typesWhen
path.source
is neitherFilesystem
norNetwork
, the function returnsnullptr
without any logging. For better debugging and maintenance, consider logging a warning when an unrecognizedInputSource
is encountered.Apply this diff:
} else { + SPDLOG_WARN("Unhandled InputSource type: {}", static_cast<int>(path.source)); return nullptr; } }
components/core/src/clp_s/JsonFileIterator.hpp (1)
6-7
: Remove unnecessary include ofFileReader.hpp
if it's no longer used.Since
FileReader
is no longer utilized in this class, consider removing the include directive forFileReader.hpp
to clean up the code.components/core/src/clp_s/Utils.cpp (2)
68-100
: Consider refactoring constant comparisons for maintainabilityIn
directory_is_multi_file_archive
, the multiple comparisons offormatted_name
against constants can be simplified for better readability and maintainability. Consider using astd::unordered_set
orstd::set
to store the constants and check for membership.Here is a possible refactor:
// Define the set of archive filenames static const std::unordered_set<std::string> archive_files = { constants::cArchiveTimestampDictFile, constants::cArchiveSchemaTreeFile, constants::cArchiveSchemaMapFile, constants::cArchiveVarDictFile, constants::cArchiveLogDictFile, constants::cArchiveArrayDictFile, constants::cArchiveTableMetadataFile, constants::cArchiveTablesFile }; //... if (archive_files.count(formatted_name) > 0) { continue; } else { // ... Remaining code }
91-97
: Avoid using exceptions for control flow when parsing integersUsing exceptions for control flow can be inefficient. Instead of using
std::stoi
inside atry-catch
block, consider checking iffile_name
is numeric before attempting to convert it.Here is a possible refactor:
// Check if file_name is numeric if (std::all_of(file_name.begin(), file_name.end(), ::isdigit)) { // It's a number, continue processing continue; } else { return false; }components/core/src/clp_s/CommandLineArguments.cpp (2)
426-451
: Refactor duplicated code for handlingarchive_path
andarchive_id
The logic for processing
archive_path
andarchive_id
is duplicated in multiple commands (Extract
andSearch
). Refactoring this code into a shared function can improve maintainability and reduce code duplication.Also applies to: 694-719
301-306
: Refactor authentication handling into a common utility functionThe authentication type handling logic is repeated across different command parsing sections (
Compress
,Extract
, andSearch
). Consider abstracting this into a shared utility function to reduce redundancy and enhance code maintainability.Also applies to: 447-451, 715-719
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
components/core/CMakeLists.txt
(2 hunks)components/core/src/clp_s/ArchiveReader.cpp
(1 hunks)components/core/src/clp_s/ArchiveReader.hpp
(2 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(0 hunks)components/core/src/clp_s/CMakeLists.txt
(3 hunks)components/core/src/clp_s/CommandLineArguments.cpp
(15 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(3 hunks)components/core/src/clp_s/InputConfig.cpp
(1 hunks)components/core/src/clp_s/InputConfig.hpp
(1 hunks)components/core/src/clp_s/JsonConstructor.cpp
(3 hunks)components/core/src/clp_s/JsonConstructor.hpp
(2 hunks)components/core/src/clp_s/JsonFileIterator.cpp
(2 hunks)components/core/src/clp_s/JsonFileIterator.hpp
(3 hunks)components/core/src/clp_s/JsonParser.cpp
(7 hunks)components/core/src/clp_s/JsonParser.hpp
(5 hunks)components/core/src/clp_s/ReaderUtils.cpp
(2 hunks)components/core/src/clp_s/ReaderUtils.hpp
(2 hunks)components/core/src/clp_s/Utils.cpp
(2 hunks)components/core/src/clp_s/Utils.hpp
(2 hunks)components/core/src/clp_s/ZstdDecompressor.cpp
(2 hunks)components/core/src/clp_s/clp-s.cpp
(7 hunks)components/core/src/clp_s/search/kql/CMakeLists.txt
(1 hunks)components/core/tests/test-clp_s-end_to_end.cpp
(3 hunks)
💤 Files with no reviewable changes (1)
- components/core/src/clp_s/ArchiveWriter.hpp
🧰 Additional context used
📓 Path-based instructions (19)
components/core/tests/test-clp_s-end_to_end.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveReader.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ZstdDecompressor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ArchiveReader.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonConstructor.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonConstructor.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonParser.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonFileIterator.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/Utils.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/InputConfig.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ReaderUtils.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/JsonFileIterator.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/CommandLineArguments.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/clp-s.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ReaderUtils.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/CommandLineArguments.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/Utils.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/InputConfig.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (33)
components/core/src/clp_s/ReaderUtils.hpp (2)
4-8
: LGTM! Include directives are well-organized.
The new includes are properly ordered (standard library before project-specific headers) and necessary for the new functionality.
Line range hint 1-99
: Verify the impact of removing get_archives
.
The removal of get_archives
method might affect existing code that relied on listing archives from a directory.
Let's verify the usage and potential impact:
✅ Verification successful
The removal of get_archives
method appears to be safe
The verification shows that:
- There are no direct references to
ReaderUtils::get_archives
in the codebase - The functionality for getting archives is properly handled through
CommandLineArguments::get_archives_dir()
andGlobalSQLiteMetadataDB
- No test files reference this specific method
The removal aligns with the codebase's architecture where archive handling is managed through dedicated components rather than utility classes.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any remaining references to the removed get_archives method
# Search for direct method calls
echo "Searching for direct method calls to get_archives..."
rg "get_archives\(" --type cpp
# Search for potential comments or TODOs referencing the method
echo "Searching for references in comments..."
rg "get_archives" --type cpp
# Search for tests that might need updating
echo "Searching for test files that might need updating..."
fd "test.*\.(cpp|hpp)$" | xargs rg "get_archives"
Length of output: 3638
Script:
#!/bin/bash
# Let's check for any references to ReaderUtils::get_archives specifically
echo "Searching for ReaderUtils::get_archives usage..."
rg "ReaderUtils::get_archives" --type cpp
# Let's also check if there are any references in the test files specifically for ReaderUtils
echo "Searching for ReaderUtils tests..."
fd "test.*\.(cpp|hpp)$" | xargs rg "ReaderUtils"
# Let's check the git history for the specific removal
echo "Checking git history for the removal..."
git log -p --no-merges -S "get_archives" -- "components/core/src/clp_s/ReaderUtils.hpp"
Length of output: 365
components/core/tests/test-clp_s-end_to_end.cpp (2)
12-12
: LGTM!
The new include is properly placed and necessary for the updated path handling functionality.
74-76
: Consider adding test cases for network sources.
While the implementation correctly uses the new Path
struct for filesystem sources, we should also verify the S3 ingestion functionality mentioned in the PR objectives.
Let's check if there are any existing S3-related test cases:
Consider adding test cases that:
- Verify S3 path handling
- Test error scenarios with invalid S3 credentials
- Validate behaviour with non-existent S3 buckets
components/core/src/clp_s/Utils.hpp (3)
8-9
: LGTM: Required includes added and properly ordered.
The new includes for <string_view>
and <vector>
are necessary for the updated method signatures.
20-21
: LGTM: Method name is now more descriptive.
The rename from find_all_files
to find_all_files_in_directory
better describes the method's purpose.
34-45
: LGTM: Well-documented utility method with clear examples.
The get_last_non_empty_path_component
method is thoroughly documented with examples that clarify its behaviour.
components/core/src/clp_s/CommandLineArguments.hpp (3)
15-15
: LGTM: Include statement properly added
The addition of InputConfig.hpp is necessary for the new Path and NetworkAuthOption types.
55-57
: LGTM: Well-designed getter methods
The getter methods are properly implemented with:
- Const correctness
- Efficient return by const reference
- Clear naming that reflects their purpose
181-182
: LGTM: Member variables properly structured
The member variables are well-organized and properly encapsulated. Let's verify the migration from file paths to input paths across the codebase.
✅ Verification successful
Migration from m_file_paths to m_input_paths verified
The codebase shows consistent usage of the new Path type and proper initialization of NetworkAuthOption:
- The old
m_file_paths
member has been completely removed with no remaining references m_input_paths
is consistently used with the Path type across multiple components:- CommandLineArguments.hpp
- JsonParser.hpp/cpp
- InputConfig.hpp/cpp
- NetworkAuthOption is properly initialized with default construction where used
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify consistent usage of Path instead of std::string for paths
# and proper initialization of NetworkAuthOption across the codebase
# Check for any remaining uses of m_file_paths
rg "m_file_paths" --type cpp --type hpp
# Check for consistent use of Path type with m_input_paths
ast-grep --pattern 'std::vector<std::string> $path_var'
# Check for proper initialization of NetworkAuthOption
ast-grep --pattern 'NetworkAuthOption $var;'
Length of output: 172
Script:
#!/bin/bash
# Let's try with correct file extensions and more specific patterns
# Check for any remaining uses of m_file_paths
rg "m_file_paths" -t cpp -t h -t hpp
# Check for path-related vector declarations
rg "std::vector<.*Path.*>" -t cpp -t h -t hpp
# Check NetworkAuthOption usage
rg "NetworkAuthOption" -t cpp -t h -t hpp
# Look for input paths usage
rg "m_input_paths" -t cpp -t h -t hpp
Length of output: 276
Script:
#!/bin/bash
# Search using just the basic extensions and include context
# Check for any remaining uses of m_file_paths
rg "m_file_paths" -t cpp -t h --context 2
# Check for path-related vector declarations
rg "std::vector<.*Path.*>" -t cpp -t h --context 2
# Check NetworkAuthOption usage
rg "NetworkAuthOption" -t cpp -t h --context 2
# Look for input paths usage
rg "m_input_paths" -t cpp -t h --context 2
# Also try finding all header files to understand available extensions
fd -e h -e hpp -e hxx
Length of output: 36707
components/core/src/clp_s/search/kql/CMakeLists.txt (1)
28-28
: LGTM! Removal of Boost::filesystem aligns with C++17 transition.
The change correctly removes the Boost::filesystem dependency since the target already uses C++20 which includes std::filesystem.
components/core/CMakeLists.txt (1)
260-261
: LGTM! Addition of InputConfig files.
The new InputConfig files align with the PR's objective to support network resource handling.
components/core/src/clp_s/CMakeLists.txt (1)
22-23
: Review thread safety of file operations
The addition of FileReader, NetworkReader, and related utilities suggests concurrent operations. Ensure:
- Thread-safe file and network operations
- Proper resource cleanup in error cases
Also applies to: 29-30, 37-38, 43-43, 46-47, 49-49
✅ Verification successful
Thread safety and resource management are properly implemented
The NetworkReader implementation demonstrates robust thread safety and resource management through:
- Proper synchronization using
std::mutex
,std::condition_variable
, andstd::atomic
variables - RAII-based resource management with clear cleanup in the destructor
- Thread-safe buffer pool management with synchronized acquire/release operations
- Proper error handling with try-catch blocks and state management
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for thread safety mechanisms and resource cleanup
ast-grep --pattern 'class $_Reader {
$$$
std::mutex $_
$$$
}'
Length of output: 69
Script:
#!/bin/bash
# Search for FileReader implementation and mutex usage
rg -U "class .*Reader" -A 10
# Check for resource cleanup patterns
rg -U "~.*Reader|close\(\)|cleanup\(\)" -A 5
# Look for thread-safety related includes
rg "#include.*mutex|thread|atomic"
# Search for RAII patterns and resource management
rg "std::(unique_ptr|shared_ptr|lock_guard|scoped_lock)"
Length of output: 68699
Script:
#!/bin/bash
# Search for NetworkReader implementation to check thread safety
rg -U "class NetworkReader" -A 30
# Check for mutex and thread synchronization in NetworkReader
rg -U "std::mutex|std::thread|std::condition_variable" components/core/src/clp/NetworkReader
# Look for resource cleanup in NetworkReader destructor and error handling
rg -U "~NetworkReader|NetworkReader::cleanup" -A 10
Length of output: 4189
Script:
#!/bin/bash
# Search for synchronization primitives in NetworkReader implementation
rg -U "std::mutex|std::thread|std::condition_variable|std::atomic" components/core/src/clp/NetworkReader.hpp
# Look for buffer management and thread safety in NetworkReader
rg -U "class Buffer|BufferPool|synchronized|thread_safe" components/core/src/clp/NetworkReader.hpp
# Check for error handling and resource cleanup patterns
rg -U "try|catch|RAII|cleanup|release" components/core/src/clp/NetworkReader.cpp
Length of output: 1193
components/core/src/clp_s/ArchiveReader.cpp (1)
19-30
: Proper input validation in the open
method
The input validation checks in the open
method enhance robustness by ensuring that the archive path is valid, the source is the filesystem, and that the path is a directory. This is a good practice to prevent potential errors during runtime.
components/core/src/clp_s/JsonConstructor.hpp (1)
30-35
: Consistent update of JsonConstructorOption
with new input structures
The modifications to JsonConstructorOption
, replacing archives_dir
and archive_id
with archive_path
and network_auth
, are consistent with the updated input handling mechanism. This improves clarity and maintainability.
components/core/src/clp_s/JsonParser.hpp (1)
Line range hint 35-48
: Enhancement of input handling and network authentication
The replacement of file_paths
with input_paths
of type std::vector<Path>
and the addition of network_auth
enhance the flexibility and scalability of the JsonParser
. The changes are consistently applied in both the option struct and class members, aligning with the new input configuration approach.
Also applies to: 115-116
components/core/src/clp_s/ArchiveReader.hpp (2)
11-11
: LGTM! Include changes align with network resource handling.
The addition of InputConfig.hpp and removal of boost/filesystem.hpp reflects the transition to the new input configuration management system.
37-37
: LGTM! Method signature change supports network resources.
The updated signature properly supports both local and network-based archive access.
components/core/src/clp_s/clp-s.cpp (3)
15-15
: LGTM! Proper curl initialization for network operations.
The addition of CurlGlobalInstance ensures proper initialization of libcurl for network operations.
Also applies to: 287-287
310-310
: LGTM! Extract command properly handles network archives.
The changes correctly integrate network authentication and path handling for archive extraction.
Also applies to: 318-319
363-368
: LGTM! Proper function call formatting.
The search_archive function call maintains good readability with proper parameter alignment.
components/core/src/clp_s/JsonParser.cpp (2)
23-24
: LGTM! Proper member initialization.
The new member variables are correctly initialized in the constructor's initialization list.
439-439
: LGTM! Consistent path reporting in error messages.
Error messages consistently use path.path for reporting file locations.
Also applies to: 473-473, 498-498, 532-532, 542-542
components/core/src/clp_s/JsonConstructor.cpp (2)
38-38
: Function call updated with network authentication—Looks good!
The addition of m_option.network_auth
to the m_archive_reader->open
call correctly integrates network authentication into the archive opening process.
115-115
: Duplicate concern regarding sanitization of get_archive_id()
As previously mentioned, when using m_archive_reader->get_archive_id()
—here within the BSON document—it is important to validate and sanitize the archive ID to ensure data integrity and security.
components/core/src/clp_s/ReaderUtils.cpp (2)
3-9
: Includes are appropriate and necessary
The added include directives are correct and required for the new functionality.
154-161
: Proper exception handling in try_create_file_reader
The function correctly handles exceptions and logs errors when opening files. Good job.
components/core/src/clp_s/JsonFileIterator.cpp (1)
10-17
: Constructor implementation is correct.
The constructor now correctly initializes m_reader
with the provided clp::ReaderInterface& reader
.
components/core/src/clp_s/Utils.cpp (5)
23-23
: No issues found
The check for whether path
is a directory is correctly implemented using std::filesystem::is_directory
.
29-29
: No issues found
Correctly checks if the directory is empty using std::filesystem::is_empty
.
42-43
: Efficient handling of empty directories
The code correctly disables recursion into empty directories, improving traversal efficiency.
126-148
: No issues found
The function get_last_non_empty_path_component
correctly retrieves the last non-empty component of a path.
150-161
: No issues found
The function get_last_uri_component
properly parses the URI and retrieves the last component using boost::urls
.
archive_paths.push_back(iter->path().string()); | ||
bool try_sign_url(std::string& url) { | ||
clp::aws::AwsAuthenticationSigner signer{ | ||
std::getenv(cAwsAccessKeyIdEnvVar), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noticed that if I forget to pass in the access key, clp-s will throw a quite confusing message
terminate called after throwing an instance of 'std::logic_error' what(): basic_string::_M_construct null not valid Aborted
We should let signer parse the env vars, verify they are not null and then initilaize the member variables
Or we can implement our own helper which either parse the envvar or throw a more meaningful exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
components/core/src/clp_s/JsonParser.cpp (1)
474-477
: Enhance error message clarityThe error messages could be more descriptive by indicating whether the path is a network URL or local file path.
- "Encountered non-json-object while trying to parse {} after parsing {} bytes", - path.path, + "Encountered non-json-object while trying to parse {} ({}) after parsing {} bytes", + path.path, + path.is_network_path() ? "network path" : "local file",Apply similar changes to other error messages for consistency.
Also applies to: 499-502, 533-536, 543-545
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/JsonParser.cpp
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/JsonParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (1)
components/core/src/clp_s/JsonParser.cpp (1)
24-26
: LGTM: Constructor member initialization list updated correctly
The new member variables are properly initialized in the constructor's initialization list, maintaining the class's initialization consistency.
for (auto const& path : m_input_paths) { | ||
auto reader{ReaderUtils::try_create_reader(path, m_network_auth)}; | ||
if (nullptr == reader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error logging for reader creation failure
When reader creation fails, we should log an error message before returning false to help with debugging.
auto reader{ReaderUtils::try_create_reader(path, m_network_auth)};
if (nullptr == reader) {
+ SPDLOG_ERROR("Failed to create reader for path: {}", path.path);
m_archive_writer->close();
return false;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for (auto const& path : m_input_paths) { | |
auto reader{ReaderUtils::try_create_reader(path, m_network_auth)}; | |
if (nullptr == reader) { | |
for (auto const& path : m_input_paths) { | |
auto reader{ReaderUtils::try_create_reader(path, m_network_auth)}; | |
if (nullptr == reader) { | |
SPDLOG_ERROR("Failed to create reader for path: {}", path.path); | |
m_archive_writer->close(); | |
return false; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
components/core/src/clp_s/clp-s.cpp (2)
Line range hint
318-324
: Enhance error message specificityConsider providing more specific error messages to distinguish between network authentication failures and other types of archive access failures.
- SPDLOG_ERROR("{}", e.what()); + SPDLOG_ERROR("Failed to decompress archive {}: {}", archive_path, e.what());
363-368
: Improve readability of function callThe function call formatting could be improved for better readability.
- if (false - == search_archive( - command_line_arguments, - archive_reader, - expr->copy(), - reducer_socket_fd - )) + if (false == search_archive( + command_line_arguments, + archive_reader, + expr->copy(), + reducer_socket_fd))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/clp-s.cpp
(6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/clp-s.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (3)
components/core/src/clp_s/clp-s.cpp (3)
15-15
: LGTM: CURL integration for S3 support
The addition of CURL support is appropriate for handling S3 operations.
Also applies to: 287-287
310-316
: LGTM: Network authentication support for extraction
The network authentication support is properly integrated into the extraction process.
91-92
: Verify network authentication handling
The addition of network authentication support aligns with S3 integration. However, we should verify the error handling for network authentication failures.
✅ Verification successful
Let me gather more information about the error handling implementation.
Let me check the network authentication implementation further.
Let me check the error handling in the network reader implementation.
Based on the review of the implementation, I can now provide a final response.
Network authentication error handling is properly implemented
The codebase shows comprehensive error handling for network authentication:
- Network reader operations are wrapped in try-catch blocks with proper error logging
- Authentication failures are caught and reported through the error handling system
- The implementation includes proper cleanup of resources in error scenarios
- Error propagation is maintained through multiple layers (ReaderUtils -> NetworkReader -> CurlDownloadHandler)
Key error handling locations:
components/core/src/clp_s/ReaderUtils.cpp
: Catches and logs network reader failurescomponents/core/src/clp_s/clp-s.cpp
: Handles archive opening failures with authenticationcomponents/core/src/clp/NetworkReader.cpp
: Implements robust error handling for network operations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check error handling for network authentication
# Look for try-catch blocks around network operations
rg -A 5 "network_auth" --type cpp
Length of output: 9649
Script:
#!/bin/bash
# Check error handling in network reader creation and authentication
ast-grep --pattern 'try_create_network_reader($path, $auth) {
$$$
}'
# Check for network-related error handling
rg -B 2 -A 5 "catch.*exception" --type cpp
Length of output: 34622
Script:
#!/bin/bash
# Check network authentication implementation
ast-grep --pattern 'try_create_network_reader($path, $auth) {
$$$
}'
# Check network reader implementation
rg -A 10 "class.*NetworkReader" --type cpp
Length of output: 906
Script:
#!/bin/bash
# Check NetworkReader implementation and error handling
rg -B 2 -A 10 "NetworkReader::" --type cpp
# Check for network-related error codes
rg "ErrorCode.*Network" --type hpp
Length of output: 14029
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
components/core/src/clp_s/JsonParser.cpp (2)
429-431
: 🛠️ Refactor suggestionAdd detailed error logging for reader creation failure
When reader creation fails, we should provide more context about the failure reason.
auto reader{ReaderUtils::try_create_reader(path, m_network_auth)}; if (nullptr == reader) { + SPDLOG_ERROR("Failed to create reader for path: {} - Type: {}", path.path, path.type); m_archive_writer->close(); return false; }
548-563
:⚠️ Potential issueImprove network error handling structure
The current implementation has several issues that were previously identified:
- The archive writer is closed inside the network error handling block
- The error message could be more specific by including the HTTP status code
Apply this diff to improve the error handling:
if (auto network_reader = std::dynamic_pointer_cast<clp::NetworkReader>(reader); nullptr != network_reader) { if (auto const rc = network_reader->get_curl_ret_code(); rc.has_value() && CURLcode::CURLE_OK != rc.value()) { auto curl_error_message = network_reader->get_curl_error_msg(); - std::string error_msg_str; - if (curl_error_message.has_value()) { - error_msg_str = curl_error_message.value(); - } - SPDLOG_ERROR("Encountered curl error during ingestion - {}", error_msg_str); - m_archive_writer->close(); + SPDLOG_ERROR( + "Encountered curl error during ingestion of {} - Code: {} - {}", + path.path, + rc.value(), + curl_error_message.value_or("Unknown error") + ); return false; } }
🧹 Nitpick comments (1)
components/core/src/clp_s/JsonParser.cpp (1)
441-441
: Enhance error messages with path type informationConsider including the path type in error messages to help with debugging network-related issues.
Example for one of the messages:
SPDLOG_ERROR( - "Encountered error - {} - while trying to parse {} after parsing {} bytes", + "Encountered error - {} - while trying to parse {} (type: {}) after parsing {} bytes", simdjson::error_message(json_file_iterator.get_error()), path.path, + path.type, bytes_consumed_up_to_prev_record );Also applies to: 475-475, 500-500, 534-534, 544-544
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/JsonParser.cpp
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp_s/JsonParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (1)
components/core/src/clp_s/JsonParser.cpp (1)
24-26
: LGTM: Constructor changes properly initialize new members
The initialization of new members follows the class's existing initialization pattern and aligns with the PR objectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
components/core/src/clp_s/ReaderUtils.cpp (5)
154-161
: Enhance error message clarityConsider adding the specific error code to the log message for better debugging:
- SPDLOG_ERROR("Failed to open file for reading - {} - {}", file_path, e.what()); + SPDLOG_ERROR("Failed to open file for reading - {} - Error: {} (code: {})", file_path, e.what(), e.get_error_code());
184-185
: Enhance error logging in catch blockThe catch block should log the specific error for better debugging:
- } catch (std::exception const& e) { + } catch (std::exception const& e) { + SPDLOG_ERROR("Failed to sign URL: {}", e.what()); return false;
201-203
: Add logging for unsupported auth methodConsider adding a log message when an unsupported authentication method is encountered:
default: + SPDLOG_ERROR("Unsupported authentication method: {}", static_cast<int>(auth.method)); return nullptr;
208-209
: Include sanitized URL in error messageConsider including a sanitized version of the URL in the error message for better debugging:
- SPDLOG_ERROR("Failed to open url for reading - {}", e.what()); + SPDLOG_ERROR("Failed to open url for reading - {} - {}", sanitize_url(request_url), e.what());Note: You'll need to implement a
sanitize_url
function that removes sensitive information like access keys.
214-223
: Add debug logging for input sourceConsider adding debug logging to help with troubleshooting:
std::shared_ptr<clp::ReaderInterface> ReaderUtils::try_create_reader(Path const& path, NetworkAuthOption const& network_auth) { + SPDLOG_DEBUG("Creating reader for {} source: {}", + path.source == InputSource::Filesystem ? "filesystem" : "network", + path.path); if (InputSource::Filesystem == path.source) {components/core/src/clp_s/JsonParser.cpp (1)
441-441
: Consider using a more descriptive name for the path member.The usage of
path.path
could be confusing. Consider renaming the member to something more descriptive likefile_path
orresource_path
to better indicate its purpose.Also applies to: 475-475, 500-500, 534-534, 544-544
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/JsonParser.cpp
(7 hunks)components/core/src/clp_s/ReaderUtils.cpp
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp_s/JsonParser.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp_s/ReaderUtils.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
🔇 Additional comments (5)
components/core/src/clp_s/ReaderUtils.cpp (2)
3-9
: LGTM: Appropriate includes added
The new includes properly support the added functionality for logging, AWS authentication, and network operations.
164-172
: AWS credentials validation implemented correctly
The implementation now properly checks for null environment variables and provides clear error messages.
components/core/src/clp_s/JsonParser.cpp (3)
24-26
: LGTM: Constructor member initialization list is well-structured.
The initialization of new members follows the existing pattern and maintains consistency.
429-431
: Add specific error logging for reader creation failure.
The error handling should include specific error messages to aid in debugging.
548-564
: 🛠️ Refactor suggestion
Improve error handling structure for network operations.
While the error message formatting has improved, closing the archive writer inside the network error handling block could prevent processing remaining paths. Consider restructuring to handle cleanup after processing all paths.
if (auto network_reader = std::dynamic_pointer_cast<clp::NetworkReader>(reader);
nullptr != network_reader)
{
if (auto const rc = network_reader->get_curl_ret_code();
rc.has_value() && CURLcode::CURLE_OK != rc.value())
{
auto const curl_error_message = network_reader->get_curl_error_msg();
SPDLOG_ERROR(
"Encountered curl error while ingesting {} - Code: {} - Message: {}",
path.path,
rc.value(),
curl_error_message.value_or("Unknown error")
);
- m_archive_writer->close();
return false;
}
}
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most comments focus on fixing docstrings, some are just personal preferences.
constexpr char cAwsSecretAccessKeyEnvVar[] = "AWS_SECRET_ACCESS_KEY"; | ||
|
||
/** | ||
* Enum class definining the source of some resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Enum class definining the source of some resource. | |
* Enum class defining the source of a resource. |
}; | ||
|
||
/** | ||
* Enum class defining the auth that needs to be performed to access some resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Enum class defining the auth that needs to be performed to access some resource. | |
* Enum class defining the authentication method required for accessing a resource. |
}; | ||
|
||
/** | ||
* Struct describing a path to some resource as well as its source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Struct describing a path to some resource as well as its source. | |
* Struct representing a resource path with its source type. |
/** | ||
* Recursively records all file paths that are children of the the given Path, including the Path | ||
* itself. | ||
* @param option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param option | |
* @param path |
get_input_archives_for_path(Path const& path, std::vector<Path>& archives) -> bool; | ||
|
||
/** | ||
* Determines the archive id of a given archive based on a path to that archive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Determines the archive id of a given archive based on a path to that archive. | |
* Determines the archive id of an archive based on the archive path. |
)( | ||
"auth", | ||
po::value<std::string>(&auth) | ||
->value_name("AUTH_TYPE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the same name for this value_name
and the class name in InputConfig? Either AUTH_METHOD
(AuthMethod
) or AUTH_TYPE
(AuthType
)
)( | ||
"output-dir", | ||
po::value<std::string>(&m_output_dir), | ||
"The output directory for the decompressed file" | ||
); | ||
// clang-format on | ||
|
||
po::options_description input_options("Input Options"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we move it to decompression options?
if (archive_path.empty()) { | ||
throw std::invalid_argument("No archive path specified"); | ||
} | ||
|
||
if (false == archive_id.empty()) { | ||
auto archive_fs_path = std::filesystem::path(archive_path) / archive_id; | ||
if (false == std::filesystem::exists(archive_fs_path)) { | ||
throw std::invalid_argument("Requested archive does not exist"); | ||
} | ||
m_input_paths.emplace_back(clp_s::Path{ | ||
.source{clp_s::InputSource::Filesystem}, | ||
.path{archive_fs_path.string()} | ||
}); | ||
} else if (false == get_input_archives_for_raw_path(archive_path, m_input_paths)) { | ||
throw std::invalid_argument("Invalid archive path"); | ||
} | ||
|
||
if (m_input_paths.empty()) { | ||
throw std::invalid_argument("No archive paths specified"); | ||
} | ||
|
||
if (cS3Auth == auth) { | ||
m_network_auth.method = AuthMethod::S3PresignedUrlV4; | ||
} else if (cNoAuth != auth) { | ||
throw std::invalid_argument(fmt::format("Invalid authentication type \"{}\"", auth) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we deduplicate the code a bit?
throw std::invalid_argument("No archive path specified"); | ||
} | ||
|
||
if (false == archive_id.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that without archive id, the path can be either a url for a filesystem path, but with archive id, only filesystem paths are allowed?
@@ -22,7 +23,7 @@ class JsonFileIterator { | |||
* @param buf_size the initial buffer size | |||
*/ | |||
explicit JsonFileIterator( | |||
std::string const& file_name, | |||
clp::ReaderInterface& reader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the description?
Description
This PR adds support for ingesting logs from S3, and majorly refactors the command line interface to simplify specifying resources that exist at some network location. A follow-up PR will build on the utilties added by this PR to implement search on archives located on s3, as well as general search on single-file archives.
The major changes are as follows
Validation performed
Summary by CodeRabbit
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Refactor