-
-
Notifications
You must be signed in to change notification settings - Fork 676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store List of Fields in Segment #2279
base: main
Are you sure you want to change the base?
Conversation
Fiels may be encoded in the columnar storage or in the inverted index for JSON fields. Add a new Segment file that contains the list of fields (schema + encoded)
if let Some(list_fields_file) = self.list_fields_file.as_ref() { | ||
let file = list_fields_file.read_bytes()?; | ||
let fields_metadata = | ||
read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>(); | ||
fields_metadata.map_err(|e| e.into()) | ||
} else { | ||
// Schema fallback | ||
Ok(self | ||
.schema() | ||
.fields() | ||
.map(|(_field, entry)| FieldMetadata { | ||
field_name: entry.name().to_string(), | ||
typ: entry.field_type().value_type(), | ||
indexed: entry.is_indexed(), | ||
stored: entry.is_stored(), | ||
fast: entry.is_fast(), | ||
}) | ||
.collect()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is admittedly code golfing, but I think it makes it a bit less noisier:
if let Some(list_fields_file) = self.list_fields_file.as_ref() { | |
let file = list_fields_file.read_bytes()?; | |
let fields_metadata = | |
read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>(); | |
fields_metadata.map_err(|e| e.into()) | |
} else { | |
// Schema fallback | |
Ok(self | |
.schema() | |
.fields() | |
.map(|(_field, entry)| FieldMetadata { | |
field_name: entry.name().to_string(), | |
typ: entry.field_type().value_type(), | |
indexed: entry.is_indexed(), | |
stored: entry.is_stored(), | |
fast: entry.is_fast(), | |
}) | |
.collect()) | |
} | |
let fields_metadata = if let Some(list_fields_file) = self.list_fields_file.as_ref() { | |
let file = list_fields_file.read_bytes()?; | |
read_split_fields(file)?.collect::<io::Result<Vec<FieldMetadata>>>()? | |
} else { | |
// Schema fallback | |
self.schema() | |
.fields() | |
.map(|(_field, entry)| FieldMetadata { | |
field_name: entry.name().to_string(), | |
typ: entry.field_type().value_type(), | |
indexed: entry.is_indexed(), | |
stored: entry.is_stored(), | |
fast: entry.is_fast(), | |
}) | |
.collect() | |
}; | |
Ok(fields_metadata) |
@@ -24,34 +26,44 @@ impl From<u32> for OrderedPathId { | |||
|
|||
#[derive(Default)] | |||
pub(crate) struct PathToUnorderedId { | |||
map: FnvHashMap<String, u32>, | |||
/// TinySet contains the type codes of the columns in the path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think defining a struct with named fields would improve readability here, likely making the comment unnecessary or at least attaching it to the named field directly.
let mut sorted_ids: Vec<(&str, (u32, TinySet))> = self | ||
.map | ||
.iter() | ||
.map(|(k, (id, typ_code))| (k.as_str(), (*id, *typ_code))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the plural typ_codes
would be good here and below, or maybe all_codes
or typ_code_bitvec
as above, to visually indicate that this is not a single code as in e.g. insert_new_path
.
@@ -81,6 +84,11 @@ impl SegmentSerializer { | |||
&mut self.postings_serializer | |||
} | |||
|
|||
/// Accessor to the ``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing name inside the backticks?
serializer: &mut InvertedIndexSerializer, | ||
) -> crate::Result<()> { | ||
// Replace unordered ids by ordered ids to be able to sort | ||
let unordered_id_to_ordered_id: Vec<OrderedPathId> = | ||
ctx.path_to_unordered_id.unordered_id_to_ordered_id(); | ||
let ordered_id_to_path = ctx.path_to_unordered_id.ordered_id_to_path(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move the binding up here if it is not used in the loop? I would suggest leaving it below to make that clear to the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Vec is reused in another method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it unordered_id_to_ordered_id
that is reused in serialize_segment_fields
? I was thinking about the binding for ordered_id_to_path
which moved up here from line 79 without any change to how it is used.
// In this case we need to map the potential fast field to the field name | ||
// accepted by the query parser. | ||
let create_canonical = | ||
!field_entry.is_expand_dots_enabled() && json_path.contains('.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems dangerous to pass the name (which is derived from field_entry
) as an argument but access field_entry
as a capture. Maybe build_path
should be a free function taking field_entry
and json_path
(and map_to_canonical
) as arguments to make it obvious what it accesses and how.
write_field(field_metadata, &mut payload); | ||
} | ||
let compression_level = 3; | ||
let payload_compressed = zstd::stream::encode_all(&mut &payload[..], compression_level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are compressing from full buffers to full buffers, Zstd's bulk API might yield better results (as @trinity-1686a found elsewhere IIRC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #1946 (comment) and #1946 (comment). This might be more of an issue when decompressing than compressing, but either way the bulk API is faster, so if it's easy to use, we should use it
mut reader: R, | ||
) -> io::Result<impl Iterator<Item = io::Result<FieldMetadata>>> { | ||
let format_version = read_exact_array::<_, 1>(&mut reader)?[0]; | ||
assert_eq!(format_version, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be an error instead of a panic?
} | ||
|
||
/// Reads the Split fields from a stream of bytes | ||
fn read_split_fields_from_zstd<R: Read>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I would suggest merging this into read_split_fields
. The name was surprising to me as it does nothing specific to Zstd and I think following the logic of constructing the iterator is easier if it is not split over multiple functions.
(If you do go for Zstd's bulk API, this would probably become simpler as well due to working on a full in-memory representation (if that is not consider too memory hungry) which can simply be chunked instead of calling read_exact
repeatedly.)
) -> io::Result<impl Iterator<Item = io::Result<FieldMetadata>>> { | ||
let mut num_fields = u32::from_le_bytes(read_exact_array::<_, 4>(&mut reader)?); | ||
|
||
Ok(std::iter::from_fn(move || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this break w.r.t. error handling, but wouldn't something like
Ok((0..num_fields).map(move |_| read_field(&mut reader))
work as well?
Fiels may be encoded in the columnar storage or in the inverted index
for JSON fields.
Add a new Segment file that contains the list of fields (schema +
encoded)