diff --git a/CHANGELOG.md b/CHANGELOG.md index 0a19f5cb11..af181b1d3a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -165,7 +165,7 @@ > [specific guidance for users of the Zed CLI tools](https://github.com/brimdata/zed-lake-migration#zed-cli-tools). * Zed lake storage format is now at version 3 (#4386, #4415) -* Allow loading and responses in [VNG](docs/formats/vng.md) format over the lake API (#4345) +* Allow loading and responses in [VNG](docs/formats/csup.md) format over the lake API (#4345) * Fix an issue where [record spread expressions](docs/language/expressions.md#record-expressions) could cause a crash (#4359) * Fix an issue where the Zed service `/version` endpoint returned "unknown" if it had been built via `go install` (#4371) * Branch-level [meta-queries](docs/commands/zed.md#meta-queries) on the `main` branch no longer require an explicit `@main` reference (#4377, #4394) @@ -177,7 +177,7 @@ ## v1.5.0 * Add `float16` primitive type (#4301) -* Add segment compression to the [VNG](docs/formats/vng.md) format (#4299) +* Add segment compression to the [VNG](docs/formats/csup.md) format (#4299) * Add `-unbuffered` flag to `zed` and `zq` (#4320) * Add `-csv.delim` flag to `zed` and `zq` for reading CSV with non-comma delimiter (#4325) * Add `csv.delim` query parameter to lake API for reading CSV with non-comma delimiter (#4333) @@ -186,7 +186,7 @@ * Fix an issue where type decorators of union values were leaking into CSV output (#4338) ## v1.4.0 -* The ZST format is now called [VNG](docs/formats/vng.md) (#4256) +* The ZST format is now called [VNG](docs/formats/csup.md) (#4256) * Allow loading of "line" format over the lake API (#4229) * Allow loading of Parquet format over the lake API (#4235) * Allow loading of Zeek TSV format over the lake API (#4246) @@ -629,7 +629,7 @@ questions. ## v0.23.0 * zql: Add `week` as a unit for [time grouping with `every`](docs/language/functions/every.md) (#1374) * zq: Fix an issue where a `null` value in a [JSON type definition](docs/integrations/zeek/README.md) caused a failure without an error message (#1377) -* zq: Add [`zst` format](docs/formats/vng.md) to `-i` and `-f` command-line help (#1384) +* zq: Add [`zst` format](docs/formats/csup.md) to `-i` and `-f` command-line help (#1384) * zq: ZNG spec and `zq` updates to introduce the beta ZNG storage format (#1375, #1415, #1394, #1457, #1512, #1523, #1529), also addressing the following: * New data type `bytes` for storing sequences of bytes encoded as base64 (#1315) * Improvements to the `enum` data type (#1314) @@ -693,7 +693,7 @@ questions. * zqd: Fix an issue where starting `zqd listen` created excess error messages when subdirectories were present (#1303) * zql: Add the [`fuse` operator](docs/language/operators/fuse.md) for unifying records under a single schema (#1310, #1319, #1324) * zql: Fix broken links in documentation (#1321, #1339) -* zst: Introduce the [ZST format](docs/formats/vng.md) for columnar data based on ZNG (#1268, #1338) +* zst: Introduce the [ZST format](docs/formats/csup.md) for columnar data based on ZNG (#1268, #1338) * pcap: Fix an issue where certain pcapng files could fail import with a `bad option length` error (#1341) * zql: [Document the `**` operator](docs/language/README.md#search-syntax) for type-specific searches that look within nested records (#1337) * zar: Change the archive data file layout to prepare for handing chunk files with overlapping ranges and improved S3 support (#1330) diff --git a/docs/README.md b/docs/README.md index 753445141c..f3f034d404 100644 --- a/docs/README.md +++ b/docs/README.md @@ -41,7 +41,7 @@ that underlie the super-structured data formats. * The [super data formats](formats/README.md) are a family of [human-readable (Super JSON, JSUP)](formats/jsup.md), [sequential (Super Binary, BSUP)](formats/bsup.md), and -[columnar (Super Columnar, CSUP)](formats/vng.md) formats that all adhere to the +[columnar (Super Columnar, CSUP)](formats/csup.md) formats that all adhere to the same abstract super data model. * The [SuperPipe language](language/README.md) is the system's pipeline language for performing queries, searches, analytics, transformations, or any of the above combined together. diff --git a/docs/commands/zed.md b/docs/commands/zed.md index 2c22969dda..81287ae2b6 100644 --- a/docs/commands/zed.md +++ b/docs/commands/zed.md @@ -118,7 +118,7 @@ replication easy to support and deploy. The cloud objects that comprise a lake, e.g., data objects, commit history, transaction journals, partial aggregations, etc., are stored as Zed data, i.e., either as [row-based Super Binary](../formats/bsup.md) -or [columnar VNG](../formats/vng.md). +or [Super Columnar](../formats/csup.md). This makes introspection of the lake structure straightforward as many key lake data structures can be queried with metadata queries and presented to a client as Zed data for further processing by downstream tooling. diff --git a/docs/commands/zq.md b/docs/commands/zq.md index f3bc3540a8..414cb55930 100644 --- a/docs/commands/zq.md +++ b/docs/commands/zq.md @@ -100,7 +100,7 @@ Note here that the query `1+1` [implies](../language/pipeline-model.md#implied-o | `line` | no | One string value per input line | | `parquet` | yes | [Apache Parquet](https://github.com/apache/parquet-format) | | `tsv` | yes | [TSV - Tab-Separated Values](https://en.wikipedia.org/wiki/Tab-separated_values) | -| `vng` | yes | [VNG - Binary Columnar Format](../formats/vng.md) | +| `csup` | yes | [Super Columnar](../formats/csup.md) | | `zeek` | yes | [Zeek Logs](https://docs.zeek.org/en/master/logs/index.html) | | `zjson` | yes | [ZJSON - Zed over JSON](../formats/zjson.md) | | `bsup` | yes | [Super Binary](../formats/bsup.md) | @@ -158,7 +158,7 @@ JSON any number that appears without a decimal point as an integer type. :::tip note The reason `zq` is not particularly performant for ZSON is that the ZNG or -[VNG](../formats/vng.md) formats are semantically equivalent to ZSON but much more efficient and +[Super Columnar](../formats/csup.md) formats are semantically equivalent to ZSON but much more efficient and the design intent is that these efficient binary formats should be used in use cases where performance matters. ZSON is typically used only when data needs to be human-readable in interactive settings or in automated tests. @@ -186,7 +186,7 @@ typically omit quotes around field names. | `table` | (described [below](#simplified-text-outputs)) | | `text` | (described [below](#simplified-text-outputs)) | | `tsv` | [TSV - Tab-Separated Values](https://en.wikipedia.org/wiki/Tab-separated_values) | -| `vng` | [VNG - Binary Columnar Format](../formats/vng.md) | +| `csup` | [Super Columnar](../formats/csup.md) | | `zeek` | [Zeek Logs](https://docs.zeek.org/en/master/logs/index.html) | | `zjson` | [ZJSON - Zed over JSON](../formats/zjson.md) | | `bsup` | [Super Binary](../formats/bsup.md) | diff --git a/docs/formats/README.md b/docs/formats/README.md index 48eac6342c..aa95b2639f 100644 --- a/docs/formats/README.md +++ b/docs/formats/README.md @@ -271,7 +271,7 @@ documents are Super JSON values as the Super JSON format is a strict superset of * [Super Binary](bsup.md) is a row-based, binary representation somewhat like Avro but leveraging the super data model to represent a sequence of arbitrarily-typed values. -* [Super Columnar](vng.md) is columnar like Parquet or ORC but also +* [Super Columnar](csup.md) is columnar like Parquet or ORC but also embodies the super data model for heterogeneous and self-describing schemas. * [Super JSON over JSON](zjson.md) defines a format for encapsulating Super JSON inside plain JSON for easy decoding by JSON-based clients, e.g., diff --git a/docs/formats/vng.md b/docs/formats/csup.md similarity index 56% rename from docs/formats/vng.md rename to docs/formats/csup.md index 1e8973190b..55377db18d 100644 --- a/docs/formats/vng.md +++ b/docs/formats/csup.md @@ -1,60 +1,58 @@ --- sidebar_position: 4 -sidebar_label: VNG +sidebar_label: Super Columnar --- -# VNG Specification +# Super Columnar Specification -VNG, pronounced "ving", is a file format for columnar data based on -[the Zed data model](zed.md). -VNG is the "stacked" version of Zed, where the fields from a stream of -Zed records are stacked into vectors that form columns. +Super Columnar is a file format based on +the [super data model](zed.md) where data is stacked to form columns. Its purpose is to provide for efficient analytics and search over -bounded-length sequences of [Super Binary](bsup.md) data that is stored in columnar form. +bounded-length sequences of [super-structured data](./README.md#2-a-super-structured-pattern) that is stored in columnar form. Like [Parquet](https://github.com/apache/parquet-format), -VNG provides an efficient columnar representation for semi-structured data, -but unlike Parquet, VNG is not based on schemas and does not require +Super Columnar provides an efficient representation for semi-structured data, +but unlike Parquet, Super Columnar is not based on schemas and does not require a schema to be declared when writing data to a file. Instead, -VNG exploits the super-structured nature of Zed data: columns of data +it exploits the nature of super-structured data: columns of data self-organize around their type structure. -## VNG Files +## Super Columnar Files -A VNG file encodes a bounded, ordered sequence of Zed values. -To provide for efficient access to subsets of VNG-encoded data (e.g., columns), -the VNG file is presumed to be accessible via random access +A Super Columnar file encodes a bounded, ordered sequence of values. +To provide for efficient access to subsets of Super Columnar-encoded data (e.g., columns), +the file is presumed to be accessible via random access (e.g., range requests to a cloud object store or seeks in a Unix file system) -and VNG is therefore not intended as a streaming or communication format. +and is therefore not intended as a streaming or communication format. -A VNG file can be stored entirely as one storage object +A Super Columnar file can be stored entirely as one storage object or split across separate objects that are treated -together as a single VNG entity. While the VNG format provides much flexibility +together as a single Super Columnar entity. While the format provides much flexibility for how data is laid out, it is left to an implementation to lay out data in intelligent ways for efficient sequential read accesses of related data. ## Column Streams -The VNG data abstraction is built around a collection of _column streams_. +The Super Columnar data abstraction is built around a collection of _column streams_. There is one column stream for each top-level type encountered in the input where each column stream is encoded according to its type. For top-level complex types, the embedded elements are encoded recursively in additional column streams as described below. For example, -a record column encodes a "presence" vector encoding any null value for +a [record column](#record-column) encodes a [presence column](#presence-columns) encoding any null value for each field then encodes each non-null field recursively, whereas -an array column encodes a "lengths" vector and encodes each +an [array column](#array-column) encodes a sequence of "lengths" and encodes each element recursively. Values are reconstructed one by one from the column streams by picking values from each appropriate column stream based on the type structure of the value and its relationship to the various column streams. For hierarchical records -(i.e., records inside of records, or records inside of arrays inside of records, etc), +(i.e., records inside of records, or records inside of arrays inside of records, etc.), the reconstruction process is recursive (as described below). ## The Physical Layout -The overall layout of a VNG file is comprised of the following sections, +The overall layout of a Super Columnar file is comprised of the following sections, in this order: * the data section, * the reassembly section, and @@ -64,89 +62,99 @@ This layout allows an implementation to buffer metadata in memory while writing column data in a natural order to the data section (based on the volume statistics of each column), then write the metadata into the reassembly section along with the trailer -at the end. This allows a ZNG stream to be converted to a VNG file +at the end. This allows a stream to be converted to a Super Columnar file in a single pass. -> That said, the layout is -> flexible enough that an implementation may optimize the data layout with -> additional passes or by writing the output to multiple files then -> merging them together (or even leaving the VNG entity as separate files). +:::tip note +That said, the layout is +flexible enough that an implementation may optimize the data layout with +additional passes or by writing the output to multiple files then +merging them together (or even leaving the Super Columnar entity as separate files). +::: ### The Data Section The data section contains raw data values organized into _segments_, where a segment is a seek offset and byte length relative to the data section. Each segment contains a sequence of -[primitive-type Zed values](zed.md#1-primitive-types), +[primitive-type values](zed.md#1-primitive-types), encoded as counted-length byte sequences where the counted-length is variable-length encoded as in the [Super Binary specification](bsup.md). Segments may be compressed. There is no information in the data section for how segments relate to one another or how they are reconstructed into columns. They are just -blobs of ZNG data. - -> Unlike Parquet, there is no explicit arrangement of the column chunks into -> row groups but rather they are allowed to grow at different rates so a -> high-volume column might be comprised of many segments while a low-volume -> column must just be one or several. This allows scans of low-volume record types -> (the "mice") to perform well amongst high-volume record types (the "elephants"), -> i.e., there are not a bunch of seeks with tiny reads of mice data interspersed -> throughout the elephants. -> -> TBD: The mice/elephants model creates an interesting and challenging layout -> problem. If you let the row indexes get too far apart (call this "skew"), then -> you have to buffer very large amounts of data to keep the column data aligned. -> This is the point of row groups in Parquet, but the model here is to leave it -> up to the implementation to do layout as it sees fit. You can also fall back -> to doing lots of seeks and that might work perfectly fine when using SSDs but -> this also creates interesting optimization problems when sequential reads work -> a lot better. There could be a scan optimizer that lays out how the data is -> read that lives under the column stream reader. Also, you can make tradeoffs: -> if you use lots of buffering on ingest, you can write the mice in front of the -> elephants so the read path requires less buffering to align columns. Or you can -> do two passes where you store segments in separate files then merge them at close -> according to an optimization plan. +blobs of Super Binary data. + +:::tip note +Unlike Parquet, there is no explicit arrangement of the column chunks into +row groups but rather they are allowed to grow at different rates so a +high-volume column might be comprised of many segments while a low-volume +column must just be one or several. This allows scans of low-volume record types +(the "mice") to perform well amongst high-volume record types (the "elephants"), +i.e., there are not a bunch of seeks with tiny reads of mice data interspersed +throughout the elephants. +::: + +:::tip TBD +The mice/elephants model creates an interesting and challenging layout +problem. If you let the row indexes get too far apart (call this "skew"), then +you have to buffer very large amounts of data to keep the column data aligned. +This is the point of row groups in Parquet, but the model here is to leave it +up to the implementation to do layout as it sees fit. You can also fall back +to doing lots of seeks and that might work perfectly fine when using SSDs but +this also creates interesting optimization problems when sequential reads work +a lot better. There could be a scan optimizer that lays out how the data is +read that lives under the column stream reader. Also, you can make tradeoffs: +if you use lots of buffering on ingest, you can write the mice in front of the +elephants so the read path requires less buffering to align columns. Or you can +do two passes where you store segments in separate files then merge them at close +according to an optimization plan. +::: ### The Reassembly Section The reassembly section provides the information needed to reconstruct -column streams from segments, and in turn, to reconstruct the original Zed values +column streams from segments, and in turn, to reconstruct the original values from column streams, i.e., to map columns back to composite values. -> Of course, the reassembly section also provides the ability to extract just subsets of columns -> to be read and searched efficiently without ever needing to reconstruct -> the original rows. How well this performs is up to any particular -> VNG implementation. -> -> Also, the reassembly section is in general vastly smaller than the data section -> so the goal here isn't to express information in cute and obscure compact forms -> but rather to represent data in an easy-to-digest, programmer-friendly form that -> leverages ZNG. - -The reassembly section is a ZNG stream. Unlike Parquet, +:::tip note +Of course, the reassembly section also provides the ability to extract just subsets of columns +to be read and searched efficiently without ever needing to reconstruct +the original rows. How well this performs is up to any particular +Super Columnar implementation. + +Also, the reassembly section is in general vastly smaller than the data section +so the goal here isn't to express information in cute and obscure compact forms +but rather to represent data in an easy-to-digest, programmer-friendly form that +leverages Super Binary. +::: + +The reassembly section is a Super Binary stream. Unlike Parquet, which uses an externally described schema (via [Thrift](https://thrift.apache.org/)) to describe -analogous data structures, we simply reuse ZNG here. +analogous data structures, we simply reuse Super Binary here. #### The Super Types -This reassembly stream encodes 2*N+1 Zed values, where N is equal to the number -of top-level Zed types that are present in the encoded input. -To simplify terminology, we call a top-level Zed type a "super type", -e.g., there are N unique super types encoded in the VNG file. +This reassembly stream encodes 2*N+1 values, where N is equal to the number +of top-level types that are present in the encoded input. +To simplify terminology, we call a top-level type a "super type", +e.g., there are N unique super types encoded in the Super Columnar file. These N super types are defined by the first N values of the reassembly stream and are encoded as a null value of the indicated super type. A super type's integer position in this sequence defines its identifier -encoded in the super column (defined below). This identifier is called +encoded in the [super column](#the-super-column). This identifier is called the super ID. -> Change the first N values to type values instead of nulls? +:::tip note +Change the first N values to type values instead of nulls? +::: The next N+1 records contain reassembly information for each of the N super types where each record defines the column streams needed to reconstruct the original -Zed values. +values. #### Segment Maps @@ -155,7 +163,7 @@ A segment map is a list of the segments from the data area that are concatenated to form the data for a column stream. Each segment map that appears within the reassembly records is represented -with a Zed array of records that represent seek ranges conforming to this +with an array of records that represent seek ranges conforming to this type signature: ``` [{offset:uint64,length:uint32,mem_length:uint32,compression_format:uint8}] @@ -164,24 +172,26 @@ type signature: In the rest of this document, we will refer to this type as `` for shorthand and refer to the concept as a "segmap". -> We use the type name "segmap" to emphasize that this information represents -> a set of byte ranges where data is stored and must be read from *rather than* -> the data itself. +:::tip note +We use the type name "segmap" to emphasize that this information represents +a set of byte ranges where data is stored and must be read from *rather than* +the data itself. +::: #### The Super Column The first of the N+1 reassembly records defines the "super column", where this column -represents the sequence of super types of each original Zed value, i.e., indicating +represents the sequence of [super types](#the-super-types) of each original value, i.e., indicating which super type's column stream to select from to pull column values to form the reconstructed value. The sequence of super types is defined by each type's super ID (as defined above), 0 to N-1, within the set of N super types. -The super column stream is encoded as a sequence of ZNG-encoded `int32` primitive values. -While there are a large number entries in the super column (one for each original row), +The super column stream is encoded as a sequence of Super Binary-encoded `int32` primitive values. +While there are a large number of entries in the super column (one for each original row), the cardinality of super IDs is small in practice so this column will compress very significantly, e.g., in the special case that all the -values in the VNG file have the same super ID, +values in the Super Columnar file have the same super ID, the super column will compress trivially. The reassembly map appears as the next value in the reassembly section @@ -193,13 +203,13 @@ Following the root reassembly map are N reassembly maps, one for each unique sup Each reassembly record is a record of type ``, as defined below, where each reassembly record appears in the same sequence as the original N schemas. -Note that there is no "any" type in Zed, but rather this terminology is used +Note that there is no "any" type in the super data model, but rather this terminology is used here to refer to any of the concrete type structures that would appear -in a given VNG file. +in a given Super Columnar file. In other words, the reassembly record of the super column combined with the N reassembly records collectively define the original sequence -of Zed data values in the original order. +of data values in the original order. Taken in pieces, the reassembly records allow efficient access to sub-ranges of the rows, to subsets of columns of the rows, to sub-ranges of columns of the rows, and so forth. @@ -207,16 +217,18 @@ This simple top-down arrangement, along with the definition of the other column structures below, is all that is needed to reconstruct all of the original data. -> Note that each row reassembly record has its own layout of columnar -> values and there is no attempt made to store like-typed columns from different -> schemas in the same physical column. +:::tip note +Each row reassembly record has its own layout of columnar +values and there is no attempt made to store like-typed columns from different +schemas in the same physical column. +::: The notation `` refers to any instance of the five column types: -* ``, -* ``, -* ``, -* ``, or -* ``. +* [``](#record-column), +* [``](#array-column), +* [``](#union-column), +* [``](#map-column), or +* [``](#primitive-column). Note that when decoding a column, all type information is known from the super type in question so there is no need @@ -239,14 +251,14 @@ where * `` through `` are the names of the top-level fields of the original row record, * the `column` fields are column stream definitions for each field, and -* the `presence` columns are `int32` ZNG column streams comprised of a +* the [`presence` columns](#presence-columns) are `int32` Super Binary column streams comprised of a run-length encoding of the locations of column values in their respective rows, -when there are null values (as described below). +when there are null values. If there are no null values, then the `presence` field contains an empty ``. If all of the values are null, then the `column` field is null (and the `presence` contains an empty ``). For an empty ``, there is no -corresponding data stored in the data section. Since a `` is a Zed +corresponding data stored in the data section. Since a `` is an array, an empty `` is simply the empty array value `[]`. #### Array Column @@ -258,10 +270,10 @@ An `` has the form: where * `values` represents a continuous sequence of values of the array elements that are sliced into array values based on the length information, and -* `lengths` encodes a Zed `int32` sequence of values that represent the length - of each array value. +* `lengths` encodes an `int32` sequence of values that represent the length +of each array value. -The `` structure is used for both Zed arrays and sets. +The `` structure is used for both arrays and sets. #### Map Column @@ -285,7 +297,9 @@ in the same column order implied by the union type, and * `tags` is a column of `int32` values where each subsequent value encodes the tag of the union type indicating which column the value falls within. -> TBD: change code to conform to columns array instead of record{c0,c1,...} +:::tip TBD +Change code to conform to columns array instead of record{c0,c1,...} +::: The number of times each value of `tags` appears must equal the number of values in each respective column. @@ -305,44 +319,46 @@ present so that null values are not encoded. Instead the presence column is encoded as a sequence of alternating runs. First, the number of values present is encoded, then the number of values not present, then the number of values present, and so forth. These runs are then stored -as Zed `int32` values in the presence column (which may be subject to further +as `int32` values in the presence column (which may be subject to further compression based on segment compression). ### The Trailer -After the reassembly section is a ZNG stream with a single record defining -the "trailer" of the VNG file. The trailer provides a magic field -indicating the "vng" format, a version number, +After the reassembly section is a Super Binary stream with a single record defining +the "trailer" of the Super Columnar file. The trailer provides a magic field +indicating the file format, a version number, the size of the segment threshold for decomposing segments into frames, the size of the skew threshold for flushing all segments to storage when the memory footprint roughly exceeds this threshold, -and an array of sizes in bytes of the sections of the VNG file. +and an array of sizes in bytes of the sections of the Super Columnar file. This type of this record has the format ``` -{magic:string,type:string,version:int64,sections:[int64],meta:{skew_thresh:int64,segment_thresh:int64} +{magic:string,type:string,version:int64,sections:[int64],meta:{skew_thresh:int64,segment_thresh:int64}} ``` The trailer can be efficiently found by scanning backward from the end of the -VNG file to find a valid ZNG stream containing a single record value +Super Columnar file to find a valid Super Binary stream containing a single record value conforming to the above type. ## Decoding -To decode an entire VNG file into rows, the trailer is read to find the sizes -of the sections, then the ZNG stream of the reassembly section is read, +To decode an entire Super Columnar file into rows, the trailer is read to find the sizes +of the sections, then the Super Binary stream of the reassembly section is read, typically in its entirety. Since this data structure is relatively small compared to all of the columnar -data in the VNG file, +data in the file, it will typically fit comfortably in memory and it can be very fast to scan the entire reassembly structure for any purpose. -> For example, for a given query, a "scan planner" could traverse all the -> reassembly records to figure out which segments will be needed, then construct -> an intelligent plan for reading the needed segments and attempt to read them -> in mostly sequential order, which could serve as -> an optimizing intermediary between any underlying storage API and the -> VNG decoding logic. +:::tip Example +For a given query, a "scan planner" could traverse all the +reassembly records to figure out which segments will be needed, then construct +an intelligent plan for reading the needed segments and attempt to read them +in mostly sequential order, which could serve as +an optimizing intermediary between any underlying storage API and the +Super Columnar decoding logic. +::: To decode the "next" row, its schema index is read from the root reassembly column stream. @@ -354,37 +370,37 @@ The top-level reassembly fetches column values as a ``. For any ``, a value from each field is read from each field's column, accounting for the presence column indicating null, -and the results are encoded into the corresponding ZNG record value using -ZNG type information from the corresponding schema. +and the results are encoded into the corresponding Super Binary record value using +type information from the corresponding schema. For a `` a value is determined by reading the next value from its segmap. For an ``, a length is read from its `lengths` segmap as an `int32` and that many values are read from its the `values` sub-column, -encoding the result as a ZNG array value. +encoding the result as a Super Binary array value. For a ``, a value is read from its `tags` segmap and that value is used to select the corresponding column stream -`c0`, `c1`, etc. The value read is then encoded as a ZNG union value +`c0`, `c1`, etc. The value read is then encoded as a Super Binary union value using the same tag within the union value. ## Examples ### Hello, world -Start with this [Super JSON](jsup.md)): +Start with this [Super JSON](jsup.md) file `hello.jsup`: ``` {a:"hello",b:"world"} {a:"goodnight",b:"gracie"} ``` -To convert to VNG format: +To convert to Super Columnar format: ``` super -f csup hello.jsup > hello.csup ``` -Segments in the VNG format would be laid out like this: +Segments in the Super Columnar format would be laid out like this: ``` === column for a hello diff --git a/docs/formats/zjson.md b/docs/formats/zjson.md index 49a1ec27c8..393346f9cd 100644 --- a/docs/formats/zjson.md +++ b/docs/formats/zjson.md @@ -9,7 +9,7 @@ sidebar_label: ZJSON The [super data model](zed.md) is based on richly typed records with a deterministic field order, -as is implemented by the [Super JSON](jsup.md), [Super Binary](bsup.md), and [Super Columnar](vng.md) formats. +as is implemented by the [Super JSON](jsup.md), [Super Binary](bsup.md), and [Super Columnar](csup.md) formats. Given the ubiquity of JSON, it is desirable to also be able to serialize super data into the JSON format. However, encoding super data values directly as JSON values would not work without loss of information. diff --git a/docs/language/search-expressions.md b/docs/language/search-expressions.md index 5349eef448..ad40078c70 100644 --- a/docs/language/search-expressions.md +++ b/docs/language/search-expressions.md @@ -132,7 +132,7 @@ When processing [Super Binary](../formats/bsup.md) data, the SuperDB runtime per Boyer-Moore scan over decompressed data buffers before parsing any data. This allows large buffers of data to be efficiently discarded and skipped when searching for rarely occurring values. For a [SuperDB data lake](../lake/format.md), -a planned feature will use [Super Columnar](../formats/vng.md) files to further accelerate searches. +a planned feature will use [Super Columnar](../formats/csup.md) files to further accelerate searches. ::: ### Search Terms diff --git a/docs/tutorials/zq.md b/docs/tutorials/zq.md index e9364a6c4c..87da9d179d 100644 --- a/docs/tutorials/zq.md +++ b/docs/tutorials/zq.md @@ -163,7 +163,7 @@ The human-readable format of Zed is called [ZSON](../formats/jsup.md) ZSON is nice because it has a comprehensive type system and you can go from ZSON to an efficient binary row format ([Super Binary](../formats/bsup.md)) -and columnar ([VNG](../formats/vng.md)) --- and vice versa --- +and columnar ([Super Columnar](../formats/csup.md)) --- and vice versa --- with complete fidelity and no loss of information. In this tour, we'll stick to ZSON (though for large data sets, [ZNG is much faster](../commands/zq.md#performance)). diff --git a/vng/object.go b/vng/object.go index 2a69850401..735ee52a17 100644 --- a/vng/object.go +++ b/vng/object.go @@ -1,5 +1,5 @@ // Package vng implements the reading and writing of VNG serialization objects. -// The VNG format is described at https://github.com/brimdata/super/blob/main/docs/formats/vng.md. +// The VNG format is described at https://github.com/brimdata/super/blob/main/docs/formats/csup.md. // // A VNG object is created by allocating an Encoder for any top-level Zed type // via NewEncoder, which recursively descends into the Zed type, allocating an Encoder