[doc] Document Spec: table index and file index

apache · Aug 7, 2024 · b118e63 · b118e63
1 parent c11b950
commit b118e63
Show file tree

Hide file tree

Showing 6 changed files with 197 additions and 44 deletions.
diff --git a/docs/content/concepts/spec/fileindex.md b/docs/content/concepts/spec/fileindex.md
@@ -0,0 +1,138 @@
+---
+title: "File Index"
+weight: 7
+type: docs
+aliases:
+- /concepts/spec/fileindex.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# File index
+
+Define `file-index.${index_type}.columns`, Paimon will create its corresponding index file for each file. If the index
+file is too small, it will be stored directly in the manifest, or in the directory of the data file. Each data file
+corresponds to an index file, which has a separate file definition and can contain different types of indexes with
+multiple columns.
+
+## Index File
+
+File index file format. Put all column and offset in the header.
+
+<pre>
+  _____________________________________    _____________________
+｜     magic    ｜version｜head length ｜
+｜-------------------------------------｜
+｜            column number            ｜
+｜-------------------------------------｜
+｜   column 1        ｜ index number   ｜
+｜-------------------------------------｜
+｜  index name 1 ｜start pos ｜length  ｜
+｜-------------------------------------｜
+｜  index name 2 ｜start pos ｜length  ｜
+｜-------------------------------------｜
+｜  index name 3 ｜start pos ｜length  ｜
+｜-------------------------------------｜            HEAD
+｜   column 2        ｜ index number   ｜
+｜-------------------------------------｜
+｜  index name 1 ｜start pos ｜length  ｜
+｜-------------------------------------｜
+｜  index name 2 ｜start pos ｜length  ｜
+｜-------------------------------------｜
+｜  index name 3 ｜start pos ｜length  ｜
+｜-------------------------------------｜
+｜                 ...                 ｜
+｜-------------------------------------｜
+｜                 ...                 ｜
+｜-------------------------------------｜
+｜  redundant length ｜redundant bytes ｜
+｜-------------------------------------｜    ---------------------
+｜                BODY                 ｜
+｜                BODY                 ｜
+｜                BODY                 ｜             BODY
+｜                BODY                 ｜
+｜_____________________________________｜    _____________________
+*
+magic:                            8 bytes long, value is 1493475289347502L, BIT_ENDIAN
+version:                          4 bytes int, BIT_ENDIAN
+head length:                      4 bytes int, BIT_ENDIAN
+column number:                    4 bytes int, BIT_ENDIAN
+column x name:                    2 bytes short BIT_ENDIAN and Java modified-utf-8
+index number:                     4 bytes int (how many column items below), BIT_ENDIAN
+index name x:                     2 bytes short BIT_ENDIAN and Java modified-utf-8
+start pos:                        4 bytes int, BIT_ENDIAN
+length:                           4 bytes int, BIT_ENDIAN
+redundant length:                 4 bytes int (for compatibility with later versions, in this version, content is zero)
+redundant bytes:                  var bytes (for compatibility with later version, in this version, is empty)
+BODY:                             column index bytes + column index bytes + column index bytes + .......
+</pre>
+
+## Column Index Bytes: BloomFilter
+
+Define `'file-index.bloom-filter.columns'`.
+
+Content of bloom filter index is simple: 
+- numHashFunctions 4 bytes int, BIT_ENDIAN
+- bloom filter bytes
+
+This class use (64-bits) long hash. Store the num hash function (one integer) and bit set bytes only. Hash bytes type 
+(like varchar, binary, etc.) using xx hash, hash numeric type by [specified number hash](http://web.archive.org/web/20071223173210/http://www.concentric.net/~Ttwang/tech/inthash.htm).
+
+## Column Index Bytes: Bitmap
+
+Define `'file-index.bitmap.columns'`.
+
+Bitmap file index format (V1):
+
+<pre>
+Bitmap file index format (V1)
++-------------------------------------------------+-----------------
+｜ version (1 byte)                               ｜
++-------------------------------------------------+
+｜ row count (4 bytes int)                        ｜
++-------------------------------------------------+
+｜ non-null value bitmap number (4 bytes int)     ｜
++-------------------------------------------------+
+｜ has null value (1 byte)                        ｜
++-------------------------------------------------+
+｜ null value offset (4 bytes if has null value)  ｜       HEAD
++-------------------------------------------------+
+｜ value 1 | offset 1                             ｜
++-------------------------------------------------+
+｜ value 2 | offset 2                             ｜
++-------------------------------------------------+
+｜ value 3 | offset 3                             ｜
++-------------------------------------------------+
+｜ ...                                            ｜
++-------------------------------------------------+-----------------
+｜ serialized bitmap 1                            ｜
++-------------------------------------------------+
+｜ serialized bitmap 2                            ｜
++-------------------------------------------------+       BODY
+｜ serialized bitmap 3                            ｜
++-------------------------------------------------+
+｜ ...                                            ｜
++-------------------------------------------------+-----------------
+*
+value x:                       var bytes for any data type (as bitmap identifier)
+offset:                        4 bytes int (when it is negative, it represents that there is only one value
+                                 and its position is the inverse of the negative value)
+</pre>
+
+Integer are all BIT_ENDIAN.
diff --git a/docs/content/concepts/spec/indexfile.md b/docs/content/concepts/spec/indexfile.md
diff --git a/docs/content/concepts/spec/snapshot.md b/docs/content/concepts/spec/snapshot.md
@@ -53,7 +53,7 @@ Snapshot File is JSON, it includes:
 4. baseManifestList: a manifest list recording all changes from the previous snapshots.
 5. deltaManifestList: a manifest list recording all new changes occurred in this snapshot.
 6. changelogManifestList: a manifest list recording all changelog produced in this snapshot, null if no changelog is produced.
-7. indexManifest: a manifest recording all index files of this table, null if no index file.
+7. indexManifest: a manifest recording all index files of this table, null if no table index file.
 8. commitUser: usually generated by UUID, it is used for recovery of streaming writes, one stream write job with one user.
 9. commitIdentifier: transaction id corresponding to streaming write, each transaction may result in multiple commits for different commitKinds.
 10. commitKind: type of changes in this snapshot, including append, compact, overwrite and analyze.

diff --git a/docs/content/concepts/spec/tableindex.md b/docs/content/concepts/spec/tableindex.md
@@ -0,0 +1,57 @@
+---
+title: "Table Index"
+weight: 6
+type: docs
+aliases:
+- /concepts/spec/tableindex.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Table index
+
+Table Index files is in the `index` directory.
+
+## Dynamic Bucket Index
+
+Dynamic bucket index is used to store the correspondence between the hash value of the primary-key and the bucket.
+
+Its structure is very simple, only storing hash values in the file:
+
+HASH_VALUE | HASH_VALUE | HASH_VALUE | HASH_VALUE | ...
+
+HASH_VALUE is the hash value of the primary-key. 4 bytes, BIT_ENDIAN.
+
+## Deletion Vectors
+
+Deletion file is used to store the deleted records position for each data file. Each bucket has one deletion file for
+primary key table.
+
+{{< img src="/img/deletion-file.png">}}
+
+The deletion file is a binary file, and the format is as follows:
+
+- First, record version by a byte. Current version is 1.
+- Then, record <size of serialized bin, serialized bin, checksum of serialized bin> in sequence.
+- Size and checksum are BIT_ENDIAN Integer.
+
+For each serialized bin:
+
+- First, record a const magic number by an int (BIT_ENDIAN). Current the magic number is 1581511376.
+- Then, record serialized bitmap. Which is a [RoaringBitmap](https://github.com/RoaringBitmap/RoaringBitmap) (org.roaringbitmap.RoaringBitmap).
diff --git a/docs/static/img/deletion-file.png b/docs/static/img/deletion-file.png
diff --git a/...on-common/src/main/java/org/apache/paimon/fileindex/bloomfilter/BloomFilterFileIndex.java b/...on-common/src/main/java/org/apache/paimon/fileindex/bloomfilter/BloomFilterFileIndex.java
@@ -101,7 +101,7 @@ public void write(Object key) {
         public byte[] serializedBytes() {
             int numHashFunctions = filter.getNumHashFunctions();
             byte[] serialized = new byte[filter.getBitSet().bitSize() / Byte.SIZE + Integer.BYTES];
-            // little endian
+            // big endian
             serialized[0] = (byte) ((numHashFunctions >>> 24) & 0xFF);
             serialized[1] = (byte) ((numHashFunctions >>> 16) & 0xFF);
             serialized[2] = (byte) ((numHashFunctions >>> 8) & 0xFF);