Skip to content

Commit

Permalink
[flink] Add Clustering section in doc to introduce range partition an…
Browse files Browse the repository at this point in the history
…d sort
  • Loading branch information
WencongLiu committed May 28, 2024
1 parent 7e4c2f5 commit 860bc0f
Showing 1 changed file with 32 additions and 0 deletions.
32 changes: 32 additions & 0 deletions docs/content/flink/sql-write.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,38 @@ snapshot expiration, and even partition expiration in Flink Sink (if it is confi

For multiple jobs to write the same table, you can refer to [dedicated compaction job]({{< ref "maintenance/dedicated-compaction#dedicated-compaction-job" >}}) for more info.

### Clustering

In Paimon, clustering is a feature that allows you to cluster data in your [Append Table]({{< ref "append-table/append-table#Append Table" >}})
based on the values of certain columns during the write process. This organization of data can significantly enhance the efficiency of downstream
tasks when reading the data, as it enables faster and more targeted data retrieval.

To utilize clustering, you can specify the columns you want to cluster when creating or writing to a table. Here's a simple example of how to enable clustering:

```sql
CREATE TABLE my_table (
a STRING,
b STRING,
c STRING,
) WITH (
'sink.clustering.by-columns' = 'a,b',
);
```

You can also use SQL hints to dynamically set clustering options:

```sql
INSERT INTO my_table /*+ OPTIONS('sink.clustering.by-columns' = 'a,b') */
SELECT * FROM source;
```

The clustering is implemented by range partition and sort. The range partition relies on the sampling and if the sampling process consumed too much
time in the job, you can decrease the total sample number by setting the `sink.clustering.sample-factor`. The comparison algorithm of range partition is auto
decided (including ORDER/ZORDER/HILBERT), but you can set it manually by setting the `sink.clustering.strategy`. The sort step is enabled at default to
further optimize data orderliness. If the sort slows down the job, you can set the `sink.clustering.sort-in-cluster` to false.

You can refer to [FlinkConnectorOptions]({{< ref "maintenance/configurations#FlinkConnectorOptions" >}}) for more info about the configurations above.

## Overwriting the Whole Table

For unpartitioned tables, Paimon supports overwriting the whole table.
Expand Down

0 comments on commit 860bc0f

Please sign in to comment.