Skip to content

Commit

Permalink
[doc] Minor refactor Python API doc (#4635)
Browse files Browse the repository at this point in the history
  • Loading branch information
yuzelin authored Dec 4, 2024
1 parent c7cd8e9 commit c235a4a
Showing 1 changed file with 46 additions and 28 deletions.
74 changes: 46 additions & 28 deletions docs/content/program-api/python-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,8 @@ schema = Schema(
)
```

See [Data Types]({{< ref "python-api#data-types" >}}) for all supported `pyarrow-to-paimon` data types mapping.

Second, if you have some Pandas data, the `pa_schema` can be extracted from `DataFrame`:

```python
Expand Down Expand Up @@ -217,7 +219,15 @@ table = catalog.get_table('database_name.table_name')

## Batch Read

The reading is divided into Scan Plan and Read Splits stages. A `ReadBuilder` is used to create utils for these stages.
### Set Read Parallelism

TableRead interface provides parallelly reading for multiple splits. You can set `'max-workers': 'N'` in `catalog_options`
to set thread numbers for reading splits. `max-workers` is 1 by default, that means TableRead will read splits sequentially
if you doesn't set `max-workers`.

### Get ReadBuilder and Perform pushdown

A `ReadBuilder` is used to build reading utils and perform filter and projection pushdown.

```python
table = catalog.get_table('database_name.table_name')
Expand All @@ -241,25 +251,7 @@ predicate_5 = predicate_builder.and_predicates([predicate3, predicate4])
read_builder = read_builder.with_filter(predicate_5)
```


| Predicate kind | Predicate method |
|:----------------------|:----------------------------------------------|
| p1 and p2 | PredicateBuilder.and_predicates([p1, p2]) |
| p1 or p2 | PredicateBuilder.or_predicates([p1, p2]) |
| f = literal | PredicateBuilder.equal(f, literal) |
| f != literal | PredicateBuilder.not_equal(f, literal) |
| f < literal | PredicateBuilder.less_than(f, literal) |
| f <= literal | PredicateBuilder.less_or_equal(f, literal) |
| f > literal | PredicateBuilder.greater_than(f, literal) |
| f >= literal | PredicateBuilder.greater_or_equal(f, literal) |
| f is null | PredicateBuilder.is_null(f) |
| f is not null | PredicateBuilder.is_not_null(f) |
| f.startswith(literal) | PredicateBuilder.startswith(f, literal) |
| f.endswith(literal) | PredicateBuilder.endswith(f, literal) |
| f.contains(literal) | PredicateBuilder.contains(f, literal) |
| f is in [l1, l2] | PredicateBuilder.is_in(f, [l1, l2]) |
| f is not in [l1, l2] | PredicateBuilder.is_not_in(f, [l1, l2]) |
| lower <= f <= upper | PredicateBuilder.between(f, lower, upper) |
See [Predicate]({{< ref "python-api#predicate" >}}) for all supported filters and building methods.

You can also pushdown projection by `ReadBuilder`:

Expand All @@ -268,16 +260,20 @@ You can also pushdown projection by `ReadBuilder`:
read_builder = read_builder.with_projection(['f3', 'f2'])
```

Then you can scan plan:
### Scan Plan

Then you can step into Scan Plan stage to get `splits`:

```python
table_scan = read_builder.new_scan()
splits = table_scan.splits()
```

### Read Splits

Finally, you can read data from the `splits` to various data format.

### Apache Arrow
#### Apache Arrow

This requires `pyarrow` to be installed.

Expand All @@ -300,7 +296,7 @@ You can also read data into a `pyarrow.RecordBatchReader` and iterate record bat

```python
table_read = read_builder.new_read()
for batch in table_read.to_arrow_batch_reader(splits)
for batch in table_read.to_arrow_batch_reader(splits):
print(batch)

# pyarrow.RecordBatch
Expand All @@ -311,7 +307,7 @@ for batch in table_read.to_arrow_batch_reader(splits)
# f1: ["a","b","c"]
```

### Pandas
#### Pandas

This requires `pandas` to be installed.

Expand All @@ -330,11 +326,11 @@ print(df)
# ...
```

### DuckDB
#### DuckDB

This requires `duckdb` to be installed.

You can convert the splits into a in-memory DuckDB table and query it:
You can convert the splits into an in-memory DuckDB table and query it:

```python
table_read = read_builder.new_read()
Expand All @@ -353,7 +349,7 @@ print(duckdb_con.query("SELECT * FROM duckdb_table WHERE f0 = 1").fetchdf())
# 0 1 a
```

### Ray
#### Ray

This requires `ray` to be installed.

Expand All @@ -380,7 +376,7 @@ print(ray_dataset.to_pandas())

## Batch Write

Paimon table write is Two-Phase Commit, you can write many times, but once committed, no more data can be write.
Paimon table write is Two-Phase Commit, you can write many times, but once committed, no more data can be written.

{{< hint warning >}}
Currently, Python SDK doesn't support writing primary key table with `bucket=-1`.
Expand Down Expand Up @@ -438,3 +434,25 @@ write_builder.overwrite({'dt': '2024-01-01'})
| pyarrow.float64() | DOUBLE |
| pyarrow.string() | STRING |
| pyarrow.boolean() | BOOLEAN |

## Predicate

| Predicate kind | Predicate method |
|:----------------------|:----------------------------------------------|
| p1 and p2 | PredicateBuilder.and_predicates([p1, p2]) |
| p1 or p2 | PredicateBuilder.or_predicates([p1, p2]) |
| f = literal | PredicateBuilder.equal(f, literal) |
| f != literal | PredicateBuilder.not_equal(f, literal) |
| f < literal | PredicateBuilder.less_than(f, literal) |
| f <= literal | PredicateBuilder.less_or_equal(f, literal) |
| f > literal | PredicateBuilder.greater_than(f, literal) |
| f >= literal | PredicateBuilder.greater_or_equal(f, literal) |
| f is null | PredicateBuilder.is_null(f) |
| f is not null | PredicateBuilder.is_not_null(f) |
| f.startswith(literal) | PredicateBuilder.startswith(f, literal) |
| f.endswith(literal) | PredicateBuilder.endswith(f, literal) |
| f.contains(literal) | PredicateBuilder.contains(f, literal) |
| f is in [l1, l2] | PredicateBuilder.is_in(f, [l1, l2]) |
| f is not in [l1, l2] | PredicateBuilder.is_not_in(f, [l1, l2]) |
| lower <= f <= upper | PredicateBuilder.between(f, lower, upper) |

0 comments on commit c235a4a

Please sign in to comment.