You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Above the Table.write, there are two sources converted to RowGroup:
RemoteEngineService.write_batch receives the raw bytes of arrow record batch and converts the record batches to RowGroup;
StorageService.write receives the raw bytes of custom protobuf struct and converts the protobuf struct to RowGroup;
Under the Table.write, the RowGroup will be encoded into raw bytes for wal logs and memtable rows, and the wal log payload doesn't have any special requirement for the encoding method while the memtable rows require that the RowGroup must be encoded in rows to keep all rows in primary key order;
Proposal
From the description above, it can be found that there are too many conversions during the write procedure, leading to high CPU utilization, which has been proven in the production environment.
Maybe we can use only one struct for the whole write procedure to avoid extra conversions. And for the wal and memetable, I guess we can let the wal log payload shares the same encoded bytes used by memtable. And such struct must be designed for writing, that is to say, there is no need to include complex schema information.
Additional Context
The encoding and decoding of the arrow ipc performs very well, and I guess it should a benchmark for the new struct designed for write procedure.
The text was updated successfully, but these errors were encountered:
Describe This Problem
In the current write procedure:
RowGroup
is used forwrite
method ofTable
trait;Table.write
, there are two sources converted toRowGroup
:RemoteEngineService.write_batch
receives the raw bytes of arrow record batch and converts the record batches toRowGroup
;StorageService.write
receives the raw bytes of custom protobuf struct and converts the protobuf struct toRowGroup
;Table.write
, theRowGroup
will be encoded into raw bytes for wal logs and memtable rows, and the wal log payload doesn't have any special requirement for the encoding method while the memtable rows require that theRowGroup
must be encoded in rows to keep all rows in primary key order;Proposal
From the description above, it can be found that there are too many conversions during the write procedure, leading to high CPU utilization, which has been proven in the production environment.
Maybe we can use only one struct for the whole write procedure to avoid extra conversions. And for the wal and memetable, I guess we can let the wal log payload shares the same encoded bytes used by memtable. And such struct must be designed for writing, that is to say, there is no need to include complex schema information.
Additional Context
The encoding and decoding of the arrow ipc performs very well, and I guess it should a benchmark for the new struct designed for write procedure.
The text was updated successfully, but these errors were encountered: