outbuf.write is too slow #17

izhukov1992 · 2023-06-15T16:12:09Z

Hi!

Thanks a lot for this exciting Cython extension for avro serialization! It makes my code approximately x2 faster.

But the next iteration of profiling shows the bottleneck in io.py (BytesIO.write) that is used in process of serialization. Probably, I use it in wrong way (please, correct me if so, maybe need to use something else BytestIO):

buff = BytestIO()
writer = FastDatumWriter(schema_object)
encder = FastBinaryEncoder(buff)

for dict_record in dict_records:
    buff.seek(0)
    writer.write(dict_record, encoder)

If this is correct use case, maybe you could suggest how to use native Cython data structures in fast_binary.pyx (because I'm just newbie in Cython for a while). Then I could create my own fork and try to implement it to avoid using of BytesIO.

Just my own point of view on this "problem" is that the .write() method is invoked for each field of schema. If schema is quiet complex it leads to multiple invoking of .write() method (in my profiling report it takes 50% of execution time). Probably, it's possible to fill some internal Cython data structure (maybe char*) and convert it once to BytesIO in the end.

I will be happy to see any answer from you!
Thank you!

The text was updated successfully, but these errors were encountered:

mikepk · 2023-07-07T05:37:02Z

Hi there! I'm the original author but no longer at Pluralsight. It's been a while since I looked at this code. The idea was to keep things in the streaming interface (file like) and limit memory consumption. You can write directly to a binary file or a stream like stdout. Ultimately you likely want to output the serialized avro to something other than a BytesIO and just passing in the file or stream lets you go directly.

I would be curious if there is time savings to collecting all of the data in memory and then doing a single write to the output buffer. I can't remember if I tested this or not. My hunch is that the complexity of managing your own buffer with pre-allocation and resizing, reuse, etc.. wouldn't buy you much but I could be totally wrong.

Pluralsight doesn't seem to be maintaining this repo so I made a fork back in my personal github if you want to fork and play with the latest version -- https://github.com/mikepk/spavro

izhukov1992 · 2023-07-19T21:00:38Z

Hi @mikepk !

Thank you for response. I've made some redesign (using cdef functions for writing values of base types and using Cython array.array instead of BytesIO) and I reached +40% performance in my app. The link to MR is above.

I've have made some manual regression testing and it looks good. But I have some troubles with runing benchmark.py:

Generating sample avro to deserialize
Traceback (most recent call last):
  File "/opt/january-mediator/spavro/benchmark/benchmark.py", line 148, in <module>
    benchmark_results = run_benchmarks()
  File "/opt/january-mediator/spavro/benchmark/benchmark.py", line 122, in run_benchmarks
    read_functions = [("Avro", make_avro_reader(schema)),
  File "/opt/january-mediator/spavro/benchmark/benchmark.py", line 42, in make_avro_reader
    parsed_schema = avro.schema.Parse(json.dumps(schema))
AttributeError: module 'avro.schema' has no attribute 'Parse'

If you are interested and could help with performance test I would be glad to make MR to your repo.

Thanks!

mikepk · 2023-07-19T21:52:14Z

Hi there! I'll take a look! Unfortunately I can't merge the pull request since I'm not at Pluralsight anymore and not the maintainer.

izhukov1992 linked a pull request Jul 19, 2023 that will close this issue

Redesign to cdef functions and array.array #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outbuf.write is too slow #17

outbuf.write is too slow #17

izhukov1992 commented Jun 15, 2023 •

edited

Loading

mikepk commented Jul 7, 2023

izhukov1992 commented Jul 19, 2023

mikepk commented Jul 19, 2023

outbuf.write is too slow #17

outbuf.write is too slow #17

Comments

izhukov1992 commented Jun 15, 2023 • edited Loading

mikepk commented Jul 7, 2023

izhukov1992 commented Jul 19, 2023

mikepk commented Jul 19, 2023

izhukov1992 commented Jun 15, 2023 •

edited

Loading