Skip to content
This repository has been archived by the owner on Nov 15, 2024. It is now read-only.

outbuf.write is too slow #17

Open
izhukov1992 opened this issue Jun 15, 2023 · 3 comments · May be fixed by #18
Open

outbuf.write is too slow #17

izhukov1992 opened this issue Jun 15, 2023 · 3 comments · May be fixed by #18

Comments

@izhukov1992
Copy link

izhukov1992 commented Jun 15, 2023

Hi!

Thanks a lot for this exciting Cython extension for avro serialization! It makes my code approximately x2 faster.

But the next iteration of profiling shows the bottleneck in io.py (BytesIO.write) that is used in process of serialization. Probably, I use it in wrong way (please, correct me if so, maybe need to use something else BytestIO):

buff = BytestIO()
writer = FastDatumWriter(schema_object)
encder = FastBinaryEncoder(buff)

for dict_record in dict_records:
    buff.seek(0)
    writer.write(dict_record, encoder)

If this is correct use case, maybe you could suggest how to use native Cython data structures in fast_binary.pyx (because I'm just newbie in Cython for a while). Then I could create my own fork and try to implement it to avoid using of BytesIO.

Just my own point of view on this "problem" is that the .write() method is invoked for each field of schema. If schema is quiet complex it leads to multiple invoking of .write() method (in my profiling report it takes 50% of execution time). Probably, it's possible to fill some internal Cython data structure (maybe char*) and convert it once to BytesIO in the end.

I will be happy to see any answer from you!
Thank you!

@mikepk
Copy link
Contributor

mikepk commented Jul 7, 2023

Hi there! I'm the original author but no longer at Pluralsight. It's been a while since I looked at this code. The idea was to keep things in the streaming interface (file like) and limit memory consumption. You can write directly to a binary file or a stream like stdout. Ultimately you likely want to output the serialized avro to something other than a BytesIO and just passing in the file or stream lets you go directly.

I would be curious if there is time savings to collecting all of the data in memory and then doing a single write to the output buffer. I can't remember if I tested this or not. My hunch is that the complexity of managing your own buffer with pre-allocation and resizing, reuse, etc.. wouldn't buy you much but I could be totally wrong.

Pluralsight doesn't seem to be maintaining this repo so I made a fork back in my personal github if you want to fork and play with the latest version -- https://github.com/mikepk/spavro

@izhukov1992 izhukov1992 linked a pull request Jul 19, 2023 that will close this issue
@izhukov1992
Copy link
Author

Hi @mikepk !

Thank you for response. I've made some redesign (using cdef functions for writing values of base types and using Cython array.array instead of BytesIO) and I reached +40% performance in my app. The link to MR is above.

I've have made some manual regression testing and it looks good. But I have some troubles with runing benchmark.py:

Generating sample avro to deserialize
Traceback (most recent call last):
  File "/opt/january-mediator/spavro/benchmark/benchmark.py", line 148, in <module>
    benchmark_results = run_benchmarks()
  File "/opt/january-mediator/spavro/benchmark/benchmark.py", line 122, in run_benchmarks
    read_functions = [("Avro", make_avro_reader(schema)),
  File "/opt/january-mediator/spavro/benchmark/benchmark.py", line 42, in make_avro_reader
    parsed_schema = avro.schema.Parse(json.dumps(schema))
AttributeError: module 'avro.schema' has no attribute 'Parse'

If you are interested and could help with performance test I would be glad to make MR to your repo.

Thanks!

@mikepk
Copy link
Contributor

mikepk commented Jul 19, 2023

Hi there! I'll take a look! Unfortunately I can't merge the pull request since I'm not at Pluralsight anymore and not the maintainer.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants