add Gzip_io.string_lwt #25

tatchi · 2024-05-06T09:32:37Z

Compressing body content of a big size can block the event loop, preventing other operations from proceeding concurrently. This PR introduces a Gzip_io.string_lwt function which should handle compression of large string data without blocking the event loop.

reference: https://git.ahrefs.com/ahrefs/monorepo/pull/18094#issuecomment-37750

rr0gi

no like

yield is a hack
copy-paste
gzip+chunked would be a natural way to do this
would prefer support for other compression methods than improving gzip usage
can just add doc note that gzip can be expensive

mfp · 2024-05-10T13:59:53Z

gzip_io.ml

+  let buff = Buffer.create chunk_size in
+  let len = String.length s in
+  let rec loop i =
+    if i >= len then (
+      (* Final flush of the buffer if there's any residue *)
+      if Buffer.length buff > 0 then IO.nwrite out (Buffer.to_bytes buff);
+      Lwt.return_unit)
+    else begin
+      let c = s.[i] in
+      Buffer.add_char buff c;
+      if Buffer.length buff < chunk_size then loop (i + 1)
+      else (
+        (* Buffer is full, write and clear it *)
+        IO.nwrite out (Buffer.to_bytes buff);
+        Buffer.clear buff;
+        (* Yield after processing a chunk *)
+        let%lwt () = yield () in
+        loop (i + 1))
+    end
+  in
+  let%lwt () = loop 0 in


There's no need to process the input string char by char (slow!), accumulate the data to be compressed in a buffer, then allocate a string corresponding to the data (ouch!). It should be possible to write substrings (not allocated, but with offset + size) from the original string to the IO.

Just checked
https://github.com/ygrek/ocaml-extlib/blob/0104e03b77254454a4c50859286cadd1803d2490/src/IO.mli#L96-L105

Thanks, I updated the code in ba084d4 (#25)

tatchi · 2024-05-19T09:28:43Z

gzip_io.ml

-        let%lwt () = yield () in
-        loop (i + 1))
-    end
+  let b = Bytes.unsafe_of_string s in


Is this ok to use unsafe_of_string here ?

AFAICS it's only read from so so it's fine.

It's a bit strange IO.output expects a Bytes.t though.

mfp

While this works, I've since then learned that httpev does chunked transfer encoding with Chunks, so it is possible to stream the compressed data without allocating for the whole compressed response (at least twice as much allocation as the size of the end result). This cannot be built on top of Gzip_stream though -- essentially the logic from Gzip_stream.output would have to be replicated and the output written to the function passed to the generator.

Since the plan is to not compress in httpev though, I'd say we can do with what you wrote so far :-)

mfp · 2024-05-22T19:51:31Z

gzip_io.ml

-        let%lwt () = yield () in
-        loop (i + 1))
-    end
+  let b = Bytes.unsafe_of_string s in


AFAICS it's only read from so so it's fine.

It's a bit strange IO.output expects a Bytes.t though.

mfp · 2024-05-22T19:57:36Z

gzip_io.ml

@@ -36,6 +36,23 @@ let string s =
  IO.nwrite out (Bytes.unsafe_of_string s); (* IO wrong type *)
  IO.close_out out

+let string_lwt ?(chunk_size = 3000) ?(yield = Lwt.pause) s =


I've realized that Gzip_stream imposes a 1024-byte buffer size, so we should not give the illusion of control.
i.e. can use chunk_size = 4096 or whatever internally, so that there's no problem if buffer_size were changed in Gzip_stream, but not allow to pass ?chunk_size since it'll normally be ineffective.

mfp · 2024-05-22T20:06:07Z

gzip_io.ml

+    if offset + written >= len then Lwt.return_unit
+    else (
+      (* Yield after processing a chunk *)
+      let%lwt () = yield () in


With a buffer of size 1024 we can compress maybe around 30000 chunks/s, which means yielding every ~30 us. At this point the overhead of returning to the event loop could start to become measurable. So yielding after processing around say 4096 bytes could be better. Probably not a big deal though.

(This depends on the cost of the switch, which I guesstimated at a few us...)

mfp · 2024-05-22T20:24:06Z

httpev.ml

    (* TODO do not apply encoding to application/gzip *)
    (* TODO gzip + chunked? *)
    match body, code, c.req with
-    | `Body s, `Ok, Ready { encoding=Gzip; _ } when String.length s > 128 -> ("Content-Encoding", "gzip")::hdrs, `Body (Gzip_io.string s)
-    | _ -> hdrs, body
+    | `Body s, `Ok, Ready { encoding=Gzip; _ } when String.length s > 128 ->


We allocate a 1024-byte buffer in Gzip_stream, so maybe a higher threshold makes sense (it feels a bit silly to go ahead and allocate 1K for only 129 bytes of response).

add Gzip_io.string_lwt

022a17c

tatchi requested review from mfp, rr0gi and Khady May 6, 2024 09:32

tatchi mentioned this pull request May 6, 2024

add support for gzip + chunked #26

Open

rr0gi requested changes May 10, 2024

View reviewed changes

mfp requested changes May 10, 2024

View reviewed changes

more efficient compression

ba084d4

tatchi commented May 19, 2024

View reviewed changes

tatchi requested review from mfp and rr0gi May 22, 2024 06:47

mfp reviewed May 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Gzip_io.string_lwt #25

add Gzip_io.string_lwt #25

tatchi commented May 6, 2024 •

edited

Loading

rr0gi left a comment

mfp May 10, 2024

mfp May 10, 2024

tatchi May 19, 2024

tatchi May 19, 2024

mfp May 22, 2024

mfp left a comment

mfp May 22, 2024

mfp May 22, 2024

mfp May 22, 2024

mfp May 22, 2024

add Gzip_io.string_lwt #25

Are you sure you want to change the base?

add Gzip_io.string_lwt #25

Conversation

tatchi commented May 6, 2024 • edited Loading

rr0gi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tatchi commented May 6, 2024 •

edited

Loading