Use IO.copy_stream when possible #383

casperisfine · 2018-01-24T16:31:06Z

Fix: #66

Context

We noticed that Google Cloud Storage's ruby library performance on download was heavily impacted by CPU usage on the host, especially for big files. After some digging it was clear it's due to how the data has to transit through read() and write() instead of leveraging sendfile().

An experiment using a quick and dirty patch showed a reduction from 15s to 5s for a 500MB download.

The patch

To leverage sendfile() in ruby, the best and simplest API is IO.copy_stream as suggested in #66.

The problem is that copy_stream need IO or IO like objects to work with, and httpclient's API mostly deal with blocks, so I had to adapt the API somehow.

One important thing to note, is that we can only leverage sendfile if there is no modifications to apply on the request body, e.g. no chunking, no compression.

I'll add comments on specific parts of the patch in a later comments.

casperisfine · 2018-01-24T16:32:26Z

lib/httpclient.rb

@@ -651,8 +651,8 @@ def redirect_uri_callback=(redirect_uri_callback)
  # use get method.  get returns HTTP::Message as a response and you need to
  # follow HTTP redirect by yourself if you need.
  def get_content(uri, *args, &block)
-    query, header = keyword_argument(args, :query, :header)
-    success_content(follow_redirect(:get, uri, query, nil, header || {}, &block))
+    query, header, to = keyword_argument(args, :query, :header, :to)


Maybe we can do better in term of API. But it seemed logical to expose this optimized code path through get_content only.

So: client.get_content('/big-file.bin', to: '/tmp/big-file.bin')

casperisfine · 2018-01-24T16:35:27Z

lib/httpclient/session.rb

+        @inflater = inflater
+      end
+
+      def write(chunk)


Something that is not very well documented, is that copy_steam do accept fake IO objects as long as they respond to #write(). However it's kind of a fallback codepath, as it won't be able to use sendfile() so there will be no speed up.

casperisfine · 2018-01-24T16:36:28Z

test/test_httpclient.rb

@@ -829,16 +845,6 @@ def test_get_with_block_arity_2_and_redirects
    assert_nil(res.content)
  end

-  def test_get_with_block_string_recycle


I feel bad about removing that test, but unfortunately read_block_size doesn't make any sense in case sendfile() is used.

But the next test, test the same behavior with chunked response, so I think it's okish to remove it.

casperisfine commented Jan 24, 2018

View reviewed changes

Use IO.copy_stream when possible

5e0d02c

casperisfine force-pushed the copy-stream branch from ea04649 to 5e0d02c Compare January 24, 2018 16:57

casperisfine mentioned this pull request Jan 24, 2018

Google::Cloud::Storage is about twice slower than gsutil cp for downloading files googleapis/google-cloud-ruby#1897

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use IO.copy_stream when possible #383

Use IO.copy_stream when possible #383

casperisfine commented Jan 24, 2018 •

edited

Loading

casperisfine Jan 24, 2018

casperisfine Jan 24, 2018

casperisfine Jan 24, 2018

Use IO.copy_stream when possible #383

Are you sure you want to change the base?

Use IO.copy_stream when possible #383

Conversation

casperisfine commented Jan 24, 2018 • edited Loading

Context

The patch

casperisfine Jan 24, 2018

Choose a reason for hiding this comment

casperisfine Jan 24, 2018

Choose a reason for hiding this comment

casperisfine Jan 24, 2018

Choose a reason for hiding this comment

casperisfine commented Jan 24, 2018 •

edited

Loading