Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use consistent hash function when hashing cache field #1314

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

lgebhardt
Copy link
Member

@lgebhardt lgebhardt commented Mar 7, 2020

Object.hash does not result in the same values across ruby invocations.

Noticed this when debugging a caching issue across app restarts. The first request after a restart would not use the cached values. On investigation I found the String#hash method does not behave as stated "Returns a hash based on the string’s length, content and encoding.". Following the link to Object#hash we see "The hash value for an object may not be identical across invocations or implementations of Ruby. If you need a stable identifier across Ruby invocations and implementations you will need to generate one with a custom method."

This should result in a lot fewer cache misses across processes and result in much more efficient memory usage for the cache servers.

My one reservation with this is the MD5, or SHA1, hexdigest is about an order of magnitude slower than the current Object#hash.

We could consider the xxhash library. It's closer to the Object#hash efficiency (about 1/2 the speed for 64bit hashes). But it's one more dependency. This is of course just the default and users are free to switch, but I'd like to choose the best overall option since I don't think most users will bother to change it (since no one else has noticed the issue to begin with).

To speed up the most common condition where the updated_at field is used as the cache_key this uses the to_f method to get the number of seconds since the Unix Epoch, including fractions. This is almost as fast as .hash, and much faster than to_r(nanoseconds since Epoch).

Any thoughts?

require 'benchmark'
require 'digest/sha1'
require 'json'
require 'xxhash'

str = "2020-03-06 13:46:49 UTC"
time = Time.now

puts ".hash #{str.hash}"
puts "Digest::SHA1.hexdigest(str) #{Digest::SHA1.hexdigest(str)}"
puts "Digest::MD5.hexdigest(str)  #{Digest::MD5.hexdigest(str)}"
puts ".to_f #{time.to_f}"
puts ".to_r #{time.to_r}"
puts "XXhash.xxh32(str)           #{XXhash.xxh32(str)}"
puts "XXhash.xxh64(str)           #{XXhash.xxh64(str)}"

n = 500_000
puts "\n\n #{n} runs of str= \"#{str}\":"
Benchmark.bm do |x|
  x.report("Object#hash ") { n.times do; str.hash; end }
  x.report("Digest::SHA1") { n.times do; Digest::SHA1.hexdigest(str); end }
  x.report("Digest::MD5 ") { n.times do; Digest::MD5.hexdigest(str); end }
  x.report("to_f ") { n.times do; time.to_f; end }
  x.report("to_r ") { n.times do; time.to_r; end }
  x.report("XXhash.xxh32") { n.times do; XXhash.xxh32(str); end }
  x.report("XXhash.xxh64") { n.times do; XXhash.xxh64(str); end }
end
$ bundle exec ruby bench.rb
.hash 416998310942915654
Digest::SHA1.hexdigest(str) 9761b8ec2d32430a8db7932d88ab35827d08f44e
Digest::MD5.hexdigest(str)  a471e7688b412b64302e757842153308
.to_f 1610730885.0390248
.to_r 64429235401561/40000
XXhash.xxh32(str)           2031935804
XXhash.xxh64(str)           6959915839349833384


 500000 runs of str= "2020-03-06 13:46:49 UTC":
       user     system      total        real
Object#hash   0.027125   0.000005   0.027130 (  0.027130)
Digest::SHA1  0.467021   0.001062   0.468083 (  0.468272)
Digest::MD5   0.452565   0.000378   0.452943 (  0.453256)
to_f   0.037617   0.000011   0.037628 (  0.037669)
to_r   0.109081   0.000035   0.109116 (  0.109159)
XXhash.xxh32  0.059678   0.000111   0.059789 (  0.059912)
XXhash.xxh64  0.066721   0.000114   0.066835 (  0.066866)

All Submissions:

  • I've checked to ensure there aren't other open Pull Requests for the same update/change.
  • I've submitted a ticket for my issue if one did not already exist.
  • My submission passes all tests. (Please run the full test suite locally to cut down on noise from travis failures.)
  • I've used Github auto-closing keywords in the commit message or the description.
  • I've added/updated tests for this change.

New Feature Submissions:

  • I've submitted an issue that describes this feature, and received the go ahead from the maintainers.
  • My submission includes new tests.
  • My submission maintains compliance with JSON:API.

Bug fixes and Changes to Core Features:

  • I've included an explanation of what the changes do and why I'd like you to include them.
  • I've provided test(s) that fails without the change.

Test Plan:

Reviewer Checklist:

  • Maintains compliance with JSON:API
  • Adequate test coverage exists to prevent regressions

@lgebhardt lgebhardt requested a review from scottgonzalez March 7, 2020 18:38
@lgebhardt lgebhardt changed the title Use MD5 when hashing cache field Use consistent hash function when hashing cache field Mar 7, 2020
@lgebhardt lgebhardt requested a review from dgeb March 7, 2020 20:03
Object.hash does not result in the same values across ruby invocations
@lgebhardt lgebhardt force-pushed the md5_cache_keys branch 2 times, most recently from 1a529c1 to 6699023 Compare January 15, 2021 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant