Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10978] Require Ruby 3.1+ for heap profiling #4178

Merged
merged 5 commits into from
Dec 2, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Dec 2, 2024

What does this PR do?

This PR raises the minimum Ruby version required for heap profiling from the previous value of >= 2.7 to >= 3.1 due to a new VM bug discovered (see below for details).

It's mostly a revert of #3366, where we had first tried to workaround a Ruby 2.7/3.0 bug, but it turns out we missed a spot, and we could trigger VM crashes because of that.

Motivation:

Ruby versions prior to 3.1 had a special optimization called rb_gc_force_recycle which would allow objects to directly be garbage collected (e.g. without needing to wait for the GC).

It turns out that rb_gc_force_recycle did not play well with the changes in Ruby 2.7 to how object ids worked. We uncovered this earlier on during the development of the heap profiler, and put in a workaround for the bug that we thought was enough...

Unfortunately, it turns out that the workaround is not enough. The following reproducer, when run on Ruby 2.7 or 3.0 shows how the Ruby VM can segfault inside id2ref due to the issue above:

puts RUBY_DESCRIPTION

require "datadog"
require "objspace"
require "pry"

NUM_OBJECTS = 10_000_000

recycled_ids = Array.new(NUM_OBJECTS) { 123 }
many_objects = Array.new(NUM_OBJECTS) { Object.new }

(0...NUM_OBJECTS).each do |i|
  recycled_ids[i] = many_objects[i].object_id
end

puts "Seeded objects!"
gets

(0...NUM_OBJECTS).each do |i|
  Datadog::Profiling::StackRecorder::Testing._native_gc_force_recycle(many_objects[i])
  many_objects[i] = nil
end

puts GC.stat

puts "Recycled objects!"
gets

many_objects = nil

10.times { GC.start }
Array.new(10_000) { Object.new }
10.times { GC.start }

puts GC.stat

puts "GC'd objects! (Ruby should have released pages?)"
gets

recycled_ids.each { |i|
  begin
    (nil == ObjectSpace._id2ref(i))
  rescue
    nil
  end
}
puts "Done!"

Crash details:

Program received signal SIGSEGV, Segmentation fault.
is_swept_object (ptr=93825033355200, objspace=<optimised out>) at gc.c:3868
3868        return page->flags.before_sweep ? FALSE : TRUE;
(gdb) bt
 #0  is_swept_object (ptr=93825033355200, objspace=<optimised out>) at gc.c:3868
 #1  is_garbage_object (objspace=0x55555555d220, objspace=0x55555555d220, ptr=93825033355200) at gc.c:3887
 #2  is_live_object (ptr=93825033355200, objspace=0x55555555d220) at gc.c:3909
 #3  is_live_object (ptr=93825033355200, objspace=0x55555555d220) at gc.c:3898
 #4  id2ref (objid=8264881) at gc.c:3999
 #5  os_id2ref (os=<optimised out>, objid=<optimised out>) at gc.c:4019

This crash happens because of two things:

  1. Ruby does not clean the object id entry for a recycled object from its internal hash map
  2. If the memory page where the object lived is returned back to the OS, trying to id2ref on that id will cause Ruby to try to read invalid memory and crash.

Change log entry:

Require Ruby 3.1+ for heap profiling

Additional Notes:

I've chosen to disable heap profiling on 2.7 and 3.0 because I can't think of a good workaround for the bug above, especially not one that does not increase the overhead of heap profiling.

How to test the change?

This PR updates the test coverage to expect Ruby 3.1+ as the minimum for the feature.

You can also quickly validate it doesn't get enabled on the older Rubies using:

$ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_HEAP_ENABLED=true bundle exec ddprofrb exec ruby -e "puts RUBY_DESCRIPTION"
W, [2024-12-02T10:42:28.771611 #112585]  WARN -- datadog: [datadog] Current Ruby version
(3.0.5) cannot support heap profiling due to VM bugs/limitations. Please upgrade to Ruby
>= 3.1 in order to use this feature. Heap profiling has been disabled.

On Ruby 3.0, the default for `gc_enabled = true` emits a warning
stating it can't be used. This warning is a bit annoying in our tests;
let's disable it and only rely on it being enabled on the specs that
are actually testing this feature.
It's no longer true that we don't use the new Ruby profiler on Ruby 2.5;
what's true is that we enable the "no signals workaround" to avoid the
potentially-problematic code path.
**What does this PR do?**

This PR raises the minimum Ruby version required for heap profiling from
the previous value of >= 2.7 to >= 3.1 due to a new VM bug discovered
(see below for details).

It's mostly a revert of #3366, where we had first tried to workaround
a Ruby 2.7/3.0 bug, but it turns out we missed a spot, and we
could trigger VM crashes because of that.

**Motivation:**

Ruby versions prior to 3.1 had a special optimization called
`rb_gc_force_recycle` which would allow objects to directly be
garbage collected (e.g. without needing to wait for the GC).

It turns out that `rb_gc_force_recycle` did not play well with the
changes in Ruby 2.7 to how object ids worked. We uncovered this earlier
on during the development of the heap profiler, and put in a workaround
for the bug that we thought was enough...

Unfortunately, it turns out that the workaround is not enough. The
following reproducer, when run on Ruby 2.7 or 3.0 shows how the Ruby VM
can segfault inside `id2ref` due to the issue above:

```ruby
puts RUBY_DESCRIPTION

require "datadog"
require "objspace"
require "pry"

NUM_OBJECTS = 10_000_000

recycled_ids = Array.new(NUM_OBJECTS) { 123 }
many_objects = Array.new(NUM_OBJECTS) { Object.new }

(0...NUM_OBJECTS).each do |i|
  recycled_ids[i] = many_objects[i].object_id
end

puts "Seeded objects!"
gets

(0...NUM_OBJECTS).each do |i|
  Datadog::Profiling::StackRecorder::Testing._native_gc_force_recycle(many_objects[i])
  many_objects[i] = nil
end

puts GC.stat

puts "Recycled objects!"
gets

many_objects = nil

10.times { GC.start }
Array.new(10_000) { Object.new }
10.times { GC.start }

puts GC.stat

puts "GC'd objects! (Ruby should have released pages?)"
gets

recycled_ids.each { |i|
  begin
    (nil == ObjectSpace._id2ref(i))
  rescue
    nil
  end
}
puts "Done!"
```

Crash details:

```
Program received signal SIGSEGV, Segmentation fault.
is_swept_object (ptr=93825033355200, objspace=<optimised out>) at gc.c:3868
3868	    return page->flags.before_sweep ? FALSE : TRUE;
(gdb) bt
 #0  is_swept_object (ptr=93825033355200, objspace=<optimised out>) at gc.c:3868
 #1  is_garbage_object (objspace=0x55555555d220, objspace=0x55555555d220, ptr=93825033355200) at gc.c:3887
 #2  is_live_object (ptr=93825033355200, objspace=0x55555555d220) at gc.c:3909
 #3  is_live_object (ptr=93825033355200, objspace=0x55555555d220) at gc.c:3898
 #4  id2ref (objid=8264881) at gc.c:3999
 #5  os_id2ref (os=<optimised out>, objid=<optimised out>) at gc.c:4019
```

This crash happens because of two things:

1. Ruby does not clean the object id entry for a recycled object
   from its internal hash map
2. If the memory page where the object lived is returned back to the
   OS, trying to `id2ref` on that id will cause Ruby to try to read
   invalid memory and crash.

**Additional Notes:**

I've chosen to disable heap profiling on 2.7 and 3.0 because
I can't think of a good workaround for the bug above, especially
not one that does not increase the overhead of heap profiling.

**How to test the change?**

This PR updates the test coverage to expect Ruby 3.1+ as the
minimum for the feature.

You can also quickly validate it doesn't get enabled on the older
Rubies using:

```
$ DD_PROFILING_ENABLED=true DD_PROFILING_EXPERIMENTAL_HEAP_ENABLED=true bundle exec ddprofrb exec ruby -e "puts RUBY_DESCRIPTION"
W, [2024-12-02T10:42:28.771611 #112585]  WARN -- datadog: [datadog] Current Ruby version
(3.0.5) cannot support heap profiling due to VM bugs/limitations. Please upgrade to Ruby
>= 3.1 in order to use this feature. Heap profiling has been disabled.
```
@ivoanjo ivoanjo requested review from a team as code owners December 2, 2024 10:47
@github-actions github-actions bot added the profiling Involves Datadog profiling label Dec 2, 2024
@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented Dec 2, 2024

Datadog Report

Branch report: ivoanjo/prof-10978-drop-heap-profiling-legacy-ruby
Commit report: 39ffa38
Test service: dd-trace-rb

✅ 0 Failed, 22049 Passed, 1459 Skipped, 5m 41.04s Total Time

ivoanjo added a commit to DataDog/documentation that referenced this pull request Dec 2, 2024
Due to a Ruby VM bug in older Ruby versions, we're going to require
Ruby 3.1+ as a minimum version for heap profiling, as per
DataDog/dd-trace-rb#4178 .

This PR updates the docs to match this raised requirement.
@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.76%. Comparing base (ca7cc9d) to head (39ffa38).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4178   +/-   ##
=======================================
  Coverage   97.76%   97.76%           
=======================================
  Files        1357     1357           
  Lines       81950    81891   -59     
  Branches     4168     4164    -4     
=======================================
- Hits        80117    80060   -57     
+ Misses       1833     1831    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pr-commenter
Copy link

pr-commenter bot commented Dec 2, 2024

Benchmarks

Benchmark execution time: 2024-12-02 14:15:36

Comparing candidate commit 39ffa38 in PR branch ivoanjo/prof-10978-drop-heap-profiling-legacy-ruby with baseline commit ca7cc9d in branch master.

Found 0 performance improvements and 2 performance regressions! Performance is the same for 29 metrics, 2 unstable metrics.

scenario:line instrumentation - targeted

  • 🟥 throughput [-9605.082op/s; -9076.484op/s] or [-5.817%; -5.497%]

scenario:method instrumentation

  • 🟥 throughput [-12720.572op/s; -12151.822op/s] or [-7.211%; -6.888%]

Copy link
Member

@Strech Strech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to raise the bar for Ruby version due to costs of patches.

lib/datadog/profiling/component.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice spot. Sad about our smart workaround not being enough 🥲

dd-mergequeue bot pushed a commit to DataDog/documentation that referenced this pull request Dec 2, 2024
Due to a Ruby VM bug in older Ruby versions, we're going to require
Ruby 3.1+ as a minimum version for heap profiling, as per
DataDog/dd-trace-rb#4178 .

This PR updates the docs to match this raised requirement.
@ivoanjo ivoanjo merged commit 5705775 into master Dec 2, 2024
318 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10978-drop-heap-profiling-legacy-ruby branch December 2, 2024 14:26
@github-actions github-actions bot added this to the 2.8.0 milestone Dec 2, 2024
@ivoanjo ivoanjo added the bug Involves a bug label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants