Sometimes one worker goes rogue and doesn't stop at the end, making my build fail #128

pelletencate · 2020-02-23T16:00:17Z

I'm running ci-queue in a bit of an alternative setting. I've got a Ruby on Rails app which has a long test suite, and I'm running 8 concurrent instances of RSpec on Heroku CI, but on one and the same dyno, which also runs Redis in-dyno.

I start the whole chain with the following script

export CI_QUEUE_URL=$([ -z "$REDIS_URL" ] && echo "redis://127.0.0.1:6379" || echo "$REDIS_URL")

for num in $(seq 1 $PARALLEL_COUNT); do
  RUBYOPT="-W0" \
  CI_NODE_INDEX=$(expr $num - 1) \
  DATABASE_URL=$([ "$num" -ne "1" ] && echo $DATABASE_URL$num || echo $DATABASE_URL) \
  rspec-queue \
    --timeout 180 \
    --max-consecutive-failures 10 \
    --max-requeues 5 \
    --requeue-tolerance 5 $@ &
done

wait
rspec-queue --report

Every now and then when this runs on Heroku CI, 7 out of the 8 workers end at the same time, but it seems the 8th one keeps on running for hours until Heroku kills my build after 2 hours that usually passes in about 10-15 minutes. It's as if it doesn't understand that the queue is done.

Randomized with seed 57482

.

Finished in 9 minutes 51 seconds (files took 10.22 seconds to load)

85 examples, 0 failures, 1 pending

Randomized with seed 15517

< THIS IS WHERE I EXPECT THE REPORT >

.......................................................................................................................-----> test command `bin/test` failed with signal: terminated

While there are 8 instances running, the word Finished only shows up 7 times in the log, so I assume what I see here is the finishing of workers 6 and 7, and the following dots being part of worker 8.

I have no idea how to debug this further, but I'd love to add more details if someone can help me in the right direction.

A little more research

I've compared the output of a successful build and a failed build.

A successful build had 8 workers finish with a total of 1216 examples, an average number of 152.
A failed build had 7 workers finish with a total of 1194 examples reported.
- This means that at the time they finished, the 8th worker only processed 22 examples, which is remarkebly few compared to the average of 170 examples processed by the other 7. (This could potentially corroborate your theory). However:
- After that, the 8th worker drew 130 dots, which as I can only imagine refer to specs that already passed in different workers.

The text was updated successfully, but these errors were encountered:

casperisfine · 2020-02-24T11:14:19Z

It's possible that one of your tests is stuck, and never complete.

What you can do to debug is to get a shell in there and inspect the redis instance. There is a key named <build_identifier>:running that contains the key of the tests being processed.

Note that the RSpec test identifiers are a bit opaque, but it gives you the spec file and test index.

pelletencate · 2020-02-24T17:28:24Z

It's possible that one of your tests is stuck, and never complete.

I doubt it, it doesn't explain the fact that the last worker is still drawing multiple dots (see last line of output above).

What you can do to debug is to get a shell in there and inspect the redis instance. There is a key named <build_identifier>:running that contains the key of the tests being processed.

Heroku CI workers are notoriously hard to jack into (except for heroku ci:debug), but I'll definitely try getting some regular redis output going while the specs are running so I get some idea what's going on here.

pelletencate · 2020-02-24T17:31:51Z

I've added some additional research to the original issue.

casperisfine · 2020-02-25T10:07:12Z

I doubt it, it doesn't explain the fact that the last worker is still drawing multiple dots

You've set the requeue count very high, so it might be running the same test in repetition ?

casperisfine · 2020-02-25T10:08:18Z

Another likely possibility is that the workers shutdown because they reached the limit of consecutive failures: --max-consecutive-failures 10

pelletencate · 2020-02-27T19:22:42Z

I doubt it, it doesn't explain the fact that the last worker is still drawing multiple dots

You've set the requeue count very high, so it might be running the same test in repetition ?

Well, it should be drawing asterisks then, instead of dots, right? (A dot in RSpec's compact formatter indicates a spec that passed, asterisk would indicate PENDING or SKIPPED [and, in this case, requeued]).

Incidentally, the output of both the complete and "broken" builds do not indicate a single faulty spec.

I'll try the following:

Dump some redis info in between specs to see if I can see something
Use a formatter that gives me the full path to the example that just passed

Thanks so far at least! I'll report back when I have some more info

pelletencate · 2020-02-27T21:40:57Z

I wrote this formatter that gives all of the output you'd ever want. As soon as I see a dangling build, I'll report its output here.

https://gist.github.com/pelletencate/80cbcc24ce988223a8e3a4ac84d930cf

casperisfine · 2020-02-27T21:49:23Z

👍

casperisfine · 2020-02-27T21:52:12Z

By the way, something I forgot to mention about your script:

for num in $(seq 1 $PARALLEL_COUNT); do
  RUBYOPT="-W0" \
  CI_NODE_INDEX=$(expr $num - 1) \
  DATABASE_URL=$([ "$num" -ne "1" ] && echo $DATABASE_URL$num || echo $DATABASE_URL) \
  rspec-queue ...
done

wait
rspec-queue --report

rspec-queue --report is meant to be called before the workers complete, this way it exit as soon as all tests were ran, it doesn't need to wait for the workers to shutdown or anything, and it has an option to specify a global timeout.

pelletencate · 2020-02-28T22:57:33Z

rspec-queue --report is meant to be called before the workers complete,

I know, I believe it was meant to make sense to run it once the first worker is finished, so it waits for the rest. Since I fork all of my workers, in my case it basically runs immediately, meaning I'd have the timeout to be set high enough to cover my entire test suite, in which case the only time saved is the startup of rspec-queue itself, which is less than a second. I figured that little time outweighs the risk of setting a too low timeout. (And setting a too high timeout would still practically "hang" my dyno).

OK, so here's what I found out.

Whenever this happens, it's because in an early stage, one of the workers downright stops responding. It seems to be something specific to the Heroku VMs, I have no recollection of ever seeing this happen on my local machine. I believe it happens in a very early stage, because the --timeout 180 doesn't reschedule the spec, literally once the test is started on the faulty worker, it does not show up in my output ever again.

Here's the relevant output:

First example shows there are 1182 in the queue. Time is in seconds.

START: worker=1 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=0 worker-queue=1 total-queue=1182 time=15 path=./spec/system/customer_system_spec.rb:59

Then, the last line ever reported by each worker.

START: worker=5 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=42 worker-queue=3 total-queue=1137 time=37 path=./spec/models/time_slot_spec.rb:237

START: worker=2 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=1072 worker-queue=158 total-queue=109 time=478 path=./spec/system/modern_checkout_experience_spec.rb:512

START: worker=0 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=1138 worker-queue=131 total-queue=43 time=18465 path=./spec/system/modern_checkout_experience_spec.rb:512

START: worker=4 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=1166 worker-queue=192 total-queue=15 time=527 path=./spec/system/widget_system_spec.rb:147

START: worker=6 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=1174 worker-queue=147 total-queue=7 time=531 path=./spec/requests/api/v1/orders_spec.rb:1076

START: worker=1 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=1175 worker-queue=145 total-queue=6 time=532 path=./spec/models/order_spec.rb:286

START: worker=7 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=1179 worker-queue=219 total-queue=2 time=535 path=./spec/requests/api/v1/products_spec.rb:325

START: worker=3 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=1181 worker-queue=196 total-queue=0 time=535 path=./spec/requests/api/v1/products_spec.rb:351

So, what does this tell me?

Worker 5 started working on its 3rd spec after 37 seconds, and never gave a sign of life ever after. 180 seconds later – my configured --timeout, the very same spec was picked up by Worker 0 and completed in 2 seconds:

START: worker=0 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=436 worker-queue=64 total-queue=744 time=217 path=./spec/models/time_slot_spec.rb:237

[...]

START: worker=0 errors=0 warnings=0 requeues= master=ready workers=8 running=8 processed=442 worker-queue=65 total-queue=738 time=219 path=./spec/requests/api/v1/orders_spec.rb:549

First line shows total-queue + processed == 1182. Last line shows total-queue + processed == 1181. Someohw, one spec got lost track of, even though the hanging one was completed later by worker 0.

I'm not sure if this hanging process is an Heroku platform issue (maybe @schneems could chime in here, I believe he's been playing around with ci-queue on Heroku CI), but I would suggest that a solution could be to not only reschedule a test after --timeout, but also signal the worker (kill? restart? Not sure if RSpec has something cool for this) so that at least it doesn't hang everything.

Unfortunately, the requeues in my formatter weren't working during this build, I fixed that later (see above gist).

I've now set up my suite to use Timeout.timeout in an around :each clause to throw an exception after a configurable amount of seconds of inactivity. Unfortunately, I haven't seen any requeues since, but I'll keep a close look during next week and immediately report if I have news.

Also, at the end of this faulty spec, there's still a line of dots being drawn. The answer to the question which process is drawing these dots, is still a complete mystery to me; I thought initially rspec-queue --report was misbehaving by running specs on its own, but it may simply be Heroku indicating that the dyno is still alive?

To be continued!

schneems · 2020-03-04T18:32:52Z

On Heroku, I do think Heroku will eventually TERM your suite.

Whenever this happens, it's because in an early stage, one of the workers downright stops responding. It seems to be something specific to the Heroku VMs

That does sound troubling, can you reproduce this behavior? If so you can open a ticket https://help.heroku.com and search CI queue. Have you tried using ci:debug at all? This is a tough scenario since it only seems to be happening with a distributed system.

One thing I've done to debug "stuck" systems is to put a backtrace heartbeat into my code something like this:

Thread.new do
  loop do
    sleep 15 # seconds
    Thread.list.each { |t| puts "=" * 80; puts t.backtrace }
  end
end

That way if anything gets stuck you should at least know where it's getting stuck in the code.

pelletencate changed the title ~~Sometimes one worker doesn't stop at the end~~ Sometimes one worker goes rogue and doesn't stop at the end, making my build fail Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes one worker goes rogue and doesn't stop at the end, making my build fail #128

Sometimes one worker goes rogue and doesn't stop at the end, making my build fail #128

pelletencate commented Feb 23, 2020 •

edited

Loading

casperisfine commented Feb 24, 2020

pelletencate commented Feb 24, 2020

pelletencate commented Feb 24, 2020

casperisfine commented Feb 25, 2020

casperisfine commented Feb 25, 2020

pelletencate commented Feb 27, 2020

pelletencate commented Feb 27, 2020

casperisfine commented Feb 27, 2020

casperisfine commented Feb 27, 2020

pelletencate commented Feb 28, 2020 •

edited

Loading

schneems commented Mar 4, 2020

Sometimes one worker goes rogue and doesn't stop at the end, making my build fail #128

Sometimes one worker goes rogue and doesn't stop at the end, making my build fail #128

Comments

pelletencate commented Feb 23, 2020 • edited Loading

A little more research

casperisfine commented Feb 24, 2020

pelletencate commented Feb 24, 2020

pelletencate commented Feb 24, 2020

casperisfine commented Feb 25, 2020

casperisfine commented Feb 25, 2020

pelletencate commented Feb 27, 2020

pelletencate commented Feb 27, 2020

casperisfine commented Feb 27, 2020

casperisfine commented Feb 27, 2020

pelletencate commented Feb 28, 2020 • edited Loading

schneems commented Mar 4, 2020

pelletencate commented Feb 23, 2020 •

edited

Loading

pelletencate commented Feb 28, 2020 •

edited

Loading