-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes one worker goes rogue and doesn't stop at the end, making my build fail #128
Comments
It's possible that one of your tests is stuck, and never complete. What you can do to debug is to get a shell in there and inspect the redis instance. There is a key named Note that the RSpec test identifiers are a bit opaque, but it gives you the spec file and test index. |
I doubt it, it doesn't explain the fact that the last worker is still drawing multiple dots (see last line of output above).
Heroku CI workers are notoriously hard to jack into (except for heroku ci:debug), but I'll definitely try getting some regular redis output going while the specs are running so I get some idea what's going on here. |
I've added some additional research to the original issue. |
You've set the requeue count very high, so it might be running the same test in repetition ? |
Another likely possibility is that the workers shutdown because they reached the limit of consecutive failures: |
Well, it should be drawing asterisks then, instead of dots, right? (A dot in RSpec's compact formatter indicates a spec that passed, asterisk would indicate PENDING or SKIPPED [and, in this case, requeued]). Incidentally, the output of both the complete and "broken" builds do not indicate a single faulty spec. I'll try the following:
Thanks so far at least! I'll report back when I have some more info |
I wrote this formatter that gives all of the output you'd ever want. As soon as I see a dangling build, I'll report its output here. https://gist.github.com/pelletencate/80cbcc24ce988223a8e3a4ac84d930cf |
👍 |
By the way, something I forgot to mention about your script: for num in $(seq 1 $PARALLEL_COUNT); do
RUBYOPT="-W0" \
CI_NODE_INDEX=$(expr $num - 1) \
DATABASE_URL=$([ "$num" -ne "1" ] && echo $DATABASE_URL$num || echo $DATABASE_URL) \
rspec-queue ...
done
wait
rspec-queue --report
|
I know, I believe it was meant to make sense to run it once the first worker is finished, so it waits for the rest. Since I fork all of my workers, in my case it basically runs immediately, meaning I'd have the timeout to be set high enough to cover my entire test suite, in which case the only time saved is the startup of OK, so here's what I found out. Whenever this happens, it's because in an early stage, one of the workers downright stops responding. It seems to be something specific to the Heroku VMs, I have no recollection of ever seeing this happen on my local machine. I believe it happens in a very early stage, because the Here's the relevant output: First example shows there are 1182 in the queue. Time is in seconds.
Then, the last line ever reported by each worker.
So, what does this tell me?
I'm not sure if this hanging process is an Heroku platform issue (maybe @schneems could chime in here, I believe he's been playing around with ci-queue on Heroku CI), but I would suggest that a solution could be to not only reschedule a test after Unfortunately, the I've now set up my suite to use Also, at the end of this faulty spec, there's still a line of dots being drawn. The answer to the question which process is drawing these dots, is still a complete mystery to me; I thought initially To be continued! |
On Heroku, I do think Heroku will eventually TERM your suite.
That does sound troubling, can you reproduce this behavior? If so you can open a ticket https://help.heroku.com and search CI queue. Have you tried using One thing I've done to debug "stuck" systems is to put a backtrace heartbeat into my code something like this: Thread.new do
loop do
sleep 15 # seconds
Thread.list.each { |t| puts "=" * 80; puts t.backtrace }
end
end That way if anything gets stuck you should at least know where it's getting stuck in the code. |
I'm running ci-queue in a bit of an alternative setting. I've got a Ruby on Rails app which has a long test suite, and I'm running 8 concurrent instances of RSpec on Heroku CI, but on one and the same dyno, which also runs Redis in-dyno.
I start the whole chain with the following script
Every now and then when this runs on Heroku CI, 7 out of the 8 workers end at the same time, but it seems the 8th one keeps on running for hours until Heroku kills my build after 2 hours that usually passes in about 10-15 minutes. It's as if it doesn't understand that the queue is done.
While there are 8 instances running, the word
Finished
only shows up 7 times in the log, so I assume what I see here is the finishing of workers 6 and 7, and the following dots being part of worker 8.I have no idea how to debug this further, but I'd love to add more details if someone can help me in the right direction.
A little more research
I've compared the output of a successful build and a failed build.
The text was updated successfully, but these errors were encountered: