-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache_vcl: Do the temperature dance for backend creation races #4205
base: master
Are you sure you want to change the base?
cache_vcl: Do the temperature dance for backend creation races #4205
Conversation
8686eca
to
62db805
Compare
Sorry, if anyone was real quick they might have caught a little glitch. I force-pushed a correction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if @AlveElde could also have a look since he spent time on the temperature state machine.
Dynamic backends may get created while we transition a VCL to the WARM temperature: Because we first set the new temperature and then post the event, backends might get created when the temperature is already WARM, but before the VCL_EVENT_WARM gets posted, which leads to double warming. This can be demonstrated with an appropriate delay during vcl_send_event(). The solution to this problem is simple: Only post WARM events to backends from before the start of the transition. In code, it looks a little more involved, but it is not complicated: Under the vcl mutex, we take all directors onto a separate "cold" list and change the temperature, and send the warm event only to the "cold" directors. When all are warm, we concatenate. To undo, we basically do the inverse: Save the warm directors, put the cold ones back, send a cold event to all warm ones and concatenate again. Fixes varnishcache#4199
62db805
to
f2ceb35
Compare
bugwash: leave more elaborate explanation on the race here |
@bsdphk as promised, the more elaborate explanation of the race: varnish-cache/bin/varnishd/cache/cache_vcl.c Lines 632 to 634 in 305be79
when, after this point, varnish-cache/bin/varnishd/cache/cache_vrt_vcl.c Lines 242 to 245 in 305be79
Then, the second event gets posted from here: varnish-cache/bin/varnishd/cache/cache_vcl.c Line 637 in 305be79
Regarding your question on IRC:
Yes, I think this would be possible, if we added the current temperature to directors. Yet it would conflict with another goal, namely to not hold the So, in short: Yes, we could avoid this problem dancing less but deadlocking more. |
Lck_Lock(&vcl_mtx); | ||
VTAILQ_SWAP(&vcl->director_list, &colddirs, vcldir, list); | ||
vcl->temp = VCL_TEMP_WARM; | ||
Lck_Unlock(&vcl_mtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to empty the vcl->director_list
at any point? If a VMOD makes a VRT_DelDirector()
call for a director that existed before this list swap, the VTAILQ_REMOVE()
in vcldir_retire()
does nothing and we get a use-after-free when swapping the list back:
varnish-cache/bin/varnishd/cache/cache_vrt_vcl.c
Lines 279 to 282 in 5cb8571
Lck_Lock(&vcl_mtx); | |
temp = vdir->vcl->temp; | |
VTAILQ_REMOVE(&vdir->vcl->director_list, vdir, list); | |
Lck_Unlock(&vcl_mtx); |
Maybe I am missing a step in the temperature dance that would make this scenario impossible, but I think it at least breaks the assumption that once a director is added to the director_list
it will be there until you remove it (and its refcount reaches 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @AlveElde, you are right that this is brittle.
Because VTAILQ_REMOVE(head, elm, field)
only touches head
if elm
is last in the list, this would only affect the case where the last director added before the swap is removed after the concat, but yes, it is another race. Less relevant, but still a race.
This is not the only place where removing directors is a problem, which is why I suggested a solution in #4142.
Maybe it's best to bring that one in first and then update this PR accordingly.
Set to draft because #4142 should be done first. |
Dynamic backends may get created while we transition a VCL to the WARM temperature: Because we first set the new temperature and then post the event, backends might get created when the temperature is already WARM, but before the
VCL_EVENT_WARM
gets posted, which leads to double warming. This can be demonstrated with an appropriate delay duringvcl_send_event()
.The solution to this problem is simple: Only post WARM events to backends from before the start of the transition.
In code, it looks a little more involved, but it is not complicated:
Under the vcl mutex, we take all directors onto a separate "cold" list and change the temperature, and send the warm event only to the "cold" directors. When all are warm, we concatenate.
To undo, we basically do the inverse: Save the warm directors, put the cold ones back, send a cold event to all warm ones and concatenate again.
Fixes #4199