Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagation delay issues #9

Open
kentonv opened this issue Apr 26, 2020 · 5 comments
Open

Propagation delay issues #9

kentonv opened this issue Apr 26, 2020 · 5 comments

Comments

@kentonv
Copy link

kentonv commented Apr 26, 2020

Thanks for writing this plugin! I've used it successfully (along with ACME.js in general) in sandstorm-io/sandstorm#3299.

I had some thoughts on propagation verification.

So, currently, if I pass verifyPropagation: true to the options, the plugin will block until it can actually see the DNS entry show up. That's a nice feature! But it struck me that it's really only telling you if the DNS entry has shown up in the Cloudflare colo closest to you, which is probably not the colo Let's Encrypt itself will be querying. So in practice it's still an estimate, and it's not clear to me if it's all that much better than waiting for a fixed delay period.

A more subtle problem with the current logic is that the plugin makes the first verification query right away after writing the DNS entry to the Cloudflare API. This might paradoxically cause verification to take longer, because this first query will almost always fail, and this negative result will then be cached (perhaps in the local machine's DNS cache, perhaps in caches on Cloudflare's side, etc.). So, the DNS entry will likely arrive at the colo a few seconds later, but the plugin may keep seeing it as missing for a while due to caching, causing verification to take longer than it needs to.

Meanwhile it looks like ACME.js does a verification step of its own. It expects the plugin to have a property called propagationDelay specifying how much time to wait before checking if the DNS entry exists. The Cloudflare plugin doesn't appear to set this property currently, which cases ACME.js to print a warning and then use 5 seconds as a default. After the delay, ACME.js does its own DNS lookup and bails out immediately if it can't find the DNS entry. In my experimentation, it seems that 5 seconds is not often long enough for Cloudflare, so when verifyPropagation is disabled, the plugin commonly fails.

With all that in mind, some specific ideas for improvements:

  • Perhaps acme-dns-01-cloudflare should set propagationDelay to something like 15 seconds when verifyPropagation is off, and zero when it is on. That'll silence the warning and should provide reasonable behavior either way. I am currently doing exactly this as a hack in Sandstorm: sandstorm-io/sandstorm@a931920

  • When verifyPropagation is on, wait 5-10 seconds before the first verification attempt. This should make verification take less time overall since it won't pollute caches in the common case that propagation happens quickly.

  • When verifyPropagation is on, the stderr output is pretty verbose and makes it look like things are going horribly wrong when in fact it's working as intended. Specifically right now it repeatedly prints things like:

      Error: Could not verify DNS for _acme-challenge.zero.kentonshouse.com
          at Function.verifyPropagation (node_modules/acme-dns-01-cloudflare/index.js:146:12)
          at Challenge.set (node_modules/acme-dns-01-cloudflare/index.js:46:5)
      Waiting for 10000 ms before attempting propagation verification retry 1 / 30.
    

    This could maybe be reduced to a single line like: "DNS not propagated yet. Will check again in 10 seconds. (attempt 1/30)"

  • I noticed that even when verifyPropagation is off, the plugin seems to verify deletions. But deletions only happen as the very last step of the ACME.js process and there's really no need to verify them, so this just adds a lot of stderr spam for no reason. Maybe verification of deletions should be skipped?

@Cherry
Copy link
Member

Cherry commented Apr 26, 2020

Thanks so much for the detailed feedback, this is really great! I'm glad it's working well in Sandstorm for you.

As you noted, ACME.js does have its own propagationDelay, but this doesn't seem to work with greenlock.js (how we use it internally), thus the need for the verificationProgation option. We don't use this directly with ACME.js, so never ran into the double propagation issue. There's an open PR to fix this for greenlock, but it's been pending for months. The main reason we added this feature was just due to the sheer number of DNS related failures we were having without it, and whilst this doesn't (and really can't) solve them all, it does greatly reduce them.

You're also absolutely right regarding the DNS caching problems, and it's very much an estimate, since there's no way to really know where/when/how Let's Encrypt are going to query for that record, or if it's going to be cached. DNS is "fun" in this regard, and a better way to guarantee this would be awesome, but I'm just not sure there is one. Even if this plugin reports everything as 100% successfully propagated, there's still a chance that when Let's Encrypt query for the record content, it's going to be stale, and fail.

Regarding the specific ideas for improvement:

Perhaps acme-dns-01-cloudflare should set propagationDelay to something like 15 seconds when verifyPropagation is off, and zero when it is on. That'll silence the warning and should provide reasonable behavior either way.

This is a good suggestion, I'll implement this.

When verifyPropagation is on, wait 5-10 seconds before the first verification attempt. This should make verification take less time overall since it won't pollute caches in the common case that propagation happens quickly.

Great feedback. I'll definitely implement this.

When verifyPropagation is on, the stderr output is pretty verbose and makes it look like things are going horribly wrong when in fact it's working as intended.

Great feedback. The latest version in git actually does hide these logs behind a verbose config option, but this hasn't hit npm yet. I think reducing the output as suggested, and the new verbose logging option should make this much nicer.

I noticed that even when verifyPropagation is off, the plugin seems to verify deletions. But deletions only happen as the very last step of the ACME.js process and there's really no need to verify them, so this just adds a lot of stderr spam for no reason. Maybe verification of deletions should be skipped?

This should only be happening when verifyPropagation is on, but I'm not sure about skipping it entirely. I'll think about this and any possible side-effects this may cause.

Cherry added a commit that referenced this issue Apr 26, 2020
fix: wait 10 seconds before attempting first propagation check, to prevent poluting the DNS cache with an invalid result

Work towards implementing suggestions in #9
@Cherry
Copy link
Member

Cherry commented Apr 26, 2020

Published most of these suggested changes in 1.2.0. Thanks for the fantastic feedback!

@kentonv
Copy link
Author

kentonv commented Apr 27, 2020

Huh, doesn't greenlock.js itself use ACME.js?

This should only be happening when verifyPropagation is on

Hmm, this morning when I tried turning the option off I still saw it verifying deletions, but not creations. Though maybe what actually happened was that the option was still on somehow but the creation succeeded on the first verification attempt? Still, I was pretty sure I had turned it off... I guess I'll try testing it some more later.

@Cherry
Copy link
Member

Cherry commented Apr 27, 2020

Huh, doesn't greenlock.js itself use ACME.js?

It does, but the propagationDelay gets stripped somewhere in the middle of the challenge module being passed around greenlock to ACME.js.

Hmm, this morning when I tried turning the option off I still saw it verifying deletions, but not creations.

I should have clarified, sorry. It definitely used to be broken in the way you described and was only intended to run when verifyPropagation was on - this was fixed in a8e4d40#diff-168726dbe96b3ce427e7fedce31bb0bcR111-R116

@kentonv
Copy link
Author

kentonv commented Apr 27, 2020

Awesome, thanks for the changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants