In Geoff Huston’s recent ISP Column “Roll Over and Die?”, Roy Arends made a thorough analysis of the behavior of Unbound in the face of increased traffic towards authoritative servers after a failed key-rollover.
Key of Roy’s analysis is the observation that Unbound holds back after finding a bogus DNSKEY but does that on a per query instead of a per zone basis.
The default value of 60 seconds causes UNBOUND to restrain itself. However, since its a per-message cache, it only restrains itself for that qname/qclass/qtype tuple. Hence, if a different query is asked, UNBOUND needs to validate the response, sees a bogus DNSKEY in the cache and starts to re-fetch the dnskey keyset. In other words, a lame root key will cause DNSKEY queries for every unique query seen per 60 second window.
We will address this using a caching mechanism that will treat DNSSEC validation failures on a zone wide basis instead of treating them as intermittent RR-set failures. That should reduce the traffic to authoritative servers significantly.
The reason why this particular problem is interesting is that, as developers, we are constantly trying to make the tradeoff between the ability to recover from failure and the costs that those recovery mechanism impose on third parties. Failure to validate a signature can have many reasons, varying from misconfiguration or synchronization failure at the authoritative side, to on-path failure or attack, to misconfiguration a the receiving side. In this case we have not been conservative enough when making the trade-offs.
The fact that these sort of issues are identified are a healthy sign of what is still early deployment and we are eager to learn from these experiences. We use two resources for gathering experience that can help us making implementation choices: the IETF DNSOP working group and OARC. OARC is an organization where data is collected and shared so that impact of certain implementation behavior is quantified. We would like to ask people to contribute measurement data and share experiences.
Back to the particular issue of stale keys. The column points out that there are mechanisms to prevent stale keys being retained after a key rollover: the mechanism described in RFC5011. As of version 1.4.0 Unbound has native support for maintaining the trust-anchor for key-rollovers based on RFC5011. We have also made “autotrust” <link> available for cases where trust-anchors need to be maintained and Unbound is not used.
In the particular case described in the columnm, RFC5011 methodology might not have worked; an old OS distribution carrying a stale key that is several generations old cannot be tracked using RFC5011 techniques. Wijngaards and Kolkman have been working on a proposal to fix that particular issue: “DNSSEC Trust Anchor History Service