Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serve expired fails when DNSSEC is enabled #994

Closed
myohannes opened this issue Jan 16, 2024 · 8 comments · Fixed by #1143
Closed

serve expired fails when DNSSEC is enabled #994

myohannes opened this issue Jan 16, 2024 · 8 comments · Fixed by #1143
Assignees
Milestone

Comments

@myohannes
Copy link

myohannes commented Jan 16, 2024

Describe the bug
Unbound's serv-expired feature seems to break when DNSSEC is enabled.

To reproduce
Steps to reproduce the behavior:

  1. Use default Unbound configuration in /etc/unbound/unbound.conf and ensure DNSSEC and serving stale is enabled. Also reduce cache max ttl for reproduce. See sample config below.
        # Serve Stale enabled  (RFC 8767)
	serve-expired: yes
	serve-expired-ttl: 86400
	serve-expired-reply-ttl: 30
	serve-expired-client-timeout: 1000
       
       # DNSSEC enabled
        module-config: "validator iterator"
	trusted-keys-file: /etc/unbound/keys.d/*.key
	auto-trust-anchor-file: "/var/lib/unbound/root.key"
	prefetch-key: yes
	val-clean-additional: yes
	val-permissive-mode: no
       
       # lower cache max TTL to trigger the issue
       cache-max-ttl: 10
       
        # Misc configs 
        log-queries: yes
        log-servfail: yes
        prefetch: yes
	rrset-roundrobin: yes
	minimal-responses: yes
	qname-minimisation: no
        harden-glue: yes

  1. restart unbound unbound-control -c /etc/unbound/unbound.conf
  2. Send a query this ensure the cache is populated for this query
    dig s3.amazonaws.com
  3. Block upstream name servers. E.g. this iptable rule blocks all .COM servers which are in the 192/8 range
# Note: change -A to -D to remove blocks
/sbin/iptables -o eth0 -A OUTPUT -p udp -d 192.0.0.0/8 -j DROP
/sbin/iptables -o eth0 -A OUTPUT -p tcp -d 192.0.0.0/8 -j DROP
  1. keep sending dig and observe that after several seconds it returns SERVFAIL
while true 
do 
        dig s3.amazonaws.com; 
        sleep 1;
done

i.e.
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23877
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 16 16:03:18 UTC 2024
;; MSG SIZE  rcvd: 45

Expected behavior

Repeat steps from 1 - 5 with DNSSEC disabled. Change module-config to:
module-config: "iterator"

With DNSSEC disabled Unbound responds from stale cache. This is confirmed by the 30 second TTL on answers.

dig s3.amazonaws.com 

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44358
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; ANSWER SECTION:
s3.amazonaws.com.	30	IN	A	16.182.68.56
s3.amazonaws.com.	30	IN	A	52.217.4.62
s3.amazonaws.com.	30	IN	A	52.216.94.205
s3.amazonaws.com.	30	IN	A	16.182.37.24
s3.amazonaws.com.	30	IN	A	52.216.209.112
s3.amazonaws.com.	30	IN	A	52.217.169.8
s3.amazonaws.com.	30	IN	A	52.217.202.128
s3.amazonaws.com.	30	IN	A	52.217.115.24

;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 16 16:03:26 UTC 2024
;; MSG SIZE  rcvd: 173

System:

  • Unbound version: 1.12.0 and 1.16.0
  • OS: Linux 5.10.201-191.748.amzn2.aarch64 GNU/Linux
  • unbound -V output:

Both Unbound 1.12.0 and 1.160.0 seems to have the same issue.

/usr/sbin/unbound  -V
Version 1.12.0

Configure line: --prefix=/usr --sysconfdir=/etc --libdir=/usr/lib64 --with-libevent
Linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.2k-fips  26 Jan 2017
Linked modules: dns64 respip validator iterator

BSD licensed, see LICENSE in source package for details.
Report bugs to [email protected] or https://github.com/NLnetLabs/unbound/issues
/usr/local/sbin/unbound -V
Version 1.16.0

Configure line: --with-libevent
Linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.2k-fips  26 Jan 2017
Linked modules: dns64 respip validator iterator

BSD licensed, see LICENSE in source package for details.
Report bugs to [email protected] or https://github.com/NLnetLabs/unbound/issues

Additional information

@gthess
Copy link
Member

gthess commented Jan 19, 2024

After reproducing the issue I can clearly see that the issue is not related to DNSSEC.
You would need to have serve-expired-client-timeout in order to easily expose the issue but you can see it in any configuration where you:

  1. Query Unbound to populate the cache
  2. Wait for the A records to expire (6 seconds)
  3. Unbound will fetch new information from upstream
  4. Cut the upstream connection
  5. Note the intermediate SERVFAIL response around the 11th second from step 1. This can be made more visible by lowering the sleep timeout to 0.2 for example.

This happens because on step 3 Unbound will get the new message from upstream which includes both A and NS records.
The NS records have a TTL of 10 (configuration) while the A records have a TTL of 5.
At this point the cached A records have expired, but the cached NS records have not.
Unbound will not overwrite the still good RRSETs with new information.
But it will not update the message's TTL to the minimun value and that is the bug.

Coming to the last step, SERVFAIL is happening because of the upstream broken connection.
Unbound tries to replace the SERVFAIL with a cached answer (serve-expired-client-timeout).
It finds the previously cached answer which is now in a peculiar state:

  • the message has TTL of 0
  • the A rrset has TTL of 0
  • the NS rrset has TTL of -1

(The TTL values are relative to current time; not the actual values)
Based on the message's TTL Unbound thinks that this message is not expired and proceeds with sanity checks. One of the tests complains because one of the RRSETs is expired on a non-expired message.
The result is that Unbound cannot replace the SERVFAIL which is returned to the client.

The next second all this is resolved since the message is considered expired and the sanity check does not complain.

@gthess gthess self-assigned this Jan 19, 2024
@myohannes
Copy link
Author

gthess@ thanks for identifying the transient servfail issue which recovers quickly. But i don't think it's related to this issue.

It seems like the manual reproduce i provided above may not give consistent failures sometime it succeeds. Therefore i uploaded some supporting data here to show a new one line command that has better chance of reproducing it.

unbound_994_reproduce.zip

Inside the zip you will find unbound-serv-expired-failure-reproduce.txt which has the entire reproduce i did along with the script used. I have also included the associated unbound.log that was generated from the reproduce see unbound-serv-expired-failure-log.txt

Looking at the verbose unbound log, the issue is likely related to DNSSEC (validator module). It seems like when delegations expire and have to be refreshed due to incoming query and/or could be prefetch-key, if there is an outage to upstream (i.e. .com servers in this case) the DNSSEC is unable to do the validation because of reaching single delegation point limit

request com. has exceeded the maximum number of glue fetches 129 to a single delegation point
...
...
Jan 19 23:55:37 unbound[16237:0] debug: out of query targets -- returning SERVFAIL
Jan 19 23:55:37 unbound[16237:0] debug: store error response in message cache
Jan 19 23:55:37 unbound[16237:0] debug: return error response SERVFAIL
Jan 19 23:55:37 unbound[16237:0] debug: mesh_run: iterator module exit state is module_finished
Jan 19 23:55:37 unbound[16237:0] debug: validator[module 0] operate: extstate:module_state_initial event:module_event_moddone
Jan 19 23:55:37 unbound[16237:0] info: validator operate: query com. DNSKEY IN
Jan 19 23:55:37 unbound[16237:0] debug: validator: nextmodule returned
Jan 19 23:55:37 unbound[16237:0] debug: not validating response, is valrec(validation recursion lookup)
Jan 19 23:55:37 unbound[16237:0] debug: mesh_run: validator module exit state is module_finished
Jan 19 23:55:37 unbound[16237:0] error: SERVFAIL <com. DNSKEY IN>: misc failure

Because the validator module comes before iterator it doesn't give the query a chance to be answered from stale cache. This explains why turning off validator module seems to be consistently working.

Can you please take a look and confirm this theory when you get a chance.

@gthess
Copy link
Member

gthess commented Jan 22, 2024

I am afraid I still can't reproduce with the provided script.
Could you also share the configuration file you were using for this? From the log output I don't see the cache-max-ttl used, neither the expected "Serve expired:" log when the reply would have been a SERVFAIL.

The serve-expired logic (looking for expired records before replying) when serve-expired-client-timeout is used, happens after all the modules are done; it is part of the core mechanism in Unbound, not tied to any particular module.

I am digging through the logfile in the mean time in case something pops up.

@myohannes
Copy link
Author

You may need to run the script multiple times if you haven't already.

This is the unbound config i used.
unbound.txt

These are the configs i modified from the default config that comes with the install.

	chroot: ""
	verbosity: 4
	cache-max-ttl: 10
	log-queries: yes
        log-replies: yes
 	log-servfail: yes
 	prefetch: yes
	prefetch-key: yes
	module-config: "validator iterator"
	logfile: "/var/log/unbound.log"
	log-time-ascii: yes
        use-syslog: no
unbound -V
Version 1.16.0

Configure line: --with-libevent
Linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.2k-fips  26 Jan 2017
Linked modules: dns64 respip validator iterator

BSD licensed, see LICENSE in source package for details.
Report bugs to [email protected] or https://github.com/NLnetLabs/unbound/issues

@gthess
Copy link
Member

gthess commented Jan 22, 2024

I am confused now :) This config has no serve-expired activated options

@myohannes
Copy link
Author

myohannes commented Jan 23, 2024

Ah my mistake! The unbound config i provided was hand jammed and got mixed up with multiple installations of unbound in my box. Was the unbound log i provided earlier helpful?

I think i found a possible reason why it's hard to reproduce. I got a new box and installed unbound and was unable to reproduce it as well. When comparing with past boxes i noticed a subtle difference in the environments. In the older boxes, unbound was getting many queries in the background from other on host clients, whereas the new box was not. So here is another attempt at reproduce that seem to work for me on the new box consistently which i presume is more similar to your testing environment.

This time i happened to use unbound 1.12.0
unbound-1.12.0.txt

  1. I installed Unbound 1.12.0
$ pwd
/tmp/unbound-release-1.12.0
  1. Save attached unbound conf file to /etc/unbound/unbound-1.12.0.conf

  2. In first terminal keep sending queries every 1 or 2 seconds to simulate continous queries sent locally

 while true ; do dig ec2messages.us-west-2.amazonaws.com; dig logs.us-west-2.amazonaws.com. ; dig dynamodb.us-west-2.amazonaws.com; dig sts.us-west-2.amazonaws.com.; sleep 2; done
  1. After a few seconds in second terminal Run this reproduce script. The script ensures DNSSEC is turned on by querying dnssec-failed.org first before the upstream block to make sure dnssec validation is enabled correctly.
    Note: Assumes there is a /var/log/unbound.log already and it prunes the log on every run.
sudo rm /var/log/unbound.log && \
sudo touch /var/log/unbound.log && \
sudo chown unbound:unbound /var/log/unbound.log && \
echo $(date) && \
(sudo /sbin/iptables -o eth0 -D OUTPUT -p udp -d 192.0.0.0/8 -j DROP 2>/dev/null || true) && \
(sudo /sbin/iptables -o eth0 -D OUTPUT -p tcp -d 192.0.0.0/8 -j DROP 2>/dev/null || true) && \
(sudo ./unbound-control -c /etc/unbound/unbound-1.12.0.conf stop 2>/dev/null || true) && \
sleep 5 && \
sudo ./unbound -c /etc/unbound/unbound-1.12.0.conf && \
sleep 5 && \
(dig dnssec-failed.org || true) && \
(dig s3.amazonaws.com.|| true) && \
(sudo ./unbound-control -c /etc/unbound/unbound-1.12.0.conf dump_cache | grep "s3.amazonaws.com." || true) && \
sleep 2 && \
sudo /sbin/iptables -o eth0 -A OUTPUT -p udp -d 192.0.0.0/8 -j DROP && \
sudo /sbin/iptables -o eth0 -A OUTPUT -p tcp -d 192.0.0.0/8 -j DROP && \
sleep 5 && \
while true 
do
        dig s3.amazonaws.com.;
        sleep 1;
done;
Tue Jan 23 00:25:14 UTC 2024

ok
Jan 23 00:25:19 unbound[6052:0] debug: creating udp4 socket 127.0.0.1 53
Jan 23 00:25:19 unbound[6052:0] debug: creating tcp4 socket 127.0.0.1 53
Jan 23 00:25:19 unbound[6052:0] debug: creating udp4 socket 127.0.0.1 53
Jan 23 00:25:19 unbound[6052:0] debug: creating tcp4 socket 127.0.0.1 53
Jan 23 00:25:19 unbound[6052:0] debug: creating tcp4 socket 127.0.0.1 8953

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> dnssec-failed.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 14549
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;dnssec-failed.org.		IN	A

;; Query time: 246 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:24 UTC 2024
;; MSG SIZE  rcvd: 46


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12280
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; ANSWER SECTION:
s3.amazonaws.com.	5	IN	A	54.231.131.136
s3.amazonaws.com.	5	IN	A	16.182.104.128
s3.amazonaws.com.	5	IN	A	54.231.199.208
s3.amazonaws.com.	5	IN	A	16.182.37.216
s3.amazonaws.com.	5	IN	A	52.216.113.157
s3.amazonaws.com.	5	IN	A	52.216.56.224
s3.amazonaws.com.	5	IN	A	52.217.139.64
s3.amazonaws.com.	5	IN	A	52.217.228.192

;; Query time: 20 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:24 UTC 2024
;; MSG SIZE  rcvd: 173

s3.amazonaws.com.	10	IN	NS	ns-1300.awsdns-34.org.
s3.amazonaws.com.	10	IN	NS	ns-1579.awsdns-05.co.uk.
s3.amazonaws.com.	10	IN	NS	ns-63.awsdns-07.com.
s3.amazonaws.com.	10	IN	NS	ns-771.awsdns-32.net.
s3.amazonaws.com.	5	IN	A	52.216.113.157
s3.amazonaws.com.	5	IN	A	52.216.56.224
s3.amazonaws.com.	5	IN	A	52.217.139.64
s3.amazonaws.com.	5	IN	A	52.217.228.192
s3.amazonaws.com.	5	IN	A	54.231.131.136
s3.amazonaws.com.	5	IN	A	16.182.104.128
s3.amazonaws.com.	5	IN	A	54.231.199.208
s3.amazonaws.com.	5	IN	A	16.182.37.216
msg s3.amazonaws.com. IN A 32896 1 5 3 1 1 0
s3.amazonaws.com. IN A 0
s3.amazonaws.com. IN NS 0

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 58059
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 212 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:31 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 27344
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 38 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:32 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 14796
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:33 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 14377
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:34 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8511
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 19 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:35 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 28141
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 19 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:36 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 43336
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 20 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:37 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49799
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 19 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:38 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 65125
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 20 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:39 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49522
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 19 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:40 UTC 2024
;; MSG SIZE  rcvd: 45


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.amzn2.13.5 <<>> s3.amazonaws.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5745
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;s3.amazonaws.com.		IN	A

;; Query time: 36 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jan 23 00:25:41 UTC 2024
;; MSG SIZE  rcvd: 45

^C
..
..
..

<did not recover to stale answers>

Unbound log file for this run
unbound-log-01-23.zip

@gthess
Copy link
Member

gthess commented Jan 30, 2024

I was able to reproduce it at the end with the new information and I discovered the following when using serve-expired-client-timeout:

  • The iterator should not overwrite an expired cached entry with lower security (in this case insecure in the cache, unchecked fresh).
  • The validator should not overwrite an expired cached entry with lower security (in this case insecure in cache, bogus when revalidating).
  • Probably needs an extra ttl on the message itself to indicate that any resolution will probably servfail if we try to resolve immediately for the next X seconds. Instead use the cached entry straight away.
  • The serve-expired-client-timeout logic should also work for DNSSEC failures, not just module errors. Bogus turns into SERVFAIL just before replying to a client, way past the serve-expired-client-timeout logic that tries to replace SERVFAIL answers (i.e., module errors) with (expired) cached data.

For this specific issue what was happening based on timing was either:

  • The iterator would generate SERVFAIL because it couldn't reach any nameservers and that would be replaced with an expired answer. The validator would not act on the SERVFAIL and a cache entry would be used, or
  • The iterator would get the new answer and overwrite the valid expired RRSETs in the cache with new unchecked ones. The validator would try to validate them, fail and return bogus. The bogus response would not trigger a cache replacement and even if it would, there is no usable cache entry at this point. The continuous response for these queries would be SERVFAIL (bogus) then until the upstream issue is resolved.

I'll need to schedule and address all the above since more scenarios could be problematic based on timing and combination of modules.
In general be very conservative while updating cache entries when serve expired is used and treat DNSSEC errors as an additional trigger for replying with cached entries if possible.

@gthess
Copy link
Member

gthess commented Sep 24, 2024

#1143 fixed the generic case of preferring expired answers to resolution/validation errors. The specific case mentioned here (affected insecure zone because of parent DNSSEC errors) is also introduced as an explicit test case for this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants