A few days ago, we had some strengh phenomenon on our Oracle RAC Servern with TNSListener Timeouts.
EM Event: Critical:LISTENER_SCAN3_rac-scan - Listener response to a TNS ping is 5,010 msecs
After some research and many hours of analysing, we could get the right information. The server wants to get the A-Record and the AAAA-Record (quadA-Record) after sending his dns-request. That you can see in this TCPDUMP.
2016-10-13 15:22:54.247889 IP (tos 0x0, ttl 64, id 11169, offset 0, flags [DF], proto UDP (17), length 75) 188.8.131.52.23397 > 184.108.40.206.domain: [bad udp cksum 5c97!] 7836+ A? rac-scan.domain.local. (47) 2016-10-13 15:22:54.247895 IP (tos 0x0, ttl 64, id 11170, offset 0, flags [DF], proto UDP (17), length 75) 220.127.116.11.23397 > 18.104.22.168.domain: [bad udp cksum 6ac8!] 53901+ AAAA? rac-scan.domain.local. (47) 2016-10-13 15:22:54.248504 IP (tos 0x0, ttl 127, id 30819, offset 0, flags [none], proto UDP (17), length 123) 22.214.171.124.domain > 126.96.36.199.23397: [udp sum ok] 7836* q: A? rac-scan.domain.local. 3/0/0 rac-scan.domain.local. [1h] A 188.8.131.52, rac-scan.domain.local. [1h] A 10.60.22.105, rac-scan.domain.local. [1h] A 184.108.40.206 (95) 2016-10-13 15:22:59.251573 IP (tos 0x0, ttl 64, id 11171, offset 0, flags [DF], proto UDP (17), length 75) 220.127.116.11.23397 > 18.104.22.168.domain: [bad udp cksum 5c97!] 7836+ A? rac-scan.domain.local. (47)
Okay this behavior is the normal way in dualstack enviremonts. The problem is, that some operationsystems and some hardware send the request over the same socket, but close the socket after the first response. Because of that, the second request waits and waits and waits…. (till Timeout).
So… you can deactivate the IPv6 if you dont need that, or you could set the following option for the resolver.
man resolv.conf [... abridged output ...] ... single-request (since glibc 2.10) sets RES_SNGLKUP in _res.options. By default, glibc performs IPv4 and IPv6 lookups in parallel since version 2.9. Some appliance DNS servers cannot handle these queries properly and make the requests time out. This option disables the behavior and makes glibc perform the IPv6 and IPv4 requests sequentially (at the cost of some slowdown of the resolving process). single-request-reopen (since glibc 2.9) The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly only sends back one reply. When that happens the client sytem will sit and wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket and open a new one before sending the second request. ...
We decided to take option two and this is how the resolv.conf is looking now.
[root@rac01 ~]# cat /etc/resolv.conf options single-request-reopen search domain.local nameserver 22.214.171.124 nameserver 126.96.36.199
Now the machines running like charme 😉