Linux test-dns resolveTxt failure
squeek502 opened this issue · 11 comments
I was getting this locally and now it's happening on the CI too:
# Starting Test: dns - resolveTxt
/home/runner/work/luvit/luvit/tests/libs/tap.lua:81: /home/runner/work/luvit/luvit/tests/test-dns.lua:86: assertion failed!
stack traceback:
[C]: in function 'error'
/home/runner/work/luvit/luvit/tests/libs/tap.lua:81: in function </home/runner/work/luvit/luvit/tests/libs/tap.lua:64>
[C]: in function 'xpcall'
/home/runner/work/luvit/luvit/tests/libs/tap.lua:64: in function 'run'
/home/runner/work/luvit/luvit/tests/libs/tap.lua:165: in function 'tap'
/home/runner/work/luvit/luvit/tests/run.lua:42: in function 'fn'
[string "bundle:deps/require.lua"]:310: in function 'require'
/home/runner/work/luvit/luvit/main.lua:128: in function </home/runner/work/luvit/luvit/main.lua:20>
not ok 21 dns - resolveTxt
EDIT: Locally the error I'm getting is Maximum attempts reached
Something weird is going on here:
- The test right after also does a
dns.resolveTxt('google.com')
and that works fine - Changing the test from using
google.com
to usingnodejs.org
fixes it (this is the domain the node test-dns uses) - It only happens on Linux, not Mac (it's skipped on the Windows CI)
- It also happens when using older
luvi
versions (tested with 2.7.6 and it fails there too) - EDIT: It gets fixed if I change the order of the
resolveTxt
test (i.e. move it to the bottom of the file)
I'm not sure I have the knowledge necessary for debugging this one properly. A quick fix would be to change the domain that it looks up the TXT records for.
For me, I am getting the following when building Luvit on Linux:
Uncaught Error: /mnt/bilal/home/Desktop/luvit/deps/dns.lua:690: attempt to perform arithmetic on local 'len_lo' (a nil value)
stack traceback:
/mnt/bilal/home/Desktop/luvit/deps/dns.lua:690: in function 'handler'
/mnt/bilal/home/Desktop/luvit/deps/core.lua:248: in function 'emit'
...bilal/home/Desktop/luvit/deps/stream/stream_readable.lua:172: in function 'push'
/mnt/bilal/home/Desktop/luvit/deps/net.lua:123: in function </mnt/bilal/home/Desktop/luvit/deps/net.lua:117>
[builtin#37]: at 0x004e1840
/mnt/bilal/home/Desktop/luvit/init.lua:49: in function </mnt/bilal/home/Desktop/luvit/init.lua:47>
[C]: in function 'xpcall'
/mnt/bilal/home/Desktop/luvit/init.lua:47: in function 'fn'
[string "bundle:deps/require.lua"]:310: in function <[string "bundle:deps/require.lua"]:266>
make: *** [Makefile:12: test] Error 255
I have traced a tiny bit of this, and found that (line 114 from net):
function Socket:_read(n)
local onRead
function onRead(err, data)
timer.active(self)
if err then
return self:destroy(err)
elseif data then
p(3, n, data) -- data = '\000'
self:push(data)
else
self:push(nil)
self:emit('_socketEnd')
end
end
We notice here that the data getting passed is \000
, now back to dns (line 685):
function onData(msg)
local len_hi, len_lo, len, answers
len_hi = byte(msg, 1)
len_lo = byte(msg, 2)
len = lshift(len_hi, 8) + len_lo -- len_lo == nil
Since msg
is \000
, string.byte('\000', 2) == nil
this will fail. I have tested the Readable stream class a bit, and it just works fine.
Now the data seems to be coming directly from luv: (net line 133):
uv.read_start(self._handle, onRead)
so I am not entirely sure why this single test is the one getting this kind of data.
I've also confirmed:
Changing the test from using google.com to using nodejs.org fixes it (this is the domain the node test-dns uses)
makes it somehow work just fine without getting this kind of weird chunk.
I have just noticed that it doesn't have to be a different domain, just changing it to www.google.com
seems to work. That's qutie weird but I guess it is not totally broken.
Good to mention, requesting google.com
only without www
will return a 301 - Moved. I think all tests that uses google.com
should be changed to www.google.com
until we find the exact reason behind this weirdness.
Update: I've changed all tests to use www.google.com
and that made Test dns - resolveTxtTimeout Order
fail with
/mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:81: /mnt/bilal/home/Desktop/luvit/tests/test-dns.lua:98: assertion failed!
stack traceback:
[C]: in function 'error'
/mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:81: in function </mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:64>
[C]: in function 'xpcall'
/mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:64: in function 'run'
/mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:165: in function 'tap'
/mnt/bilal/home/Desktop/luvit/tests/run.lua:42: in function 'fn'
[string "bundle:deps/require.lua"]:310: in function 'require'
/mnt/bilal/home/Desktop/luvit/main.lua:128: in function </mnt/bilal/home/Desktop/luvit/main.lua:20>
... oh lol. Rest of the tests seems fine with it now, including resolveTxt
.
Update 2: Changing the tests that uses google.com
except resolveTxtTimeout
seems like a working workaround.
Update 3: With some help from Nameless, we found that the failing request (google.com
) look like:
XL\131\128\000\001\000\a\000\000\000\000\006google\003com\000\000\016\000\001\192\f\000\016\000\001\000\000\f\148\000<;facebook-domain-verification=22rm551cu4k0ab0bxsw536tlds4h95\192\f\000\016\000\001\000\000\f\148\000$#v=spf1 include:_spf.google.com ~all\192\f\000\016\000\001\000\000\f\148\000+*apple-domain-verification=30afIBcvSuDV2PLX\192\f\000\016\000\001\000\000\f\148\000EDgoogle-site-verification=TV9-DBe4R80X4v0M4U_bd_J9cpOJM0nikft0jAgjmsQ\192\f\000\016\000\001\000\000\f\148\000A@globalsign-smime-dv=CDYX+XFHUw2wml6/Gb8+59BsH31KzUr6c1l2BPvqKX8=\192\f\000\016\000\001\000\000\f\148\000EDgoogle-site-verification=wD8N7i1JTNTkezJ49swvWW48f8_9xveREV4oB-0Hf5o\192\f\000\016\000\001\000\000\f\148\000.-docusign=1b0a6754-49b1-4db5-8540-d2c12664b289
when a successful one (www.google.com
) is similar to:
\175{\129\128\000\001\000\001\000\001\000\000\003www\006google\003com\000\000\016\000\001\192\f\000\005\000\001\000\000\000\000\000\018\015forcesafesearch\192\016\192\016\000\006\000\001\000\000\000<\000&\003ns1\192\016\tdns-admin\192\016\022\1885a\000\000\003\132\000\000\003\132\000\000\a\b\000\000\000<
@squeek502 you think we can use that as a workaround for now? or we should totally figure out the weirdness happening here first? or maybe just changing that resolveTxt
domain to something else only? perhaps to www.google.com
and the rest google.com
Ok, so the servers being used are essentially global, so each test affects the next.
In its current place:
# Starting Test: resolveTxt
'udp_iter' { port = 53, host = '127.0.0.53' }
'udp_iter' { port = 53, host = '127.0.0.53' }
'udp_iter' { port = 53, host = '127.0.0.53' }
'udp_iter' { port = 53, host = '127.0.0.53' }
'udp_iter' { port = 53, host = '127.0.0.53' }
Maximum attempts reached
not ok 9 resolveTxt
When moved to the bottom:
# Starting Test: resolveTxt
'udp_iter' { port = 53, tcp = false, host = '8.8.8.8' }
ok 15 resolveTxt
The servers get set to DEFAULT_SERVERS
, but on Luvit init dns.loadResolvers()
gets called which sets servers to the system's dns resolver (hence the 127.0.0.53:53
server).
So, one quick fix would be to just call dns.setDefaultServers()
in that test. It's still strange that this is failing, but maybe that's a Linux bug? Or a Libuv bug? I'll look more into it.
EDIT: In the meantime, #1149
Sounds good, will test if this works on my machine now
This is getting weirder, resolveTxt is indeed now working on my machine, though it is failing at dns - resolveMx
with:
oh sorry, that looks like my bad. I just refetched the test file and it successfully passed all tests.
This should do for now
Some more weirdness:
- When i first boot I get
Server fault
(IIRC) as the error, but then can never get that again, and instead getMaximum attempts reached
. Clearing dns cache doesn't change anything either systemd-resolve --type=TXT google.com
givesgoogle.com: resolve call failed: Query timed out
after a long while, so this might be a more general systemd dns resolver issue
This might be due to Google not handling DNS of type TXT well enough:
google.com
seems to be always invalid, and always returns a301
:
systemd-resolve --type=TXT google.com
google.com: resolve call failed: Received invalid reply
www.google.com
seems to not support at the very least, TXT:
systemd-resolve --type=TXT www.google.com
www.google.com: resolve call failed: Name 'forcesafesearch.google.com' does not have any RR of the requested type
I suggest we could change Google to something else everywhere in the tests maybe?
I think it's fine to work around whatever weirdness they are doing on the main domain. It could be easily a custom dns server that responds differently depending on any number of factors (time of day, geo location, version of client, etc) at their scale.