luvit/luvit

Linux test-dns resolveTxt failure

squeek502 opened this issue · 11 comments

I was getting this locally and now it's happening on the CI too:

# Starting Test: dns - resolveTxt
  /home/runner/work/luvit/luvit/tests/libs/tap.lua:81: /home/runner/work/luvit/luvit/tests/test-dns.lua:86: assertion failed!
  stack traceback:
  	[C]: in function 'error'
  	/home/runner/work/luvit/luvit/tests/libs/tap.lua:81: in function </home/runner/work/luvit/luvit/tests/libs/tap.lua:64>
  	[C]: in function 'xpcall'
  	/home/runner/work/luvit/luvit/tests/libs/tap.lua:64: in function 'run'
  	/home/runner/work/luvit/luvit/tests/libs/tap.lua:165: in function 'tap'
  	/home/runner/work/luvit/luvit/tests/run.lua:42: in function 'fn'
  	[string "bundle:deps/require.lua"]:310: in function 'require'
  	/home/runner/work/luvit/luvit/main.lua:128: in function </home/runner/work/luvit/luvit/main.lua:20>
not ok 21 dns - resolveTxt

EDIT: Locally the error I'm getting is Maximum attempts reached

Something weird is going on here:

  • The test right after also does a dns.resolveTxt('google.com') and that works fine
  • Changing the test from using google.com to using nodejs.org fixes it (this is the domain the node test-dns uses)
  • It only happens on Linux, not Mac (it's skipped on the Windows CI)
  • It also happens when using older luvi versions (tested with 2.7.6 and it fails there too)
  • EDIT: It gets fixed if I change the order of the resolveTxt test (i.e. move it to the bottom of the file)

I'm not sure I have the knowledge necessary for debugging this one properly. A quick fix would be to change the domain that it looks up the TXT records for.

For me, I am getting the following when building Luvit on Linux:

Uncaught Error: /mnt/bilal/home/Desktop/luvit/deps/dns.lua:690: attempt to perform arithmetic on local 'len_lo' (a nil value)
stack traceback:
        /mnt/bilal/home/Desktop/luvit/deps/dns.lua:690: in function 'handler'
        /mnt/bilal/home/Desktop/luvit/deps/core.lua:248: in function 'emit'
        ...bilal/home/Desktop/luvit/deps/stream/stream_readable.lua:172: in function 'push'
        /mnt/bilal/home/Desktop/luvit/deps/net.lua:123: in function </mnt/bilal/home/Desktop/luvit/deps/net.lua:117>
        [builtin#37]: at 0x004e1840
        /mnt/bilal/home/Desktop/luvit/init.lua:49: in function </mnt/bilal/home/Desktop/luvit/init.lua:47>
        [C]: in function 'xpcall'
        /mnt/bilal/home/Desktop/luvit/init.lua:47: in function 'fn'
        [string "bundle:deps/require.lua"]:310: in function <[string "bundle:deps/require.lua"]:266>
make: *** [Makefile:12: test] Error 255

I have traced a tiny bit of this, and found that (line 114 from net):

function Socket:_read(n)
  local onRead

  function onRead(err, data)
    timer.active(self)
    if err then
      return self:destroy(err)
    elseif data then
      p(3, n, data) -- data = '\000'
      self:push(data)
    else
      self:push(nil)
      self:emit('_socketEnd')
    end
  end

We notice here that the data getting passed is \000, now back to dns (line 685):

    function onData(msg)
      local len_hi, len_lo, len, answers

      len_hi = byte(msg, 1)
      len_lo = byte(msg, 2)
      len = lshift(len_hi, 8) + len_lo -- len_lo == nil

Since msg is \000, string.byte('\000', 2) == nil this will fail. I have tested the Readable stream class a bit, and it just works fine.
Now the data seems to be coming directly from luv: (net line 133):

uv.read_start(self._handle, onRead)

so I am not entirely sure why this single test is the one getting this kind of data.

I've also confirmed:

Changing the test from using google.com to using nodejs.org fixes it (this is the domain the node test-dns uses)

makes it somehow work just fine without getting this kind of weird chunk.

I have just noticed that it doesn't have to be a different domain, just changing it to www.google.com seems to work. That's qutie weird but I guess it is not totally broken.

Good to mention, requesting google.com only without www will return a 301 - Moved. I think all tests that uses google.com should be changed to www.google.com until we find the exact reason behind this weirdness.

Update: I've changed all tests to use www.google.com and that made Test dns - resolveTxtTimeout Order fail with

  /mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:81: /mnt/bilal/home/Desktop/luvit/tests/test-dns.lua:98: assertion failed!
  stack traceback:
        [C]: in function 'error'
        /mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:81: in function </mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:64>
        [C]: in function 'xpcall'
        /mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:64: in function 'run'
        /mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:165: in function 'tap'
        /mnt/bilal/home/Desktop/luvit/tests/run.lua:42: in function 'fn'
        [string "bundle:deps/require.lua"]:310: in function 'require'
        /mnt/bilal/home/Desktop/luvit/main.lua:128: in function </mnt/bilal/home/Desktop/luvit/main.lua:20>

... oh lol. Rest of the tests seems fine with it now, including resolveTxt.

Update 2: Changing the tests that uses google.com except resolveTxtTimeout seems like a working workaround.

Update 3: With some help from Nameless, we found that the failing request (google.com) look like:

XL\131\128\000\001\000\a\000\000\000\000\006google\003com\000\000\016\000\001\192\f\000\016\000\001\000\000\f\148\000<;facebook-domain-verification=22rm551cu4k0ab0bxsw536tlds4h95\192\f\000\016\000\001\000\000\f\148\000$#v=spf1 include:_spf.google.com ~all\192\f\000\016\000\001\000\000\f\148\000+*apple-domain-verification=30afIBcvSuDV2PLX\192\f\000\016\000\001\000\000\f\148\000EDgoogle-site-verification=TV9-DBe4R80X4v0M4U_bd_J9cpOJM0nikft0jAgjmsQ\192\f\000\016\000\001\000\000\f\148\000A@globalsign-smime-dv=CDYX+XFHUw2wml6/Gb8+59BsH31KzUr6c1l2BPvqKX8=\192\f\000\016\000\001\000\000\f\148\000EDgoogle-site-verification=wD8N7i1JTNTkezJ49swvWW48f8_9xveREV4oB-0Hf5o\192\f\000\016\000\001\000\000\f\148\000.-docusign=1b0a6754-49b1-4db5-8540-d2c12664b289

when a successful one (www.google.com) is similar to:

\175{\129\128\000\001\000\001\000\001\000\000\003www\006google\003com\000\000\016\000\001\192\f\000\005\000\001\000\000\000\000\000\018\015forcesafesearch\192\016\192\016\000\006\000\001\000\000\000<\000&\003ns1\192\016\tdns-admin\192\016\022\1885a\000\000\003\132\000\000\003\132\000\000\a\b\000\000\000<

@squeek502 you think we can use that as a workaround for now? or we should totally figure out the weirdness happening here first? or maybe just changing that resolveTxt domain to something else only? perhaps to www.google.com and the rest google.com

Ok, so the servers being used are essentially global, so each test affects the next.

In its current place:

# Starting Test: resolveTxt
'udp_iter'	{ port = 53, host = '127.0.0.53' }
'udp_iter'	{ port = 53, host = '127.0.0.53' }
'udp_iter'	{ port = 53, host = '127.0.0.53' }
'udp_iter'	{ port = 53, host = '127.0.0.53' }
'udp_iter'	{ port = 53, host = '127.0.0.53' }
  Maximum attempts reached
not ok 9 resolveTxt

When moved to the bottom:

# Starting Test: resolveTxt
'udp_iter'	{ port = 53, tcp = false, host = '8.8.8.8' }
ok 15 resolveTxt

The servers get set to DEFAULT_SERVERS, but on Luvit init dns.loadResolvers() gets called which sets servers to the system's dns resolver (hence the 127.0.0.53:53 server).

So, one quick fix would be to just call dns.setDefaultServers() in that test. It's still strange that this is failing, but maybe that's a Linux bug? Or a Libuv bug? I'll look more into it.

EDIT: In the meantime, #1149

Sounds good, will test if this works on my machine now

This is getting weirder, resolveTxt is indeed now working on my machine, though it is failing at dns - resolveMx with:

oh sorry, that looks like my bad. I just refetched the test file and it successfully passed all tests.
This should do for now

Some more weirdness:

  • When i first boot I get Server fault (IIRC) as the error, but then can never get that again, and instead get Maximum attempts reached. Clearing dns cache doesn't change anything either
  • systemd-resolve --type=TXT google.com gives google.com: resolve call failed: Query timed out after a long while, so this might be a more general systemd dns resolver issue

This might be due to Google not handling DNS of type TXT well enough:

  • google.com seems to be always invalid, and always returns a 301:
systemd-resolve --type=TXT google.com
google.com: resolve call failed: Received invalid reply
  • www.google.com seems to not support at the very least, TXT:
systemd-resolve --type=TXT www.google.com
www.google.com: resolve call failed: Name 'forcesafesearch.google.com' does not have any RR of the requested type

I suggest we could change Google to something else everywhere in the tests maybe?

I think it's fine to work around whatever weirdness they are doing on the main domain. It could be easily a custom dns server that responds differently depending on any number of factors (time of day, geo location, version of client, etc) at their scale.