tarantool/avro-schema

Attempt to get element -1 from stack of size -1

olegrok opened this issue · 8 comments

I have a problem that happens only under load.

Problem occurs only on MacOS (seems GC64 is enabled).
Avro-schema version: 3.0.3
Tarantool version: 2.2.1
Also we use only "validate" method (there are no flatten, unflatten, etc).
The problem appears when I load some big amount of data to Tarantool more 2GB.

[string "avro.utils.fstack"]:30: Attempt to get element -1 from stack of size -1
stack traceback:
	[string "avro.utils.fstack"]:30: in function 'get'
	...ects/tdg/.rocks/share/tarantool/avro_schema/frontend.lua:946: in function 'copy_data_eh'
	...ects/tdg/.rocks/share/tarantool/avro_schema/frontend.lua:965: in function 'validate'

my object:

{
    "id": 1,
    "value": 1,
    "body": "tdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdg"
}

schema:

{
    "type": "record",
    "name": "TestObject",
    "logicalType": "Aggregate",
    "fields": [
        {"name": "id", "type": "long"},
        {"name": "value", "type": "int"},
        {"name": "body", "type": "string*"}
    ]
  }

Problem does not appear if jit.off() is called

Can you share a reproducer?

It happens inside my application. Avro schema code that I extracted does not reproduce it.

I think jit traces are broken inside my application and root of problem could be inside another place

I propose to work on a reproducer (at least via avro-schema, at max reduce it to just Lua code) during some fixed time (say, two working days) and:

  • If it'll succeed, file an issue against tarantool/tarantool or tarantool/luajit regarding GC 64 / Mac OS.
  • If it'll fail, close this issue (or what else we can do?).

This logic looks like:

local fiber = require('fiber')
local json = require('json')
local avro_schema = require('avro_schema')

local json_schema = [[
[
    {
    "type": "record",
    "name": "TestObject",
    "fields": [
        {"name": "id", "type": "long"},
        {"name": "value", "type": "int"},
        {"name": "body", "type": "string*"}
    ]
  }
]
]]

local schema = json.decode(json_schema)

local ok, handle = avro_schema.create(schema)

assert(ok, handle)

local object = {
    TestObject = {
        id = 1,
        value = 1,
        body = "tdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdgtdg",
    }
}

local function validate_object(obj)
    local ok, err = avro_schema.validate(handle, obj)
    assert(ok, err)
end

validate_object(object)

box.cfg{memtx_memory = 4 * 2^30}

local space = box.schema.space.create('test_space', {if_not_exists = true})
space:create_index('pk', {if_not_exists = true})


local function insert_object(obj)
    space:replace({obj.id, obj.value, obj.body})
end

local worker_count = 1e3

for i = 1, worker_count do
    local obj = table.deepcopy(object)
    obj['TestObject']['id'] = i
    fiber.new(function()
        while true do
            validate_object(obj)
            insert_object(obj['TestObject'])
            obj['TestObject']['id'] = obj['TestObject']['id'] + worker_count
        end
    end)
end

But seems it doesn't reproduce a problem.

Can you share a reproducer based on your application (it is okay to do so privately; preferably via an issue in the application repository)?

I haven't faced this issue for long. As I remember it was perftest for TDG1. I'm not sure I'm able to reproduce it again but probably I should try to do it.

Also feel free to close this issue. I'm not sure it's avro-schema issue it looks like luajit bug. The most awful that I don't have isolated testcase.

With non-isolated test case we at least can make a guess about similarity to othre known problems and try to bisect on tarantool and/or luajit commits.