patrickfrey/strus

Problems inserting big positions?

andreasbaumann opened this issue · 2 comments

DEBUG: field 482:11'html_meta_file': './data/etext/10556.html', @192151315
DEBUG: lookup expression for field 'html_meta_file'
DEBUG: got expression number for 'html_meta_file' to be 0
token positions of document '1055610556' are out or range (document too big, 150199 token positions assigned)
DEBUG: buffer reset, rest: 1055710557   Brooke, L. Leslie (Leonard Leslie), 1862-1940   Johnny Crow's Party             English PZ: Language and Literatures: 
failed to process document 'gutenberg.tsv': failed to process document 'gutenberg.tsv': error closing document in transaction: corrupt data (unpackInt32_ 1)

done

The positions of the experimental TSV segmenter with ZIP-file @zipinclude function are quite big,
because it's basically the position within the TSV file and the position of the file withing the
uncompressed ZIP stream.

See

https://github.com/andreasbaumann/strusExamples/tree/master/gutenberg

and

https://github.com/andreasbaumann/strusAnalyzer/tree/tsv_extensions

I'm actually never resetting the position in the segmenter. Is it possible to reset it to 0 when we
start a new document section?

For reference the gdb stacktrace:

(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fffb7376600, 
    tinfo=0x7ffff6260ab0 <typeinfo for std::runtime_error>, 
    dest=0x7ffff5f8b050 <std::runtime_error::~runtime_error()>)
    at /build/gcc-multilib/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:62
#1  0x00007ffff51b8d4f in unpackInt32_ (end=end@entry=0x7fffbf44d591 "", 
    itr=@0x7fffe7ffc760: 0x7fffbf44d590 "\335")
    at /home/abaumann/strus/strus/src/storage/indexPacker.cpp:32
#2  strus::unpackIndex (itr=@0x7fffe7ffc760: 0x7fffbf44d590 "\335", 
    end=end@entry=0x7fffbf44d591 "")
    at /home/abaumann/strus/strus/src/storage/indexPacker.cpp:81
#3  0x00007ffff51b3590 in strus::ForwardIndexBlock::position_at (
    ref=0x7fffbf44d590 "\335", this=0x7fffe7ffc840)
    at /home/abaumann/strus/strus/src/storage/forwardIndexBlock.cpp:22
#4  strus::ForwardIndexBlock::append (this=this@entry=0x7fffe7ffc840, 
    pos=@0x7fffe0197f90: 4072, item="\225")
    at /home/abaumann/strus/strus/src/storage/forwardIndexBlock.cpp:71
#5  0x00007ffff51b3f1b in strus::ForwardIndexMap::closeCurblock (
    this=this@entry=0x7fffe0000bf0, typeno=@0x7fffbd78f370: 5, 
    elemlist=std::vector of length 128, capacity 128 = {...})
    at /home/abaumann/strus/strus/src/storage/forwardIndexMap.cpp:29
#6  0x00007ffff51b5206 in strus::ForwardIndexMap::defineForwardIndexTerm (
    this=0x7fffe0000bf0, typeno=@0x7fffbd78f370: 5, 
    typeno@entry=@0x7fffbd78f370: <optimized out>, pos=@0x7fffbd78f374: 4175, 
---Type <return> to continue, or q <return> to quit---
    pos@entry=@0x7fffbd78f374: <optimized out>, termstring="\273")
    at /home/abaumann/strus/strus/src/storage/forwardIndexMap.cpp:159
#7  0x00007ffff519e6ec in strus::StorageTransaction::defineForwardIndexTerm (
    this=<optimized out>, typeno=@0x7fffbd78f370: 5, 
    pos=@0x7fffbd78f374: 4175, termstring="\273")
    at /home/abaumann/strus/strus/src/storage/storageTransaction.cpp:164
#8  0x00007ffff51d60d9 in strus::StorageDocument::done (this=0x7fffe0de1980)
    at /home/abaumann/strus/strus/src/storage/storageDocument.cpp:157
#9  0x00000000004206c2 in strus::InsertProcessor::run (this=0x65e8a0)
    at /home/abaumann/strus/strusUtilities/src/strusInsert/insertProcessor.cpp:230
#10 0x00007ffff6cb098d in ?? () from /usr/lib/libboost_thread.so.1.63.0
#11 0x00007ffff57a52e7 in start_thread () from /usr/lib/libpthread.so.0
#12 0x00007ffff54e654f in clone () from /usr/lib/libc.so.6

and

(gdb) 
#7  0x00007ffff519e6ec in strus::StorageTransaction::defineForwardIndexTerm (this=<optimized out>, typeno=@0x7fffbd78f370: 5, 
    pos=@0x7fffbd78f374: 4175, termstring="\273") at /home/abaumann/strus/strus/src/storage/storageTransaction.cpp:164
164		m_forwardIndexMap.defineForwardIndexTerm( typeno, pos, termstring);
(gdb) p typeno
$3 = (const strus::Index &) @0x7fffbd78f370: 5
(gdb) p pos
$4 = (const strus::Index &) @0x7fffbd78f374: 4175
(gdb) p termstring
$5 = "\273"