marqo-ai/marqo

[BUG] Vectorise error when content is \n\n or \r\r

vicilliar opened this issue · 1 comments

Describe the bug
When adding a document where a tensor field is multiple newlines or carriage returns, with no other characters, it causes a vectorise failure.

Error message:

File "/home/joshua/work/marqo_main/marqo/src/marqo/tensor_search/throttling/redis_throttle.py", line 34, in wrapper
    return function(*args, **kwargs)
  File "/home/joshua/work/marqo_main/marqo/src/marqo/tensor_search/api.py", line 195, in add_or_replace_documents
    return tensor_search.add_documents_orchestrator(
  File "/home/joshua/work/marqo_main/marqo/src/marqo/tensor_search/tensor_search.py", line 255, in add_documents_orchestrator
    return add_documents(config=config, add_docs_params=add_docs_params_with_device)
  File "/home/joshua/work/marqo_main/marqo/src/marqo/tensor_search/tensor_search.py", line 604, in add_documents
    vector_chunks = s2_inference.vectorise(
  File "/home/joshua/work/marqo_main/marqo/src/marqo/s2_inference/s2_inference.py", line 77, in vectorise
    raise RuntimeError(f"Vectorise created an empty list of batches! Content: {content}")
RuntimeError: Vectorise created an empty list of batches! Content: []

To Reproduce
Steps to reproduce the behavior:

  1. try indexing any of the following content:
mq.index("test-index-1").add_documents([
{"test": "\r\r"},
{"test": "\r\r\r"},
{"test": "\n\n"},
{"test": "\n\n\n"},
{"test": "\r\n "},
]

Expected behavior
Vectorise function should be able to handle this scenario.

sample of simplewiki docs that initially led to the discovery of this bug:

Sample of 6 docs.
[{'_id': '5589',
  'content_0': 'Reykjavík is the capital city of the island country of '
               'Iceland. It is also the largest city in that country. The '
               'population of Reykjavík is over 117,000 people. There is a '
               'geothermal bath, both natural and unnatural in appearance. It '
               'is in the capital and people relax in this hot spring during '
               'the cooler months. Björk, an Icelandic singer, is from '
               'Reykjavik.',
  'content_1': 'Other websites ',
  'content_2': '\r\r',
  'docDate': 1514612186000,
  'domain': 's.wikipedia.org',
  'title': 'Reykjavík ',
  'url': 'http://s.wikipedia.org/wiki/Reykjav%C3%ADk'},
 {'_id': '5960',
  'content_0': 'Nonmetals or non-metals are chemical elements that does not '
               'have the properties of a metal. Some are gases including: '
               'hydrogen, helium, oxygen, nitrogen, fluorine, neon or radon '
               'and many others. An example of a solid that is a nonmetal is '
               'sulfur. It is yellow and not shiny at all. An example of a '
               'liquid that is a nonmetal is bromine. It is red. A non metal '
               'is also a good insulator for heat and cold. Usually, gases or '
               'brittle solids are non-metals. Elements on the periodic table '
               'can be classified as metal, semimetal, or non-metal.',
  'content_1': 'Five times more elements are metals than nonmetals. However, '
               'nonmetals are abundant and important. Two of the '
               'nonmetals—hydrogen and helium—make up over 99 per cent of the '
               'observable Universe, and one—oxygen—makes up close to half of '
               "the Earth's crust, oceans and atmosphere. Living organisms are "
               'also composed almost entirely of nonmetals, and nonmetals form '
               'many more compounds than metals.',
  'content_2': '\r\r',
  'docDate': 1613073138000,
  'domain': 's.wikipedia.org',
  'title': 'Nonmetal ',
  'url': 'http://s.wikipedia.org/wiki/Nonmetal'},
 {'_id': '6172',
  'content_0': 'Events \r'
               'Geoffrey of Monmouth produces the Historia Regum Britanniae.\r'
               'Construction of the Durham Cathedral is completed in England.\r'
               'Construction of Exeter Cathedral begins in England.\r'
               'June 4 – Lothair III is crowned Holy Roman Emperor by '
               'Pope Innocent II.',
  'content_1': 'Births \rMarch 5 – Henry II of England (d. 1189\rSources',
  'content_2': '\r\r\r',
  'docDate': 1544978303000,
  'domain': 's.wikipedia.org',
  'title': '1133 ',
  'url': 'http://s.wikipedia.org/wiki/1133'},
 {'_id': '136543',
  'content_0': ' \r'
               'Fars Province (, Ostān-e Fārs ) is one of the 31 provinces of '
               'Iran. Its capital is Shiraz.',
  'content_1': 'References',
  'content_2': '\r\r\r',
  'docDate': 1622233430000,
  'domain': 's.wikipedia.org',
  'title': 'Fars Province ',
  'url': 'http://s.wikipedia.org/wiki/Fars_Province'},
 {'_id': '136544',
  'content_0': ' \r'
               'Kurdistan Province (, Ostān-e Kurdistān ) is one of the 31 '
               'provinces of Iran. Its capital is Sanandaj.',
  'content_1': 'References',
  'content_2': '\r\r',
  'docDate': 1622259381000,
  'domain': 's.wikipedia.org',
  'title': 'Kurdistan Province ',
  'url': 'http://s.wikipedia.org/wiki/Kurdistan_Province'},
 {'_id': '143612',
  'content_0': 'West Nusa Tenggara ( – NTB) is a province of Indonesia. It is '
               'the west part of the Lesser Sunda Islands except Bali. Bali is '
               'its own province. Mataram, on Lombok, is the capital and '
               'largest city of the province. In the 2010 census the '
               'population was 4,496,855. Estimasi Penduduk Mennurat Jenis '
               'Kelamin dan Provinsi di Indonesia Tahun 2014. The area of the '
               'province is 19,708.79 km2.\r'
               'The two largest islands in the province are Lombok in the west '
               'and Sumbawa in the east.',
  'content_1': 'References \rOther websites ',
  'content_2': '\r\r',
  'docDate': 1622281533000,
  'domain': 's.wikipedia.org',
  'title': 'West Nusa Tenggara ',
  'url': 'http://s.wikipedia.org/wiki/West_Nusa_Tenggara'}]