jina-ai/examples

Wikipedia: ConnectionClosedError('code = 1006 (connection closed abnormally [internal]), no reason')

alexcg1 opened this issue · 12 comments

I'm testing Wikipedia example with different datasets. Each time I:

  • Delete workspace
  • Create a new folder like data.products containing data.txt
  • ln -s data.products data
  • Set path in app.py to data/data.txt
  • Run python app.py -t index
    It worked the first few times. But now after indexing 100 Documents (out of ~3000) I'm getting error:Got following error while streaming requests via websocket: ConnectionClosedError('code = 1006 (connection closed abnormally [internal]), no reason')`

I've tried with several different datasets, same thing each time. Only thing I did was swap out dataset.

I mean, I'm pretty sure if I clone examples repo from scratch and create new venv this problem would go away. But thought important to bring it up.

Sample dataset I used attached (adapted from https://www.kaggle.com/PromptCloudHQ/toy-products-on-amazon)
data.txt

Using

  • Jina 1.1.0
  • Python 3.8.8
  • Manjaro 21

So I re-cloned the repo, started with a clean venv, and everything worked okay after that. Closing for now

Correction: Error popped up again.

It worked fine indexing amazon toy dataset with JINA_MAX_DOCS unset (thus indexing only 50 as specified in app.py. But using export JINA_MAX_DOCS=30000 caused the error.

I'll try changing:

  restful: True

to

  restful: False

in flows/index.yml to see if that fixes things as @deepankarm suggested

So I re-cloned the repo, started with a clean venv, and everything worked okay after that. Closing for now

Correction: Error popped up again.

It worked fine indexing amazon toy dataset with JINA_MAX_DOCS unset (thus indexing only 50 as specified in app.py. But using export JINA_MAX_DOCS=30000 caused the error.

I'll try changing:

  restful: True

to

  restful: False

in flows/index.yml to see if that fixes things as @deepankarm suggested

After doing this I'm now at 300 Documents indexed and it's proceeding smoothly. I'll update as I go along.

I suggest someone tests RESTful indexing using the full Wikipedia dataset (check the README for howto), with JINA_MAX_DOCS set pretty high. That way we can see if it's the dataset itself (it shouldn't be, since the Docker image is pre-indexed with 30k docs) or if the indexing is choking after a certain number - @rutujasurve94 ?

@alexcg1 @rutujasurve94 With restful: true and f.index(...), we use websockets for streaming. This is not highly used/tested (as frontend still doesn't stream requests to jina). If you face issues with it, feel free to create issues in core.

Hey @alexcg1

If the Core issue #2343 is closed, does that mean this issue is closed? I'm on a burn & clean up mood this morning.

Alas no @FionnD . This issue is about the index_restful crapping out after 100 or so Documents and Jina crashes. Core issue #2343 was a whole other bug where Jina kept spitting out terminal output even after Flow said it was complete

ok... let me see if I can get a someone to try and reproduce it

I had it occur multiple times, in multiple virtual environments, multiple Python versions, multiple datasets, from multiple clones of repo (to ensure I hadn't accidentally polluted with my own prior changes)

I've fixed the tests and codes at #559. Except the query_restful, all the other CLI arguments are tested during CI. This should have been fixed.

Confirmed it works with 1,000 docs. Trying now with more just in case

3,000 docs works. Tested on AWS ec2

Please close if your happy it's works :)