Wikipedia: ConnectionClosedError('code = 1006 (connection closed abnormally [internal]), no reason')
alexcg1 opened this issue · 12 comments
I'm testing Wikipedia example with different datasets. Each time I:
- Delete
workspace
- Create a new folder like
data.products
containingdata.txt
ln -s data.products data
- Set path in
app.py
todata/data.txt
- Run
python app.py -t index
It worked the first few times. But now after indexing 100 Documents (out of ~3000) I'm getting error:
Got following error while streaming requests via websocket: ConnectionClosedError('code = 1006 (connection closed abnormally [internal]), no reason')`
I've tried with several different datasets, same thing each time. Only thing I did was swap out dataset.
I mean, I'm pretty sure if I clone examples repo from scratch and create new venv this problem would go away. But thought important to bring it up.
Sample dataset I used attached (adapted from https://www.kaggle.com/PromptCloudHQ/toy-products-on-amazon)
data.txt
Using
- Jina 1.1.0
- Python 3.8.8
- Manjaro 21
So I re-cloned the repo, started with a clean venv, and everything worked okay after that. Closing for now
Correction: Error popped up again.
It worked fine indexing amazon toy dataset with JINA_MAX_DOCS
unset (thus indexing only 50 as specified in app.py
. But using export JINA_MAX_DOCS=30000
caused the error.
I'll try changing:
restful: True
to
restful: False
in flows/index.yml
to see if that fixes things as @deepankarm suggested
So I re-cloned the repo, started with a clean venv, and everything worked okay after that. Closing for now
Correction: Error popped up again.
It worked fine indexing amazon toy dataset with
JINA_MAX_DOCS
unset (thus indexing only 50 as specified inapp.py
. But usingexport JINA_MAX_DOCS=30000
caused the error.I'll try changing:
restful: Trueto
restful: Falsein
flows/index.yml
to see if that fixes things as @deepankarm suggested
After doing this I'm now at 300 Documents indexed and it's proceeding smoothly. I'll update as I go along.
I suggest someone tests RESTful indexing using the full Wikipedia dataset (check the README for howto), with JINA_MAX_DOCS
set pretty high. That way we can see if it's the dataset itself (it shouldn't be, since the Docker image is pre-indexed with 30k docs) or if the indexing is choking after a certain number - @rutujasurve94 ?
@alexcg1 @rutujasurve94 With restful: true
and f.index(...)
, we use websockets for streaming. This is not highly used/tested (as frontend still doesn't stream requests to jina). If you face issues with it, feel free to create issues in core.
Hey @alexcg1
If the Core issue #2343 is closed, does that mean this issue is closed? I'm on a burn & clean up mood this morning.
Alas no @FionnD . This issue is about the index_restful
crapping out after 100 or so Documents and Jina crashes. Core issue #2343 was a whole other bug where Jina kept spitting out terminal output even after Flow said it was complete
ok... let me see if I can get a someone to try and reproduce it
I had it occur multiple times, in multiple virtual environments, multiple Python versions, multiple datasets, from multiple clones of repo (to ensure I hadn't accidentally polluted with my own prior changes)
I've fixed the tests and codes at #559. Except the query_restful
, all the other CLI arguments are tested during CI. This should have been fixed.
Confirmed it works with 1,000 docs. Trying now with more just in case
3,000 docs works. Tested on AWS ec2
Please close if your happy it's works :)