load INSTRUCTOR_Transformer max_seq_length 512

Question

load INSTRUCTOR_Transformer max_seq_length 512

Closed this issue a month ago · 22 comments

Error when loading PDF....( drag and drop or through browse for file.
"load INSTRUCTOR_Transformer
max_seq_length 512........"

Answer 1 · 2024-03-31T12:35:13.000Z

This text should be there, is there an actual error message, or what is your problem?

Answer 2 · 2024-03-31T17:37:56.000Z

I had a bug of chroma-hnswlib which depends by chromadb, i did a downgrade of the chromadb package to 0.4.2 and it did the trick.

Answer 3 · 2024-04-01T08:43:34.000Z

Could you elaborate on what the actual bug was? Did it just not index the pdf files or did you get an error message?
This might help someone else.

Answer 4 · 2024-04-03T10:51:15.000Z

Tenía un insecto de croma-hnswlib que depende del cromatodb, hice una rebaja de la degradación del paquete de cromadb a 0.4.2 e hizo el truco.

Explain it better please

Answer 5 · 2024-04-18T22:21:59.000Z

Yes, there is a few bugs with pdf feature :

reinstall sentence-transformers (pip install sentence-transformers==2.2.2 )
you should downgrade Chromadb to 0.4.3 (pip install chromadb==0.4.3)
there is a code bugs not sure how to fix it: @Leon-Sander
- when you upload a pdf file => will proses it looks ok
- once you sending your questions/prompts and click enter the pdf that you have uploading will be disappear =>
means you can NOT chat with the pdf, also the pdf toggle above set to true after you click to send button.
if anyone fix the code bugs ping me :))

Use Case:
if the user toggle Pdf chat on, means can chat with hover.pdf that we have on the application
if the user upload pdf should stay there on pdf sessions to have chat with.

Answer 6 · 2024-04-18T22:43:36.000Z

joaquindev23 commented 6 months ago

Answer 7 · 2024-04-18T22:44:00.000Z

joaquindev23 commented 6 months ago

Answer 8 · 2024-04-19T08:48:24.000Z

@B7-9414

3. there is a code bugs not sure how to fix it: @Leon-Sander
   - when you upload a pdf file => will proses it looks ok
   - once you sending your questions/prompts and click enter the pdf that you have uploading will be disappear =>
   means you can NOT chat with the pdf, also the pdf toggle above set to true after you click to send button.
   if anyone fix the code bugs ping me :))

This is intended and no bug. After uploading the pdf, it gets ingested into the vectordb, there is zero reason for the pdf to still be there. If it would still be there, then it would trigger the upload function again and again, populating your vectordb with the same document over and over. And obviously, since you just uploaded a pdf, you would want to chat with it, therefore the pdf toggle is set to true.
So all pdfs you uploaded over time are inside the vectordb, you can chat with them at any time if the pdf toggle is true.
If the pdf toggle is false, you chat with the model without pdf access.

Answer 9 · 2024-04-19T08:52:53.000Z

Solution to @joaquindev23 in #23

Answer 10 · 2024-04-19T14:35:33.000Z

@Leon-Sander
I have tested that out won’t ingested into the vectordb when you upload the pdf file. you can just upload/drop it manually to pdfs folder then toggle it on/True.

also, if you toggle this off still you can chat with the pdf Even though, when hit clear caches and refresh the page. we should clear it from streamlit UI setting.

Answer 11 · 2024-04-19T14:55:19.000Z

@B7-9414
In my case it gets ingested into the vectordb, thats how I designed the code. Also dropping the pdf in the folder does nothing since I didn't program it to read the pdf folder while running.

also, if you toggle this off still you can chat with the pd

I just changed the code, so that toggling the pdf chat off also clears the cache, now it should work.

Answer 12 · 2024-04-19T19:48:59.000Z

@Leon-Sander
As you see when you I upload a pdf file has some content about Bassam's info. the toggle will set to on/True and the bassam pdf file I have upload will disappear once I send the question. mean while i have healthcare pdf file under pdfs folder as shown.

when ask/send a questions about Bassam pdf which I have upload won't gave an answer cuz, is NOT in the vector db.
when I ask questions about healthcare related it to the pdf which I have it locally under pdfs folder. will return the valid answer which means its stored in vector db.

Answer 13 · 2024-04-19T20:15:10.000Z

@B7-9414 It seems that you made some changes to the code, maybe that's the reason.

Without seeing your code version or a video demonstration, I can't comprehend you problem since at my end it works exactly as intended.

Answer 14 · 2024-04-19T20:19:50.000Z

@Leon-Sander
I have re clone the main branch without changed anything. still the same issue :(( I have tried to fix that could not !! plz test that from your end! ty

Answer 15 · 2024-04-19T20:27:49.000Z

@B7-9414 Just cloned the repo and tested it again, here the command line output when I drag in a pdf in the streamlit frontend, as you can see, the adding worked. The pdf contained my resume, so information which the model could not know beforehand.
Then I asked a question, and the pdf chain got loaded and answered my question with contents from the pdf.
Works like a charm, no problems.

Answer 16 · 2024-04-19T20:32:24.000Z

@Leon-Sander
yes, I have the something but, when you ask questions about the pdf you have uploaded it won't get a valid answer. we may need to to have another chain to handle that , rather than one.

then, try to add it locally under pdfs folder and toggle it on u will get a valid answer .

Answer 17 · 2024-04-19T20:34:58.000Z

@B7-9414 The problem might be the way your pdf files are structured. I provided the hover pdf in this repository, please add it in the frontend and then ask "what is the HoVer dataset about?". If you get an answer, it worked, and your problem probably lies in your pdf files.

Answer 18 · 2024-04-22T15:52:15.000Z

it gives me this error while returning answer to my chat?

Answer 19 · 2024-04-22T15:53:22.000Z

although these run fine like I am getting all the print

Answer 20 · 2024-04-22T20:08:33.000Z

@B7-9414
3. there is a code bugs not sure how to fix it: @Leon-Sander
   - when you upload a pdf file => will proses it looks ok
   - once you sending your questions/prompts and click enter the pdf that you have uploading will be disappear =>
   means you can NOT chat with the pdf, also the pdf toggle above set to true after you click to send button.
   if anyone fix the code bugs ping me :))
This is intended and no bug. After uploading the pdf, it gets ingested into the vectordb, there is zero reason for the pdf to still be there. If it would still be there, then it would trigger the upload function again and again, populating your vectordb with the same document over and over. And obviously, since you just uploaded a pdf, you would want to chat with it, therefore the pdf toggle is set to true. So all pdfs you uploaded over time are inside the vectordb, you can chat with them at any time if the pdf toggle is true. If the pdf toggle is false, you chat with the model without pdf access.

Hi Leon, I am still learning so forgive me if this is a silly question but is the vectordb stored in the vram or is it stored on the drive and loaded based on the context of the question? I assume the former but I can see the VRAM getting maxed out if the ingested data is a library vs a smaller set of files and was hoping there was a way to implement the latter.

What are your thoughts on mass ingested file storage?

Answer 21 · 2024-04-29T15:57:49.000Z

@JTMarsh556 vectordb is stored on the drive, and document search is performed in RAM. If you have enough RAM you can have a vectordb with a large amount of text ingested.

VRAM is used if the models are loaded on GPU.

Answer 22 · 2024-04-29T16:01:58.000Z

Thank you. That clarified things for me. I appreciate you taking the time to explain that to me.