PDF failed to process
jschulman opened this issue · 3 comments
Within LibreChat using a git pull from this morning and updated .env and librachat.yml files, I attach a PDF and submit the prompt. I get error "An error occurred while processing your request." This is the log files:
rag_api | 2024-05-19 01:58:23,615 - root - INFO - Request POST http://rag_api:8000/embed - 200
rag_api | 2024-05-19 01:58:32,233 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
rag_api | 2024-05-19 01:58:32,353 - root - ERROR - list index out of range
rag_api | 2024-05-19 01:58:32,353 - root - INFO - Request POST http://rag_api:8000/query - 500
LibreChat | 2024-05-19 01:58:32 error: Error creating context: Request failed with status code 500
LibreChat | 2024-05-19 01:58:32 error: [handleAbortError] AI response error; aborting request: Request failed with status code 500
can you set DEBUG_RAG_API=True
in your .env file and see if you can recreate the error?
I've run it through a wide variety of PDFs. There is something unique about this PDF that it doesn't like. Debug logs below. Here is the PDF metadata:
_kMDItemDisplayNameWithExtensions = "name.pdf"
com_apple_metadata_modtime = 735652485
kMDItemContentCreationDate = 2024-04-24 11:54:45 +0000
kMDItemContentCreationDate_Ranking = 2024-05-15 00:00:00 +0000
kMDItemContentModificationDate = 2024-04-24 11:54:45 +0000
kMDItemContentType = "com.adobe.pdf"
kMDItemContentTypeTree = (
"com.adobe.pdf",
"public.data",
"public.item",
"public.composite-content",
"public.content"
)
kMDItemDateAdded = 2024-05-15 04:06:30 +0000
kMDItemDisplayName = "name.pdf"
kMDItemDocumentIdentifier = 415425
kMDItemFSContentChangeDate = 2024-04-24 11:54:45 +0000
kMDItemFSCreationDate = 2024-04-24 11:54:45 +0000
kMDItemFSCreatorCode = ""
kMDItemFSFinderFlags = 0
kMDItemFSHasCustomIcon = (null)
kMDItemFSInvisible = 0
kMDItemFSIsExtensionHidden = 0
kMDItemFSIsStationery = (null)
kMDItemFSLabel = 0
kMDItemFSName = "name.pdf"
kMDItemFSNodeCount = (null)
kMDItemFSOwnerGroupID = 20
kMDItemFSOwnerUserID = 501
kMDItemFSSize = 488604
kMDItemFSTypeCode = ""
kMDItemInterestingDate_Ranking = 2024-05-18 00:00:00 +0000
kMDItemKind = "PDF document"
kMDItemLastUsedDate = 2024-05-18 17:52:12 +0000
kMDItemLastUsedDate_Ranking = 2024-05-18 00:00:00 +0000
kMDItemLogicalSize = 488604
kMDItemPhysicalSize = 488604
kMDItemUseCount = 9
kMDItemUsedDates = (
"2024-05-12 05:00:00 +0000",
"2024-05-18 05:00:00 +0000"
)
LOGS:
rag_api | 2024-05-19 20:00:47,000 - root - DEBUG - /query - {'id': 'x', 'username': 'x', 'provider': 'local', 'email': 'x', 'iat': x, 'exp': x}
rag_api | 2024-05-19 20:00:47,032 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
rag_api | 2024-05-19 20:00:47,307 - urllib3.connectionpool - DEBUG - https://openaipublic.blob.core.windows.net:443 "GET /encodings/cl100k_base.tiktoken HTTP/1.1" 200 1681126
rag_api | 2024-05-19 20:00:47,839 - openai._base_client - DEBUG - Request options: {'method': 'post', 'url': '/embeddings', 'files': None, 'post_parser': <function Embeddings.create..parser at 0x14fc1c5b4c10>, 'json_data': {'input': [[1264, 5730, 553]], 'model': 'text-embedding-3-small', 'encoding_format': 'base64'}}
rag_api | 2024-05-19 20:00:48,061 - openai._base_client - DEBUG - Sending HTTP Request: POST https://api.openai.com/v1/embeddings
rag_api | 2024-05-19 20:00:48,062 - httpcore.connection - DEBUG - connect_tcp.started host='api.openai.com' port=443 local_address=None timeout=None socket_options=None
rag_api | 2024-05-19 20:00:48,360 - httpcore.connection - DEBUG - connect_tcp.complete return_value=<httpcore._backends.sync.SyncStream object at 0x14fc198fbb50>
rag_api | 2024-05-19 20:00:48,360 - httpcore.connection - DEBUG - start_tls.started ssl_context=<ssl.SSLContext object at 0x14fc1c9f3e40> server_hostname='api.openai.com' timeout=None
rag_api | 2024-05-19 20:00:48,381 - httpcore.connection - DEBUG - start_tls.complete return_value=<httpcore._backends.sync.SyncStream object at 0x14fc198fbb80>
rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - send_request_headers.started request=<Request [b'POST']>
rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - send_request_headers.complete
rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - send_request_body.started request=<Request [b'POST']>
rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - send_request_body.complete
rag_api | 2024-05-19 20:00:48,381 - httpcore.http11 - DEBUG - receive_response_headers.started request=<Request [b'POST']>
rag_api | 2024-05-19 20:00:48,542 - httpcore.http11 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Sun, 19 May 2024 20:00:48 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-allow-origin', b''), (b'openai-model', b'text-embedding-3-small'), (b'openai-organization', b'one37'), (b'openai-processing-ms', b'25'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=15724800; includeSubDomains'), (b'x-ratelimit-limit-requests', b'5000'), (b'x-ratelimit-limit-tokens', b'5000000'), (b'x-ratelimit-remaining-requests', b'4999'), (b'x-ratelimit-remaining-tokens', b'4999996'), (b'x-ratelimit-reset-requests', b'12ms'), (b'x-ratelimit-reset-tokens', b'0s'), (b'x-request-id', b'req_x'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Set-Cookie', b'__cf_bm=x-1.0.1.1-x; path=/; expires=Sun, 19-May-24 20:30:48 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'Set-Cookie', b'_cfuvid=x-0.0.1.1-x; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), (b'Server', b'cloudflare'), (b'CF-RAY', b'8866acde7cb52d4c-ORD'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
rag_api | 2024-05-19 20:00:48,543 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
rag_api | 2024-05-19 20:00:48,543 - httpcore.http11 - DEBUG - receive_response_body.started request=<Request [b'POST']>
rag_api | 2024-05-19 20:00:48,543 - httpcore.http11 - DEBUG - receive_response_body.complete
rag_api | 2024-05-19 20:00:48,543 - httpcore.http11 - DEBUG - response_closed.started
rag_api | 2024-05-19 20:00:48,543 - httpcore.http11 - DEBUG - response_closed.complete
rag_api | 2024-05-19 20:00:48,544 - openai._base_client - DEBUG - HTTP Response: POST https://api.openai.com/v1/embeddings "200 OK" Headers([('date', 'Sun, 19 May 2024 20:00:48 GMT'), ('content-type', 'application/json'), ('transfer-encoding', 'chunked'), ('connection', 'keep-alive'), ('access-control-allow-origin', ''), ('openai-model', 'text-embedding-3-small'), ('openai-organization', 'x'), ('openai-processing-ms', '25'), ('openai-version', '2020-10-01'), ('strict-transport-security', 'max-age=15724800; includeSubDomains'), ('x-ratelimit-limit-requests', '5000'), ('x-ratelimit-limit-tokens', '5000000'), ('x-ratelimit-remaining-requests', '4999'), ('x-ratelimit-remaining-tokens', '4999996'), ('x-ratelimit-reset-requests', '12ms'), ('x-ratelimit-reset-tokens', '0s'), ('x-request-id', 'req_7eaa45631d94004341818ccd734162c6'), ('cf-cache-status', 'DYNAMIC'), ('set-cookie', '__cf_bm=x-1.0.1.1-x; path=/; expires=Sun, 19-May-24 20:30:48 GMT; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('set-cookie', '_cfuvid=x-0.0.1.1-x; path=/; domain=.api.openai.com; HttpOnly; Secure; SameSite=None'), ('server', 'cloudflare'), ('cf-ray', '8866acde7cb52d4c-ORD'), ('content-encoding', 'gzip'), ('alt-svc', 'h3=":443"; ma=86400')])
rag_api | 2024-05-19 20:00:48,544 - openai._base_client - DEBUG - request_id: req_7eaa45631d94004341818ccd734162c6
rag_api | 2024-05-19 20:00:48,557 - root - ERROR - list index out of range
rag_api | 2024-05-19 20:00:48,557 - root - INFO - Request POST http://rag_api:8000/query - 500
LibreChat | 2024-05-19 20:00:48 error: Error creating context: Request failed with status code 500
LibreChat | 2024-05-19 20:00:48 error: [handleAbortError] AI response error; aborting request: Request failed with status code 500
I've "fixed" this issue and it seems that MongoDB Atlas reliably produces it by not returning any results. They are now handled but mongodb integration will have to go through more extensive review.