data privacy
Closed this issue · 9 comments
Issue you'd like to raise.
this project will have any data leak for local documents?
Suggestion:
No response
Not at all. Once your installation is done it does not require any connection at all. Thus an internet connection is not required. It's as air-gapped as it will get.
If it is of any need I could package up an installation that runs on a USB Medium like tails OS without the need to even have a connection to begin with.
If any questions remain feel free to respond here and I'll reopen the issue.
This project's test folder already has more than 10 different media types that can be ingested within millisecond time via multithreaded ingestion. We beat PrivateGPT in performance. Also we chose qdrant which should be way more performant when it comes to mmr.
This repo does not use "GPT" in its name hence people with less knowledge tend to skip on it.
Either use docker or refer to the installation from source. That's also done within a minute. I can't serve any guide in regards to conda. I would recommend installing it from source.
thanks for info. this is for local GPT to feed with own documents without outside API or internet to keep data safe.
I am seeing below error: can you please help:
(casalioy-py3.11) root@47c87b8ed509:/srv/CASALIOY# python3.11 casalioy/ingest.py
found local model dir at models/sentence-transformers/all-MiniLM-L6-v2
found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin
Delete current database?(Y/N): Y
Deleting db...
Scanning files
found local model dir at models/sentence-transformers/all-MiniLM-L6-v2 ] 0/ 8 eta [?:??:??]
found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin
found local model dir at models/sentence-transformers/all-MiniLM-L6-v2
found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin
found local model dir at models/sentence-transformers/all-MiniLM-L6-v2
found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin
[nltk_data] Downloading package punkt to /root/nltk_data... ] 2/ 8 eta [00 14
[nltk_data] Downloading package punkt to /root/nltk_data...=====================================> 4 05
[nltk_data] Downloading package punkt to /root/nltk_data... 6
[nltk_data] Unzipping tokenizers/punkt.zip. 7
[nltk_data] Error with downloaded zip file
50.0% [=======================================================================================> ] 4/ 8 eta [00:00]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/casalioy/ingest.py", line 125, in process_one_doc
document = self.load_one_doc(filepath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/casalioy/ingest.py", line 74, in load_one_doc
return self.file_loadersfilepath.suffix[1:].load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/langchain/document_loaders/markdown.py", line 25, in _get_elements
return partition_md(filename=self.file_path, **self.unstructured_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/md.py", line 52, in partition_md
return partition_html(
^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/html.py", line 91, in partition_html
layout_elements = document_to_element_list(document, include_page_breaks=include_page_breaks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/common.py", line 73, in document_to_element_list
num_pages = len(document.pages)
^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/xml.py", line 52, in pages
self._pages = self._read()
^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 116, in _read
element = _parse_tag(tag_elem)
^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 222, in _parse_tag
return _text_to_element(text, tag_elem.tag, ancestortags)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 237, in _text_to_element
elif is_narrative_tag(text, tag):
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 265, in is_narrative_tag
return tag not in HEADING_TAGS and is_possible_narrative_text(text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 76, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 273, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 222, in sentence_count
sentences = sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 38, in sent_tokenize
return _sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/nltk/tokenize/init.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/nltk/data.py", line 876, in open
return find(path, path + [""]).open()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/srv/CASALIOY/.venv/nltk_data'
- '/srv/CASALIOY/.venv/share/nltk_data'
- '/srv/CASALIOY/.venv/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/srv/CASALIOY/casalioy/ingest.py", line 170, in
main(sources_directory, cleandb)
File "/srv/CASALIOY/casalioy/ingest.py", line 164, in main
ingester.ingest_from_directory(sources_directory, chunk_size, chunk_overlap)
File "/srv/CASALIOY/casalioy/ingest.py", line 144, in ingest_from_directory
for embeddings in pb(pool.imap_unordered(self.process_one_doc, all_items), total=len(all_items)):
File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/prompt_toolkit/shortcuts/progress_bar/base.py", line 353, in iter
for item in self.data:
File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
LookupError:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/root/nltk_data'
- '/srv/CASALIOY/.venv/nltk_data'
- '/srv/CASALIOY/.venv/share/nltk_data'
- '/srv/CASALIOY/.venv/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
(casalioy-py3.11) root@47c87b8ed509:/srv/CASALIOY# /usr/local/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
(casalioy-py3.11) root@47c87b8ed509:/srv/CASALIOY#
'punkt' is not available or not downloaded in your environment. Did you try python3
, then import nltk nltk.download('punkt')
As said you should use the installation from source. Docker might cause issues besides Windows 11H2 - Ubuntu is unstable.
To install please refer to the ReadMe
git clone https://github.com/su77ungr/CASALIOY && cd CASALIOY/
python -m pip install poetry
python -m poetry config virtualenvs.in-project true
python -m poetry install
. .venv/bin/activate
python -m pip install --force streamlit sentence_transformers # Temporary bandaid fix, waiting for streamlit >=1.23
pre-commit install
Urgent:
docker is bypassing ubuntu host UFW firewall? docker is able to access internet? can you please help why? I want to block internet after downloading docker images.
If i paste documents in source directory in docker then it will accessed by internet? if internet access is there in docker then source documents will be accessed from outside?
Just install it on a vm and disable the internet connection. DON'T USE DOCKER if you don't know how to firewall it or how to use it.