[BUG] AutoRetriever occured error on windows os.

Question

[BUG] AutoRetriever occured error on windows os.

histmeisah opened this issue 4 months ago · 1 comments

histmeisah commented 4 months ago

Required prerequisites

I have read the documentation https://camel-ai.github.io/camel/camel.html.
I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

What version of camel are you using?

0.1.6.4

System information

windows 11
python 3.10
camel 0.1.6.4

Problem description

Bug Report: Invalid Collection Name Generation in AutoRetriever on Windows OS

Demo

Run this demo :https://colab.research.google.com/drive/1qs5zqQ3LrTTaPqa6ykShklmKps8fycmY?usp=sharing On my own windows PC.

Description

The AutoRetriever class in the CAMEL library is generating invalid collection names when processing URLs, leading to a WinError 123 (The filename, directory name, or volume label syntax is incorrect) when trying to create or access the vector storage.

Steps to Reproduce

Initialize an AutoRetriever instance.
Call the run_vector_retriever method with a list of URLs as the contents parameter.
The method fails when trying to create a Qdrant collection with an invalid name.

Expected Behavior

The _collection_name_generator method should create a valid collection name for any input, including URLs with special characters.

Actual Behavior

The method creates invalid collection names for some URLs, causing the QdrantStorage initialization to fail with a WinError 123.

Error Message

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'local_data\\collection\\![](https:'

Proposed Solution

Modify the _collection_name_generator method in the AutoRetriever class to ensure it always produces a valid collection name:

def _collection_name_generator(self, content: str) -> str:
    parsed_url = urlparse(content)
    is_url = all([parsed_url.scheme, parsed_url.netloc])

    if is_url:
        # Use a stricter character replacement for URLs
        collection_name = re.sub(
            r'[^a-zA-Z0-9]+',
            '_',
            parsed_url.netloc + parsed_url.path
        )
    elif os.path.exists(content):
        collection_name = re.sub(r'[^a-zA-Z0-9]+', '_', Path(content).stem)
    else:
        collection_name = re.sub(r'[^a-zA-Z0-9]+', '_', content[:30])

    # Ensure the name starts with a letter
    collection_name = re.sub(r'^[^a-zA-Z]+', '', collection_name)
    
    # Remove leading and trailing underscores
    collection_name = collection_name.strip('_')
    
    # Use a default name if empty
    if not collection_name:
        collection_name = 'default_collection'
    
    # Limit length
    return collection_name[:30]

Additionally, add error handling in the run_vector_retriever method:

try:
    collection_name = self._collection_name_generator(content)
    print(f"Generated collection name: {collection_name}")  # For debugging
    vector_storage_instance = self._initialize_vector_storage(collection_name)
    # ... rest of the method
except Exception as e:
    print(f"Error processing content: {content}")
    print(f"Error details: {str(e)}")
    continue  # Skip this content and continue with the next

Reproducible example code

The Python snippets:

Command lines:

Extra dependencies:

Steps to reproduce:

Traceback

No response

Expected behavior

No response

Additional context

No response

Answer 1 · 2024-09-03T11:25:29.000Z

Hey @histmeisah , thanks for the issue! The bug has been fixed in #872 , You can try it out in version 0.1.6.7 +