Future-House/paper-qa

Indexing 50 files reaching semantic scholar rate limit

Opened this issue ยท 4 comments

Tried to index a folder of 50 files and reached the rate limit. The indexing aborted and rerun the command leads to resumption from the start.

I wonder if there is an easy way that we can slow down the indexing so to avoid rate limit. Or is it possible to resume from a checkpoint rather than restart fresh.

Hey @sanxing-chen! I'm here to help you out with any bugs, questions, or contributions you have. Let's tackle this indexing issue together!

To address the rate limit issue and avoid restarting the indexing process from scratch, you can adjust the rate of requests and utilize the checkpointing mechanism.

  1. Adjusting the Rate of Requests:
    You can control the rate of requests by modifying the max_concurrent_requests field in the AnswerSettings class located in the paperqa/settings.py file. This field allows you to set the maximum number of concurrent requests to LLMs.

    class AnswerSettings(BaseModel):
        model_config = ConfigDict(extra="forbid")
    
        # Other fields...
    
        max_concurrent_requests: int = Field(
            default=4, description="Max concurrent requests to LLMs"
        )

    Additionally, you can adjust the index_concurrency field in the AgentSettings class to control the number of concurrent filesystem reads for indexing.

    class AgentSettings(BaseModel):
        model_config = ConfigDict(extra="forbid")
    
        # Other fields...
    
        index_concurrency: int = Field(
            default=30,
            description="Number of concurrent filesystem reads for indexing",
        )

    [1]

  2. Checkpointing Mechanism:
    The SearchIndex class has a method called save_index which saves the current state of the index files. This method is called at various points in the indexing process to ensure progress is saved. For example, the process_file function calls save_index after processing each file:

    await search_index.save_index()

    This allows the indexing process to resume from the last saved state rather than starting from scratch.

    async def save_index(self) -> None:
        file_index_path = await self.file_index_filename
        async with await anyio.open_file(file_index_path, "wb") as f:
            await f.write(zlib.compress(pickle.dumps(await self.index_files)))

    [2][3]

By adjusting these settings and utilizing the checkpointing mechanism, you can slow down the indexing process to avoid hitting rate limits and resume from the last checkpoint if the process is interrupted.

To continue talking to Dosu, mention @dosu.

Try

pqa --agent.index_concurrency=1 ...

that will slow down the indexing process. @mskarlin may have an answer for more direct control of S2 rate limits though

If i have an entire zotero library stored locally (with zotero.sqlite) can I not skip the semantic scholar API?

@swairshah I am also trying to figure out this. let me know if you have any luck! I'll add anything i learn too.