Indexing 50 files reaching semantic scholar rate limit
Opened this issue ยท 4 comments
Tried to index a folder of 50 files and reached the rate limit. The indexing aborted and rerun the command leads to resumption from the start.
I wonder if there is an easy way that we can slow down the indexing so to avoid rate limit. Or is it possible to resume from a checkpoint rather than restart fresh.
Hey @sanxing-chen! I'm here to help you out with any bugs, questions, or contributions you have. Let's tackle this indexing issue together!
To address the rate limit issue and avoid restarting the indexing process from scratch, you can adjust the rate of requests and utilize the checkpointing mechanism.
-
Adjusting the Rate of Requests:
You can control the rate of requests by modifying themax_concurrent_requests
field in theAnswerSettings
class located in thepaperqa/settings.py
file. This field allows you to set the maximum number of concurrent requests to LLMs.class AnswerSettings(BaseModel): model_config = ConfigDict(extra="forbid") # Other fields... max_concurrent_requests: int = Field( default=4, description="Max concurrent requests to LLMs" )
Additionally, you can adjust the
index_concurrency
field in theAgentSettings
class to control the number of concurrent filesystem reads for indexing.class AgentSettings(BaseModel): model_config = ConfigDict(extra="forbid") # Other fields... index_concurrency: int = Field( default=30, description="Number of concurrent filesystem reads for indexing", )
-
Checkpointing Mechanism:
TheSearchIndex
class has a method calledsave_index
which saves the current state of the index files. This method is called at various points in the indexing process to ensure progress is saved. For example, theprocess_file
function callssave_index
after processing each file:await search_index.save_index()
This allows the indexing process to resume from the last saved state rather than starting from scratch.
async def save_index(self) -> None: file_index_path = await self.file_index_filename async with await anyio.open_file(file_index_path, "wb") as f: await f.write(zlib.compress(pickle.dumps(await self.index_files)))
By adjusting these settings and utilizing the checkpointing mechanism, you can slow down the indexing process to avoid hitting rate limits and resume from the last checkpoint if the process is interrupted.
Try
pqa --agent.index_concurrency=1 ...
that will slow down the indexing process. @mskarlin may have an answer for more direct control of S2 rate limits though
If i have an entire zotero library stored locally (with zotero.sqlite) can I not skip the semantic scholar API?
@swairshah I am also trying to figure out this. let me know if you have any luck! I'll add anything i learn too.