bitcoinsearch/scraper

Handling Edits and AI-Generated Transcripts in Bitcoin Transcripts Repository

Closed this issue · 4 comments

I've identified an issue and potential improvements in the current scraping logic for the Bitcoin Transcripts repository that we need to address to better handle edits and the inclusion of AI-generated transcripts. My suggestions aim to improve the accuracy and efficiency of our indexing process for Bitcoin Transcripts.

Current Issue
The existing mechanism for generating unique IDs for transcripts is based solely on the transcript's file path.

const id = pathWithoutExtension.replace(path.join(process.env.DATA_DIR, "bitcointranscripts", folder_name), '').replaceAll("\\", "+").replaceAll("/", "+");

This approach doesn't account for edits to transcripts. Once a transcript is indexed, any future edits won't trigger a re-indexing due to the ID remaining unchanged. This limitation wasn't a major concern until our recent update related to AI-generated transcripts. These AI-generated transcripts are added to the repository for review, meaning they're indexed before being finalized by human review. Consequently, the final, reviewed versions aren't re-indexed, as they produce the same ID as the previous AI-generated version.

Possible Solutions

  1. Exclude AI-generated transcripts in need of review from parsing. We can identify these transcripts through their metadata to ensure they're not indexed until after the review process.
  2. Revise the ID generation logic to incorporate the transcript's body. This change would help capture edits to transcripts by generating a new ID for the updated content. However, it introduces challenges, including the difficulty of locating and removing the original transcript from the index and increased processing time due to the MD5 hashing of the transcript body.
  3. Limit the range of documents the scraper processes by leveraging repository metadata. Currently, the scraper evaluates all documents, regardless of their modification status. Focusing on documents added or edited within the last week could streamline the process. This approach not only enhances efficiency but also supports the ID regeneration strategy for edited transcripts, making it easier to identify and update modified files.

This approach doesn't account for edits to transcripts. Once a transcript is indexed, any future edits won't trigger a re-indexing due to the ID remaining unchanged.

@kouloumos, As far as I know, a transcript of an already published video or a podcast will not be edited/updated again. Can you please confirm if they will be edited again once published?

Consequently, the final, reviewed versions aren't re-indexed, as they produce the same ID as the previous AI-generated version.

@kouloumos Which AI-generated transcripts are you referring to? Is it the one we are generating using GPT-4 in the mailing-list-summaries?

@kouloumos Which AI-generated transcripts are you referring to? Is it the one we are generating using GPT-4 in the mailing-list-summaries?

@urvishp80 this issue is about the Bitcoin Transcripts project. I believe that if you read this it will help you get context on what I am describing.

This has been solved by #64. The file path for each transcript is still used as the unique identifier (ID) but the logic now uses an upsert function that updates a document if it exists for the given ID; otherwise, inserts a new one.

There are still changes that need to happen in the scraper for Bitcoin Transcripts in order to better account for AI-generated transcripts as well as additional metadata for each transcript. Those will be described in a different issue.