StampyAI/alignment-research-dataset

Track subsets in larger dataset

Opened this issue · 1 comments

We have consolidated lots of smaller subsets into larger, logically grouped subsets like blogs. However, it'd still be nice pull sources from a smaller subset that can be used in with pinecone metadata. Consider adding a column in MySQL 'domain' based on the 'url' to easily find smaller subsets.

Will this be:

  • the domain of the url that is displayed to the user
  • the domain of the source url (if provided)
  • the domain of the place where the article was first found (e.g. from the alignment newsletter)