Issues
- 1
- 0
Add arxiv papers from Slack
#195 opened by ccstan99 - 0
Check pinecone for deleted sources
#142 opened by ccstan99 - 1
- 3
agentmodels working urls by using github urls when websites ones are broken
#184 opened by Thomas-Lemoine - 0
Fix daily dataset updates
#183 opened by ccstan99 - 1
Add parsers and blogs
#169 opened by ccstan99 - 1
Improve YouTube transcripts
#172 opened by ccstan99 - 7
Improve catching duplicate urls
#163 opened by ccstan99 - 0
Automatic indeces marking
#156 opened by markovial - 0
Deduplicate alignmentforum & lesswrong
#160 opened by ccstan99 - 1
Track subsets in larger dataset
#173 opened by ccstan99 - 0
Make embedding_utils.py cleaner by adding a generic process-in-batches function.
#177 opened by henri123lemoine - 4
Pinecone metadata to include confidence & summary
#168 opened by ccstan99 - 0
Handle a(g)isafetyfundamentals.com
#165 opened by ccstan99 - 0
Fix YouTube authors from playlists
#170 opened by ccstan99 - 0
Add 80,000 Hours AI Archive
#153 opened by ccstan99 - 0
Add governance.ai
#155 opened by markovial - 0
Add table for storing pinecone metadata
#101 opened by mruwnik - 0
Missing text should be autoscraped
#164 opened by ccstan99 - 0
Updating datasets for modified content sources
#96 opened by Mishaall - 1
- 1
- 0
Update readme.md
#91 opened by Thomas-Lemoine - 1
Fix titles
#158 opened by ccstan99 - 0
Reorganize source and source_type
#140 opened by ccstan99 - 0
- 0
Add transformer-circuits.pub
#141 opened by ccstan99 - 0
- 1
Import rest of special docs to SQL
#116 opened by ccstan99 - 0
Provide way to update metadata
#126 opened by ccstan99 - 0
Properly handle arxiv papers
#125 opened by ccstan99 - 0
Fix NULL authors
#122 opened by ccstan99 - 0
Deduplicate by content
#128 opened by ccstan99 - 0
Decide on special docs workflow
#127 opened by ccstan99 - 0
Add command to setup index
#121 opened by mruwnik - 0
Finetune embeddings model
#119 opened by henri123lemoine - 0
Consistent naming
#117 opened by Thomas-Lemoine - 0
Remove gdocs metadata magic docs
#102 opened by mruwnik - 0
- 0
Use whisper.ai for youtube transcripts
#112 opened by mruwnik - 0
Remove old jsonl default flow
#100 opened by mruwnik - 0
Fix audio transcripts
#98 opened by mruwnik - 0
add confidence column to articles table
#104 opened by mruwnik - 0
Properly handle authors in the database
#99 opened by mruwnik - 0
add blog.eleuther.ai
#85 opened by mruwnik - 0
add deepmind technical-blogs
#86 opened by mruwnik - 0
Add dataset for openai research
#87 opened by mruwnik - 1
General data cleaning
#95 opened by Mishaall - 0
Validation checks for url and revisions
#84 opened by ccstan99