I have 20 years of public content I've produced and presented over my career, and I'd like to have an LLM that is trained to answer questions and generate summaries of my opinions, in my "voice", At this point, this LLM doesn't exist, but to encourage development I have organized my public content and references to sources in this repo. For now, if you want to know my opinions on things, you have to watch all my videos and listent to my podcasts yourself!
My own content is stored or linked to in authors/virtiual_adrianco and consists of:
- 4 published books (pdf of two provided), ~10 forewords to books, ~100 blog posts (text)
- Twitter archive 2008-2022 (conversation text)
- Mastodon.social - 2021-now https://mastodon.social/@adrianco (RSS at https://mastodon.social/@adrianco.rss)
- Github projects (code)
- Blog posts mostly at https://adrianco.medium.com
- ~100 presentation decks (images) greatest hits: https://github.com/adrianco/slides/tree/master/Greatest%20Hits
- ~20 podcasts (audio conversations, should be good Q&A training material)
- ~50 videos of talks and interviews (audio/video/YouTube playlists)
If another author wants to use this repo as a starting point, clone it and add your own directory of content under authors. If you want to contribute it freely for other people to use as a training data set, then send a pull request and I'll include it here.
Creative Commons - attribution share-alike. Permission explicitly granted for anyone to use as a training set to develop the meGPT concept. Free for use by any author/speaker/expert resulting in a Chatbot that can answer questions as if it was the author, with reference to published content. I have called my own build of this virtual_adrianco - with opinions on cloud computing, sustainability, performance tools, microservices, speeding up innovation, Wardley mapping, open source, chaos engineering, resilience, Sun Microsystems, Netflix, AWS etc. etc. I'm happy to share any models that are developed. I don't need to monetize this, I'm semi-retired and have managed to monetize this content well enough already, I don't work for a big corporation any more..
All the code in this repo has been written by the free version of ChatGPT 4 based on short prompts, with no subsequent edits, in a few minutes of my time here and there. I can read Python amd mostly make sense of it but I'm not an experienced Python programmer. Look in the relevant issue for a public link to the chat thread that generated the code. This is a ridiculously low friction and easy way to write code.
To use this repo, clone it to a local disk, setup the python environment, run the build.py script for an author and it will walk through the published content table for that author processing each line in turn. The build script will create a downloads/ directory and create a state.json file in it which records successful processing steps so that incremental runs of build.py will not re-run the same downloads. Each kind of data needs a corresponding script in the processors directory.
git clone https://github.com/adrianco/megpt.git
cd megpt
python -m venv venv
Windows:
venv\Scripts\activate
macOS/Linux:
source venv/bin/activate
pip install -r requirements.txt
Run the build script
Usage: build.py <author>
python build.py virtual_adrianco
For test purposes process a single kind of data from an arbitrary URL, output to downloads without updating the state.json file
Usage: python process.py <author> <Kind> <SubKind> <URL>
Build.py and process.py appear to be operating correctly and book_processor.py correctly downloaded pdfs of books, Any raw file downloads can clone this processor. Each website download is going to need customized extraction, and the correct div name for The New Stack (thenewstack.io) has been added as a Subkind, and correct text content download is working for stories.
I have been assembling my content for a while, and will update the references table now and again https://github.com/adrianco/meGPT/blob/main/authors/virtual_adrianco/published_content.csv
YouTube videos have transcripts with index offsets into the video itself but the transcript quality isn't good, and they can only be read via API by the owner of the video. It's easier to download videos with pytube and process them with whisper to generate more curated transcripts that identify when the author is talking if there is more than one speaker.
Twitter archive - the raw archive files were over 100MB and too big for github. The extract_conversations script was used to pull out only the tweets that were part of a conversation, so they can be further analyzed to find questions and answers. The code to do this was written by ChatGPT, worked first time, but if there are any problems with the output I'm happy to share the raw tweets. File an issue.
Mastodon archive - available as an RSS feed. Medium blog platform - available as an RSS feed. Need to import an RSS feed. It also would be good to have this be incremental so that the training material can be updated efficiently as new blog posts and toots appear.
Issues have been created to track development of ingestion processing code.