Why use ReadItToMe rather than a screen reader? I built this tool for two major use cases.
- Reading research papers and large web content in a smart way (don't read ads, don't read menus, etc etc).
- Reading large forums and summarizing the findings, consensus, insights, etc.
In these cases, it blows a standard screenreader out of the water.
- Support for OpenAI models (GPT-3/GPT-4)
- Support for Anthropic models (Claude-2/3)
- Support for Ollama Models (Mistral, Llama2, etc)
Specify Url
py main.py --url "https://example.com/page"
Specify a filename (to reuse the same file. By default one file per webpage is generated)
py main.py --fixed-filename "summary.mp3"
Specify a 'playlist' or file with multiple urls, one per line, to process. Can be combined with --silent and --download-only to setup a playlist for later listening.
py main.py --playlist \your\directory\playlist.txt
Save the AI generated summaries for later viewing
py main.py --save-summaries \output\dir
Flags
- --silent (Don't vocalize the actions being performed)
- --download-only (Only download the audio files, don't play them back (useful for bulk creating a playlist))
py main.py --playlist C:\git\HNplaylist.txt --download-only --silent
- Copy or Rename config.example.json to config.json
- Add your keys for models. OpenAI key is required for OpenAi's natural text to speech which is the main feature of this app. (may support other platforms in the future)
- Add your output directory - this is where audio files generated for playback will be stored
- Add your selected model and model type for text summarization (openai, claude, ollama)
Disclaimer: I'm not a daily Python coder but ironically the core implementation is in Python via experimentation and backported to C# via Claude 3.0 and hand fixup.
- Opted to use Pygame for audio playback in Python as it provided the most seamless user experience (other approaches required convoluted FFMPEG setup on Windows)
- Opted for OpenAI's voice - I personally enjoy the natural way they sound including vocal mannerisms.
- MAX_RESPONSE_TOKENS has a very strong effect on how thorough or concise the summary is. At 720, you'll get a reasonable and detailed overview if the story is brief. I personally use 3072 since I use it for large stories or HackerNews threads. Expand this if you prefer a deeper dive - to the limits of your model. Of course, this has a direct effect on cost-per-query.
- In general, requires models with 16k+ context sizes in order to be useful (GPT-3.5-turbo, GPT-4-Turbo, Claude-2, Claude-3)
- Not all Ollama models support large context sizes.
- In practice Mistral was passable but most small/medium models (7B or less) did poorly or required tweaking to deliver useful summaries. YMMV!
- Claude-3 and GPT-4 did exceptionally well due to the large context sizes and recollect quality
- Chromium and FF based plugins (investigating)
- Better tested support for specific local models (ollama, oobabooga, or anything that supports the OpenAI api)
- Support for multiple audio generation models