Scrape articles from Substack
- download articles as HTML
- download TTS audios as MP3
- cache raw files
- Substack handle, e.g. from page url
https://foo.substack.com/
- (for paywalled articles) Substack access token from account with subscription, e.g. from cookie
substack_sid
- Deno
- set environmental variables in
.env
file - scrape articles
deno task articles
- scrape TTS audios
deno task audios
- can convert HTML to markdown using pandoc and Nushell
ls out/handle/ | where type == "dir" | get name | par-each { cd $in; pandoc --wrap=none --strip-comments -f html-native_divs-native_spans -t gfm-tex_math_dollars-raw_html -o article.md article.html }