DingDocs Crawler

A high-performance batch downloader for DingTalk documents built with TypeScript, Puppeteer, and Crawlee.

Features

⚡ High Performance: Built with Crawlee framework for efficient web scraping and file downloading
📄 Multi-format Support: Currently handles various DingTalk document types:
- Documents
- Spreadsheets
- Mind Maps
- AI Tables
- Uploaded files (PDF, images, etc.)
- Nested folders
🛡️ Stable & Reliable: Stealth mode, retry mechanism, and comprehensive error handling

If you have asdf installed:

# Install bun plugin
asdf plugin add bun

# Install bun (version specified in .tool-versions)
asdf install bun

Visit bun.sh for installation instructions.

git clone https://github.com/imyelo/dingdocs-crawler.git
cd dingdocs-crawler

bun install

The crawler uses environment variables for configuration. Create a .env.local file in the project root:

APP_ENTRY_URL=https://your-dingtalk-docs-url-with-folder-page

Example:

Variable	Description	Default	Required
`APP_ENTRY_URL`	Starting URL for crawling, should be a folder page	-	✅
`APP_CRAWLER_TIMEOUT_SECONDS`	Total crawler timeout	4500	❌
`APP_REQUEST_TIMEOUT_SECONDS`	Individual request timeout	1800	❌
`APP_VISIBLE`	Show browser window	true	❌
`APP_MAX_CONCURRENCY`	Maximum concurrent requests	1	❌
`APP_MAX_REQUEST_RETRIES`	Retry attempts for failed requests	10	❌
`APP_PROXY_URLS`	Comma-separated proxy URLs	-	❌
`APP_LOG_PATH`	Log file directory	./output.log	❌
`APP_DOWNLOAD_PATH`	Download directory	./downloads	❌
`APP_LOGTAIL_SOURCE_TOKEN`	Logtail integration token (keep empty if you don't know what it is)	-	❌
`APP_HEALTHY_UUID`	Health check UUID (keep empty if you don't know what it is)	-	❌

Start the crawler:

bun start

Monitor logs in real-time:

bun run log