A high-performance batch downloader for DingTalk documents built with TypeScript, Puppeteer, and Crawlee.
English | δΈζ README
- β‘ High Performance: Built with Crawlee framework for efficient web scraping and file downloading
- π Multi-format Support: Currently handles various DingTalk document types:
- Documents
- Spreadsheets
- Mind Maps
- AI Tables
- Uploaded files (PDF, images, etc.)
- Nested folders
- π‘οΈ Stable & Reliable: Stealth mode, retry mechanism, and comprehensive error handling
- Bun >= 1.2.20
If you have asdf installed:
# Install bun plugin
asdf plugin add bun
# Install bun (version specified in .tool-versions)
asdf install bun
Visit bun.sh for installation instructions.
- Clone the repository:
git clone https://github.com/imyelo/dingdocs-crawler.git
cd dingdocs-crawler
- Install dependencies:
bun install
The crawler uses environment variables for configuration. Create a .env.local
file in the project root:
APP_ENTRY_URL=https://your-dingtalk-docs-url-with-folder-page
Example:
Variable | Description | Default | Required |
---|---|---|---|
APP_ENTRY_URL |
Starting URL for crawling, should be a folder page | - | β |
APP_CRAWLER_TIMEOUT_SECONDS |
Total crawler timeout | 4500 | β |
APP_REQUEST_TIMEOUT_SECONDS |
Individual request timeout | 1800 | β |
APP_VISIBLE |
Show browser window | true | β |
APP_MAX_CONCURRENCY |
Maximum concurrent requests | 1 | β |
APP_MAX_REQUEST_RETRIES |
Retry attempts for failed requests | 10 | β |
APP_PROXY_URLS |
Comma-separated proxy URLs | - | β |
APP_LOG_PATH |
Log file directory | ./output.log | β |
APP_DOWNLOAD_PATH |
Download directory | ./downloads | β |
APP_LOGTAIL_SOURCE_TOKEN |
Logtail integration token (keep empty if you don't know what it is) | - | β |
APP_HEALTHY_UUID |
Health check UUID (keep empty if you don't know what it is) | - | β |
Start the crawler:
bun start
Monitor logs in real-time:
bun run log
Apache-2.0 Β© yelo, 2025 - present