A powerful and flexible web crawler that converts website content into structured JSON. Perfect for creating training datasets, content migration, web scraping, or any task requiring structured web content extraction.
Just two commands to crawl a website and save the content in a structured JSON file.
npx crawltojson config
npx crawltojson crawl
- 🌐 Crawl any website with customizable patterns
- 📦 Export to structured JSON
- 🎯 CSS selector-based content extraction
- 🔄 Automatic retry mechanism for failed requests
- 🌲 Depth-limited crawling
- ⏱️ Configurable timeouts
- 🚫 URL pattern exclusion
- 💾 Stream-based processing for memory efficiency
- 🎨 Beautiful CLI interface with progress indicators
- Installation
- Quick Start
- Configuration Options
- Advanced Usage
- Output Format
- Use Cases
- Development
- Troubleshooting
- Contributing
- License
npm install -g crawltojson
npx crawltojson
npm install crawltojson
- Generate configuration file:
crawltojson config
- Start crawling:
crawltojson crawl
-
url
- Starting URL to crawl- Example: "https://example.com/blog"
- Must be a valid HTTP/HTTPS URL
-
match
- URL pattern to match (supports glob patterns)- Example: "https://example.com/blog/**"
- Use ** for wildcard matching
- Default: Same as starting URL with /** appended
-
selector
- CSS selector to extract content- Example: "article.content"
- Default: "body"
- Supports any valid CSS selector
-
maxPages
- Maximum number of pages to crawl- Default: 50
- Range: 1 to unlimited
- Helps control crawl scope
-
maxRetries
- Maximum number of retries for failed requests- Default: 3
- Useful for handling temporary network issues
- Exponential backoff between retries
-
maxLevels
- Maximum depth level for crawling- Default: 3
- Controls how deep the crawler goes from the starting URL
- Level 0 is the starting URL
- Helps prevent infinite crawling
-
timeout
- Page load timeout in milliseconds- Default: 7000 (7 seconds)
- Prevents hanging on slow-loading pages
- Adjust based on site performance
-
excludePatterns
- Array of URL patterns to ignore- Default patterns:
[ "**/tag/**", // Ignore tag pages "**/tags/**", // Ignore tag listings "**/#*", // Ignore anchor links "**/search**", // Ignore search pages "**.pdf", // Ignore PDF files "**/archive/**" // Ignore archive pages ]
- Default patterns:
The configuration is stored in crawltojson.config.json
. Example:
{
"url": "https://example.com/blog",
"match": "https://example.com/blog/**",
"selector": "article.content",
"maxPages": 100,
"maxRetries": 3,
"maxLevels": 3,
"timeout": 7000,
"outputFile": "crawltojson.output.json",
"excludePatterns": [
"**/tag/**",
"**/tags/**",
"**/#*"
]
}
The selector
option supports any valid CSS selector. Examples:
# Single element
article.main-content
# Multiple elements
.post-content, .comments
# Nested elements
article .content p
# Complex selectors
main article:not(.ad) .content
The match
pattern supports glob-style matching:
# Match exact path
https://example.com/blog/
# Match all blog posts
https://example.com/blog/**
# Match specific sections
https://example.com/blog/2024/**
https://example.com/blog/*/technical/**
Customize excludePatterns
for your needs:
{
"excludePatterns": [
"**/tag/**", // Tag pages
"**/category/**", // Category pages
"**/page/*", // Pagination
"**/wp-admin/**", // Admin pages
"**?preview=true", // Preview pages
"**.pdf", // PDF files
"**/feed/**", // RSS feeds
"**/print/**" // Print pages
]
}
The crawler generates a JSON file with the following structure:
[
{
"url": "https://example.com/page1",
"content": "Extracted content...",
"timestamp": "2024-11-02T12:00:00.000Z",
"level": 0
},
{
"url": "https://example.com/page2",
"content": "More content...",
"timestamp": "2024-11-02T12:00:10.000Z",
"level": 1
}
]
url
: The normalized URL of the crawled pagecontent
: Extracted text content based on selectortimestamp
: ISO timestamp of when the page was crawledlevel
: Depth level from the starting URL (0-based)
-
Content Migration
- Crawl existing website content
- Export to structured format
- Import into new platform
-
Training Data Collection
- Gather content for ML models
- Create datasets for NLP
- Build content classifiers
-
Content Archival
- Archive website content
- Create backups
- Document snapshots
-
SEO Analysis
- Extract meta content
- Analyze content structure
- Track content changes
-
Documentation Collection
- Crawl documentation sites
- Create offline copies
- Generate searchable indexes
- Clone the repository:
git clone https://github.com/yourusername/crawltojson.git
cd crawltojson
- Install dependencies:
npm install
- Build the project:
npm run build
- Link for local testing:
npm link
# Run build
npm run build
# Clean build
npm run clean
# Run tests
npm test
# Watch mode
npm run dev
- Update version:
npm version patch|minor|major
- Build and publish:
npm run build
npm publish
- Browser Installation Failed
# Manual installation
npx playwright install chromium
- Permission Errors
# Fix CLI permissions
chmod +x ./dist/cli.js
- Build Errors
# Clean install
rm -rf node_modules dist package-lock.json
npm install
npm run build
Set DEBUG environment variable:
DEBUG=crawltojson* crawltojson crawl
- Fork the repository
- Create feature branch
- Commit changes
- Push to branch
- Create Pull Request
- Use ESLint configuration
- Add tests for new features
- Update documentation
- Follow semantic versioning
MIT License - see LICENSE for details.
- Built with Playwright
- CLI powered by Commander.js
- Inspired by web scraping communities
Made with ❤️ by Vivek M. Agarwal