crawltojson

A powerful and flexible web crawler that converts website content into structured JSON. Perfect for creating training datasets, content migration, web scraping, or any task requiring structured web content extraction.

🎯 Intended Use

Just two commands to crawl a website and save the content in a structured JSON file.

npx crawltojson config
npx crawltojson crawl

🚀 Features

🌐 Crawl any website with customizable patterns
📦 Export to structured JSON
🎯 CSS selector-based content extraction
🔄 Automatic retry mechanism for failed requests
🌲 Depth-limited crawling
⏱️ Configurable timeouts
🚫 URL pattern exclusion
💾 Stream-based processing for memory efficiency
🎨 Beautiful CLI interface with progress indicators

🔧 Installation

Global Installation (Recommended)

npm install -g crawltojson

Using npx (No Installation)

npx crawltojson

Local Project Installation

npm install crawltojson

🚀 Quick Start

Generate configuration file:

crawltojson config

Start crawling:

crawltojson crawl

⚙️ Configuration Options

Basic Options

url - Starting URL to crawl
- Example: "https://example.com/blog"
- Must be a valid HTTP/HTTPS URL
match - URL pattern to match (supports glob patterns)
- Example: "https://example.com/blog/**"
- Use ** for wildcard matching
- Default: Same as starting URL with /** appended
selector - CSS selector to extract content
- Example: "article.content"
- Default: "body"
- Supports any valid CSS selector
maxPages - Maximum number of pages to crawl
- Default: 50
- Range: 1 to unlimited
- Helps control crawl scope

Advanced Options

maxRetries - Maximum number of retries for failed requests
- Default: 3
- Useful for handling temporary network issues
- Exponential backoff between retries
maxLevels - Maximum depth level for crawling
- Default: 3
- Controls how deep the crawler goes from the starting URL
- Level 0 is the starting URL
- Helps prevent infinite crawling
timeout - Page load timeout in milliseconds
- Default: 7000 (7 seconds)
- Prevents hanging on slow-loading pages
- Adjust based on site performance

excludePatterns - Array of URL patterns to ignore

Default patterns:

[
  "**/tag/**",    // Ignore tag pages
  "**/tags/**",   // Ignore tag listings
  "**/#*",        // Ignore anchor links
  "**/search**",  // Ignore search pages
  "**.pdf",       // Ignore PDF files
  "**/archive/**" // Ignore archive pages
]

Configuration File

The configuration is stored in crawltojson.config.json. Example:

{
  "url": "https://example.com/blog",
  "match": "https://example.com/blog/**",
  "selector": "article.content",
  "maxPages": 100,
  "maxRetries": 3,
  "maxLevels": 3,
  "timeout": 7000,
  "outputFile": "crawltojson.output.json",
  "excludePatterns": [
    "**/tag/**",
    "**/tags/**",
    "**/#*"
  ]
}

🎯 Advanced Usage

Selecting Content

The selector option supports any valid CSS selector. Examples:

# Single element
article.main-content

# Multiple elements
.post-content, .comments

# Nested elements
article .content p

# Complex selectors
main article:not(.ad) .content

URL Pattern Matching

The match pattern supports glob-style matching:

# Match exact path
https://example.com/blog/

# Match all blog posts
https://example.com/blog/**

# Match specific sections
https://example.com/blog/2024/**
https://example.com/blog/*/technical/**

Exclude Patterns

Customize excludePatterns for your needs:

{
  "excludePatterns": [
    "**/tag/**",        // Tag pages
    "**/category/**",   // Category pages
    "**/page/*",        // Pagination
    "**/wp-admin/**",   // Admin pages
    "**?preview=true",  // Preview pages
    "**.pdf",           // PDF files
    "**/feed/**",       // RSS feeds
    "**/print/**"       // Print pages
  ]
}

📄 Output Format

The crawler generates a JSON file with the following structure:

[
  {
    "url": "https://example.com/page1",
    "content": "Extracted content...",
    "timestamp": "2024-11-02T12:00:00.000Z",
    "level": 0
  },
  {
    "url": "https://example.com/page2",
    "content": "More content...",
    "timestamp": "2024-11-02T12:00:10.000Z",
    "level": 1
  }
]

Fields:

url: The normalized URL of the crawled page
content: Extracted text content based on selector
timestamp: ISO timestamp of when the page was crawled
level: Depth level from the starting URL (0-based)

🎯 Use Cases

Content Migration
- Crawl existing website content
- Export to structured format
- Import into new platform
Training Data Collection
- Gather content for ML models
- Create datasets for NLP
- Build content classifiers
Content Archival
- Archive website content
- Create backups
- Document snapshots
SEO Analysis
- Extract meta content
- Analyze content structure
- Track content changes
Documentation Collection
- Crawl documentation sites
- Create offline copies
- Generate searchable indexes

🛠️ Development

Local Setup

Clone the repository:

git clone https://github.com/yourusername/crawltojson.git
cd crawltojson

Install dependencies:

npm install

Build the project:

npm run build

Link for local testing:

npm link

Development Commands

# Run build
npm run build

# Clean build
npm run clean

# Run tests
npm test

# Watch mode
npm run dev

Publishing

Update version:

npm version patch|minor|major

Build and publish:

npm run build
npm publish

❗ Troubleshooting

Common Issues

Browser Installation Failed

# Manual installation
npx playwright install chromium

Permission Errors

# Fix CLI permissions
chmod +x ./dist/cli.js

Build Errors

# Clean install
rm -rf node_modules dist package-lock.json
npm install
npm run build

Debug Mode

Set DEBUG environment variable:

DEBUG=crawltojson* crawltojson crawl

🤝 Contributing

Fork the repository
Create feature branch
Commit changes
Push to branch
Create Pull Request

Coding Standards

Use ESLint configuration
Add tests for new features
Update documentation
Follow semantic versioning

📜 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Built with Playwright
CLI powered by Commander.js
Inspired by web scraping communities

Made with ❤️ by Vivek M. Agarwal

vivmagarwal/crawltojson