/crawltojson

Primary LanguageJavaScript

crawltojson

A powerful and flexible web crawler that converts website content into structured JSON. Perfect for creating training datasets, content migration, web scraping, or any task requiring structured web content extraction.

🎯 Intended Use

Just two commands to crawl a website and save the content in a structured JSON file.

npx crawltojson config
npx crawltojson crawl

🚀 Features

  • 🌐 Crawl any website with customizable patterns
  • 📦 Export to structured JSON
  • 🎯 CSS selector-based content extraction
  • 🔄 Automatic retry mechanism for failed requests
  • 🌲 Depth-limited crawling
  • ⏱️ Configurable timeouts
  • 🚫 URL pattern exclusion
  • 💾 Stream-based processing for memory efficiency
  • 🎨 Beautiful CLI interface with progress indicators

📋 Table of Contents

🔧 Installation

Global Installation (Recommended)

npm install -g crawltojson

Using npx (No Installation)

npx crawltojson

Local Project Installation

npm install crawltojson

🚀 Quick Start

  1. Generate configuration file:
crawltojson config
  1. Start crawling:
crawltojson crawl

⚙️ Configuration Options

Basic Options

  • url - Starting URL to crawl

  • match - URL pattern to match (supports glob patterns)

  • selector - CSS selector to extract content

    • Example: "article.content"
    • Default: "body"
    • Supports any valid CSS selector
  • maxPages - Maximum number of pages to crawl

    • Default: 50
    • Range: 1 to unlimited
    • Helps control crawl scope

Advanced Options

  • maxRetries - Maximum number of retries for failed requests

    • Default: 3
    • Useful for handling temporary network issues
    • Exponential backoff between retries
  • maxLevels - Maximum depth level for crawling

    • Default: 3
    • Controls how deep the crawler goes from the starting URL
    • Level 0 is the starting URL
    • Helps prevent infinite crawling
  • timeout - Page load timeout in milliseconds

    • Default: 7000 (7 seconds)
    • Prevents hanging on slow-loading pages
    • Adjust based on site performance
  • excludePatterns - Array of URL patterns to ignore

    • Default patterns:
      [
        "**/tag/**",    // Ignore tag pages
        "**/tags/**",   // Ignore tag listings
        "**/#*",        // Ignore anchor links
        "**/search**",  // Ignore search pages
        "**.pdf",       // Ignore PDF files
        "**/archive/**" // Ignore archive pages
      ]

Configuration File

The configuration is stored in crawltojson.config.json. Example:

{
  "url": "https://example.com/blog",
  "match": "https://example.com/blog/**",
  "selector": "article.content",
  "maxPages": 100,
  "maxRetries": 3,
  "maxLevels": 3,
  "timeout": 7000,
  "outputFile": "crawltojson.output.json",
  "excludePatterns": [
    "**/tag/**",
    "**/tags/**",
    "**/#*"
  ]
}

🎯 Advanced Usage

Selecting Content

The selector option supports any valid CSS selector. Examples:

# Single element
article.main-content

# Multiple elements
.post-content, .comments

# Nested elements
article .content p

# Complex selectors
main article:not(.ad) .content

URL Pattern Matching

The match pattern supports glob-style matching:

# Match exact path
https://example.com/blog/

# Match all blog posts
https://example.com/blog/**

# Match specific sections
https://example.com/blog/2024/**
https://example.com/blog/*/technical/**

Exclude Patterns

Customize excludePatterns for your needs:

{
  "excludePatterns": [
    "**/tag/**",        // Tag pages
    "**/category/**",   // Category pages
    "**/page/*",        // Pagination
    "**/wp-admin/**",   // Admin pages
    "**?preview=true",  // Preview pages
    "**.pdf",           // PDF files
    "**/feed/**",       // RSS feeds
    "**/print/**"       // Print pages
  ]
}

📄 Output Format

The crawler generates a JSON file with the following structure:

[
  {
    "url": "https://example.com/page1",
    "content": "Extracted content...",
    "timestamp": "2024-11-02T12:00:00.000Z",
    "level": 0
  },
  {
    "url": "https://example.com/page2",
    "content": "More content...",
    "timestamp": "2024-11-02T12:00:10.000Z",
    "level": 1
  }
]

Fields:

  • url: The normalized URL of the crawled page
  • content: Extracted text content based on selector
  • timestamp: ISO timestamp of when the page was crawled
  • level: Depth level from the starting URL (0-based)

🎯 Use Cases

  1. Content Migration

    • Crawl existing website content
    • Export to structured format
    • Import into new platform
  2. Training Data Collection

    • Gather content for ML models
    • Create datasets for NLP
    • Build content classifiers
  3. Content Archival

    • Archive website content
    • Create backups
    • Document snapshots
  4. SEO Analysis

    • Extract meta content
    • Analyze content structure
    • Track content changes
  5. Documentation Collection

    • Crawl documentation sites
    • Create offline copies
    • Generate searchable indexes

🛠️ Development

Local Setup

  1. Clone the repository:
git clone https://github.com/yourusername/crawltojson.git
cd crawltojson
  1. Install dependencies:
npm install
  1. Build the project:
npm run build
  1. Link for local testing:
npm link

Development Commands

# Run build
npm run build

# Clean build
npm run clean

# Run tests
npm test

# Watch mode
npm run dev

Publishing

  1. Update version:
npm version patch|minor|major
  1. Build and publish:
npm run build
npm publish

❗ Troubleshooting

Common Issues

  1. Browser Installation Failed
# Manual installation
npx playwright install chromium
  1. Permission Errors
# Fix CLI permissions
chmod +x ./dist/cli.js
  1. Build Errors
# Clean install
rm -rf node_modules dist package-lock.json
npm install
npm run build

Debug Mode

Set DEBUG environment variable:

DEBUG=crawltojson* crawltojson crawl

🤝 Contributing

  1. Fork the repository
  2. Create feature branch
  3. Commit changes
  4. Push to branch
  5. Create Pull Request

Coding Standards

  • Use ESLint configuration
  • Add tests for new features
  • Update documentation
  • Follow semantic versioning

📜 License

MIT License - see LICENSE for details.

🙏 Acknowledgments


Made with ❤️ by Vivek M. Agarwal