/x-scraper

Primary LanguagePythonMIT LicenseMIT

🐦 X-Scraper - Enterprise Twitter/X.com Data Collection Tool

Python 3.8+ License: MIT Selenium

A high-performance, enterprise-grade Twitter/X.com scraping solution designed exclusively for Twitter/X.com data collection. Efficiently handles 10,000+ Twitter users with intelligent batch processing, advanced anti-detection, and rate limiting.

🚀 Features

  • 🎯 Twitter/X.com Exclusive: Designed specifically for Twitter/X.com platform
  • 📈 Massive Scale: Handle 10,000+ Twitter users efficiently
  • 🛡️ Anti-Detection: Advanced stealth features to bypass bot detection
  • ⚡ High Performance: Concurrent processing with intelligent batching
  • 🔄 Rate Limiting: Smart delays and request management
  • 📊 Multiple Formats: JSON and CSV output support
  • 🔧 Flexible Configuration: JSON configs and command-line options
  • 📝 Comprehensive Logging: Detailed progress tracking and error handling
  • 🔄 Resume Capability: Continue interrupted scraping sessions

📋 Table of Contents

🛠️ Installation

Prerequisites

  • Python 3.8 or higher
  • Chrome browser installed
  • Git (for cloning)

Step 1: Clone the Repository

git clone https://github.com/AryanVBW/x-scraper.git
cd x-scraper

Step 2: Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Verify Installation

# Test basic scraper
python src/advanced_twitter_scraper.py --username elonmusk --count 3 --method selenium

# Test enterprise scraper
echo "elonmusk" > test_user.txt
python src/enterprise_batch_scraper.py --users test_user.txt --tweet-count 3 --workers 1

🚀 Quick Start

Single User Scraping

# Scrape 10 tweets from a single user
python src/advanced_twitter_scraper.py --username elonmusk --count 10 --method selenium --headless

Batch Scraping (Multiple Users)

# Create user list
echo -e "elonmusk\nbillgates\ntim_cook" > users.txt

# Scrape 5 tweets from each user
python src/enterprise_batch_scraper.py --users users.txt --tweet-count 5 --workers 3 --headless

Using Configuration Files

# Use predefined configuration
python src/enterprise_batch_scraper.py --config config/twitter_exclusive_config.json --headless

📖 Usage Examples

Example 1: Basic Single User Scraping

# Scrape 20 tweets from @elonmusk with visible browser
python src/advanced_twitter_scraper.py \
  --username elonmusk \
  --count 20 \
  --method selenium \
  --output-format json

Example 2: Enterprise Batch Processing

# Scrape 1000 users with 10 tweets each using 5 workers
python src/enterprise_batch_scraper.py \
  --users config/twitter_accounts.txt \
  --tweet-count 10 \
  --workers 5 \
  --headless \
  --format csv \
  --log-level INFO

Example 3: Custom Configuration

# Use custom JSON configuration for complex setups
python src/enterprise_batch_scraper.py \
  --config config/enterprise_users.json \
  --headless \
  --delay-min 1.0 \
  --delay-max 3.0

⚙️ Configuration

Command Line Options

Advanced Twitter Scraper

python src/advanced_twitter_scraper.py [OPTIONS]

Options:
  --username TEXT         Twitter username (without @)
  --count INTEGER         Number of tweets to scrape [default: 10]
  --method TEXT          Scraping method: selenium [default: selenium]
  --headless             Run in headless mode
  --output-format TEXT   Output format: json [default: json]
  --delay-min FLOAT      Minimum delay between requests [default: 2.0]
  --delay-max FLOAT      Maximum delay between requests [default: 5.0]

Enterprise Batch Scraper

python src/enterprise_batch_scraper.py [OPTIONS]

Options:
  --users TEXT           Path to file containing usernames
  --config TEXT          Path to JSON configuration file
  --tweet-count INTEGER  Number of tweets per user [default: 10]
  --workers INTEGER      Number of concurrent workers [default: auto]
  --headless            Run browsers in headless mode
  --format TEXT         Output format: json, csv [default: json]
  --delay-min FLOAT     Minimum delay between requests [default: 0.5]
  --delay-max FLOAT     Maximum delay between requests [default: 1.5]
  --log-level TEXT      Logging level: DEBUG, INFO, WARNING, ERROR [default: INFO]

Configuration Files

JSON Configuration Example

{
  "users": [
    {"username": "elonmusk", "tweet_count": 15},
    {"username": "billgates", "tweet_count": 10},
    {"username": "tim_cook", "tweet_count": 8}
  ],
  "settings": {
    "max_workers": 3,
    "headless": true,
    "delay_range": [1.0, 2.5],
    "output_format": "json",
    "log_level": "INFO"
  }
}

Text File Format

elonmusk
billgates
tim_cook
jeffbezos
sundarPichai

📊 Output Formats

JSON Output Structure

{
  "metadata": {
    "total_users": 3,
    "successful_scrapes": 3,
    "failed_scrapes": 0,
    "total_tweets": 25,
    "scraped_at": "2025-01-18T10:30:00.000Z"
  },
  "results": [
    {
      "username": "elonmusk",
      "success": true,
      "tweet_count": 10,
      "tweets": [
        {
          "text": "Tweet content here...",
          "created_at": "2025-01-18T09:15:00.000Z",
          "metrics": {
            "replies": 1250,
            "retweets": 3400,
            "likes": 15600
          },
          "url": "https://x.com/elonmusk/status/1234567890",
          "id": "1234567890",
          "hashtags": ["#AI", "#Technology"],
          "mentions": ["@openai"]
        }
      ]
    }
  ]
}

CSV Output

The CSV format includes columns: username, tweet_text, created_at, replies, retweets, likes, url, tweet_id, hashtags, mentions.

🎯 Best Practices

For High Volume Scraping (1000+ users)

  1. Use JSON Configuration: Better control over individual user settings
  2. Reasonable Tweet Counts: 5-15 tweets per user to avoid rate limits
  3. Monitor Logs: Watch for rate limiting and adjust delays accordingly
  4. Use CSV Format: More efficient for large datasets
  5. Run During Off-Peak Hours: Better success rates
  6. Always Use Headless Mode: Faster and more stable

Rate Limiting Guidelines

  • Small Scale (1-50 users): Default delays (0.5-1.5s) are sufficient
  • Medium Scale (50-500 users): Increase delays to 1.0-3.0s
  • Large Scale (500+ users): Use 2.0-5.0s delays and fewer workers

Error Handling

  • Monitor logs for suspended/private accounts
  • Implement retry logic for failed requests
  • Use appropriate worker counts based on system resources

🔧 Troubleshooting

Common Issues

1. "No tweets found" Error

# Solution: Check if account exists and is public
# Try with a known public account first
python src/advanced_twitter_scraper.py --username elonmusk --count 3

2. Chrome Driver Issues

# Solution: Update Chrome and reinstall webdriver-manager
pip uninstall webdriver-manager
pip install webdriver-manager

3. Rate Limiting

# Solution: Increase delays and reduce workers
python src/enterprise_batch_scraper.py --users users.txt --delay-min 2.0 --delay-max 5.0 --workers 1

4. Memory Issues

# Solution: Reduce batch size and workers
python src/enterprise_batch_scraper.py --users users.txt --workers 2 --tweet-count 5

Debug Mode

# Enable debug logging for detailed information
python src/enterprise_batch_scraper.py --users users.txt --log-level DEBUG

📁 Project Structure

x-scraper/
├── src/
│   ├── advanced_twitter_scraper.py    # Single-user scraper
│   └── enterprise_batch_scraper.py    # Batch scraper for multiple users
├── config/
│   ├── twitter_exclusive_config.json  # Sample configuration
│   ├── enterprise_users.json          # Enterprise user list
│   ├── twitter_accounts.txt           # Simple user list
│   └── users.json                     # JSON user configuration
├── data/
│   └── batch_results.json             # Output directory
├── docs/
│   ├── README.md                      # Additional documentation
│   └── SCRAPING_GUIDE.md              # Detailed scraping guide
├── requirements.txt                    # Python dependencies
├── .env.example                       # Environment variables template
└── README.md                          # This file

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Setup

# Clone your fork
git clone https://github.com/yourusername/x-scraper.git
cd x-scraper

# Create development environment
python -m venv dev-env
source dev-env/bin/activate  # On Windows: dev-env\Scripts\activate
pip install -r requirements.txt

# Run tests
python -m pytest tests/  # If tests are available

⚠️ Legal and Ethical Considerations

  • Respect Rate Limits: Don't overwhelm Twitter's servers
  • Public Data Only: Only scrape publicly available tweets
  • Terms of Service: Ensure compliance with Twitter's ToS
  • Data Privacy: Handle scraped data responsibly
  • Attribution: Credit original tweet authors when using data

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

If you encounter any issues or have questions:

  1. Check the Troubleshooting section
  2. Search existing GitHub Issues
  3. Create a new issue with detailed information

⭐ Star this repository if you find it helpful!

🔗 Connect with us: GitHub | Issues


Last updated: January 2025