A high-performance, enterprise-grade Twitter/X.com scraping solution designed exclusively for Twitter/X.com data collection. Efficiently handles 10,000+ Twitter users with intelligent batch processing, advanced anti-detection, and rate limiting.
- 🎯 Twitter/X.com Exclusive: Designed specifically for Twitter/X.com platform
- 📈 Massive Scale: Handle 10,000+ Twitter users efficiently
- 🛡️ Anti-Detection: Advanced stealth features to bypass bot detection
- ⚡ High Performance: Concurrent processing with intelligent batching
- 🔄 Rate Limiting: Smart delays and request management
- 📊 Multiple Formats: JSON and CSV output support
- 🔧 Flexible Configuration: JSON configs and command-line options
- 📝 Comprehensive Logging: Detailed progress tracking and error handling
- 🔄 Resume Capability: Continue interrupted scraping sessions
- Installation
- Quick Start
- Usage Examples
- Configuration
- API Reference
- Best Practices
- Troubleshooting
- Contributing
- License
- Python 3.8 or higher
- Chrome browser installed
- Git (for cloning)
git clone https://github.com/AryanVBW/x-scraper.git
cd x-scraper
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
pip install -r requirements.txt
# Test basic scraper
python src/advanced_twitter_scraper.py --username elonmusk --count 3 --method selenium
# Test enterprise scraper
echo "elonmusk" > test_user.txt
python src/enterprise_batch_scraper.py --users test_user.txt --tweet-count 3 --workers 1
# Scrape 10 tweets from a single user
python src/advanced_twitter_scraper.py --username elonmusk --count 10 --method selenium --headless
# Create user list
echo -e "elonmusk\nbillgates\ntim_cook" > users.txt
# Scrape 5 tweets from each user
python src/enterprise_batch_scraper.py --users users.txt --tweet-count 5 --workers 3 --headless
# Use predefined configuration
python src/enterprise_batch_scraper.py --config config/twitter_exclusive_config.json --headless
# Scrape 20 tweets from @elonmusk with visible browser
python src/advanced_twitter_scraper.py \
--username elonmusk \
--count 20 \
--method selenium \
--output-format json
# Scrape 1000 users with 10 tweets each using 5 workers
python src/enterprise_batch_scraper.py \
--users config/twitter_accounts.txt \
--tweet-count 10 \
--workers 5 \
--headless \
--format csv \
--log-level INFO
# Use custom JSON configuration for complex setups
python src/enterprise_batch_scraper.py \
--config config/enterprise_users.json \
--headless \
--delay-min 1.0 \
--delay-max 3.0
python src/advanced_twitter_scraper.py [OPTIONS]
Options:
--username TEXT Twitter username (without @)
--count INTEGER Number of tweets to scrape [default: 10]
--method TEXT Scraping method: selenium [default: selenium]
--headless Run in headless mode
--output-format TEXT Output format: json [default: json]
--delay-min FLOAT Minimum delay between requests [default: 2.0]
--delay-max FLOAT Maximum delay between requests [default: 5.0]
python src/enterprise_batch_scraper.py [OPTIONS]
Options:
--users TEXT Path to file containing usernames
--config TEXT Path to JSON configuration file
--tweet-count INTEGER Number of tweets per user [default: 10]
--workers INTEGER Number of concurrent workers [default: auto]
--headless Run browsers in headless mode
--format TEXT Output format: json, csv [default: json]
--delay-min FLOAT Minimum delay between requests [default: 0.5]
--delay-max FLOAT Maximum delay between requests [default: 1.5]
--log-level TEXT Logging level: DEBUG, INFO, WARNING, ERROR [default: INFO]
{
"users": [
{"username": "elonmusk", "tweet_count": 15},
{"username": "billgates", "tweet_count": 10},
{"username": "tim_cook", "tweet_count": 8}
],
"settings": {
"max_workers": 3,
"headless": true,
"delay_range": [1.0, 2.5],
"output_format": "json",
"log_level": "INFO"
}
}
elonmusk
billgates
tim_cook
jeffbezos
sundarPichai
{
"metadata": {
"total_users": 3,
"successful_scrapes": 3,
"failed_scrapes": 0,
"total_tweets": 25,
"scraped_at": "2025-01-18T10:30:00.000Z"
},
"results": [
{
"username": "elonmusk",
"success": true,
"tweet_count": 10,
"tweets": [
{
"text": "Tweet content here...",
"created_at": "2025-01-18T09:15:00.000Z",
"metrics": {
"replies": 1250,
"retweets": 3400,
"likes": 15600
},
"url": "https://x.com/elonmusk/status/1234567890",
"id": "1234567890",
"hashtags": ["#AI", "#Technology"],
"mentions": ["@openai"]
}
]
}
]
}
The CSV format includes columns: username
, tweet_text
, created_at
, replies
, retweets
, likes
, url
, tweet_id
, hashtags
, mentions
.
- Use JSON Configuration: Better control over individual user settings
- Reasonable Tweet Counts: 5-15 tweets per user to avoid rate limits
- Monitor Logs: Watch for rate limiting and adjust delays accordingly
- Use CSV Format: More efficient for large datasets
- Run During Off-Peak Hours: Better success rates
- Always Use Headless Mode: Faster and more stable
- Small Scale (1-50 users): Default delays (0.5-1.5s) are sufficient
- Medium Scale (50-500 users): Increase delays to 1.0-3.0s
- Large Scale (500+ users): Use 2.0-5.0s delays and fewer workers
- Monitor logs for suspended/private accounts
- Implement retry logic for failed requests
- Use appropriate worker counts based on system resources
# Solution: Check if account exists and is public
# Try with a known public account first
python src/advanced_twitter_scraper.py --username elonmusk --count 3
# Solution: Update Chrome and reinstall webdriver-manager
pip uninstall webdriver-manager
pip install webdriver-manager
# Solution: Increase delays and reduce workers
python src/enterprise_batch_scraper.py --users users.txt --delay-min 2.0 --delay-max 5.0 --workers 1
# Solution: Reduce batch size and workers
python src/enterprise_batch_scraper.py --users users.txt --workers 2 --tweet-count 5
# Enable debug logging for detailed information
python src/enterprise_batch_scraper.py --users users.txt --log-level DEBUG
x-scraper/
├── src/
│ ├── advanced_twitter_scraper.py # Single-user scraper
│ └── enterprise_batch_scraper.py # Batch scraper for multiple users
├── config/
│ ├── twitter_exclusive_config.json # Sample configuration
│ ├── enterprise_users.json # Enterprise user list
│ ├── twitter_accounts.txt # Simple user list
│ └── users.json # JSON user configuration
├── data/
│ └── batch_results.json # Output directory
├── docs/
│ ├── README.md # Additional documentation
│ └── SCRAPING_GUIDE.md # Detailed scraping guide
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
└── README.md # This file
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
# Clone your fork
git clone https://github.com/yourusername/x-scraper.git
cd x-scraper
# Create development environment
python -m venv dev-env
source dev-env/bin/activate # On Windows: dev-env\Scripts\activate
pip install -r requirements.txt
# Run tests
python -m pytest tests/ # If tests are available
- Respect Rate Limits: Don't overwhelm Twitter's servers
- Public Data Only: Only scrape publicly available tweets
- Terms of Service: Ensure compliance with Twitter's ToS
- Data Privacy: Handle scraped data responsibly
- Attribution: Credit original tweet authors when using data
This project is licensed under the MIT License - see the LICENSE file for details.
- Selenium for web automation
- WebDriver Manager for driver management
- BeautifulSoup for HTML parsing
If you encounter any issues or have questions:
- Check the Troubleshooting section
- Search existing GitHub Issues
- Create a new issue with detailed information
⭐ Star this repository if you find it helpful!
🔗 Connect with us: GitHub | Issues
Last updated: January 2025