CrawlPY is a Python package for web scraping and YouTube video scraping.
- Web Scraping: Easily scrape web pages using requests and BeautifulSoup.
- YouTube Scraping: Scrape YouTube videos and download them.
- Audio Transcription: Transcribe audio from videos using Deepgram API.
- Selenium Support: Support for websites that require JavaScript rendering.
You can install CrawlPY using pip:
pip install crawlpy
To start using the WebScraper
, you need to create an instance of it. You can customize headers, timeout, retries, and other settings.
from pyscrapy import WebScraper
scraper = WebScraper(headers={'User-Agent': 'Mozilla/5.0'}, timeout=15, retries=5, use_selenium=False)
You can fetch the text content of a single URL or multiple URLs. The function returns the plain text content of the web pages.
# Single URL
text = scraper.get_page_text("https://google.com")
print(text)
# Multiple URLs
texts = scraper.get_page_text(["https://google.com", "https://wikipedia.com"])
for text in texts:
print(text)
You can save the scraped content to a file in different formats: txt, json, or csv. You can also provide column names for the CSV format.
# Save as plain text
scraper.save_to_file("https://google.com", "output.txt", file_type='txt')
# Save as JSON
scraper.save_to_file(["https://google.com", "https://wikipedia.com"], "output.json", file_type='json')
# Save as CSV with column names
scraper.save_to_file(["https://google.com", "https://wikipedia.com"], "output.csv", file_type='csv', column_names=['Content'])
You can extract content from specific HTML tags. The function returns the text content of all occurrences of the specified tag.
tags_content = scraper.get_tag_content("https://google.com", "p")
for content in tags_content:
print(content)
links = scraper.extract_links("https://example.com")
for link in links:
print(link)
screenshot_file = scraper.take_screenshot("https://google.com", filename="screenshot.png")
If you need to scrape content from websites that require JavaScript rendering, enable Selenium when initializing the WebScraper.
scraper = WebScraper(use_selenium=True)
# Now all scraping functions will use Selenium
text = scraper.get_page_text("https://example.com")
print(text)
The YouTube scraper in CrawlPy allows you to download YouTube videos and transcribe their audio content using the Deepgram API. To use this functionality, ensure you have set up your environment with the required API keys.
pip install crawlpy
You need to set up your Deepgram API key as an environment variable. Create a .env file in your project directory and add your API key:
DEEPGRAM_API_KEY=your_deepgram_api_key
import os
os.environ["DEEPGRAM_API_KEY"] = deepgram_api_key
from crawlpy import YouTubeScraper
youtube_scraper = YouTubeScraper()
video_url = "https://www.youtube.com/watch?v=oHg5SJYRHA0"
file_path = youtube_scraper.download_video(video_url)
print(f"Video downloaded to {file_path}")
You can transcribe the audio content of a YouTube video. The transcriber function can take either a URL or the path to a previously downloaded video file.
# Transcribe using a video URL
transcript = youtube_scraper.transcribe_video(video_url)
print("Transcript:", transcript)
# Optionally, save the transcript to a file
transcript = youtube_scraper.transcribe_video(video_url, save=True, filename="transcript.txt")
print(f"Transcript saved to {transcript}")