Welcome to the GitBook Scraper project! This tool is designed to scrape content from GitBook sites and save it in Markdown format. 📄✨
- Puppeteer: For headless browser automation.
- fs-extra: For file system operations.
- path: For handling and transforming file paths.
First, clone the repository to your local machine. Then, navigate to the project directory and install the necessary dependencies:
npm installTo run the scraper, execute the following command:
node scraper.jsThis will start the scraper, which will navigate to the specified GitBook site, extract the content, and save it to a result.md file in the project directory.
- scraper.js: The main file that runs the web scraper.
- BASE_URL: The URL of the GitBook site to be scraped.
- OUTPUT_FILE: The path to the file where the scraped content will be saved.
- Browser Initialization: Launch a headless browser using Puppeteer.
- Navigate to URL: Go to the specified URL and wait until the page is fully loaded.
- Expand Menus: Click on any expandable menu items to ensure all content is visible.
- Collect Links: Gather all the links to the pages within the GitBook site.
- Scrape Content: Visit each page, extract the relevant content, and format it in Markdown.
- Save to File: Write the collected content to the specified output file.
- Close Browser: Shut down the browser once the scraping is complete.
Browser Initialization:
const browser = await puppeteer.launch({
headless: "new",
args: ['--no-sandbox', '--disable-setuid-sandbox']
});Navigating to URL:
await page.goto(BASE_URL, { waitUntil: 'networkidle0', timeout: 60000 });Collecting Links:
const links = await page.evaluate(() => {
const anchors = Array.from(document.querySelectorAll('a'));
return anchors.map(anchor => anchor.href);
});Scraping Content:
const content = await page.evaluate(() => {
return document.querySelector('.page-inner').innerText;
});Saving to File:
await fs.writeFile(OUTPUT_FILE, markdownContent);If an error occurs during scraping, it will be logged to the console, and the scraper will continue processing the remaining pages.
catch (error) {
console.error('Unexpected error:', error);
}If you have any questions or suggestions, feel free to reach out to fastuptime.