/PHP-Component-Spider

This is a web crawler developed for internal use of Brand Websites

Primary LanguagePHPMIT LicenseMIT

PHP Component Spider

License: MIT CodeFactor Latest Stable Version License PHAR Build

This PHP Component Spider is designed to scrape websites for specific components or search criteria defined by XPath filters. It uses the PHPScraper library to fetch and process web pages, and the League\Csv library to log the results in CSV files. This tool is easy to extend with custom XPath filters to meet various scraping needs.

Features

  • Scrape websites for specific components or text based on XPath filters.
  • Log results into CSV files for further analysis.
  • Configurable timeout and maximum redirects.
  • Easy to extend with additional filters.

Requirements

  • PHP 8.1 or higher
  • Composer

Build & Run from Source Code

  1. Clone the repository:
git clone https://github.com/marioungui/PHP-Component-Spider.git
  1. Navigate to the project directory:
cd PHP-Component-Spider
  1. Install the dependencies using Composer:
composer install
  1. Build the Phar package:
php -d phar.readonly=0 phar-creator.php
  1. Run the batch spider.bat
  2. Follow the on-screen instructions to select the component to search for and the domain to scrape.

Filters

The filters are defined in filters.php and use XPath to identify specific components on the web pages. Here are the current filters available:

Component Index Filter
MVP Block 1 //*[@class='mvp-block']
Smart Question Search Engine Block 2 //*[@class='sqe-block']
Related Articles Block 3 //h2[text()='Artigos relacionados' or text()='Artigos Relacionados' or text()='Articulos Relacionados' or text()='Articulos relacionados' ]
Related Products Block 4 //h2[text()='Produtos Relacionados' or text()='Produtos Relacionados' or text()='Productos relacionados' or text()='Productos Relacionados']
Brands Block 5 //*[starts-with(@id, 'brands_block')]/@id
Stages Block 6 //*[starts-with(@id, 'stages_block')]
String Search 7 //*[contains(text(),'word')]
Action Bar 8 //div[contains(@class, 'action-bar__wrapper')]
Links Containing 9 //a[contains(@href, 'word')]
Stages Block using From Library 10 //div[contains(@class, 'paragraph--type--stages-block')]//div[contains(@class, 'grid-col-10')]

Extending with Custom Filters

Extending the tool with new filters is simple:

  1. Open the filters.php file.
  2. Add a new case in the switch statement with your component name or index.
  3. Define the $component and $filter variables with your custom XPath.

Example:

case 'new-component':
case 11:
    $component = "New Component";
    $filter = "//*[@class='new-component-class']";
    break;

Contributing

Feel free to submit issues or pull requests if you have any improvements or new features you'd like to add.

License

This project is licensed under the MIT License.