/web-scraper

:microscope: Scraper for the web

Primary LanguagePHP

Build Status

Web Scraper

Table of contents


Intro

Web is a messy place, So scraping to find out what's what is a super difficult process. Current application using DOMDocument & XPath to go through XML resources. But there are other options to extract data from these resources like using Regular Expressions and ...

Currently IMDB Movies are supported as a model.


Development

  • Using TDD approach (Unit & Feature testing)
  • Using RESTful API
  • Using polymorphism for the Link DB entity
  • Using Scraper Helper class for the sake of dependency injection (parseUrl, downloadResource, processHtml)

Documenting

Using PHPDocs.


Testing

Using PHPUnit. Run (from the root of project):

  • ./vendor/bin/phpunit

Continuous Integration

Using Travis-CI: config file is ./.travis.yml


Deployment

  • system requirement
    • you need cURL, too
  • install composer
  • clone the repo
  • copy .env.example to .env and then config DB info (name, username, password or maybe driver if you wanna use something other than MySQL) inside it
  • run:
    • composer install
    • php artisan key:generate
    • php artisan migrate
    • php artisan serve

Usage

There are currently 5 actions: (OpenAPI specification is in the roadmap :))

  • List the IMDB movies => GET {host:port}/api/imdb-movie
  • Create an IMDB movie => POST {host:port}/api/imdb-movie url={imdb_movie_url}
  • Get a specific IMDB movie => GET {host:port}/api/imdb-movie/{id}
  • Update an existing IMDB movie => PATCH {host:port}/api/imdb-movie/{id} url={imdb_movie_url}
  • Delete an IMDB movies => DELETE {host:port}/api/imdb-movie/{id}

Note that currently there is no implementation for OAuth or other authentication system in the current version, so you can send the requests without going through any authentication process.