Web Scraper

Intro
Development
Documenting
Testing
Continuous Integration
Deployment
Usage

Intro

Web is a messy place, So scraping to find out what's what is a super difficult process. Current application using DOMDocument & XPath to go through XML resources. But there are other options to extract data from these resources like using Regular Expressions and ...

Currently IMDB Movies are supported as a model.

Development

Using TDD approach (Unit & Feature testing)
Using RESTful API
Using polymorphism for the Link DB entity
Using Scraper Helper class for the sake of dependency injection (parseUrl, downloadResource, processHtml)

Documenting

Using PHPDocs.

Testing

Using PHPUnit. Run (from the root of project):

./vendor/bin/phpunit

Continuous Integration

Using Travis-CI: config file is ./.travis.yml

Deployment

system requirement
- you need cURL, too
install composer
clone the repo
copy .env.example to .env and then config DB info (name, username, password or maybe driver if you wanna use something other than MySQL) inside it
run:
- composer install
- php artisan key:generate
- php artisan migrate
- php artisan serve

Usage

There are currently 5 actions: (OpenAPI specification is in the roadmap :))

List the IMDB movies => GET {host:port}/api/imdb-movie
Create an IMDB movie => POST {host:port}/api/imdb-movie url={imdb_movie_url}
Get a specific IMDB movie => GET {host:port}/api/imdb-movie/{id}
Update an existing IMDB movie => PATCH {host:port}/api/imdb-movie/{id} url={imdb_movie_url}
Delete an IMDB movies => DELETE {host:port}/api/imdb-movie/{id}

Note that currently there is no implementation for OAuth or other authentication system in the current version, so you can send the requests without going through any authentication process.

techieforfun/web-scraper