CaptainScraper is a NodeJs web scraper framework. It allows developers to build simple or complex scrapers in a minimum of time. Take the time to discover these features!
Install the following:
- NodeJs (>=5)
- MongoDb
- Typescript (npm)
- ts-node (npm)
Clone the repository and install the required modules:
git clone git@github.com:andrewdsilva/CaptainScraper.git
cd vendor/
npm install
cd ..
Install the following:
- Docker: https://docs.docker.com/engine/installation/
- Docker Compose: https://docs.docker.com/compose/install/
Build an image of CaptainScraper from the Dockerfile:
At the command line, make sure the current directory is the root of CaptainScraper project, where the docker-compose.yml is.
docker-compose build
Now you can run a terminal on the Docker with Docker Compose:
docker-compose run app bash
# Manually start mongo database
bash app/startDatabase.sh
# Execute a script located at /src/Sample/Allocine/Controller/AllocineCinemas.ts
ts-node app/console script:run Sample/Allocine/AllocineCinemas
# Equivalent
ts-node app/console script:run Sample/Allocine/Controller/AllocineCinemas
# Execute a script using docker-compose
docker-compose run app script:run Sample/Allocine/AllocineCinemas
# Use docker-compose in dev environment (with no entrypoint)
docker-compose -f docker-compose.dev.yml run app bash
A controller is a class with a function execute that contains the main logic of your program. Every scraper has a controller. This is an example of controller declaration:
import { Controller } from '../../../../app/importer';
class MyFirstController extends Controller {
public execute(): void {
console.log( 'Hello world!' );
}
}
export { MyFirstController };
A parser is a function you create that reads information from a web page. There is several kind of parsers, for example HtmlParser allow you to parse the page with cheerio that is an equivalent of jQuery.
import { HtmlParser } from '../../../../app/importer';
class MyFirstParser extends HtmlParser {
public name: string = 'MyFirstParser';
public parse( $: any, parameters: any ): void {
/* Finding users on the page */
$( 'div.user' ).each(function() {
console.log( 'User found: ' + $( this ).text() );
});
}
}
export { MyFirstParser };
To load a page we use the addPage function of the Scraper module. In a controller you can get a module like this:
let scraperModule: any = this.get( 'Scraper' );
In a parser you can get the Scraper module with the parent attribute of the class. This attribute references the instance of Scraper that call the parser.
let scraperModule: any = this.parent;
Then you can call the addPage function with some parameters. This operation will be queued!
let listPageParameters: any = {
url : 'https://www.google.fr',
parser: MyParser
};
scraperModule.addPage( listPageParameters );
To handle a form make sure the FormHandler module is imported in app/config/config.json.
First, load the page that contains the form you want to submit. Then, in the parser you can get the FormHandler module like that:
let formHandler: any = this.get('FormHandler');
Use the getForm function from the formHandler module to create a new Form object based on the form present in the page. The Form will be automatically filled with all the inputs presents of the HTML form.
let form: any = formHandler.getForm( '.auth-form form', $ );
Then you can set your values in the Form like this:
form.setInput( 'login', 'Robert1234' );
Call the submit function from the formHandler module to send your form. The secound parameter is the Parser that will be called with the server answer.
formHandler.submit( form, LoggedParser );
This is a suggestion to organize your project, with separate folder for controllers and parsers.
captainscraper/
├─ app/
├─ src/
│ └─ MyProject/
│ └─ Controller/
│ └─ MyController.ts
│ └─ Parser/
│ └─ MyFirstParser.ts
│ └─ MySecondParser.ts
├─ vendor/
You can make your own custom parameters file and access these values from your scripts. First create the json file app/config/parameters.json and initialize it with a json object.
{
"sample" : {
"github" : {
"login" : "MyLogin",
"password" : "MyPassword"
}
}
}
Then call the get function from the Parameters class to get a data.
Parameters.get('sample').github.password
You can import the Parameters class in your Controller or Parser like this:
import { Parameters } from '../../../../app/importer';
I know it's a little bit tricky, it will be simplified.
Module parameters can be modified like this:
let scraperModule: any = this.get('Scraper');
scraperModule.param.websiteDomain = 'https://github.com';
Parameters:
websiteDomain
domain name of the website you want to scrap, this parameter is important because it is used to complete relative URLbasicAuth
if your website need basic authentication, set this parameter like this: user:passwordenableCookies
enable cookies like a real navigator, necessary for form handling, default: falsefrequency
maximum page loading frequencymaxLoadingPages
maximum number of pages load in the same timemaxFailPerPage
number of time that loading the same page can fail before giving uptimeout
request timeout in millisecond
Parameters for the addPage function:
url
requested urlheader
request headers (Object)param
data transmits to the Parserparser
Parser class used for this pagenoDoublon
if you want to check for duplicate request, default falseform
form data for POST requestmethod
request method (GET, POST...), default GET
Methods:
createEmptyForm()
create and return an empty FormgetForm( selector, $ )
create Form from HTMLsubmit( form, parser )
submit Form and call Parser
Form object methods:
setInput( key, value )
set the value of key in the Form
Form object parameters:
inputs
all inputs and values of the formmethod
form method (GET, POST...)action
form action url
Sample:
let logger: any = this.get('Logs');
logger.log( 'My log !' );
Methods:
log( message, [ display = true ] )
save your log in a file and display on the console
Logs are saved in app/logs/{ CONTROLLER_NAME }.log.