A lightweight crawler framework

Usage

Add maven dependency

<dependency>
    <groupId>io.loli.nekocat</groupId>
    <artifactId>nekocat-core</artifactId>
    <version>0.0.5</version>
</dependency>

NekoCatSpider.builder()
    .name("spiderName")
    .startUrl("http://www.example.com/")
    .url(NekoCatProperties.builder()
            // deal with the start-url
            .regex("http://www.example.com/")
            .pipline((resp)->{
                response.asDocument()
                    .select("css-select")
                    .forEach(a ->
                        // url that should be downloaded
                        resp.getContext().next(a.attr("href"));
                    );
            })
            .build())
    .url(NekoCatProperties.builder().regex("http://www.example.com/.+")
            .pipline(resp -> {
                // select all images
                resp.adDocument().select("img")
                .forEach(img->{
                    resp.getContext().next(img.attr("src"));
                });
            })
            .build())
     .build()
     .start();

Logging

Nekocat provides two simple logging interceptors LoggingInterceptor and ErrorLoggingInterceptor

ErrorLoggingInterceptor only log exceptions but LoggingInterceptor log all.

NekoCatProperties.builder()
    ...
    .log()

NekoCatProperties.builder()
    ...
    .logError()

Thread pool

NekoCatProperties.builder()
    .regex(".*\\.jpg")
    ...
    .downloadPoolSize(1)
    .downloadMaxQueueSize(1024)
    .piplinePoolSize(1)
    .piplineMaxQueueSize(1024)

Exit while no urls emitted

NekoCatSpider.builder()
    .name("spiderName")
    ...
    .stopAfterNoRequestEmmitMillis(3600 * 1000L)

Get next pipline result

NekoCatSpider.builder()
    .name("spiderName")
    .startUrl("http://www.example.com/")
    .url(NekoCatProperties.builder().regex("http://www.example.com/")
            .pipline(resp -> {
                // select all images
                resp.asDocument().select("img")
                .forEach(img->{
                    CompletableFuture<Object> result = resp.getContext().next(img.attr("src")).getPiplineResult();
                    // get the file returned by the next pipline
                    File imgFile = (File)result.get();
                    
                });
            })
            .build())
    .url(NekoCatProperties.builder().regex(".*\\.jpg")
            .pipline(resp -> {
                // select all images
                byte[] bytes = resp.asBytes();
                // write img to filesystem and return this file
                writeBytesToFile(bytes);
                return yourFile;
            })
            .build())
    .build()

Pass object to next request

NekoCatSpider.builder()
    .name("spiderName")
    .startUrl("http://www.example.com/")
    .url(NekoCatProperties.builder().regex("http://www.example.com/")
            .pipline(resp -> {
                // select all images
                resp.asDocument().select("img")
                .forEach(img->{
                    resp.getContext().addNextAttribute("storeFolder", "/tmp");
                    resp.getContext().next(img.attr("src"));
                });
            })
            .build())
    .url(NekoCatProperties.builder().regex(".*\\.jpg")
            .pipline(resp -> {
                String storeFolder = resp.getContext().getAttribute("storeFolder");
                // select all images
                byte[] bytes = resp.asBytes();
                // write img to filesystem and return this file
                writeBytesToFile(storeFolder, bytes);
                return null;
            })
            .build())
    .build()

Http POST

// form
// value must be urlencoded
request.setMethod("POST");
request.setRequestBody("param1=value1&param2=value2");
...

// json
request.setMethod("POST");
request.addHeader("content-type", "application/json");
request.setRequestBody(your_json_str);

Additional headers

request.addHeader(yourAdditionalHeader);

Scheduled

// spider will download the startUrl every 10 mins
NekoCatSpider.builder()
    .name("spiderName")
    .startUrl("http://www.example.com")
    ...
    .loopInterval(1000 * 60 * 10)
    ...

// interval of each download 
NekoCatProperties.builder()
    .regex(".*\\.jpg")
    .interval(1000)
    ...

Filter duplicate url

NekoCatProperties.builder()
    ...
    .interceptor(new FilterDownloadedUrlInterceptor(1024))
    ...

Retry

NekoCatProperties.builder()
    ...
    downloadRetry(1)
    ...
    piplineRetry(1)
    ...

TODO

json export
redis queue/db queue
Thread Pool Factory

chocotan/nekocat

A lightweight crawler framework

Usage

Logging

Thread pool

Exit while no urls emitted

Get next pipline result

Pass object to next request

Http POST

Additional headers

Scheduled

Filter duplicate url

Retry

TODO

License