/nekocat

A lightweight crawler framework

Primary LanguageJava

A lightweight crawler framework

FOSSA Status

Usage

Add maven dependency

<dependency>
    <groupId>io.loli.nekocat</groupId>
    <artifactId>nekocat-core</artifactId>
    <version>0.0.5</version>
</dependency>
NekoCatSpider.builder()
    .name("spiderName")
    .startUrl("http://www.example.com/")
    .url(NekoCatProperties.builder()
            // deal with the start-url
            .regex("http://www.example.com/")
            .pipline((resp)->{
                response.asDocument()
                    .select("css-select")
                    .forEach(a ->
                        // url that should be downloaded
                        resp.getContext().next(a.attr("href"));
                    );
            })
            .build())
    .url(NekoCatProperties.builder().regex("http://www.example.com/.+")
            .pipline(resp -> {
                // select all images
                resp.adDocument().select("img")
                .forEach(img->{
                    resp.getContext().next(img.attr("src"));
                });
            })
            .build())
     .build()
     .start();

Logging

Nekocat provides two simple logging interceptors LoggingInterceptor and ErrorLoggingInterceptor

ErrorLoggingInterceptor only log exceptions but LoggingInterceptor log all.

NekoCatProperties.builder()
    ...
    .log()
NekoCatProperties.builder()
    ...
    .logError()

Thread pool

NekoCatProperties.builder()
    .regex(".*\\.jpg")
    ...
    .downloadPoolSize(1)
    .downloadMaxQueueSize(1024)
    .piplinePoolSize(1)
    .piplineMaxQueueSize(1024)

Exit while no urls emitted

NekoCatSpider.builder()
    .name("spiderName")
    ...
    .stopAfterNoRequestEmmitMillis(3600 * 1000L)

Get next pipline result

NekoCatSpider.builder()
    .name("spiderName")
    .startUrl("http://www.example.com/")
    .url(NekoCatProperties.builder().regex("http://www.example.com/")
            .pipline(resp -> {
                // select all images
                resp.asDocument().select("img")
                .forEach(img->{
                    CompletableFuture<Object> result = resp.getContext().next(img.attr("src")).getPiplineResult();
                    // get the file returned by the next pipline
                    File imgFile = (File)result.get();
                    
                });
            })
            .build())
    .url(NekoCatProperties.builder().regex(".*\\.jpg")
            .pipline(resp -> {
                // select all images
                byte[] bytes = resp.asBytes();
                // write img to filesystem and return this file
                writeBytesToFile(bytes);
                return yourFile;
            })
            .build())
    .build()

Pass object to next request

NekoCatSpider.builder()
    .name("spiderName")
    .startUrl("http://www.example.com/")
    .url(NekoCatProperties.builder().regex("http://www.example.com/")
            .pipline(resp -> {
                // select all images
                resp.asDocument().select("img")
                .forEach(img->{
                    resp.getContext().addNextAttribute("storeFolder", "/tmp");
                    resp.getContext().next(img.attr("src"));
                });
            })
            .build())
    .url(NekoCatProperties.builder().regex(".*\\.jpg")
            .pipline(resp -> {
                String storeFolder = resp.getContext().getAttribute("storeFolder");
                // select all images
                byte[] bytes = resp.asBytes();
                // write img to filesystem and return this file
                writeBytesToFile(storeFolder, bytes);
                return null;
            })
            .build())
    .build()

Http POST

// form
// value must be urlencoded
request.setMethod("POST");
request.setRequestBody("param1=value1&param2=value2");
...

// json
request.setMethod("POST");
request.addHeader("content-type", "application/json");
request.setRequestBody(your_json_str);

Additional headers

request.addHeader(yourAdditionalHeader);

Scheduled

// spider will download the startUrl every 10 mins
NekoCatSpider.builder()
    .name("spiderName")
    .startUrl("http://www.example.com")
    ...
    .loopInterval(1000 * 60 * 10)
    ...
// interval of each download 
NekoCatProperties.builder()
    .regex(".*\\.jpg")
    .interval(1000)
    ...

Filter duplicate url

NekoCatProperties.builder()
    ...
    .interceptor(new FilterDownloadedUrlInterceptor(1024))
    ...

Retry

NekoCatProperties.builder()
    ...
    downloadRetry(1)
    ...
    piplineRetry(1)
    ...

TODO

  1. json export
  2. redis queue/db queue
  3. Thread Pool Factory

License

FOSSA Status