/crawlerizer

Web Crawler with some Pepper

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

crawlerizer

This project consists of a simple REST API in which one can post a JSON object consisting of an url, which is parsed by the backend and it tries to crawl the result trying to qualify the content.

List of Endpoints

/crawlOne

{
  "url": "cde.com.ar",
  "rank": 834987
  }

/crawl

[    
    {
        "url": "cde.com.ar",
        "rank": 834987
    },
    {
        "url": "clarin.com",
        "rank": 834987
    }
]

/getAll

Qualifier

Is a basic component intended to find a regular text in the tags of the html content. Currently is extensible via adding a new implementation of the IQualifier interface.

By default, it will apply the TitleQualifier which is in charge of search keywords in the <title> tag to find some matches.

Instructions

Clone the repo and simply run mvn install inside the unzipped project's folder. If successful this should produce a war file inside the target folder which can be deployed in the server, or import it in your favorite IDE and deploy it into the embeded server to be run

Advice

Since I faced conflicts when trying to integrate testing the REST API, a Postman file is included into /crawlerizer/src/test/resources the to import and allow the basic runs on the application