Duplicate Detector for News Articles

Usage

Querying for amount of pages

First, you will want to retrieve the total number of pages for our document collections via:

GET /articles/pages

GET /filteredArticles/pages

A response will look like this:

GET /filteredArticles/pages HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cmts4.f4.htw-berlin.de:8080
User-Agent: HTTPie/0.9.8



HTTP/1.1 200
Content-Type: application/json;charset=UTF-8
Date: Wed, 28 Jun 2017 20:38:23 GMT
Transfer-Encoding: chunked

{
    "pages": 239
}

The first page is always 0.

Querying for specific pages

You can query the API via HTTP like this:

GET /articles/page/<pageNumber>

or this:

GET /filteredArticles/page/<pageNumber>

to retrieve a specific page of articles or filtered articles respectively. A response will look like this:

GET /filteredArticles/page/100 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: cmts4.f4.htw-berlin.de:8080
User-Agent: HTTPie/0.9.8

HTTP/1.1 200
Content-Type: application/json;charset=UTF-8
Date: Mon, 26 Jun 2017 19:02:56 GMT
Transfer-Encoding: chunked

which will serve a json list of a hundered (or so) objects like this:

    [
        {
            "DuplicateIDs": ["74d98c2f-3f7b-2c50-a897-30ca49393fa5", "f84b2e95-f4fe-026c-7c58-0875be120f83"],
            "adresse": "some street",
            "bezirk": "some precinct",
            "einsatzkraefte": {"object": "of some relief forces"},
            "ereigniszeitpunkt": "maybe a timestamp",
            "id": "67a1591b-7e71-effd-fa5d-772173cf7c12",
            "inhalt": "some content",
            "kategorie": "a category",
            "meldungszeitpunkt": "another timestamp",
            "ortsteil": "district",
            "referenzmeldungen": ["a reference", "another reference"],
            "titel": "a title",
            "url": "a url",
            "zeitung": "a journal"
        },
        {
          "DuplicateIDs": [],
          "...": "..."
        }
    ]

ToDo

build REST client to fetch article collection
add persistence layer (MongoDB ?)
init articles to mongo
implement REST API w/ pagination to serve filtered data set (Spring Boot)
implement MinHash filter
implement LHS filter

Open Questions