Task - PDF file - Ukrainian language
News feed http://brovary-rada.gov.ua/documents/
To start the application, you need to run the command: docker-compose up
Ports 8000
and 3306
should be free..
Optional parameter page_limit
(int), number of pages to scan (starting with the newest news).
By default, it scans all pages.
- Request
$ curl -X GET
http://0.0.0.0:8000/api/run_checker?page_limit=3
- Response
HTTP/1.1 200
{"status": "ok"}
- Error Response
HTTP/1.1 500
{"status": "Parsing error. See logs output."}
Optional parameters (for pagination):
-
limit
(int), Number of news in response. (Default = 20) -
after
(int), From which element to show next news -
before
(int), To what element to show news
- Request
$ curl -X GET
http://0.0.0.0:8000/api/articles/?limit=10
- Response
HTTP/1.1 200
{
"paging": {
"previous": "http://0.0.0.0:8000/api/articles/?limit=3&before=4",
"cursors": {
"after": 6,
"before": 4
},
"next": "http://0.0.0.0:8000/api/articles/?limit=3&after=6"
},
"data": [
{
"status": "no changes | updated | deleted",
"title": "Page title",
"created_at": 1496241466,
"updated_at": 1496241466,
"content": "html content",
"link": "http://brovary-rada.gov.ua/documents/27297.html",
"id": 4
}
]
}
To go to the next or previous page, you can use
paging->previous
or paging->next
- Error Response
HTTP/1.1 400
{"status": "Error. See logs output."}
The same parameters and the response as Get news list
The same parameters and the response as Get news list
The same parameters and the response as Get news list
- Request
$ curl -X GET
http://0.0.0.0:8000/api/articles/one/1
- Response
HTTP/1.1 200
{
"status": "no changes | updated | deleted",
"title": "Page title",
"created_at": 1496241466,
"updated_at": 1496241466,
"content": "html content",
"link": "http://brovary-rada.gov.ua/documents/27297.html",
"id": 1
}
- Error Response
HTTP/1.1 400
{"status": "Error. See logs output."}
-
To implement the task has been used Tornado Web Server
-
Database MySQL
-
Parsing
html
pages - Python lib Beautiful Soup
Each saved news has 3 statuses(no changes
, updated
, deleted
).
Each time you start the parser (/api/run_checker
), compare checksum content,
If there is a difference - a new version of the document is saved, and the parent changes the status to updated
.
When a deleted document is detected, the saved status changes to deleted
.
The parsing function is recursive, works until it loads the specified number of pages, and if this option is not specified, until it scans all the news.