WIP
This project aims to create a platform to develop ranking algorithms for news i tried to maximize modularity to keep modifications easy
currently it only keeps track of scientific news but its trivial to add new news targets.
check the TODO file for problems
123
front end uses semantic ui
Project has seed links by default but if you want to use your own links, edit the link_list.csv
rss_list.csv
category_list.txt
make sure you have mongodb, python installed
- Install python requirements
$ pip install -r requirements.txt
- Confirm settings on config.py
- Run
insert_links.py
provides seed links to other url feeds for crawlers, meant to be crawled reqularly. - Run
insert_categories.py
(provides possible category names) - Run
cronjobs.py
sets the cronjob for crawler and ranker (checkconfig.py
for path ) - Run
get_news.py
starts calling each crawler and collect data - Run
rank_db.py
queries collected news data and ranks them with available rankers. query has a specific date range (checkconfig.py
for date range )
Now you are ready to run the server !
python server.py
rank_db.py
Explanation
get_news.py
Explanation
Not every field crawlers collect are required but can changed in the config.py
file
After crawler parses the data validate.py
checks the data for specified key's existence
Field Key | Required ? | Comment |
---|---|---|
title |
YES | |
category |
YES | |
url |
YES | |
page_type | NO | determines which crawler to use |
date | NO | utc format |
subtitle | NO | decription |
author | NO | |
domain | NO | url's domain |
Only requirement for each ranker package is that it accepts and returns a dict object.
You have to check for field existence since news data keys can vary
ranker package should be located in the rank
folder and start with the prefix rank_
- project_root/rank/
__init__.py
(links package) - project_root/static/
ranker_data.js
add json to link your ranker to front end
{
"text":"Shortest Title",
"value":"shortest_title",
"icon":"eye",
},
key | purpose |
---|---|
text |
what user sees |
value |
must match with init.py file in your package |
icon |
possible values; Semantic UI Icons |
if you haven't started the server you should add it to link_list.csv
or rss_list.csv
if you started the server use the add_link.py
inside the utils
folder
it will handle category creation and insert the link to mongo collection crawl_target