Web pages similarity checking

Mainly, we have 4 modules: Page Crawler, Page Extractor, Similarity Checker and Web Service

Page Crawler: Crawl html content of web page. Currently, we use a remote crawler cluster provided by SeoClarity.
Page Extractor: Extract content from raw html (remove html tag, unnecessary content,...). Currently, we use python dragnet library.
Similarity Checker: Calculate similarity between web pages content. The content is firstly tokenizer into tokens, after that generate ngram (or shingles) tokens set, and finally apply distance metrics to calculate similarity. Currently, we support three distance metrics are jaccard, cosine and fuzzy.
Web Service: Build RESTful api services. Currently, we use python flask (a micro web framework) and Swagger UI for presenting API documents.

Technologies

Python 2.7
Flash Web and Restful API

Development

Install python libraries for Ubuntu

sudo apt-get install -y git python-dev python-pip build-essential libxml2-dev libxslt1-dev zlib1g-dev

Set up python virtual environment

virtualenv -p python2.7 venv
source ./venv/bin/activate
for lib in $(cat requirements.txt); do pip install $lib; done

Config IDE (e.g Pycharm) using created python venv: Go to Preference -> Project -> Project Interpreter

Deployment

There 2 options:

Docker swarm cluster (recommended): support scaling application, load balancing
Standalone docker container

Install docker: for more options, please refer to official page

curl -fsSL get.docker.com -o get-docker.sh && sudo sh get-docker.sh

Git clone and cd to project:

git clone https://bitbucket.org/diepdt/webpages-duplicated-checking.git
cd webpages-duplicated-checking

Option 1: Using docker swarm cluster (recommended)

Init docker swarm cluster: this machine will be the master node

docker swarm init
# Or with --advertise-addr
docker swarm init --advertise-addr [IP_ADDRESS]

[Optional] Add more node to swarm cluster

docker swarm join --token [TOKEN] [MASTER_HOST:PORT]

Update configs:

vi docker-compose.yml
# Update `CRAWLER_URL`, `CRAWLER_ACCESS_KEY`: Please DO NOT surround value by single quote or double quote, put value only
# Update `services.web.deploy.replicas`: scaling web/api to a numnber of instances
# Update web/api port, default is `8888`

Deploy services: Web + Redis + Monitor

docker stack deploy -c docker-compose.yml sim-check

Useful commands

docker service ls   # list all services
docker service logs -f sim-check_web    # wiew service logs
docker stack ps sim-check   # view all container/process of sim-check
docker stack rm sim-check   # removing sim-check
docker node ls  # list all swarn cluster
docker stack deploy -c docker-compose.yml sim-check # update service
docker swarm leave --force  # leave current node from Swarm cluster
docker stack ls    # list stacks or apps
docker inspect <task or container>      # inspect task or container

Option 2: Using standalone docker container

Start redis

docker run -d --name redis -p 6379:6379 redis

Start web app (UI + RestAPI)

Note: remember to update CRAWLER_URL, CRAWLER_ACCESS_KEY, REDIS_HOST

docker run -d \
           --name sim-check \
           -p 8888:8888 \
           -e CRAWLER_URL= \
           -e CRAWLER_ACCESS_KEY= \
           -e REDIS_HOST=192.168.1.118 \
           -e REDIS_PORT=6379 \
           -v `pwd`:/code \
           diepdao12892/webpages-duplicated-checking:1.0 \
           gunicorn -k tornado -w 2 -b 0.0.0.0:8888 main:app --max-requests 10000

Useful commands