/similar-page-checker

A service to measure the similarity score between 2 web pages

Primary LanguagePython

Web pages similarity checking

Mainly, we have 4 modules: Page Crawler, Page Extractor, Similarity Checker and Web Service

  • Page Crawler: Crawl html content of web page. Currently, we use a remote crawler cluster provided by SeoClarity.
  • Page Extractor: Extract content from raw html (remove html tag, unnecessary content,...). Currently, we use python dragnet library.
  • Similarity Checker: Calculate similarity between web pages content. The content is firstly tokenizer into tokens, after that generate ngram (or shingles) tokens set, and finally apply distance metrics to calculate similarity. Currently, we support three distance metrics are jaccard, cosine and fuzzy.
  • Web Service: Build RESTful api services. Currently, we use python flask (a micro web framework) and Swagger UI for presenting API documents.

Technologies

  • Python 2.7
  • Flash Web and Restful API

Development

Install python libraries for Ubuntu

sudo apt-get install -y git python-dev python-pip build-essential libxml2-dev libxslt1-dev zlib1g-dev

Set up python virtual environment

virtualenv -p python2.7 venv
source ./venv/bin/activate
for lib in $(cat requirements.txt); do pip install $lib; done

Config IDE (e.g Pycharm) using created python venv: Go to Preference -> Project -> Project Interpreter

Deployment

There 2 options:

  • Docker swarm cluster (recommended): support scaling application, load balancing
  • Standalone docker container

Install docker: for more options, please refer to official page

curl -fsSL get.docker.com -o get-docker.sh && sudo sh get-docker.sh

Git clone and cd to project:

git clone https://bitbucket.org/diepdt/webpages-duplicated-checking.git
cd webpages-duplicated-checking

Option 1: Using docker swarm cluster (recommended)

  • Init docker swarm cluster: this machine will be the master node
docker swarm init
# Or with --advertise-addr
docker swarm init --advertise-addr [IP_ADDRESS]
  • [Optional] Add more node to swarm cluster
docker swarm join --token [TOKEN] [MASTER_HOST:PORT]
vi docker-compose.yml
# Update `CRAWLER_URL`, `CRAWLER_ACCESS_KEY`: Please DO NOT surround value by single quote or double quote, put value only
# Update `services.web.deploy.replicas`: scaling web/api to a numnber of instances
# Update web/api port, default is `8888`
  • Deploy services: Web + Redis + Monitor
docker stack deploy -c docker-compose.yml sim-check
  • Useful commands
docker service ls   # list all services
docker service logs -f sim-check_web    # wiew service logs
docker stack ps sim-check   # view all container/process of sim-check
docker stack rm sim-check   # removing sim-check
docker node ls  # list all swarn cluster
docker stack deploy -c docker-compose.yml sim-check # update service
docker swarm leave --force  # leave current node from Swarm cluster
docker stack ls    # list stacks or apps
docker inspect <task or container>      # inspect task or container

Option 2: Using standalone docker container

  • Start redis
docker run -d --name redis -p 6379:6379 redis
  • Start web app (UI + RestAPI)

Note: remember to update CRAWLER_URL, CRAWLER_ACCESS_KEY, REDIS_HOST

docker run -d \
           --name sim-check \
           -p 8888:8888 \
           -e CRAWLER_URL= \
           -e CRAWLER_ACCESS_KEY= \
           -e REDIS_HOST=192.168.1.118 \
           -e REDIS_PORT=6379 \
           -v `pwd`:/code \
           diepdao12892/webpages-duplicated-checking:1.0 \
           gunicorn -k tornado -w 2 -b 0.0.0.0:8888 main:app --max-requests 10000
  • Useful commands
docker ps   # list all containers
docker logs -f sim-check    # view container sim-check logs
docker stop sim-check   # stop container/process sim-check
docker start sim-check  # start container sim-check
docker restart sim-check    # restart container sim-check
docker rm sim-check     # remove container sim-check