strainer: A Python repository from hmakkapati

Strainer is a HipChat service that identifies mentions, emoticons and
URLs in chat messages.


Installation
============
Strainer API requires the following python packages:
    * eventlet
    * flask
    * lxml
    * beautifulsoup4
    * requests
    * nose
    * requests-mock

To install Strainer in a virtualenv:
    * Create virtual environment
        ``virtualenv strainer-venv``
    * Activate virtual environment
        ``source strainer-venv/bin/activate``
    * Install dependencies
        ``pip install eventlet Flask lxml beautifulsoup4 requests nose requests-mock``


Running Tests
=============
To run tests, execute the ``run_tests.sh`` script in the top-level directory:
    ``./run_tests.sh``


Running Server
==============
To run the server, execute ``server.py`` script in ``strainer`` package.
    ``./strainer/server.py``


Notes
=====

# Improve tests
    * There can be more tests around timing the timeouts while fetching titles.
    * There can be a lot more tests around URL straining testing for various
      kinds of valid and invalid URLs
    * Similarly, testing for title fetching can be improved for a diverse set
      of valid and invalid URLs
    * There can be more tests around unicode characters
    * Functional test coverage may be improved as well. In the interest of
      time, since there is decent unit test coverage for straining and
      fetching, some functional tests that could potentially be testing the
      same thing are skipped here.
    * Documentation in tests can be improved. However, the test names should be
      self-explanatory

# This currently developed and tested on Python 2. May need more effort to make
  it Python 3 compatible.

# If there's no limit on how long a chat message can be, it could be possible
  the API may time out. Is there scope for parallelizing message straining?

# Currently, there are three different regular expressions. One regular
  expression each for straining emoticons, mentions and URLs specifically. Is
  there a way to devise one regular expression that does it all? If so, is
  that going to efficient than the current three?

# Logging could be improved further. Generating and logging request ids could
  prove very useful while troubleshooting

# The use of 'POST' vs 'GET' maybe debated. With GET, the input chat message
  may have to be sent as a body assuming the message can be of any length.
  Sending body with 'GET' is much debated and usually discouraged as there no
  standards defined around processing request body with 'GET'. Hence, the use
  of 'POST' here. Usage of 'POST' could be debated too from a strict RESTful
  sense as the Strainer API doesn't create any resources. 'POST' just happens
  to a better option for this scenario.

# There is probably scope for caching the titles if some URLs appear more
  often than others in chat messages. If the URLs are fairly random, then
  adding cache may not be a big gain. This needs to be studied further.
hmakkapati/strainer