/ashen

Redisearch based cross-language fuzzy search engine

Primary LanguagePythonMIT LicenseMIT

ASHEN: Area SearcH ENgine

Redisearch based full text fuzzy search engine

Description

This is an implementation of a dead simple in-memory search engine built with redis db and redisearch module. A fuzzy area-description dataset has been used here for demonstrating the process if indexing and querying data. However, it can be used as a quick template to build any sort of search engine where the entire indexed data primary lives in the memory and the query response needs to be performant. While performing queries, this implementation applies Levenstein distance based full text fuzzy matching. Also, it automatically backs up the entire index periodically in the ./redisearch-data folder and can be configured through the docker-compose.yml file. The entire stack consists of:

Running the Engine

  • Before running the engine, install docker and docker-compose on your machine.

  • Clone the repo and go to the root folder.

  • In the ./settings.toml file provide your internal ip as host = <your-internal-ip> under the production section.

  • Run

    docker-compose up -d

Making Index

To make the engine functional, you will need to provide data in a specific format that will eventually be indexed by the engine. In this case, the area-description dataset looks like this. You'll find a sample dataset in the index-data folder. Your dataset should be named as area.csv:

index, areaId, areaTile, areaBody
0    , 1     , Azimpur , Example area in Azimpur
1    , 2     , Lalbagh , Some are in lalbagh
2    , 3     , Feni    , Sadar road, Feni
  • In the root folder, create a python 3.8 virtual environment, activate the environment and install the dependencies via running the following commands one by one:

    python3.8 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
  • Place your data (should be formatted like above) in the index-data folder and run:

    python -m index.insert_data

    This should start the indexing process. It takes around a minute to insert one million key value pairs in redis.

  • You can explore your dataset by going this url. This opens up a RedisInsight dashboard:

    <yourhost>:8001

    img

Running Queries

  • Queries can be performed on the following POST API:

    <yourhost>/area-search/
  • Header:

    Content-Type: application/json
    x-api-key: 1234ABCD
    
  • The payload should go as JSON:

    {"query": "West Shaorapara,around Mirpur 10,\nShapla sharani.\nHouse no:438/3"}
  • Response:

    {
    "matchedArea": [
        {
        "areaBody": "House5,road1,block E,cholontica more,mirpur6,dhaka1216",
        "areaId": "315",
        "areaTitle": "Mirpur",
        "score": 48.0
        },
        {
        "areaBody": "House3 Road9 Block c Mirpur6",
        "areaId": "315",
        "areaTitle": "Mirpur",
        "score": 48.0
        },
        {
        "areaBody": "House3 Road9 Block c Mirpur6",
        "areaId": "315",
        "areaTitle": "Mirpur",
        "score": 48.0
        }
    ],
    "query": "West Shaorapara,around Mirpur 10,\nShapla sharani.\nHouse no:438/3",
    "verdictArea": "Mirpur",
    "verdictAreaId": "315"
    }

Architecture

.
├── app                       [flask-application]
│   ├── __init__.py
│   └── search_api
│       ├── __init__.py
│       ├── search_data.py
│       ├── utils.py
│       └── views.py
├── docker-compose.yml
├── Dockerfile
├── flask_run.py
├── index                     [This module should be run to insert new data]
│   ├── __init__.py
│   ├── index_data.py
│   └── insert_data.py
├── index-data                [Index module pulls data from here]
│   ├── area.csv
│   └── placeholder-area.csv
├── LICENSE
├── README.md
├── redisearch-data           [Redis back lives here]
│   ├── dump.rdb
│   └── placeholder.rdb
├── requirements.txt
└── settings.toml

Remarks

This application is built and tested on:

  • Python 3.8
  • Ubuntu 18.05
  • Redis stable 5.0
  • Redisearch 1.6.10
  • Flask 1.1.x
  • Pandas 1x