/Phantom

Phantom is a lightweight web search engine designed to provide fast and relevant search results.

Primary LanguagePythonApache License 2.0Apache-2.0

Phantom Search Engine

Python License Code style: Black Crawler Test Deploy Status

Phantom Search Engine is a lightweight, distributed web search engine designed to provide fast and relevant search results.

Phantom Demo

Features

  • Distributed crawler system for efficient web crawling
  • Multithreaded crawling for concurrent processing
  • TF-IDF based indexing for faster search and retrieval
  • Query engine for processing user queries and returning relevant results

Getting Started

Prerequisites

  • Python 3.8 or higher
  • pip

Installation

  1. Clone the repository:
git clone https://github.com/AnsahMohammad/Phantom.git
cd Phantom
  1. Create a virtual environment and activate it:
python3 -m venv .env
source .env/bin/activate
  1. Install the necessary dependencies:
pip install -r requirements.txt
  1. Build the files:
./build.sh
  1. Open the Search Engine GUI
python phantom.py

Building from Source

  1. Run the build.sh script:

This script performs the following actions:

./build.sh
  • Creates a virtual environment and activates it.
  • Installs the necessary dependencies from the requirements.txt file.
  • Runs the Phantom crawler with the specified parameters.
  • Downloads the necessary NLTK packages: stopwords and punkt.
  • Runs the Phantom indexing module.
  1. Start the query engine locally in the terminal by running the search.sh file:
./search.sh

Alternative Method

  1. Run the crawl.sh file by updating necessary parameters
  2. Run the local_search.sh to index the crawled sites and run the query engine on it

Note: Read the documentation here

Contributing

We welcome contributions! Please see our CONTRIBUTING.md for details on how to contribute to this project.

License

This project is licensed under the terms of the Apache License. See the LICENSE file for details.

Development and Maintanence

0.9.2

  • restrict send to db if EMPTY title and content
  • Do not show result when score is 0

0.9.1

  • Error handling
  • Consistency in logs
  • Local db enable

0.10+

  • Distributed query processing
  • Caching locally
  • Two layer crawling
  • Optimize the scheduler by storing visited nodes
  • Use unified crawler system in master-slave arch
  • Create Storage abstraction classes for local and remote client

0.9

  • TF-idf only on title
  • Better similarity measure on content
  • Generalize Storage Class

0.8

  • Optimize the deployment
  • Remove the nltk processing
  • Refactor the codebase
  • Migrate from local_db to cloud Phase-1
  • Optimize the user interface

0.7

  • Replace content with meta data (perhaps?)
  • Extract background worker sites from env
  • AI support Beta
  • Template optimizations

0.6

  • Extract timestamp and sort accordingly
  • Remote crawler service (use background workers)
  • Analyze the extractable metadata
  • Error Logger to supabase for analytics

0.5-

  • Don't download everytime query engine is started
  • Crawler doesn't follow the schema of remote_db
  • Tracking variables on the server
  • UI Re-org
  • Title TF_IDF
  • Join contents with .join(" ")
  • Optimize parser to extract data effectively
  • Add tests

Track Uptime here : Uptime