AnsahMohammad/Phantom

Phantom is a lightweight web search engine designed to provide fast and relevant search results.

PythonApache-2.0

Phantom Search Engine

Phantom Search Engine is a lightweight, distributed web search engine designed to provide fast and relevant search results.

Features

Distributed crawler system for efficient web crawling
Multithreaded crawling for concurrent processing
TF-IDF based indexing for faster search and retrieval
Query engine for processing user queries and returning relevant results

Getting Started

Prerequisites

Python 3.8 or higher
pip

Installation

Clone the repository:

git clone https://github.com/AnsahMohammad/Phantom.git
cd Phantom

Create a virtual environment and activate it:

python3 -m venv .env
source .env/bin/activate

Install the necessary dependencies:

pip install -r requirements.txt

Build the files:

./build.sh

Open the Search Engine GUI

python phantom.py

Building from Source

Run the build.sh script:

This script performs the following actions:

./build.sh

Creates a virtual environment and activates it.
Installs the necessary dependencies from the requirements.txt file.
Runs the Phantom crawler with the specified parameters.
Downloads the necessary NLTK packages: stopwords and punkt.
Runs the Phantom indexing module.

Start the query engine locally in the terminal by running the search.sh file:

./search.sh

Alternative Method

Run the crawl.sh file by updating necessary parameters
Run the local_search.sh to index the crawled sites and run the query engine on it

Note: Read the documentation here

Contributing

We welcome contributions! Please see our CONTRIBUTING.md for details on how to contribute to this project.

License

This project is licensed under the terms of the Apache License. See the LICENSE file for details.

Development and Maintanence

0.9.2

restrict send to db if EMPTY title and content
Do not show result when score is 0

0.9.1

Error handling
Consistency in logs
Local db enable

0.10+

Distributed query processing
Caching locally
Two layer crawling
Optimize the scheduler by storing visited nodes
Use unified crawler system in master-slave arch
Create Storage abstraction classes for local and remote client

0.9

TF-idf only on title
Better similarity measure on content
Generalize Storage Class

0.8

0.7

Replace content with meta data (perhaps?)
Extract background worker sites from env
AI support Beta
Template optimizations

0.6

Extract timestamp and sort accordingly
Remote crawler service (use background workers)
Analyze the extractable metadata
Error Logger to supabase for analytics

0.5-

Don't download everytime query engine is started
Crawler doesn't follow the schema of remote_db
Tracking variables on the server
UI Re-org
Title TF_IDF
Join contents with .join(" ")
Optimize parser to extract data effectively
Add tests

Track Uptime here : Uptime