This project is a search engine designed to index a corpus of web pages and provide efficient search capabilities. It utilizes various algorithms and techniques to rank search results based on relevance, including the PageRank algorithm and cosine similarity.
- 🌐 Efficient indexing of web pages
- 🔍 Advanced search capabilities with multi-word query handling
- 📄 Support for HTML content extraction, including titles and descriptions
- 📊 Proximity scoring to enhance search relevance
- 📦 Batch processing for indexing multiple documents
- 🔄 PageRank algorithm for ranking search results
- 🧠 Cosine similarity calculation for improved result accuracy
- 🔒 Easy setup with Streamlit for a user-friendly interface
- Python 3.9 or higher
- Required Python packages: Streamlit, BeautifulSoup, NLTK, SpaCy, NumPy
You can install the required packages using pip:
pip install streamlit beautifulsoup4 nltk spacy numpy
python -m spacy download en_core_web_sm
-
Clone the repository:
git clone https://github.com/yourusername/search-engine.git cd search-engine
-
Run the application:
streamlit run main.py
Once the server is running, you can access the search engine through your web browser at http://localhost:8501
. You can generate an index and perform searches directly from the interface.
To index documents, place your HTML files in the webpages/WEBPAGES_RAW/
directory. The application will read and index these files when you click the "Generate Index" button.
The search engine provides a simple web interface for interaction. For detailed API documentation, refer to the source code in search.py
and index.py
.
The search engine can be configured by modifying the source code. Key parameters include:
NUM_DOCS_TO_INDEX
: Set the number of documents to index (default: 50).documents
: A JSON file mapping document IDs to URLs.
To set up the development environment:
- Install Python and pip: https://www.python.org/downloads/
- Clone the repository and navigate to the project directory.
- Install dependencies:
pip install -r requirements.txt
- Run the application:
streamlit run main.py
To run the test suite, you can create unit tests in a separate test file and execute them using:
pytest
The relevance of search terms is influenced by their proximity within the text. When words are located close to each other, they are deemed more relevant to the search query. The proximity score is calculated as the inverse of the average distance between consecutive occurrences of the same term, ensuring that closely placed terms receive higher relevance.
The search engine utilizes an index to identify documents containing the search terms. Documents are ranked based on the frequency of term occurrences, with the PageRank algorithm further refining the ranking to prioritize more authoritative sources.
For multi-word queries, the search engine breaks down the input into individual words and searches for each in the index. Documents are ranked based on the number of matches for each word, with higher scores for documents containing all query terms.
When a document contains links to another, the anchor text (the clickable text in a hyperlink) is added to the target document's index. This enhances the relevance of the linked document, as it reflects how other pages describe it.
The PageRank algorithm assesses the importance of web pages based on the number of incoming links from other pages. The more links a page receives, the higher its rank, indicating greater authority. The algorithm constructs an adjacency matrix to represent links between documents and iteratively calculates scores until convergence.
Cosine similarity measures the similarity between two vectors, typically representing term frequencies in documents. It is calculated as the dot product of the vectors divided by the product of their magnitudes. This metric helps determine how closely related a document is to a search query.
The project implements zlib for data compression, significantly reducing storage requirements. A user-friendly interface is created using Streamlit, allowing for easy interaction with the search engine. Additionally, multithreading is employed to expedite index generation, enhancing overall performance.
-
Install Streamlit and other dependencies:
pip install streamlit spacy
-
Run the program:
streamlit run main.py
-
In your browser:
- Click "Generate Index" to create the index.
- Enter your query in the search bar and click "Search".
-
Ensure you have SpaCy installed and the English language model downloaded:
pip install spacy python -m spacy download en_core_web_sm
The program generates the following files in the index/
directory:
index.txt
— The uncompressed index of terms.index.txt.zz
— The zlib-compressed index of terms.docs_metadata.json
— Contains the title and description of each document, extracted during indexing.document_tfs.json
— Stores the term frequency of each document for cosine similarity calculations.
- For lemmatization: "run" and "ran" yield the same result.
- Case sensitivity: "IrviNe" and "irvine" are treated as different terms.
This project is released under the MIT License. See the LICENSE file for details.