Semantic Search Engine

Authors: Deepak Shanmugam, Mohanakrishna V H, Vidya Sri Mani

Overview
Dataset Description
Problem Statement
System Architecture
Implementation Details
Features
Prerequisites
Installation
Usage
Results and Analysis
Technology Stack
Project Structure

Overview

This project implements an advanced semantic search engine that goes beyond traditional keyword-based search by understanding the meaning and context of queries. Built on a BBC News corpus, the system employs sophisticated Natural Language Processing (NLP) techniques to deliver more relevant search results by analyzing semantic relationships between words and concepts.

The search engine progressively enhances its capabilities through multiple implementation phases, starting from basic keyword matching to advanced semantic understanding using deep NLP features.

Dataset Description

The project utilizes the BBC News Dataset, a well-curated collection of news articles:

Total Documents: 2,225 articles
Time Period: 2004-2005
Categories: 5 distinct domains
- Business
- Entertainment
- Politics
- Sports
- Technology
Source: BBC News Website
Dataset Link: http://mlg.ucd.ie/datasets/bbc.html

This multi-domain corpus provides diverse linguistic patterns and vocabulary, making it ideal for testing semantic search capabilities across different subject areas.

Problem Statement

Traditional keyword-based search engines often fail to understand the semantic intent behind user queries, leading to irrelevant results when exact keyword matches are absent. This project addresses this limitation by implementing a semantic search engine that:

Understands the contextual meaning of queries
Identifies semantic relationships between words (synonyms, hypernyms, hyponyms, etc.)
Ranks documents based on semantic similarity rather than just keyword frequency
Provides more accurate and contextually relevant search results

System Architecture

The system is built in a progressive manner with four distinct implementation tasks:

Task 1: Corpus Building

Collection and preprocessing of BBC News articles
Segmentation of documents into processable units
Creation of index mappings for efficient retrieval

Task 2: Keyword Search Index (Baseline)

Techniques: Segmentation, Tokenization
Indexing: Apache SOLR
Features: Basic word-level matching
Establishes baseline performance for comparison

Task 3: Semantic Search Index

Advanced NLP Features:
- Lemmatization: Reducing words to their base forms
- Stemming: Extracting word stems using Porter Stemmer
- POS Tagging: Part-of-speech identification
- Syntactic Parsing: Understanding sentence structure
- Semantic Relations:
  - Hypernyms (general terms)
  - Hyponyms (specific terms)
  - Meronyms (part-of relationships)
  - Holonyms (whole-of relationships)
- Head Word Extraction: Identifying the main concept using dependency parsing

Task 4: Enhanced Semantic Search (Optimized)

Improvements over Task 3:
- POS-aware lemmatization for better accuracy
- Improved head word extraction with synset resolution
- Context-aware semantic relation extraction
- Weighted boosting for different features:
  - Lemmas: 10.0x boost (highest priority)
  - Stems: 6.0x boost
  - Hypernyms: 7.0x boost
  - Words, POS tags, hyponyms, meronyms, holonyms: 1.0x boost
- Optimized query construction for better relevance

System Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                         SEMANTIC SEARCH ENGINE FLOW                          │
└─────────────────────────────────────────────────────────────────────────────┘

                              ┌──────────────────┐
                              │  BBC News Corpus │
                              │  (2,225 articles)│
                              └────────┬─────────┘
                                       │
                                       ▼
                        ┌──────────────────────────┐
                        │   Index Creation Module  │
                        │  (IndexCreation.py)      │
                        └──────────────────────────┘
                                       │
                ┌──────────────────────┼──────────────────────┐
                │                      │                      │
                ▼                      ▼                      ▼
        ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
        │  TASK 2 INDEX │     │  TASK 3 INDEX │     │  TASK 4 INDEX │
        │  (Baseline)   │     │  (Semantic)   │     │  (Enhanced)   │
        └───────────────┘     └───────────────┘     └───────────────┘
        │               │     │               │     │               │
        │ • Words       │     │ • Words       │     │ • Words       │
        │ • Lemmas      │     │ • Lemmas      │     │ • Lemmas^10x  │
        │ • Stems       │     │ • Stems       │     │ • Stems^6x    │
        │               │     │ • POS Tags    │     │ • POS+Words   │
        │               │     │ • Head Word   │     │ • Head Word   │
        │               │     │ • Hypernyms   │     │ • Hypernyms^7x│
        │               │     │ • Hyponyms    │     │ • Hyponyms    │
        │               │     │ • Meronyms    │     │ • Meronyms    │
        │               │     │ • Holonyms    │     │ • Holonyms    │
        └───────┬───────┘     └───────┬───────┘     └───────┬───────┘
                │                     │                     │
                └─────────────────────┼─────────────────────┘
                                      │
                                      ▼
                            ┌──────────────────┐
                            │  Apache SOLR     │
                            │  Search Platform │
                            └─────────┬────────┘
                                      │
                                      │
                    ┌─────────────────┴─────────────────┐
                    │                                   │
                    ▼                                   ▼
        ┌────────────────────────┐        ┌────────────────────────┐
        │   User Query Input     │        │   NLP Processing       │
        │  "latest technology"   │───────▶│  (Query Analysis)      │
        └────────────────────────┘        └────────┬───────────────┘
                                                   │
                        ┌──────────────────────────┼──────────────────────────┐
                        │                          │                          │
                        ▼                          ▼                          ▼
              ┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
              │   Query for      │      │   Query for      │      │   Query for      │
              │    TASK 2        │      │    TASK 3        │      │    TASK 4        │
              │                  │      │                  │      │                  │
              │ • Tokenization   │      │ • Words          │      │ • Words          │
              │ • Basic Lemmas   │      │ • Lemmas         │      │ • POS Lemmas     │
              │ • Stems          │      │ • Stems          │      │ • Stems          │
              │                  │      │ • POS Tags       │      │ • Head+Synset    │
              │                  │      │ • Head Word      │      │ • POS Hypernyms  │
              │                  │      │ • All Relations  │      │ • All Relations  │
              └────────┬─────────┘      └────────┬─────────┘      └────────┬─────────┘
                       │                         │                         │
                       └─────────────────────────┼─────────────────────────┘
                                                 │
                                                 ▼
                                    ┌─────────────────────────┐
                                    │   SOLR Search Engine    │
                                    │   (Ranking & Scoring)   │
                                    └────────────┬────────────┘
                                                 │
                                                 ▼
                                    ┌─────────────────────────┐
                                    │   Search Results        │
                                    │   (Top 10 Documents)    │
                                    │                         │
                                    │ Rank | Doc ID | Text    │
                                    │  1   | A123S5 | ...     │
                                    │  2   | A456S2 | ...     │
                                    │  3   | A789S1 | ...     │
                                    └─────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                           NLP FEATURE PIPELINE                               │
└─────────────────────────────────────────────────────────────────────────────┘

Query Text ───▶ Tokenization ───▶ POS Tagging ───┬───▶ Lemmatization ───┐
                     │                            │                       │
                     │                            └───▶ Head Word ────────┤
                     │                                  Extraction        │
                     ▼                                                    │
                 Stemming ──────────────────────────────────────────────┐ │
                     │                                                  │ │
                     ▼                                                  │ │
              WordNet Lookup ───┬───▶ Hypernyms ────────────────────┐  │ │
                                │                                    │  │ │
                                ├───▶ Hyponyms ─────────────────────┤  │ │
                                │                                    │  │ │
                                ├───▶ Meronyms ─────────────────────┤  │ │
                                │                                    │  │ │
                                └───▶ Holonyms ─────────────────────┤  │ │
                                                                     │  │ │
                                                                     ▼  ▼ ▼
                                                           ┌──────────────────┐
                                                           │ Feature Vector   │
                                                           │ for SOLR Query   │
                                                           └──────────────────┘

Weighted Boosting Strategy (Task 4)

The system uses intelligent feature weighting to prioritize the most semantically relevant features:

Feature Type	Boost Factor	Rationale
Lemmas	10.0x	Base forms provide strongest semantic match
Hypernyms	7.0x	General terms broaden search scope effectively
Stems	6.0x	Root forms capture word variations
Words	1.0x	Original terms maintain query intent
POS Tags	1.0x	Grammatical context adds precision
Head Word	1.0x	Main concept anchors the search
Hyponyms	1.0x	Specific terms add detail
Meronyms	1.0x	Part-of relationships add context
Holonyms	1.0x	Whole-of relationships add context

Implementation Details

Index Creation (`IndexCreation.py`)

Key Functions:

preprocess_corpus(path): Reads and preprocesses all articles from the corpus
read_articles(filePath): Reads individual article files
remove_article_title(data): Cleans article titles from content
create_index_map(listOfArticles): Creates word and sentence index mappings
extract_features(indexWordsMap, indexSentenceMap): Extracts NLP features for Task 3
extract_improvised_features(...): Extracts enhanced features for Task 4
lemmatize_words(posList): Basic lemmatization of words
improved_lemmatize_words(posList): POS-aware lemmatization
stem_words(wordsList): Applies Porter Stemmer to words
tag_pos_words(wordsList): Part-of-speech tagging
find_head_word(sentence): Basic head word extraction using dependency parsing
find_improvised_head_word(sentence): Enhanced head word extraction with synset resolution
extract_hypernyms(words): Extracts general terms from WordNet
extract_hyponyms(words): Extracts specific terms from WordNet
extract_meronyms(words): Extracts part-of relationships
extract_holonyms(words): Extracts whole-of relationships
index_features_with_solr(jsonFileName, inputChoice): Indexes features into SOLR

Feature Extraction:

Tokenization and word extraction
Lemmatization using WordNet Lemmatizer
Stemming using Porter Stemmer
POS tagging using NLTK
Dependency parsing using Stanford CoreNLP
WordNet-based semantic relation extraction

Search Engine (`SemanticSearchEngine.py`)

Query Processing Functions:

process_query_to_extract_words(query): Tokenizes query into words
process_query_to_do_lemmatization(words): Basic lemmatization
process_query_to_do_improved_lemmatization(posTags): POS-aware lemmatization
process_query_to_do_stemming(words): Extracts word stems
process_query_to_do_pos_tagging(words): Identifies part-of-speech tags
process_query_to_extract_head_word(query): Dependency-based head word extraction
process_query_to_extract_improvised_head_word(query): Enhanced head word with synset resolution

Semantic Relation Extractors:

process_query_to_extract_hypernyms(words): Extracts general terms
process_query_to_extract_hyponyms(words): Extracts specific terms
process_query_to_extract_meronyms(words): Extracts part-of relationships
process_query_to_extract_holonyms(words): Extracts whole-of relationships
process_query_to_extract_improvised_hypernyms(posTags): POS-aware hypernym extraction
process_query_to_extract_improvised_hyponyms(posTags): POS-aware hyponym extraction
process_query_to_extract_improvised_meronyms(posTags): POS-aware meronym extraction
process_query_to_extract_improvised_holonyms(posTags): POS-aware holonym extraction

Search Functions:

search_in_solr(query, indexSentenceMap): Basic keyword search (Task 2)
search_in_solr_with_multiple_features(...): Multi-feature search (Task 3)
search_in_solr_with_multiple_improvised_features(...): Weighted search (Task 4)

Features

✅ Multi-level Search Capabilities

Keyword-based search
Semantic search with multiple NLP features
Weighted semantic search with feature boosting

✅ Advanced NLP Processing

Lemmatization (basic and POS-aware)
Stemming
Part-of-Speech tagging
Dependency parsing
Semantic relation extraction using WordNet

✅ Comprehensive Semantic Understanding

Hypernym/Hyponym relationships
Meronym/Holonym relationships
Head word identification
Context-aware processing

✅ Efficient Indexing and Retrieval

Apache SOLR integration
JSON-based feature storage
Optimized query construction

Prerequisites

Python 3.8+
Apache SOLR (running on localhost:8983)
Stanford CoreNLP Server (running on localhost:9000)
NLTK with required corpora:
- WordNet
- Averaged Perceptron Tagger
- Punkt Tokenizer

Installation

Option 1: Automated Setup (Recommended)

The project includes an automated setup script that creates a virtual environment and installs all dependencies:

# 1. Clone the repository
git clone https://github.com/mohanakrishnavh/Semantic-Search-Engine.git
cd Semantic-Search-Engine

# 2. Run the automated setup script
./setup.sh

# 3. Activate the virtual environment
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

The setup script will:

Check Python version (requires 3.8+)
Create a virtual environment
Install all Python dependencies from requirements.txt
Download necessary NLTK data packages

Option 2: Manual Setup

If you prefer manual installation:

1. Clone the Repository

git clone https://github.com/mohanakrishnavh/Semantic-Search-Engine.git
cd Semantic-Search-Engine

2. Create Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

3. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Download NLTK Data

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')

External Services Setup

Setup Apache SOLR

Download and install Apache SOLR 8.x or higher
Create cores for each task:
- task2 for keyword search
- task3 for semantic search
- task4 for improved semantic search
Configure schemas according to the features used in each task
Start SOLR: bin/solr start -p 8983

Setup Stanford CoreNLP

Download Stanford CoreNLP 4.x
Start the server:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-port 9000 -timeout 15000

Configuration Notes

File Paths: The code now uses relative paths automatically. No path updates needed.
Virtual Environment: Always activate the virtual environment before running the scripts.
Dependencies: All Python dependencies are specified in requirements.txt.

Usage

Activating the Environment

Before running any scripts, activate the virtual environment:

source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

Creating Indices

Run the index creation script:

python pkg/IndexCreation.py

The script will prompt you to choose an option:

1 - Task 2: Create keyword search index (basic tokenization)
2 - Task 3: Create semantic search index (with NLP features)
3 - Task 4: Create improved semantic search index (POS-aware with feature weighting)

Example:

Enter the option to continue with
 1. Task2 
 2. Task3
 3. Task4
> 3

Processing corpus...
Extracting features...
Creating index...
Index created successfully!

Running Searches

Execute the search engine:

python pkg/SemanticSearchEngine.py

Steps:

Select the task (1 for Task 2, 2 for Task 3, or 3 for Task 4)
Enter your search query
View top 10 matching documents with their IDs and text excerpts

Example:

=== Semantic Search Engine ===

Available search modes:
1. Task 2: Keyword Search (baseline)
2. Task 3: Semantic Search (with NLP features)
3. Task 4: Enhanced Semantic Search (POS-aware with weighted features)

Enter the task number (1-3): 3

Enter your search query: What are the latest developments in technology?

Search Results:
------------------------------
Rank 1 - Document ID: 1523
Text: Technology companies unveiled new innovations...
...

Deactivating the Environment

When you're done, deactivate the virtual environment:

deactivate

Results and Analysis

Evaluation Metrics

Rank: Position of relevant documents in search results
Mean Reciprocal Rank (MRR): Average of reciprocal ranks across queries
Overall Accuracy: 63%

Performance Comparison

Task	Approach	Features	Performance
Task 2	Keyword Search	Tokenization	Baseline
Task 3	Semantic Search	Lemmas, Stems, POS, Head Word, Semantic Relations	Improved
Task 4	Weighted Semantic	POS-aware features + Weighted Boosting	Best (63% accuracy)

Key Insights

Lemmatization with POS context provides the most significant improvement (10x boost)
Hypernyms contribute substantially to semantic understanding (7x boost)
Combining multiple semantic features yields better results than individual features
Head word extraction helps focus on main concepts
Weighted feature boosting significantly improves relevance ranking

Technology Stack

Component	Technology
Programming Language	Python 3.x
NLP Library	NLTK (Natural Language Toolkit)
Search Platform	Apache SOLR
Python-SOLR Interface	PySOLR
Dependency Parsing	Stanford CoreNLP
Lemmatization	WordNet Lemmatizer
Stemming	Porter Stemmer
Semantic Relations	WordNet
Data Processing	Pandas
POS Tagging	NLTK POS Tagger

Project Structure

Semantic-Search-Engine/
├── README.md                    # Project documentation
├── requirements.txt             # Python dependencies
├── setup.sh                     # Automated setup script
├── .gitignore                   # Git exclusions
├── data/                        # BBC News corpus (2225 articles)
│   ├── 1.txt
│   ├── 2.txt
│   └── ...
├── raw_data/                    # Original unprocessed data
├── venv/                        # Virtual environment (created by setup.sh)
├── pkg/                         # Main source code package
│   ├── IndexCreation.py         # Index creation and feature extraction
│   ├── SemanticSearchEngine.py  # Search engine implementation
│   ├── Task2.json              # Keyword search index
│   ├── Task3.json              # Semantic search index
│   └── Task4.json              # Improved semantic search index
└── .project                     # Eclipse project file

Code Quality

This project follows Python best practices:

PEP 8 Compliance: All functions use snake_case naming convention
Comprehensive Documentation: All modules and functions have detailed docstrings
Relative Paths: Code uses relative paths for portability
Virtual Environment: Isolated dependency management
Version Control: Git with appropriate .gitignore patterns

Troubleshooting

Common Issues

Issue: ModuleNotFoundError when running scripts

Solution: Ensure virtual environment is activated: source venv/bin/activate

Issue: NLTK data not found errors

Solution: Run the setup script again or manually download NLTK data:
```
import nltk
nltk.download('all')
```

Issue: Cannot connect to SOLR

Solution: Verify SOLR is running on port 8983: curl http://localhost:8983/solr/

Issue: Stanford CoreNLP connection errors

Solution: Ensure CoreNLP server is running on port 9000 and has sufficient memory

Issue: Python version errors

Solution: This project requires Python 3.8 or higher. Check version: python3 --version

Getting Help

If you encounter issues not covered here:

Check that all prerequisites are properly installed
Verify that external services (SOLR, CoreNLP) are running
Ensure the virtual environment is activated
Review error messages carefully - they often indicate missing dependencies or configuration issues

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

This project is part of an academic assignment. Please check with the authors regarding usage and distribution.

Acknowledgments

BBC for providing the news corpus
NLTK and Stanford CoreNLP teams for excellent NLP tools
Apache SOLR community for the search platform

Contact

For questions or collaboration:

Deepak Shanmugam
Mohanakrishna Vanamala Hariprasad
Vidya Sri Mani

Note: This project demonstrates the evolution from basic keyword search to advanced semantic search, showcasing how NLP techniques can significantly improve information retrieval relevance.

mohanakrishnavh/semantic-search-engine