Semantic Search Engine

Authors: Deepak Shanmugam, Mohanakrishna V H, Vidya Sri Mani

Table of Contents

Overview

This project implements an advanced semantic search engine that goes beyond traditional keyword-based search by understanding the meaning and context of queries. Built on a BBC News corpus, the system employs sophisticated Natural Language Processing (NLP) techniques to deliver more relevant search results by analyzing semantic relationships between words and concepts.

The search engine progressively enhances its capabilities through multiple implementation phases, starting from basic keyword matching to advanced semantic understanding using deep NLP features.

Dataset Description

The project utilizes the BBC News Dataset, a well-curated collection of news articles:

  • Total Documents: 2,225 articles
  • Time Period: 2004-2005
  • Categories: 5 distinct domains
    • Business
    • Entertainment
    • Politics
    • Sports
    • Technology
  • Source: BBC News Website
  • Dataset Link: http://mlg.ucd.ie/datasets/bbc.html

This multi-domain corpus provides diverse linguistic patterns and vocabulary, making it ideal for testing semantic search capabilities across different subject areas.

Problem Statement

Traditional keyword-based search engines often fail to understand the semantic intent behind user queries, leading to irrelevant results when exact keyword matches are absent. This project addresses this limitation by implementing a semantic search engine that:

  1. Understands the contextual meaning of queries
  2. Identifies semantic relationships between words (synonyms, hypernyms, hyponyms, etc.)
  3. Ranks documents based on semantic similarity rather than just keyword frequency
  4. Provides more accurate and contextually relevant search results

System Architecture

The system is built in a progressive manner with four distinct implementation tasks:

Task 1: Corpus Building

  • Collection and preprocessing of BBC News articles
  • Segmentation of documents into processable units
  • Creation of index mappings for efficient retrieval

Task 2: Keyword Search Index (Baseline)

  • Techniques: Segmentation, Tokenization
  • Indexing: Apache SOLR
  • Features: Basic word-level matching
  • Establishes baseline performance for comparison

Task 3: Semantic Search Index

  • Advanced NLP Features:
    • Lemmatization: Reducing words to their base forms
    • Stemming: Extracting word stems using Porter Stemmer
    • POS Tagging: Part-of-speech identification
    • Syntactic Parsing: Understanding sentence structure
    • Semantic Relations:
      • Hypernyms (general terms)
      • Hyponyms (specific terms)
      • Meronyms (part-of relationships)
      • Holonyms (whole-of relationships)
    • Head Word Extraction: Identifying the main concept using dependency parsing

Task 4: Enhanced Semantic Search (Optimized)

  • Improvements over Task 3:
    • POS-aware lemmatization for better accuracy
    • Improved head word extraction with synset resolution
    • Context-aware semantic relation extraction
    • Weighted boosting for different features:
      • Lemmas: 10.0x boost (highest priority)
      • Stems: 6.0x boost
      • Hypernyms: 7.0x boost
      • Words, POS tags, hyponyms, meronyms, holonyms: 1.0x boost
    • Optimized query construction for better relevance

System Flow Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                         SEMANTIC SEARCH ENGINE FLOW                          │
└─────────────────────────────────────────────────────────────────────────────┘

                              ┌──────────────────┐
                              │  BBC News Corpus │
                              │  (2,225 articles)│
                              └────────┬─────────┘
                                       │
                                       ▼
                        ┌──────────────────────────┐
                        │   Index Creation Module  │
                        │  (IndexCreation.py)      │
                        └──────────────────────────┘
                                       │
                ┌──────────────────────┼──────────────────────┐
                │                      │                      │
                ▼                      ▼                      ▼
        ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
        │  TASK 2 INDEX │     │  TASK 3 INDEX │     │  TASK 4 INDEX │
        │  (Baseline)   │     │  (Semantic)   │     │  (Enhanced)   │
        └───────────────┘     └───────────────┘     └───────────────┘
        │               │     │               │     │               │
        │ • Words       │     │ • Words       │     │ • Words       │
        │ • Lemmas      │     │ • Lemmas      │     │ • Lemmas^10x  │
        │ • Stems       │     │ • Stems       │     │ • Stems^6x    │
        │               │     │ • POS Tags    │     │ • POS+Words   │
        │               │     │ • Head Word   │     │ • Head Word   │
        │               │     │ • Hypernyms   │     │ • Hypernyms^7x│
        │               │     │ • Hyponyms    │     │ • Hyponyms    │
        │               │     │ • Meronyms    │     │ • Meronyms    │
        │               │     │ • Holonyms    │     │ • Holonyms    │
        └───────┬───────┘     └───────┬───────┘     └───────┬───────┘
                │                     │                     │
                └─────────────────────┼─────────────────────┘
                                      │
                                      ▼
                            ┌──────────────────┐
                            │  Apache SOLR     │
                            │  Search Platform │
                            └─────────┬────────┘
                                      │
                                      │
                    ┌─────────────────┴─────────────────┐
                    │                                   │
                    ▼                                   ▼
        ┌────────────────────────┐        ┌────────────────────────┐
        │   User Query Input     │        │   NLP Processing       │
        │  "latest technology"   │───────▶│  (Query Analysis)      │
        └────────────────────────┘        └────────┬───────────────┘
                                                   │
                        ┌──────────────────────────┼──────────────────────────┐
                        │                          │                          │
                        ▼                          ▼                          ▼
              ┌──────────────────┐      ┌──────────────────┐      ┌──────────────────┐
              │   Query for      │      │   Query for      │      │   Query for      │
              │    TASK 2        │      │    TASK 3        │      │    TASK 4        │
              │                  │      │                  │      │                  │
              │ • Tokenization   │      │ • Words          │      │ • Words          │
              │ • Basic Lemmas   │      │ • Lemmas         │      │ • POS Lemmas     │
              │ • Stems          │      │ • Stems          │      │ • Stems          │
              │                  │      │ • POS Tags       │      │ • Head+Synset    │
              │                  │      │ • Head Word      │      │ • POS Hypernyms  │
              │                  │      │ • All Relations  │      │ • All Relations  │
              └────────┬─────────┘      └────────┬─────────┘      └────────┬─────────┘
                       │                         │                         │
                       └─────────────────────────┼─────────────────────────┘
                                                 │
                                                 ▼
                                    ┌─────────────────────────┐
                                    │   SOLR Search Engine    │
                                    │   (Ranking & Scoring)   │
                                    └────────────┬────────────┘
                                                 │
                                                 ▼
                                    ┌─────────────────────────┐
                                    │   Search Results        │
                                    │   (Top 10 Documents)    │
                                    │                         │
                                    │ Rank | Doc ID | Text    │
                                    │  1   | A123S5 | ...     │
                                    │  2   | A456S2 | ...     │
                                    │  3   | A789S1 | ...     │
                                    └─────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                           NLP FEATURE PIPELINE                               │
└─────────────────────────────────────────────────────────────────────────────┘

Query Text ───▶ Tokenization ───▶ POS Tagging ───┬───▶ Lemmatization ───┐
                     │                            │                       │
                     │                            └───▶ Head Word ────────┤
                     │                                  Extraction        │
                     ▼                                                    │
                 Stemming ──────────────────────────────────────────────┐ │
                     │                                                  │ │
                     ▼                                                  │ │
              WordNet Lookup ───┬───▶ Hypernyms ────────────────────┐  │ │
                                │                                    │  │ │
                                ├───▶ Hyponyms ─────────────────────┤  │ │
                                │                                    │  │ │
                                ├───▶ Meronyms ─────────────────────┤  │ │
                                │                                    │  │ │
                                └───▶ Holonyms ─────────────────────┤  │ │
                                                                     │  │ │
                                                                     ▼  ▼ ▼
                                                           ┌──────────────────┐
                                                           │ Feature Vector   │
                                                           │ for SOLR Query   │
                                                           └──────────────────┘

Weighted Boosting Strategy (Task 4)

The system uses intelligent feature weighting to prioritize the most semantically relevant features:

Feature Type Boost Factor Rationale
Lemmas 10.0x Base forms provide strongest semantic match
Hypernyms 7.0x General terms broaden search scope effectively
Stems 6.0x Root forms capture word variations
Words 1.0x Original terms maintain query intent
POS Tags 1.0x Grammatical context adds precision
Head Word 1.0x Main concept anchors the search
Hyponyms 1.0x Specific terms add detail
Meronyms 1.0x Part-of relationships add context
Holonyms 1.0x Whole-of relationships add context

Implementation Details

Index Creation (IndexCreation.py)

Key Functions:

  • preprocess_corpus(path): Reads and preprocesses all articles from the corpus
  • read_articles(filePath): Reads individual article files
  • remove_article_title(data): Cleans article titles from content
  • create_index_map(listOfArticles): Creates word and sentence index mappings
  • extract_features(indexWordsMap, indexSentenceMap): Extracts NLP features for Task 3
  • extract_improvised_features(...): Extracts enhanced features for Task 4
  • lemmatize_words(posList): Basic lemmatization of words
  • improved_lemmatize_words(posList): POS-aware lemmatization
  • stem_words(wordsList): Applies Porter Stemmer to words
  • tag_pos_words(wordsList): Part-of-speech tagging
  • find_head_word(sentence): Basic head word extraction using dependency parsing
  • find_improvised_head_word(sentence): Enhanced head word extraction with synset resolution
  • extract_hypernyms(words): Extracts general terms from WordNet
  • extract_hyponyms(words): Extracts specific terms from WordNet
  • extract_meronyms(words): Extracts part-of relationships
  • extract_holonyms(words): Extracts whole-of relationships
  • index_features_with_solr(jsonFileName, inputChoice): Indexes features into SOLR

Feature Extraction:

  • Tokenization and word extraction
  • Lemmatization using WordNet Lemmatizer
  • Stemming using Porter Stemmer
  • POS tagging using NLTK
  • Dependency parsing using Stanford CoreNLP
  • WordNet-based semantic relation extraction

Search Engine (SemanticSearchEngine.py)

Query Processing Functions:

  • process_query_to_extract_words(query): Tokenizes query into words
  • process_query_to_do_lemmatization(words): Basic lemmatization
  • process_query_to_do_improved_lemmatization(posTags): POS-aware lemmatization
  • process_query_to_do_stemming(words): Extracts word stems
  • process_query_to_do_pos_tagging(words): Identifies part-of-speech tags
  • process_query_to_extract_head_word(query): Dependency-based head word extraction
  • process_query_to_extract_improvised_head_word(query): Enhanced head word with synset resolution

Semantic Relation Extractors:

  • process_query_to_extract_hypernyms(words): Extracts general terms
  • process_query_to_extract_hyponyms(words): Extracts specific terms
  • process_query_to_extract_meronyms(words): Extracts part-of relationships
  • process_query_to_extract_holonyms(words): Extracts whole-of relationships
  • process_query_to_extract_improvised_hypernyms(posTags): POS-aware hypernym extraction
  • process_query_to_extract_improvised_hyponyms(posTags): POS-aware hyponym extraction
  • process_query_to_extract_improvised_meronyms(posTags): POS-aware meronym extraction
  • process_query_to_extract_improvised_holonyms(posTags): POS-aware holonym extraction

Search Functions:

  • search_in_solr(query, indexSentenceMap): Basic keyword search (Task 2)
  • search_in_solr_with_multiple_features(...): Multi-feature search (Task 3)
  • search_in_solr_with_multiple_improvised_features(...): Weighted search (Task 4)

Features

Multi-level Search Capabilities

  • Keyword-based search
  • Semantic search with multiple NLP features
  • Weighted semantic search with feature boosting

Advanced NLP Processing

  • Lemmatization (basic and POS-aware)
  • Stemming
  • Part-of-Speech tagging
  • Dependency parsing
  • Semantic relation extraction using WordNet

Comprehensive Semantic Understanding

  • Hypernym/Hyponym relationships
  • Meronym/Holonym relationships
  • Head word identification
  • Context-aware processing

Efficient Indexing and Retrieval

  • Apache SOLR integration
  • JSON-based feature storage
  • Optimized query construction

Prerequisites

  • Python 3.8+
  • Apache SOLR (running on localhost:8983)
  • Stanford CoreNLP Server (running on localhost:9000)
  • NLTK with required corpora:
    • WordNet
    • Averaged Perceptron Tagger
    • Punkt Tokenizer

Installation

Option 1: Automated Setup (Recommended)

The project includes an automated setup script that creates a virtual environment and installs all dependencies:

# 1. Clone the repository
git clone https://github.com/mohanakrishnavh/Semantic-Search-Engine.git
cd Semantic-Search-Engine

# 2. Run the automated setup script
./setup.sh

# 3. Activate the virtual environment
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

The setup script will:

  • Check Python version (requires 3.8+)
  • Create a virtual environment
  • Install all Python dependencies from requirements.txt
  • Download necessary NLTK data packages

Option 2: Manual Setup

If you prefer manual installation:

1. Clone the Repository

git clone https://github.com/mohanakrishnavh/Semantic-Search-Engine.git
cd Semantic-Search-Engine

2. Create Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

3. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

4. Download NLTK Data

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')

External Services Setup

Setup Apache SOLR

  • Download and install Apache SOLR 8.x or higher
  • Create cores for each task:
    • task2 for keyword search
    • task3 for semantic search
    • task4 for improved semantic search
  • Configure schemas according to the features used in each task
  • Start SOLR: bin/solr start -p 8983

Setup Stanford CoreNLP

  • Download Stanford CoreNLP 4.x
  • Start the server:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-port 9000 -timeout 15000

Configuration Notes

  • File Paths: The code now uses relative paths automatically. No path updates needed.
  • Virtual Environment: Always activate the virtual environment before running the scripts.
  • Dependencies: All Python dependencies are specified in requirements.txt.

Usage

Activating the Environment

Before running any scripts, activate the virtual environment:

source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

Creating Indices

Run the index creation script:

python pkg/IndexCreation.py

The script will prompt you to choose an option:

  • 1 - Task 2: Create keyword search index (basic tokenization)
  • 2 - Task 3: Create semantic search index (with NLP features)
  • 3 - Task 4: Create improved semantic search index (POS-aware with feature weighting)

Example:

Enter the option to continue with
 1. Task2 
 2. Task3
 3. Task4
> 3

Processing corpus...
Extracting features...
Creating index...
Index created successfully!

Running Searches

Execute the search engine:

python pkg/SemanticSearchEngine.py

Steps:

  1. Select the task (1 for Task 2, 2 for Task 3, or 3 for Task 4)
  2. Enter your search query
  3. View top 10 matching documents with their IDs and text excerpts

Example:

=== Semantic Search Engine ===

Available search modes:
1. Task 2: Keyword Search (baseline)
2. Task 3: Semantic Search (with NLP features)
3. Task 4: Enhanced Semantic Search (POS-aware with weighted features)

Enter the task number (1-3): 3

Enter your search query: What are the latest developments in technology?

Search Results:
------------------------------
Rank 1 - Document ID: 1523
Text: Technology companies unveiled new innovations...
...

Deactivating the Environment

When you're done, deactivate the virtual environment:

deactivate

Results and Analysis

Evaluation Metrics

  • Rank: Position of relevant documents in search results
  • Mean Reciprocal Rank (MRR): Average of reciprocal ranks across queries
  • Overall Accuracy: 63%

Performance Comparison

Task Approach Features Performance
Task 2 Keyword Search Tokenization Baseline
Task 3 Semantic Search Lemmas, Stems, POS, Head Word, Semantic Relations Improved
Task 4 Weighted Semantic POS-aware features + Weighted Boosting Best (63% accuracy)

Key Insights

  • Lemmatization with POS context provides the most significant improvement (10x boost)
  • Hypernyms contribute substantially to semantic understanding (7x boost)
  • Combining multiple semantic features yields better results than individual features
  • Head word extraction helps focus on main concepts
  • Weighted feature boosting significantly improves relevance ranking

Technology Stack

Component Technology
Programming Language Python 3.x
NLP Library NLTK (Natural Language Toolkit)
Search Platform Apache SOLR
Python-SOLR Interface PySOLR
Dependency Parsing Stanford CoreNLP
Lemmatization WordNet Lemmatizer
Stemming Porter Stemmer
Semantic Relations WordNet
Data Processing Pandas
POS Tagging NLTK POS Tagger

Project Structure

Semantic-Search-Engine/
├── README.md                    # Project documentation
├── requirements.txt             # Python dependencies
├── setup.sh                     # Automated setup script
├── .gitignore                   # Git exclusions
├── data/                        # BBC News corpus (2225 articles)
│   ├── 1.txt
│   ├── 2.txt
│   └── ...
├── raw_data/                    # Original unprocessed data
├── venv/                        # Virtual environment (created by setup.sh)
├── pkg/                         # Main source code package
│   ├── IndexCreation.py         # Index creation and feature extraction
│   ├── SemanticSearchEngine.py  # Search engine implementation
│   ├── Task2.json              # Keyword search index
│   ├── Task3.json              # Semantic search index
│   └── Task4.json              # Improved semantic search index
└── .project                     # Eclipse project file

Code Quality

This project follows Python best practices:

  • PEP 8 Compliance: All functions use snake_case naming convention
  • Comprehensive Documentation: All modules and functions have detailed docstrings
  • Relative Paths: Code uses relative paths for portability
  • Virtual Environment: Isolated dependency management
  • Version Control: Git with appropriate .gitignore patterns

Troubleshooting

Common Issues

Issue: ModuleNotFoundError when running scripts

  • Solution: Ensure virtual environment is activated: source venv/bin/activate

Issue: NLTK data not found errors

  • Solution: Run the setup script again or manually download NLTK data:
    import nltk
    nltk.download('all')

Issue: Cannot connect to SOLR

  • Solution: Verify SOLR is running on port 8983: curl http://localhost:8983/solr/

Issue: Stanford CoreNLP connection errors

  • Solution: Ensure CoreNLP server is running on port 9000 and has sufficient memory

Issue: Python version errors

  • Solution: This project requires Python 3.8 or higher. Check version: python3 --version

Getting Help

If you encounter issues not covered here:

  1. Check that all prerequisites are properly installed
  2. Verify that external services (SOLR, CoreNLP) are running
  3. Ensure the virtual environment is activated
  4. Review error messages carefully - they often indicate missing dependencies or configuration issues

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

This project is part of an academic assignment. Please check with the authors regarding usage and distribution.

Acknowledgments

  • BBC for providing the news corpus
  • NLTK and Stanford CoreNLP teams for excellent NLP tools
  • Apache SOLR community for the search platform

Contact

For questions or collaboration:

  • Deepak Shanmugam
  • Mohanakrishna Vanamala Hariprasad
  • Vidya Sri Mani

Note: This project demonstrates the evolution from basic keyword search to advanced semantic search, showcasing how NLP techniques can significantly improve information retrieval relevance.