Authors: Deepak Shanmugam, Mohanakrishna V H, Vidya Sri Mani
- Overview
- Dataset Description
- Problem Statement
- System Architecture
- Implementation Details
- Features
- Prerequisites
- Installation
- Usage
- Results and Analysis
- Technology Stack
- Project Structure
This project implements an advanced semantic search engine that goes beyond traditional keyword-based search by understanding the meaning and context of queries. Built on a BBC News corpus, the system employs sophisticated Natural Language Processing (NLP) techniques to deliver more relevant search results by analyzing semantic relationships between words and concepts.
The search engine progressively enhances its capabilities through multiple implementation phases, starting from basic keyword matching to advanced semantic understanding using deep NLP features.
The project utilizes the BBC News Dataset, a well-curated collection of news articles:
- Total Documents: 2,225 articles
- Time Period: 2004-2005
- Categories: 5 distinct domains
- Business
- Entertainment
- Politics
- Sports
- Technology
- Source: BBC News Website
- Dataset Link: http://mlg.ucd.ie/datasets/bbc.html
This multi-domain corpus provides diverse linguistic patterns and vocabulary, making it ideal for testing semantic search capabilities across different subject areas.
Traditional keyword-based search engines often fail to understand the semantic intent behind user queries, leading to irrelevant results when exact keyword matches are absent. This project addresses this limitation by implementing a semantic search engine that:
- Understands the contextual meaning of queries
- Identifies semantic relationships between words (synonyms, hypernyms, hyponyms, etc.)
- Ranks documents based on semantic similarity rather than just keyword frequency
- Provides more accurate and contextually relevant search results
The system is built in a progressive manner with four distinct implementation tasks:
- Collection and preprocessing of BBC News articles
- Segmentation of documents into processable units
- Creation of index mappings for efficient retrieval
- Techniques: Segmentation, Tokenization
- Indexing: Apache SOLR
- Features: Basic word-level matching
- Establishes baseline performance for comparison
- Advanced NLP Features:
- Lemmatization: Reducing words to their base forms
- Stemming: Extracting word stems using Porter Stemmer
- POS Tagging: Part-of-speech identification
- Syntactic Parsing: Understanding sentence structure
- Semantic Relations:
- Hypernyms (general terms)
- Hyponyms (specific terms)
- Meronyms (part-of relationships)
- Holonyms (whole-of relationships)
- Head Word Extraction: Identifying the main concept using dependency parsing
- Improvements over Task 3:
- POS-aware lemmatization for better accuracy
- Improved head word extraction with synset resolution
- Context-aware semantic relation extraction
- Weighted boosting for different features:
- Lemmas: 10.0x boost (highest priority)
- Stems: 6.0x boost
- Hypernyms: 7.0x boost
- Words, POS tags, hyponyms, meronyms, holonyms: 1.0x boost
- Optimized query construction for better relevance
┌─────────────────────────────────────────────────────────────────────────────┐
│ SEMANTIC SEARCH ENGINE FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
┌──────────────────┐
│ BBC News Corpus │
│ (2,225 articles)│
└────────┬─────────┘
│
▼
┌──────────────────────────┐
│ Index Creation Module │
│ (IndexCreation.py) │
└──────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ TASK 2 INDEX │ │ TASK 3 INDEX │ │ TASK 4 INDEX │
│ (Baseline) │ │ (Semantic) │ │ (Enhanced) │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │ │ │ │
│ • Words │ │ • Words │ │ • Words │
│ • Lemmas │ │ • Lemmas │ │ • Lemmas^10x │
│ • Stems │ │ • Stems │ │ • Stems^6x │
│ │ │ • POS Tags │ │ • POS+Words │
│ │ │ • Head Word │ │ • Head Word │
│ │ │ • Hypernyms │ │ • Hypernyms^7x│
│ │ │ • Hyponyms │ │ • Hyponyms │
│ │ │ • Meronyms │ │ • Meronyms │
│ │ │ • Holonyms │ │ • Holonyms │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└─────────────────────┼─────────────────────┘
│
▼
┌──────────────────┐
│ Apache SOLR │
│ Search Platform │
└─────────┬────────┘
│
│
┌─────────────────┴─────────────────┐
│ │
▼ ▼
┌────────────────────────┐ ┌────────────────────────┐
│ User Query Input │ │ NLP Processing │
│ "latest technology" │───────▶│ (Query Analysis) │
└────────────────────────┘ └────────┬───────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Query for │ │ Query for │ │ Query for │
│ TASK 2 │ │ TASK 3 │ │ TASK 4 │
│ │ │ │ │ │
│ • Tokenization │ │ • Words │ │ • Words │
│ • Basic Lemmas │ │ • Lemmas │ │ • POS Lemmas │
│ • Stems │ │ • Stems │ │ • Stems │
│ │ │ • POS Tags │ │ • Head+Synset │
│ │ │ • Head Word │ │ • POS Hypernyms │
│ │ │ • All Relations │ │ • All Relations │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└─────────────────────────┼─────────────────────────┘
│
▼
┌─────────────────────────┐
│ SOLR Search Engine │
│ (Ranking & Scoring) │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ Search Results │
│ (Top 10 Documents) │
│ │
│ Rank | Doc ID | Text │
│ 1 | A123S5 | ... │
│ 2 | A456S2 | ... │
│ 3 | A789S1 | ... │
└─────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ NLP FEATURE PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
Query Text ───▶ Tokenization ───▶ POS Tagging ───┬───▶ Lemmatization ───┐
│ │ │
│ └───▶ Head Word ────────┤
│ Extraction │
▼ │
Stemming ──────────────────────────────────────────────┐ │
│ │ │
▼ │ │
WordNet Lookup ───┬───▶ Hypernyms ────────────────────┐ │ │
│ │ │ │
├───▶ Hyponyms ─────────────────────┤ │ │
│ │ │ │
├───▶ Meronyms ─────────────────────┤ │ │
│ │ │ │
└───▶ Holonyms ─────────────────────┤ │ │
│ │ │
▼ ▼ ▼
┌──────────────────┐
│ Feature Vector │
│ for SOLR Query │
└──────────────────┘
The system uses intelligent feature weighting to prioritize the most semantically relevant features:
| Feature Type | Boost Factor | Rationale |
|---|---|---|
| Lemmas | 10.0x | Base forms provide strongest semantic match |
| Hypernyms | 7.0x | General terms broaden search scope effectively |
| Stems | 6.0x | Root forms capture word variations |
| Words | 1.0x | Original terms maintain query intent |
| POS Tags | 1.0x | Grammatical context adds precision |
| Head Word | 1.0x | Main concept anchors the search |
| Hyponyms | 1.0x | Specific terms add detail |
| Meronyms | 1.0x | Part-of relationships add context |
| Holonyms | 1.0x | Whole-of relationships add context |
Key Functions:
preprocess_corpus(path): Reads and preprocesses all articles from the corpusread_articles(filePath): Reads individual article filesremove_article_title(data): Cleans article titles from contentcreate_index_map(listOfArticles): Creates word and sentence index mappingsextract_features(indexWordsMap, indexSentenceMap): Extracts NLP features for Task 3extract_improvised_features(...): Extracts enhanced features for Task 4lemmatize_words(posList): Basic lemmatization of wordsimproved_lemmatize_words(posList): POS-aware lemmatizationstem_words(wordsList): Applies Porter Stemmer to wordstag_pos_words(wordsList): Part-of-speech taggingfind_head_word(sentence): Basic head word extraction using dependency parsingfind_improvised_head_word(sentence): Enhanced head word extraction with synset resolutionextract_hypernyms(words): Extracts general terms from WordNetextract_hyponyms(words): Extracts specific terms from WordNetextract_meronyms(words): Extracts part-of relationshipsextract_holonyms(words): Extracts whole-of relationshipsindex_features_with_solr(jsonFileName, inputChoice): Indexes features into SOLR
Feature Extraction:
- Tokenization and word extraction
- Lemmatization using WordNet Lemmatizer
- Stemming using Porter Stemmer
- POS tagging using NLTK
- Dependency parsing using Stanford CoreNLP
- WordNet-based semantic relation extraction
Query Processing Functions:
process_query_to_extract_words(query): Tokenizes query into wordsprocess_query_to_do_lemmatization(words): Basic lemmatizationprocess_query_to_do_improved_lemmatization(posTags): POS-aware lemmatizationprocess_query_to_do_stemming(words): Extracts word stemsprocess_query_to_do_pos_tagging(words): Identifies part-of-speech tagsprocess_query_to_extract_head_word(query): Dependency-based head word extractionprocess_query_to_extract_improvised_head_word(query): Enhanced head word with synset resolution
Semantic Relation Extractors:
process_query_to_extract_hypernyms(words): Extracts general termsprocess_query_to_extract_hyponyms(words): Extracts specific termsprocess_query_to_extract_meronyms(words): Extracts part-of relationshipsprocess_query_to_extract_holonyms(words): Extracts whole-of relationshipsprocess_query_to_extract_improvised_hypernyms(posTags): POS-aware hypernym extractionprocess_query_to_extract_improvised_hyponyms(posTags): POS-aware hyponym extractionprocess_query_to_extract_improvised_meronyms(posTags): POS-aware meronym extractionprocess_query_to_extract_improvised_holonyms(posTags): POS-aware holonym extraction
Search Functions:
search_in_solr(query, indexSentenceMap): Basic keyword search (Task 2)search_in_solr_with_multiple_features(...): Multi-feature search (Task 3)search_in_solr_with_multiple_improvised_features(...): Weighted search (Task 4)
✅ Multi-level Search Capabilities
- Keyword-based search
- Semantic search with multiple NLP features
- Weighted semantic search with feature boosting
✅ Advanced NLP Processing
- Lemmatization (basic and POS-aware)
- Stemming
- Part-of-Speech tagging
- Dependency parsing
- Semantic relation extraction using WordNet
✅ Comprehensive Semantic Understanding
- Hypernym/Hyponym relationships
- Meronym/Holonym relationships
- Head word identification
- Context-aware processing
✅ Efficient Indexing and Retrieval
- Apache SOLR integration
- JSON-based feature storage
- Optimized query construction
- Python 3.8+
- Apache SOLR (running on
localhost:8983) - Stanford CoreNLP Server (running on
localhost:9000) - NLTK with required corpora:
- WordNet
- Averaged Perceptron Tagger
- Punkt Tokenizer
The project includes an automated setup script that creates a virtual environment and installs all dependencies:
# 1. Clone the repository
git clone https://github.com/mohanakrishnavh/Semantic-Search-Engine.git
cd Semantic-Search-Engine
# 2. Run the automated setup script
./setup.sh
# 3. Activate the virtual environment
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On WindowsThe setup script will:
- Check Python version (requires 3.8+)
- Create a virtual environment
- Install all Python dependencies from
requirements.txt - Download necessary NLTK data packages
If you prefer manual installation:
git clone https://github.com/mohanakrishnavh/Semantic-Search-Engine.git
cd Semantic-Search-Enginepython3 -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windowspip install --upgrade pip
pip install -r requirements.txtimport nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')- Download and install Apache SOLR 8.x or higher
- Create cores for each task:
task2for keyword searchtask3for semantic searchtask4for improved semantic search
- Configure schemas according to the features used in each task
- Start SOLR:
bin/solr start -p 8983
- Download Stanford CoreNLP 4.x
- Start the server:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \
-port 9000 -timeout 15000- File Paths: The code now uses relative paths automatically. No path updates needed.
- Virtual Environment: Always activate the virtual environment before running the scripts.
- Dependencies: All Python dependencies are specified in
requirements.txt.
Before running any scripts, activate the virtual environment:
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On WindowsRun the index creation script:
python pkg/IndexCreation.pyThe script will prompt you to choose an option:
1- Task 2: Create keyword search index (basic tokenization)2- Task 3: Create semantic search index (with NLP features)3- Task 4: Create improved semantic search index (POS-aware with feature weighting)
Example:
Enter the option to continue with
1. Task2
2. Task3
3. Task4
> 3
Processing corpus...
Extracting features...
Creating index...
Index created successfully!
Execute the search engine:
python pkg/SemanticSearchEngine.pySteps:
- Select the task (1 for Task 2, 2 for Task 3, or 3 for Task 4)
- Enter your search query
- View top 10 matching documents with their IDs and text excerpts
Example:
=== Semantic Search Engine ===
Available search modes:
1. Task 2: Keyword Search (baseline)
2. Task 3: Semantic Search (with NLP features)
3. Task 4: Enhanced Semantic Search (POS-aware with weighted features)
Enter the task number (1-3): 3
Enter your search query: What are the latest developments in technology?
Search Results:
------------------------------
Rank 1 - Document ID: 1523
Text: Technology companies unveiled new innovations...
...
When you're done, deactivate the virtual environment:
deactivate- Rank: Position of relevant documents in search results
- Mean Reciprocal Rank (MRR): Average of reciprocal ranks across queries
- Overall Accuracy: 63%
| Task | Approach | Features | Performance |
|---|---|---|---|
| Task 2 | Keyword Search | Tokenization | Baseline |
| Task 3 | Semantic Search | Lemmas, Stems, POS, Head Word, Semantic Relations | Improved |
| Task 4 | Weighted Semantic | POS-aware features + Weighted Boosting | Best (63% accuracy) |
- Lemmatization with POS context provides the most significant improvement (10x boost)
- Hypernyms contribute substantially to semantic understanding (7x boost)
- Combining multiple semantic features yields better results than individual features
- Head word extraction helps focus on main concepts
- Weighted feature boosting significantly improves relevance ranking
| Component | Technology |
|---|---|
| Programming Language | Python 3.x |
| NLP Library | NLTK (Natural Language Toolkit) |
| Search Platform | Apache SOLR |
| Python-SOLR Interface | PySOLR |
| Dependency Parsing | Stanford CoreNLP |
| Lemmatization | WordNet Lemmatizer |
| Stemming | Porter Stemmer |
| Semantic Relations | WordNet |
| Data Processing | Pandas |
| POS Tagging | NLTK POS Tagger |
Semantic-Search-Engine/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── setup.sh # Automated setup script
├── .gitignore # Git exclusions
├── data/ # BBC News corpus (2225 articles)
│ ├── 1.txt
│ ├── 2.txt
│ └── ...
├── raw_data/ # Original unprocessed data
├── venv/ # Virtual environment (created by setup.sh)
├── pkg/ # Main source code package
│ ├── IndexCreation.py # Index creation and feature extraction
│ ├── SemanticSearchEngine.py # Search engine implementation
│ ├── Task2.json # Keyword search index
│ ├── Task3.json # Semantic search index
│ └── Task4.json # Improved semantic search index
└── .project # Eclipse project file
This project follows Python best practices:
- PEP 8 Compliance: All functions use
snake_casenaming convention - Comprehensive Documentation: All modules and functions have detailed docstrings
- Relative Paths: Code uses relative paths for portability
- Virtual Environment: Isolated dependency management
- Version Control: Git with appropriate
.gitignorepatterns
Issue: ModuleNotFoundError when running scripts
- Solution: Ensure virtual environment is activated:
source venv/bin/activate
Issue: NLTK data not found errors
- Solution: Run the setup script again or manually download NLTK data:
import nltk nltk.download('all')
Issue: Cannot connect to SOLR
- Solution: Verify SOLR is running on port 8983:
curl http://localhost:8983/solr/
Issue: Stanford CoreNLP connection errors
- Solution: Ensure CoreNLP server is running on port 9000 and has sufficient memory
Issue: Python version errors
- Solution: This project requires Python 3.8 or higher. Check version:
python3 --version
If you encounter issues not covered here:
- Check that all prerequisites are properly installed
- Verify that external services (SOLR, CoreNLP) are running
- Ensure the virtual environment is activated
- Review error messages carefully - they often indicate missing dependencies or configuration issues
Contributions are welcome! Please feel free to submit issues or pull requests.
This project is part of an academic assignment. Please check with the authors regarding usage and distribution.
- BBC for providing the news corpus
- NLTK and Stanford CoreNLP teams for excellent NLP tools
- Apache SOLR community for the search platform
For questions or collaboration:
- Deepak Shanmugam
- Mohanakrishna Vanamala Hariprasad
- Vidya Sri Mani
Note: This project demonstrates the evolution from basic keyword search to advanced semantic search, showcasing how NLP techniques can significantly improve information retrieval relevance.