This project provides an advanced text summarization system that supports multiple types of documents (news articles, scientific papers, legal documents) and offers both extractive and abstractive summarization techniques. It features a user-friendly web interface, multi-language support, and advanced evaluation metrics.
Deploy and interact with the project live using Streamlit, allowing users to experience the summarization capabilities in real-time. Check the deployed and interactive project here.
- Upload a document through the Streamlit interface.
- Select the type of summarization (extractive, abstractive, or hybrid).
- Generate the summary and view the results along with evaluation metrics.
- Customize the summary length and focus as needed.
- Use the interactive visualization tools to explore key sentences and their importance.
- Multi-Type Document Handling: Summarize various document types with adaptable preprocessing pipelines.
- Hybrid Summarization Approach: Integrates extractive (BERTSUM) and abstractive (T5, BART, Pegasus) techniques.
- Contextual Understanding: Incorporates topic modeling for better contextual understanding.
- User Interface: Streamlit-based web interface for document upload and summary generation.
- Evaluation Metrics: Detailed evaluation using ROUGE, BLEU, and METEOR scores.
- Multi-Language Support: Summarization in multiple languages using multilingual models.
data/
: Contains raw and processed datasets.raw/
: Original datasets categorized by document type.processed/
: Preprocessed datasets ready for model training.
models/
: Stores pre-trained and fine-tuned models.extractive/
: Models for extractive summarization.abstractive/
: Models for abstractive summarization.hybrid/
: Combined models for hybrid summarization.
notebooks/
: Jupyter notebooks for data preprocessing and model training.src/
: Source code for data processing, model training, evaluation, and interface.data/
: Scripts for data loading, preprocessing, and topic modeling.models/
: Summarization model scripts.evaluation/
: Scripts for evaluation metrics.interface/
: Streamlit app and utility scripts.visualization/
: Scripts for visualizing summaries and highlighting sentences.
tests/
: Unit tests for data processing, models, evaluation, and interface.requirements.txt
: List of dependencies.README.md
: Project documentation..gitignore
: Git ignore file.
- Clone the repository:
git clone https://github.com/yourusername/advanced_text_summarization.git cd advanced_text_summarization
- Create a virtual enviorment and activate it:
python3 -m venv venv source venv/bin/activate
- Install the required dependencies:
pip install -r requirements.txt
- Data Preprocessing:
- Run the data preprocessing notebook to prepare the datasets:
jupyter notebook notebooks/data_preprocessing.ipynb
- Model Training:
- Train the extractive, abstractive and hybrid models using the respective notebooks.
jupyter notebook notebooks/extractive_model_training.ipynb jupyter notebook notebooks/abstractive_model_training.ipynb jupyter notebook notebooks/hybrid_model_training.ipynb
- Running the Interface:
- Start the streamlit app to interact with the summarization system.
streamlit run src/interface/app.py
- Evaluate the performance of the summarization models using the provided evaluation scripts.
python src/evaluation/rouge.py python src/evaluation/bleu.py