A comprehensive guide and codebase for text summarization harnessing the capabilities of Large Language Models (LLMs). Delve deep into techniques, from chunking to clustering, and maximize the potential of LLMs like GPT-3.5 and GPT-4.
π Article: I highly recommend reading this article before diving into the code.
- Clone the Repository
- Install Dependencies:
python3 -m pip install -r requirements.txt
- Install Spacy's English Dataset:
python3 -m spacy download en_core_web_sm
- Set Up OpenAI API Key:
export OPENAI_API_KEY='sk-...'
- Configure IO: Navigate to
src/config.yaml
and update theinput_file
andoutput_file
parameters underio_config
. - File Handling: For the input file, only
.txt
is accepted. For the output,.json
is preferred. Place the input file in theinput
folder. The generated summary will be in theoutput
folder. - Run the Program:
cd src/ python3 main.py
summary_type_token_limit
: Determines how to categorize the input text: short, medium, or long.sentence_splitter
: Adjustapprox_total_doc_tokens
. Keep it around 1000 for medium-sized texts and up to 6000 for longer texts.cod
: Configuration for Chain of Density (CoD) prompting.map_reduce
: To further condense the final summary with CoD, setfinal_dense
totrue
.cluster_summarization
: Adjustnum_closest_points_per_cluster
(max value: 3) for thetop-k
best chunks. Varynum_clusters
(hyper-parameter for k-means) to optimize results.- Remaining configs are self-explanatory.
The output JSON comprises:
{
"summary": "Descriptive final summary...",
"keywords": ["Keyword1", "Keyword2", "..."],
"metadata": {
"total_tokens": 3625,
"total_cost": 0.082,
"total_time": 86.23
}
}
summary
: The final summary outputkeywords
: important keywords and phrasesmetadata
: Provides total time (in seconds) taken to execute your summary, total cost (in USD) for openai, and total token counts in the whole process
β€οΈ If this repository helps, please star β, and share βοΈ!
If you also found the article informative and think it could be beneficial to others, I'd be grateful if you could like π, follow π, and shareβοΈ the piece with others.
Happy coding!