Text Summarization with Sumy: A Complete Guide
Last Updated :
21 Jul, 2025
Text summarization has become increasingly important as massive amounts of textual data is generated daily. The ability to extract key information quickly is important. Sumy is a Python library designed specifically for automatic text summarization as it provides multiple algorithms to tackle this challenge effectively.
Sumy for Text Summarization
Sumy brings several advantages that make it useful for various text summarization tasks. The library supports multiple summarization algorithms including Luhn, Edmundson, LSA, LexRank and KL-summarizers which give us the flexibility to choose the approach that best fits your data. It integrates with other NLP libraries and requires minimal setup, making it accessible even for beginners. The library handles large documents efficiently and can be customized to meet summarization requirements.
Setting Up Sumy
Getting Sumy up and running is straightforward. We can install it through PyPI using pip:
pip install sumy
Text Preprocessing
Before summarization, lets see text preprocessing techniques that are required to summarize a document or text. Sumy provides built-in capabilities to prepare text for effective summarization.
Tokenization with Sumy
Tokenizationbreaks down text into manageable units like sentences or words. This process helps the summarization algorithms understand text structure and meaning more effectively.
- Tokenizer splits text into sentences first, then words
- Punctuation is automatically handled and removed
- Language-specific tokenization rules are applied
Python
from sumy.nlp.tokenizers import Tokenizer
import nltk
nltk.download('punkt')
# Create tokenizer for English
tokenizer = Tokenizer("en")
# Sample text
text = """Machine learning is transforming industries worldwide.
Companies are investing heavily in AI research and development.
The future of technology depends on these advancements."""
# Tokenize into sentences
sentences = tokenizer.to_sentences(text)
# Display tokenized words for each sentence
for sentence in sentences:
words = tokenizer.to_words(sentence)
print(words)
Output:
('Machine', 'learning', 'is', 'transforming', 'industries', 'worldwide')
('Companies', 'are', 'investing', 'heavily', 'in', 'AI', 'research', 'and', 'development')
('The', 'future', 'of', 'technology', 'depends', 'on', 'these', 'advancements')
Stemming for Word Normalization
Stemming reduces words to their root forms, helping algorithms recognize that words like "running" "runs" and "ran" are variations of the same concept.
- Stemming normalizes word variations
- Improves algorithm accuracy by grouping related terms
- Essential for frequency-based summarization methods
Python
from sumy.nlp.stemmers import Stemmer
# Create stemmer for English
stemmer = Stemmer("en")
# Test stemming on various words
test_words = ["programming", "developer", "coding", "algorithms"]
for word in test_words:
stemmed = stemmer(word)
print(f"{word} -> {stemmed}")
Output:
programming -> program
developer -> develop
coding -> code
algorithms -> algorithm
Summarization Algorithms in Sumy
Sumy provides several algorithms, each with different approaches to identifying important sentences. Let's explore the most effective ones.
1. Luhn Summarizer: Frequency-Based Approach
The Luhn algorithm ranks sentences based on the frequency of significant words. It identifies important terms by filtering out stop words and focuses on sentences containing these high-frequency terms.
- Sets up the Luhn summarizer using the Sumy library with English stemming and stop words.
- Defines a function luhn_summarize() that takes text and returns a short summary.
- Demonstrates the function with a sample paragraph and prints the top 2 sentences that capture the meaning of paragraph.
Python
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt_tab')
def luhn_summarize(text, sentence_count=2):
# Parse the input text
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize summarizer with stemmer
summarizer = LuhnSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
# Generate summary
summary = summarizer(parser.document, sentence_count)
return summary
# Test with sample text
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""
summary = luhn_summarize(sample_text, 2)
for sentence in summary:
print(sentence)
Output:
Machine learning algorithms form the backbone of most AI applications today. Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
2. Edmundson Summarizer: Customizable Word Weighting
The Edmundson algorithm allows fine-tuned control over summarization by using bonus words (emphasized), stigma words (de-emphasized) and null words (ignored).
- Uses the Edmundson summarizer from the Sumy library with English stemmer and stop words.
- Allows custom emphasis through bonus_words and stigma_words to guide what content gets prioritized or downplayed.
- Runs a summarization example with a focus on AI-related terms and prints lines with most weighted words.
Python
from sumy.summarizers.edmundson import EdmundsonSummarizer
def edmundson_summarize(text, sentence_count=2, bonus_words=None, stigma_words=None):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize summarizer
summarizer = EdmundsonSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
# Set null words
summarizer.null_words = get_stop_words("english")
# Set custom word weights
if bonus_words:
summarizer.bonus_words = bonus_words
if stigma_words:
summarizer.stigma_words = stigma_words
summary = summarizer(parser.document, sentence_count)
return summary
# Customize summarization focus
bonus_words = ["intelligence", "learning", "algorithms"]
stigma_words = ["simple", "basic"]
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""
summary = edmundson_summarize(sample_text, 2, bonus_words, stigma_words)
for sentence in summary:
print(sentence)
Output:
Artificial intelligence represents a paradigm shift in how machines process information. Machine learning algorithms form the backbone of most AI applications today.
3. LSA Summarizer: Semantic Understanding
Latent Semantic Analysis (LSA) goes beyond simple word frequency by understanding relationships and context between terms. This approach often produces more coherent and contextually accurate summaries. The code given below:
- Uses the LSA (Latent Semantic Analysis) summarizer from the Sumy library.
- Converts the input text into a format suitable for summarization using a parser and tokenizer.
- Applies LSA to extract key sentences based on underlying semantic structure.
- Prints a 2-sentence summary from the given text.
Python
from sumy.summarizers.lsa import LsaSummarizer
def lsa_summarize(text, sentence_count=2):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize LSA summarizer
summarizer = LsaSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
summary = summarizer(parser.document, sentence_count)
return summary
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""
summary = lsa_summarize(sample_text, 2)
for sentence in summary:
print(sentence)
Output:
Artificial intelligence represents a paradigm shift in how machines process information. Modern AI systems can learn from data, recognize patterns and make decisions with minimal human intervention.
Time Complexity:
- Luhn: O(n²) where n is the number of sentences
- Edmundson: O(n²) with additional overhead for custom word processing
- LSA: O(n³) due to matrix decomposition operations
Space Complexity:
- All algorithms: O(n×m) where n is sentences and m is vocabulary size
- LSA requires additional space for matrix operations
Practical Applications and Limitations
Sumy works well for:
- News articles and blog posts
- Research paper abstracts
- Technical documentation
- Legal document summaries
But, it has limitations too:
- Might struggle with highly technical or domain-specific content
- Performance depends on text structure and sentence quality
- Limited effectiveness on very short texts
Choosing the Right Algorithm
Algorithm | Best For | Advantages | Disadvantages | When to Use |
---|
LSA | General-purpose summarization | Captures semantics, produces summaries, handles synonyms | Computationally intensive and memory-heavy | Default choice for most applications |
---|
Luhn | Quick, frequency-based summaries | Fast, lightweight and easy to implement | Limited semantic understanding and may overlook context | Resource-constrained environments |
---|
Edmundson | Domain-specific content | Offers customizable weighting and adapts well to specific domains | Requires manual tuning and is complex to set up | Specialized domains with predefined key terms |
---|
The key to effective summarization with Sumy lies in understanding our text characteristics and choosing the algorithm that best matches specific requirements. Experimenting with different approaches and sentence counts will help us find the optimal configuration for the use case.