Open In App

Text Summarization with Sumy: A Complete Guide

Last Updated : 21 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Text summarization has become increasingly important as massive amounts of textual data is generated daily. The ability to extract key information quickly is important. Sumy is a Python library designed specifically for automatic text summarization as it provides multiple algorithms to tackle this challenge effectively.

Sumy for Text Summarization

Sumy brings several advantages that make it useful for various text summarization tasks. The library supports multiple summarization algorithms including Luhn, Edmundson, LSA, LexRank and KL-summarizers which give us the flexibility to choose the approach that best fits your data. It integrates with other NLP libraries and requires minimal setup, making it accessible even for beginners. The library handles large documents efficiently and can be customized to meet summarization requirements.

Setting Up Sumy

Getting Sumy up and running is straightforward. We can install it through PyPI using pip:

pip install sumy

Text Preprocessing

Before summarization, lets see text preprocessing techniques that are required to summarize a document or text. Sumy provides built-in capabilities to prepare text for effective summarization.

Tokenization with Sumy

Tokenizationbreaks down text into manageable units like sentences or words. This process helps the summarization algorithms understand text structure and meaning more effectively.

  • Tokenizer splits text into sentences first, then words
  • Punctuation is automatically handled and removed
  • Language-specific tokenization rules are applied
Python
from sumy.nlp.tokenizers import Tokenizer
import nltk
nltk.download('punkt')

# Create tokenizer for English
tokenizer = Tokenizer("en")

# Sample text
text = """Machine learning is transforming industries worldwide. 
          Companies are investing heavily in AI research and development. 
          The future of technology depends on these advancements."""

# Tokenize into sentences
sentences = tokenizer.to_sentences(text)

# Display tokenized words for each sentence
for sentence in sentences:
    words = tokenizer.to_words(sentence)
    print(words)

Output:

('Machine', 'learning', 'is', 'transforming', 'industries', 'worldwide')
('Companies', 'are', 'investing', 'heavily', 'in', 'AI', 'research', 'and', 'development')
('The', 'future', 'of', 'technology', 'depends', 'on', 'these', 'advancements')

Stemming for Word Normalization

Stemming reduces words to their root forms, helping algorithms recognize that words like "running" "runs" and "ran" are variations of the same concept.

  • Stemming normalizes word variations
  • Improves algorithm accuracy by grouping related terms
  • Essential for frequency-based summarization methods
Python
from sumy.nlp.stemmers import Stemmer

# Create stemmer for English
stemmer = Stemmer("en")

# Test stemming on various words
test_words = ["programming", "developer", "coding", "algorithms"]

for word in test_words:
    stemmed = stemmer(word)
    print(f"{word} -> {stemmed}")

Output:

programming -> program
developer -> develop
coding -> code
algorithms -> algorithm

Summarization Algorithms in Sumy

Sumy provides several algorithms, each with different approaches to identifying important sentences. Let's explore the most effective ones.

1. Luhn Summarizer: Frequency-Based Approach

The Luhn algorithm ranks sentences based on the frequency of significant words. It identifies important terms by filtering out stop words and focuses on sentences containing these high-frequency terms.

  • Sets up the Luhn summarizer using the Sumy library with English stemming and stop words.
  • Defines a function luhn_summarize() that takes text and returns a short summary.
  • Demonstrates the function with a sample paragraph and prints the top 2 sentences that capture the meaning of paragraph.
Python
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt_tab')

def luhn_summarize(text, sentence_count=2):
    # Parse the input text
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    
    # Initialize summarizer with stemmer
    summarizer = LuhnSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")
    
    # Generate summary
    summary = summarizer(parser.document, sentence_count)
    return summary

# Test with sample text
sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information. 
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention. 
Machine learning algorithms form the backbone of most AI applications today. 
Deep learning, a subset of machine learning, uses neural networks to solve complex problems. 
These technologies are revolutionizing industries from healthcare to finance. 
The potential applications of AI seem limitless as research continues to advance.
"""

summary = luhn_summarize(sample_text, 2)
for sentence in summary:
    print(sentence)

Output:

Machine learning algorithms form the backbone of most AI applications today. Deep learning, a subset of machine learning, uses neural networks to solve complex problems.

2. Edmundson Summarizer: Customizable Word Weighting

The Edmundson algorithm allows fine-tuned control over summarization by using bonus words (emphasized), stigma words (de-emphasized) and null words (ignored).

  • Uses the Edmundson summarizer from the Sumy library with English stemmer and stop words.
  • Allows custom emphasis through bonus_words and stigma_words to guide what content gets prioritized or downplayed.
  • Runs a summarization example with a focus on AI-related terms and prints lines with most weighted words.
Python
from sumy.summarizers.edmundson import EdmundsonSummarizer

def edmundson_summarize(text, sentence_count=2, bonus_words=None, stigma_words=None):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))

    # Initialize summarizer
    summarizer = EdmundsonSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")

    # Set null words
    summarizer.null_words = get_stop_words("english")

    # Set custom word weights
    if bonus_words:
        summarizer.bonus_words = bonus_words
    if stigma_words:
        summarizer.stigma_words = stigma_words

    summary = summarizer(parser.document, sentence_count)
    return summary

# Customize summarization focus
bonus_words = ["intelligence", "learning", "algorithms"]
stigma_words = ["simple", "basic"]

sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""

summary = edmundson_summarize(sample_text, 2, bonus_words, stigma_words)
for sentence in summary:
    print(sentence)

Output:

Artificial intelligence represents a paradigm shift in how machines process information. Machine learning algorithms form the backbone of most AI applications today.

3. LSA Summarizer: Semantic Understanding

Latent Semantic Analysis (LSA) goes beyond simple word frequency by understanding relationships and context between terms. This approach often produces more coherent and contextually accurate summaries. The code given below:

  • Uses the LSA (Latent Semantic Analysis) summarizer from the Sumy library.
  • Converts the input text into a format suitable for summarization using a parser and tokenizer.
  • Applies LSA to extract key sentences based on underlying semantic structure.
  • Prints a 2-sentence summary from the given text.
Python
from sumy.summarizers.lsa import LsaSummarizer

def lsa_summarize(text, sentence_count=2):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    
    # Initialize LSA summarizer
    summarizer = LsaSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")
    
    summary = summarizer(parser.document, sentence_count)
    return summary

sample_text = """
Artificial intelligence represents a paradigm shift in how machines process information.
Modern AI systems can learn from data, recognize patterns, and make decisions with minimal human intervention.
Machine learning algorithms form the backbone of most AI applications today.
Deep learning, a subset of machine learning, uses neural networks to solve complex problems.
These technologies are revolutionizing industries from healthcare to finance.
The potential applications of AI seem limitless as research continues to advance.
"""

summary = lsa_summarize(sample_text, 2)
for sentence in summary:
    print(sentence)

Output:

Artificial intelligence represents a paradigm shift in how machines process information. Modern AI systems can learn from data, recognize patterns and make decisions with minimal human intervention.

Performance Considerations

Time Complexity:

  • Luhn: O(n²) where n is the number of sentences
  • Edmundson: O(n²) with additional overhead for custom word processing
  • LSA: O(n³) due to matrix decomposition operations

Space Complexity:

  • All algorithms: O(n×m) where n is sentences and m is vocabulary size
  • LSA requires additional space for matrix operations

Practical Applications and Limitations

Sumy works well for:

  • News articles and blog posts
  • Research paper abstracts
  • Technical documentation
  • Legal document summaries

But, it has limitations too:

  • Might struggle with highly technical or domain-specific content
  • Performance depends on text structure and sentence quality
  • Limited effectiveness on very short texts

Choosing the Right Algorithm

AlgorithmBest ForAdvantagesDisadvantagesWhen to Use
LSAGeneral-purpose summarizationCaptures semantics, produces summaries, handles synonymsComputationally intensive and memory-heavyDefault choice for most applications
LuhnQuick, frequency-based summariesFast, lightweight and easy to implementLimited semantic understanding and may overlook contextResource-constrained environments
EdmundsonDomain-specific contentOffers customizable weighting and adapts well to specific domainsRequires manual tuning and is complex to set upSpecialized domains with predefined key terms

The key to effective summarization with Sumy lies in understanding our text characteristics and choosing the algorithm that best matches specific requirements. Experimenting with different approaches and sentence counts will help us find the optimal configuration for the use case.


Practice Tags :

Similar Reads