A Two-Stage Long Text Summarization Method Based On Extraction-Generation
A Two-Stage Long Text Summarization Method Based On Extraction-Generation
Abstract—Long text summarization aims to extract key long texts, long text summarization still suffers from issues
information from lengthy texts and generate concise and such as semantic loss and information redundancy[7][8],
accurate summaries. However, due to the complexity of long further increasing the difficulty of these method.
text information and the limitation of text length, existing To tackle the mentioned problem, this paper introduces
summary methods encounter hurdles like semantic a two-stage summarization model that jointly trains an
degradation and redundant information. To tackle these extractor and a generator. Initially, the model extracts the
problems, this article proposes a two-stage long text most relevant sentences from a long document, thereby
summarization method based on extraction-generation. This extending the input length it can handle while providing the
method jointly trains an extractor and a generator, effectively
generator with the most crucial information. These
combining the strengths of extractive and abstractive
methods. The extractor is utilized to extract key information
extracted text sentences are then treated as latent variables.
to tackle the challenge of handling long inputs, while the During the decoding phase, dynamic sentence-level
generator allows the use of dynamic sentence-level attention attention weights are employed, allowing the generator to
weights during decoding. This enables the generator to flexibly adjust the significance of sentences based on
dynamically adjust the importance of sentences based on contextual information, thereby eliminating noise from the
contextual information, leading to more precise summary extraction process and enhancing the precision of summary
generation. Extensive experiments conducted on the arXiv generation. This adaptability aids in filtering out extraneous
dataset demonstrate the superior performance of our information from the extraction phase, thereby enhancing
proposed model in the task of long text summarization. the precision of summary generation.
The major contributions of our work are summarized as
Keywords- Long text summarization; extraction-generation; follows:
deep learning; (1) By extracting the most relevant sentences from
lengthy documents, we extend the input length the model
I. INTRODUCTION can handle while providing the most crucial information to
The exponential growth of textual data in the era of big the generator.
data presents a formidable challenge in efficiently (2) Employing dynamic sentence-level attention
processing information within the realm of natural language weights in the decoding phase enables the generator to
processing. Text summarization approaches have garnered flexibly regulate the significance of sentences according to
substantial interest due to their capacity to automatically contextual cues, resulting in more accurate summary
generate text summaries and extract key information[1]. creation.
While Transformer-based models[2][3][4] currently excel (3) Extensive experiments were conducted on the arXiv
in short text summarization tasks, they encounter dataset, demonstrating that our method achieves excellent
difficulties in summarizing longer texts. This is due to the performance in long text summarization.
higher information density and relative simplicity of short
texts, which make them more straightforward to summarize. II. METHOD
In contrast, summarizing long texts necessitates a profound A. Overview of the Method
comprehension of content, extraction of extensive
information, and incorporation of additional details and The framework of the proposed method is depicted in
context to encapsulate the core essence of the text. Fig. 1. Given an input document X, we jointly train an
Traditional text summarization models[5][6] encounter extractor module and a generator module. Initially, the
a challenge balancing model efficiency and the quality of extractor model is utilized to extract several most relevant
summarization when handling lengthy inputs. This is due to clauses from the input text, forming an extracted sentence
the high memory complexity of the self-attention set. Subsequently, the generator model produces the final
mechanism requires the model to capture important summary using the text sentences from this set. Specifically,
information dispersed throughout long inputs while in the extractor model, we encode all sentences in the
maintaining low computational costs. Additionally, due to document using the pre-trained language model RoBERTa
the complexity of information and the length constraints in and map them to scalar scores through an MLP. The TOP-
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on April 04,2025 at 05:17:31 UTC from IEEE Xplore. Restrictions apply.
Figure 1. Overall architecture of our proposed method.
K algorithm is then applied to select the K most relevant key information, thereby enhancing the quality and
clauses from all data blocks, forming the extracted sentence accuracy of summary generation.
set . In the generator model, a key feature is dynamically
assigning weights to the extracted sentences at every time C. Extractor for long text extraction
step of the decoding process, dynamically adjusting the In the two-stage long text summarization task, the
importance of sentences in real-time. This strategy aids in design of the extractor responsible for extracting and
noise removal from the extraction process, improving the compressing long texts is particularly crucial[9]. The
precision of the generated summary and increasing extractor needs to extract essential information from the
interpretability during decoding. lengthy input text while maintaining semantic consistency.
However, due to the complexity and length of long texts,
B. Extraction-Generation Framework traditional methods that encode the entire text at once may
To address the challenges faced by models when face issues related to memory constraints.
dealing with extremely long inputs, this paper introduces a To tackle this challenge, we propose a grouping strategy
two-stage framework based on extraction and generation that divides consecutive sentences into several data blocks,
for long text summarization. In this framework, we first enabling independent computation of encoding vectors for
utilize an extractor to extract the K most relevant sentences each sentence within each data block. This approach not
from the input text. Subsequently, the generator model only mitigates memory constraints but also maintains
generates a summary based on these K sentences. semantic consistency and information integrity.
Specifically, given an input = ,…, consisting Specifically, the paper employs the pre-trained language
of N text sentences, in long text summarization, the number model RoBERTa to independently encode all sentences
of text sentences N can be very large. The output is a within each data block. RoBERTa, an improved version of
summary of length T. Initially, the extractor model the BERT model, addresses certain issues present in BERT,
calculates a relevance score for each text sentence , ensuring encoding accuracy and semantic consistency
followed by the selection of the top K sentences based while enhancing overall system performance and efficiency.
Subsequently, a multi-layer perceptron (MLP) is utilized to
map the encoding vectors onto scalar scores # , reflecting
on these scores. Following this, the generator model
generates a summary based on the extracted set of K
sentences. the importance of each sentence within the entire text. The
The objective is to train a model that produces a TOP-K algorithm is then employed to extract the K most
sequence of summary tokens y, based on the input text and relevant clauses across all data blocks as the extracted
the token y<t generated previously. The formula for sentence set. This algorithm ranks all sentences by score,
selecting the top K sentences. This approach guarantees that
= ∈ 1
computing the output probability are as follows:
the extracted sentence set contains the most critical
= − 2 information from the text while avoiding interference from
∣ = ∣
redundant information. As illustrated in Figure 2, our
extractor effectively compresses long texts into a
representative set of sentences, achieving a concise
= ∣ , 3 summary of the text.
! D. Generator for long text extraction
Where is the relevance score of each sentence , η
Due to the non-differentiable nature of the Top-K
is the parameter of the extractor. are the K highest-
operation in extraction, gradient information is lost during
scoring text clauses extracted from the document X by the
training, making it impossible to directly optimize the
extractor. The extractive-generative framework simulates
parameters of the extractor using gradient descent methods.
the output probability by using to replace X. This
Gradient descent is a widely used optimization method in
method enables the generator to focus on a smaller set of
deep learning, relying on computing gradients of the object
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on April 04,2025 at 05:17:31 UTC from IEEE Xplore. Restrictions apply.
To calculate the generation probability and dynamic
weight, we first map the input , to a context
representation vector ℎ . The generation probability
∣ , is determined by inputting ℎ into the
language model head. To calculate the dynamic weight
& ∣ , , a distinct MLP is employed to map each
ℎ to a scalar logit. Finally, we normalize all the logits of
the extracted sentence set using the softmax function to
obtain the dynamic weight & ∣ , for each
sentence. To calculate the generator's output probability
∣ (, , we multiply the generator's dynamic weight
and generation probability, and sum over all extracted
sentences to marginalize the probability, as illustrated in the
subsequent formula:
∣ = ) ∣ , & ∣ , 4
*∈+,
During every decoding time step t, the dynamic weight
Figure 2. Structure diagram of the extractor. ∣ , provides insights into how the generator
leverages the extracted sentences. Specifically, a higher
function concerning the model parameters. Subsequently, weight indicates increased significance of the sentence at
these gradients are employed for updating the model the present decoding stage. Through this method, the
parameters with the aim of minimizing the loss function. To model can learn how to better select sentences that are
address this issue, researchers have experimented with important for the generator, thereby optimizing the entire
alternative methods such as reinforcement learning to summarization generation process. The generation loss is
optimize extractor parameters. However, when the input defined as the negative log-likelihood of the golden
document N is very large, the combinatorial number of
choices for selecting X% becomes extremely large, leading
summary, as shown in the following formula:
to high variance in reinforcement learning methods in such ℒ/01 = − log ∣ 5
environments. Furthermore, there are certain issues when
using ROUGE as a training reward. For instance, when Within the extractive-generative model introduced in
using sentence-level ROUGE as a training reward, the this study, dynamic weights are employed to convey
model might tend to select sentences with highly training signals to the extractor. This fosters a collaborative
overlapping content since it primarily focuses on the synergy between the extractor and the generator,
overlap between individual sentences in the generated improving the extractor’s capacity to capture the
summary and the reference summary. This could result in importance of textual sentences during the training process.
redundant summaries. On the other hand, when using The dynamic weight of each sentence represents its
summary-level ROUGE as a training reward, the model significance at a specific time step. Initially, we calculate
emphasizes the overlap between the entire generated the average dynamic weights across all decoding steps,
summary and the reference summary, leading to sparse treating it as the cumulative importance of the sentence. To
training signals and limited useful feedback, making the achieve this, we introduce a consistency alignment loss,
training process more challenging. which is used to measure the distance between the average
To further address this issue, the paper introduces a dynamic weight distribution and the extractor distribution.
unique generator that dynamically allocates weights to the Specifically, we aim for this loss to utilize dynamic
extracted sentences during every time step of the decoding weights to adjust the extractor's distribution so that the
process. The primary characteristic of this generator is the distance between them and the average dynamic weight
dynamic allocation of importance to the extracted sentences distribution is very close on any subset of X. By doing so,
based on contextual information. By dynamically adjusting the extractor can more accurately identify important
the importance of sentences in real-time, the generator can information when extracting sentences, gradually learn
effectively mitigate noise during the extraction process. how to extract better during training, and thereby assist the
This strategy enables the generator to reduce the weight of generator model in generating high-quality summaries. For
sentences that are irrelevant to the current decoding step, simplicity, we define the consistency alignment loss as:
thereby minimizing their impact on the final generated
1
ℒconsist = KL 8 ) ⋅∣∣ , ∥
summary.
9 6
Figure 3 illustrates the structure diagram of the extractor.
For every extracted sentence x, our proposed generator can !
∣ , softmaxC , ∈ DE
∣ ,
forecast both the probability of generating
and its dynamic weight & . The generation
probability indicates the likelihood of generating the next Here, η represents the parameters of the extractor,
summary (token) given the extracted sentence set X indicating that the consistency alignment loss focuses
solely on optimizing the extractor without affecting the
and the previously generated summary . It reflects the
generator. The emphasis is on ensuring consistency
generator model's predictive capability for the summary
between distributions rather than adjusting the generator.
content based on the available information.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on April 04,2025 at 05:17:31 UTC from IEEE Xplore. Restrictions apply.
Figure 3. Structure diagram of the generator.
III. EXPERIMENT
A. Comparison With the State-of-the-Art Methods
To validate the effectiveness and superiority of the
joint extractive-generative two-stage text summarization
model proposed in this paper, we compared it with several
classic text summarization models. We utilized the arXiv
dataset to evaluate model performance and employed
common evaluation metrics, including R-1, R-2, and R-
L. Our comparative models include LexRank[10],
PGN[11], Discourse-aware[12], Bottom-Up[13], Long-
sum[14], ICSI[15], TLM-I+E[16], LED[17], Dancer[18].
The experimental results are demonstrated in Table 1,
our proposed method surpasses other comparative models
across all ROUGE metrics, encompassing ROUGE-1,
ROUGE-2, and ROUGE-L. Notably, on the arXiv Figure 4. Visualization of model comparisons experiments.
dataset, our model achieved scores of 43.36, 16.62, and
38.21 for ROUGE-1, ROUGE-2, and ROUGE-L, IV. CONCLUSION
correspondingly. When contrasted with the least effective This paper introduces a two-stage approach for long
method, our model showed improvements of 11.3, 7.58, text summarization that integrates extraction and
and 13.05 in these three metrics. In comparison to the generation techniques. To tackle the challenges inherent
previous best method, although our model slightly lags in long text summarization, we introduce a method that
behind in ROUGE-L, it outperforms in ROUGE-1 and jointly trains an extractor and a generator. By selecting
ROUGE-2 by 0.66 and 0.08, respectively. These results the most pertinent sentences from lengthy documents, our
demonstrate that our model is more suitable for long- model not only expands its input capacity but also
document summarization tasks compared to previous delivers critical information to the generator. The
methods. Its superior performance can be attributed to generator then produces the final summary by leveraging
reduced information loss between the extraction and these extracted sentences, dynamically adjusting their
generation steps and its ability to handle longer input weights throughout the decoding process to enhance real-
lengths. time sentence importance. This strategy helps to eliminate
noise in the extraction process, making the generated
B. Visual Results summaries more precise. Comparative experiments with
Figure 4 visually demonstrates the comparative existing methods demonstrate the effectiveness of our
results of our experiments. It is evident that our proposed proposed method in the task of long text summarization.
joint extractive-generative two-stage text summarization
model exhibits significant advantages across all
evaluation metrics. This demonstrates the effectiveness
and superiority of the proposed method in long text
summarization.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on April 04,2025 at 05:17:31 UTC from IEEE Xplore. Restrictions apply.
REFERENCES
[1] Wang, Hong, et al. "Cross-modal knowledge guided model for
abstractive summarization." Complex & Intelligent Systems 10.1
(2024): 577-594.
[2] Xie, Qianqian, Prayag Tiwari, and Sophia Ananiadou.
"Knowledge-enhanced graph topic transformer for explainable
biomedical text summarization." IEEE journal of biomedical and
health informatics (2023).
[3] Zhang, Xiliang, et al. "Trajectory prediction of seagoing ships in
dynamic traffic scenes via a gated spatio-temporal graph
aggregation network." Ocean Engineering 287 (2023): 115886.
[4] Su, Ming-Hsiang, Chung-Hsien Wu, and Hao-Tse Cheng. "A
two-stage transformer-based approach for variable-length
abstractive summarization." IEEE/ACM Transactions on Audio,
Speech, and Language Processing 28 (2020): 2061-2072.
[5] Gidiotis, Alexios, and Grigorios Tsoumakas. "A divide-and-
conquer approach to the summarization of long documents."
IEEE/ACM Transactions on Audio, Speech, and Language
Processing 28 (2020): 3029-3040.
[6] Zhu, Chenguang, et al. "A hierarchical network for abstractive
meeting summarization with cross-domain pretraining." arXiv
preprint arXiv:2004.02016 (2020).
[7] Li, Xingye, et al. "Magdra: a multi-modal attention graph network
with dynamic routing-by-agreement for multi-label emotion
recognition." Knowledge-Based Systems 283 (2024): 111126.
[8] Huang, Ying, et al. "Sentence salience contrastive learning for
abstractive text summarization." Neurocomputing 593 (2024):
127808.
[9] Deng, Zhenrong, et al. "A two-stage Chinese text summarization
algorithm using keyword information and adversarial learning."
Neurocomputing 425 (2021): 117-126.
[10] Erkan, Günes, and Dragomir R. Radev. "Lexrank: Graph-based
lexical centrality as salience in text summarization." Journal of
artificial intelligence research 22 (2004): 457-479.
[11] See, Abigail, Peter J. Liu, and Christopher D. Manning. "Get to
the point: Summarization with pointer-generator networks."
arXiv preprint arXiv:1704.04368 (2017).
[12] Cohan, Arman, et al. "A discourse-aware attention model for
abstractive summarization of long documents." arXiv preprint
arXiv:1804.05685 (2018).
[13] Kornilova, Anastassia, and Vlad Eidelman. "BillSum: A corpus
for automatic summarization of US legislation." arXiv preprint
arXiv:1910.00523 (2019).
[14] Liu, Yinhan. "Roberta: A robustly optimized bert pretraining
approach." arXiv preprint arXiv:1907.11692 (2019).
[15] Boudin, Florian, Hugo Mougard, and Benoit Favre. "Concept-
based summarization using integer linear programming: From
concept pruning to multiple optimal solutions." Conference on
Empirical Methods in Natural Language Processing (EMNLP)
2015. 2015.
[16] Pilault, Jonathan, et al. "On extractive and abstractive neural
document summarization with transformer language models."
Proceedings of the 2020 conference on empirical methods in
natural language processing (EMNLP). 2020.
[17] Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer:
The long-document transformer." arXiv preprint
arXiv:2004.05150 (2020).
[18] Gidiotis, Alexios, and Grigorios Tsoumakas. "A divide-and-
conquer approach to the summarization of academic articles."
arXiv preprint arXiv:2004.06190 (2020).
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on April 04,2025 at 05:17:31 UTC from IEEE Xplore. Restrictions apply.