But what is Extractive and Abstractive Summarization in Machine Learning?

Let’s understand & compare Abstractive and Extractive summarization in Machine Learning.

Karan Kaul | カラン
4 min readMay 31, 2023
abstractive and extracting summarization in Machine learning
Photo by Siora Photography on Unsplash

The Need: Why is summarization needed at all?

Summarization plays a crucial role in extracting key points or ideas from text. Summaries enhance the information retrieval process by providing an overview of the content, enabling users to search and access specific information more efficiently.

Summaries also facilitate faster analysis of multiple documents by comparing their summaries instead of the individual documents themselves.

Extractive Summarization

This technique involves extracting the most important or useful sentences from a document and presenting them as they are in the summary. It is similar to highlighting important sentences in a paragraph without modifying the words or sentences.

Overview of how Extractive Summarization works —

  1. Preprocessing: Sentence tokenization, removal of stopwords, lemmatization, stemming, etc.
  2. Sentence scoring: Sentences are assigned scores based on their value within the document using methods like frequency count of rare words, TF-IDF, etc.
  3. Filtering Sentences: Top-scoring sentences are selected to form the summary.

Here’s an example of extractive summarization code in Python:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist

def extractive_summarization(text, num_sentences):
# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Tokenize the sentences into words
words = [word_tokenize(sentence) for sentence in sentences]

# Filter out stop words
stop_words = set(stopwords.words("english"))
words = [[word for word in sentence if word.lower() not in stop_words] for sentence in words]

# Calculate word frequencies
word_frequencies = FreqDist([word for sentence in words for word in sentence])

# Assign scores to sentences based on word frequencies
sentence_scores = {sentence: sum([word_frequencies[word] for word in sentence]) for sentence in sentences}

# Sort the sentences by scores in descending order
sorted_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)

# Select the top 'num_sentences' sentences for the summary
summary_sentences = sorted_sentences[:num_sentences]

# Combine the selected sentences into the summary text
summary = ' '.join(summary_sentences)

return summary

Now, let’s move on to understanding “Abstractive” summarization and how it differs from extractive summarization.

Abstractive Summarization

Abstractive summarization involves generating a summary by understanding the key points and context of the given text. Unlike extractive techniques, abstractive methods generate new sentences that represent the idea of the original text.

Overview of How Abstractive Summarization Works —

  1. Preprocessing: Sentence tokenization, removal of stopwords, lemmatization, stemming, etc.
  2. Understanding the context: NLP techniques like Part of Speech tagging, NER, semantic analysis, etc., are used to grasp the meaning and relationships between different sentences.
  3. Summary Generation: Based on the captured meaning and relationships, new text is generated to represent the abstract of the given text. Neural Networks are often employed for such tasks.
  4. Post Processing: The generated summary undergoes further processing to remove grammatical errors or mistakes.

Here’s an example of abstractive summarization code in Python using the transformers library:

!pip install transformers
from transformers import pipeline

def abstractive_summarization(text, max_length):
# Load the summarization pipeline
summarizer = pipeline("summarization")

# Generate the summary
summary = summarizer(text, max_length=max_length, min_length=10, do_sample=True)[0]["summary_text"]

return summary

Comparing the results: Extractive vs Abstractive summarization

Here is the text that was used as the input

The boss of the company behind ChatGPT has said it has no plans to leave Europe. OpenAI CEO Sam Altman U-turned on a threat he made earlier this week to leave the block if it becomes too hard to comply with upcoming laws on artificial intelligence (AI).

The EU’s planned legislation could be the first to legislate on AI which the tech boss said was “over-regulating”. But he backtracked after wide-spread coverage of his comments. “We are excited to continue to operate here and of course have no plans to leave,” he tweeted.

The proposed law could require generative AI companies to reveal which copyrighted material had been used to train their systems to create text and images. Many in the creative industries accuse AI companies of using the work of artists, musicians and actors to train systems to imitate their work.

But Mr Altman is worried it would be technically impossible for OpenAI to comply with some of the AI Act’s safety and transparency requirements, according to Time magazine.

Extractive Summary (num_sentences = 2)

We are excited to continue to operate here and of course have no plans to leave,” he tweeted. Many in the creative industries accuse AI companies of using the work of artists, musicians and actors to train systems to imitate their work.

Abstractive Summary (max_length = 100)

Sam Altman says he has no plans to leave Europe if AI laws become too hard . He had threatened to leave the block if it becomes too hard to comply with EU AI laws . But he backtracked after wide-spread coverage of his comments.

Conclusion

Abstractive summarization definitely outperforms the Extractive method. In general, machine learning and deep learning models consistently surpass traditional techniques when tackling such problems. These models have a remarkable ability to understand context, approaching human-level performance.

However, despite the significant improvements offered by abstractive methods, it is not always practical to rely solely on them. These models typically require more computational resources to run, and their outputs may not always be grammatically correct.

In contrast, extractive methods depend on the grammatical correctness of the input. Moreover, these methods are less resource-intensive, making them more feasible for certain scenarios.

--

--

Karan Kaul | カラン

Writes about Programming & Technology (mainly Machine Learning) Connect with me on Linkedin - https://www.linkedin.com/in/krnk97/