But what is Extractive and Abstractive Summarization in Machine Learning?

Let’s understand & compare Abstractive and Extractive summarization in Machine Learning.

4 min readMay 31, 2023

abstractive and extracting summarization in Machine learning — Photo by Siora Photography on Unsplash

The Need: Why is summarization needed at all?

Summarization plays a crucial role in extracting key points or ideas from text. Summaries enhance the information retrieval process by providing an overview of the content, enabling users to search and access specific information more efficiently.

Summaries also facilitate faster analysis of multiple documents by comparing their summaries instead of the individual documents themselves.

Extractive Summarization

This technique involves extracting the most important or useful sentences from a document and presenting them as they are in the summary. It is similar to highlighting important sentences in a paragraph without modifying the words or sentences.

Overview of how Extractive Summarization works —

Preprocessing: Sentence tokenization, removal of stopwords, lemmatization, stemming, etc.
Sentence scoring: Sentences are assigned scores based on their value within the document using methods like frequency count of rare words, TF-IDF, etc.
Filtering Sentences: Top-scoring sentences are selected to form the summary.

Here’s an example of extractive summarization code in Python:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist

def extractive_summarization(text, num_sentences):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    
    # Tokenize the sentences into words
    words = [word_tokenize(sentence) for sentence in sentences]
    
    # Filter out stop words
    stop_words = set(stopwords.words("english"))
    words = [[word for word in sentence if word.lower() not in stop_words] for sentence in words]
    
    # Calculate word frequencies
    word_frequencies = FreqDist([word for sentence in words for word in sentence])
    
    # Assign scores to sentences based on word frequencies
    sentence_scores = {sentence: sum([word_frequencies[word] for word in sentence]) for sentence in sentences}
    
    # Sort the sentences by scores in descending order
    sorted_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)
    
    # Select the top 'num_sentences' sentences for the summary
    summary_sentences = sorted_sentences[:num_sentences]
    
    # Combine the selected sentences into the summary text
    summary = ' '.join(summary_sentences)
    
    return summary

Now, let’s move on to understanding “Abstractive” summarization and how it differs from extractive summarization.

Abstractive Summarization

Abstractive summarization involves generating a summary by understanding the key points and context of the given text. Unlike extractive techniques, abstractive methods generate new sentences that represent the idea of the original text.

Overview of How Abstractive Summarization Works —

Preprocessing: Sentence tokenization, removal of stopwords, lemmatization, stemming, etc.
Understanding the context: NLP techniques like Part of Speech tagging, NER, semantic analysis, etc., are used to grasp the meaning and relationships between different sentences.
Summary Generation: Based on the captured meaning and relationships, new text is generated to represent the abstract of the given text. Neural Networks are often employed for such tasks.
Post Processing: The generated summary undergoes further processing to remove grammatical errors or mistakes.

Here’s an example of abstractive summarization code in Python using the transformers library:

!pip install transformers
from transformers import pipeline

def abstractive_summarization(text, max_length):
    # Load the summarization pipeline
    summarizer = pipeline("summarization")
    
    # Generate the summary
    summary = summarizer(text, max_length=max_length, min_length=10, do_sample=True)[0]["summary_text"]
    
    return summary

Comparing the results: Extractive vs Abstractive summarization

Here is the text that was used as the input —

The boss of the company behind ChatGPT has said it has no plans to leave Europe. OpenAI CEO Sam Altman U-turned on a threat he made earlier this week to leave the block if it becomes too hard to comply with upcoming laws on artificial intelligence (AI).
The EU’s planned legislation could be the first to legislate on AI which the tech boss said was “over-regulating”. But he backtracked after wide-spread coverage of his comments. “We are excited to continue to operate here and of course have no plans to leave,” he tweeted.
The proposed law could require generative AI companies to reveal which copyrighted material had been used to train their systems to create text and images. Many in the creative industries accuse AI companies of using the work of artists, musicians and actors to train systems to imitate their work.
But Mr Altman is worried it would be technically impossible for OpenAI to comply with some of the AI Act’s safety and transparency requirements, according to Time magazine.

Extractive Summary (num_sentences = 2)

We are excited to continue to operate here and of course have no plans to leave,” he tweeted. Many in the creative industries accuse AI companies of using the work of artists, musicians and actors to train systems to imitate their work.

Abstractive Summary (max_length = 100)

Sam Altman says he has no plans to leave Europe if AI laws become too hard . He had threatened to leave the block if it becomes too hard to comply with EU AI laws . But he backtracked after wide-spread coverage of his comments.

Conclusion

Abstractive summarization definitely outperforms the Extractive method. In general, machine learning and deep learning models consistently surpass traditional techniques when tackling such problems. These models have a remarkable ability to understand context, approaching human-level performance.

However, despite the significant improvements offered by abstractive methods, it is not always practical to rely solely on them. These models typically require more computational resources to run, and their outputs may not always be grammatically correct.

In contrast, extractive methods depend on the grammatical correctness of the input. Moreover, these methods are less resource-intensive, making them more feasible for certain scenarios.

Thanks for reading!

Check out these other interesting posts & subscribe to my posts via email —

Get an email whenever Karan Kaul | カラン publishes.

Get an email whenever Karan Kaul | カラン publishes. By signing up, you will create a Medium account if you don't already…

krnk97.medium.com

Extracting Synonyms (similar words) from text using BERT & NMSLIB

An approach to extracting words that are similar/synonyms from within multiple rows of text using BERT & NMSLIB.

krnk97.medium.com

Most Asked Interview Questions in Python/Django

Top 12 Python/Django Interview Questions

enlear.academy

Benefits of using Flask over Django for Web Development

In this article, we will talk about some benefits/advantages of using the Flask framework over Django. Keep in mind…

krnk97.medium.com

But what is Extractive and Abstractive Summarization in Machine Learning?

Let’s understand & compare Abstractive and Extractive summarization in Machine Learning.

The Need: Why is summarization needed at all?

Extractive Summarization

Overview of how Extractive Summarization works —

Abstractive Summarization

Overview of How Abstractive Summarization Works —

Comparing the results: Extractive vs Abstractive summarization

Extractive Summary (num_sentences = 2)

Abstractive Summary (max_length = 100)

Conclusion

Thanks for reading!

Get an email whenever Karan Kaul | カラン publishes.

Get an email whenever Karan Kaul | カラン publishes. By signing up, you will create a Medium account if you don't already…

Extracting Synonyms (similar words) from text using BERT & NMSLIB

An approach to extracting words that are similar/synonyms from within multiple rows of text using BERT & NMSLIB.

Most Asked Interview Questions in Python/Django

Top 12 Python/Django Interview Questions

Benefits of using Flask over Django for Web Development

In this article, we will talk about some benefits/advantages of using the Flask framework over Django. Keep in mind…

Written by Karan Kaul | カラン

Responses (1)