Document Summarization Using Sentencepiece Transformers

Have you ever wished that there were a quick summary, and no more long documents? And this project has you covered well! In this project, we're equipping ourselves with cutting edge AI tools such as SentencePiece and Transformers by diving into the world of document summarization.

Sounds fun right? Let's see how we can make that possible!

Overview

The objective of this project is to perform a Summarization of a given document using SentencePiece and Transformers. We implement a deep learning model named PEGASUS. It is a pre-trained model that we further train to complete the task of summarization of texts from the SAMSum dataset. Here is what we will learn:

We download the SAMSum dataset that contains dialogue based texts.
We Fine-tune the PEGASUS model, a cutting-edge model focused on summarization tasks.
We Summarize most information while still short and understandable.
We Apply evaluation metrics such as ROUGE so that we can ensure that our model is accurate

Prerequisites

Before jumping into this exciting project, you need a few things to know. Let’s keep it simple and get you ready:

Basic understanding of Python programming.
A Google Colab account to run the project.
Knowledge of how to use Google Drive for storing data.
Familiarity with Transformers like Heard of BERT, GPT, or PEGASUS
Need a huggingface account because we’re using pre-trained models from Hugging Face
Knowledge of the basics of PyTorch. It’ll help you run and tweak the models.
CUDA-Enabled GPU

Approach

In this project first, we load the SAMSum dataset. This dataset is full of conversations. Then, we fine-tune the PEGASUS model using Hugging Face Transformers to handle these dialogues and create short summaries. We tokenize the input text using SentencePiece to break it down into pieces. So that the model can be understood. After that, the model is trained to generate summaries while keeping the key information intact. Once the model is fine-tuned, we evaluate it using ROUGE metrics to ensure our summaries are concise but accurate. Finally, we put it to the test with some real-world examples and save the trained model for future use!

Workflow and Methodology

Install Necessary Packages: Before starting the project, first execute the command: pip install dependencies. It is important to have Transformers, SentencePiece, and datasets installed for this project to run well.

Load the Dataset: Load the SAMSum dataset using HuggingFace datasets library.

Tokenization: Consider the text as the input for SentencePiece and transform it into the encoded form. In this stage, the entire conversations are put in smaller bits. So that the model can comprehend and structure the information without any possible obstacles.

Fine-tune the Model: The next step is to fine-tune the PEGASUS model. This model aims to produce a summary of the provided texts.

Evaluate the Model: We employ the evaluation technique using ROUGE after the model training process to evaluate the trained model performance.

Test and Inference: Once trained, the model is instantiated and its ability to summarize previously unseen dialogues in the given example is assessed.

Save the Model: Finally, we store the model as well as the tokenizer. This assists in the purpose of reusing the trained model for a similar task in the future.

Methodology

Data Preparation: Download and load the SAMSum dataset. We prepare these for training by tokenizing them with SentencePiece.

Model Setup: We will be employing the PEGASUS model available in the Hugging Face Library.

Fine-Tuning: The model is trained on the tokenized SAMSum dataset.

Evaluation: When the training is done we measure the performance using ROUGE metrics.

Testing and Deployment: After evaluation, we test the model with new dialogues. Finally, we save the fine-tuned model and tokenizer.

Data Collection and Preparation Workflow

Data Collection Workflow

Collect the Dataset: First, we collect the SAMSum dataset.
Load the Dataset: We load the SAMSum dataset into our environment using the Hugging Face datasets library
Explore the Data: We take a closer look at the dataset, which includes two main columns: dialogue and summary.

Data Preparation Workflow

Tokenization: The data is tokenized by us using SentencePiece. Adding breaks to the text helps the model understand it and breaks it into smaller tokens so the model can work properly.
Truncation and Padding: Short long texts and pad short ones, so that our dialogue size is fixed input size.
Batch the Data: We tokenized the data and then split it into smaller batches.

Code Explanation

STEP 1:

This code installs two important libraries of Python. The first command installs the accelerate library. This library is useful for managing and fastening the training of models especially on GPU. The second command is used to install the transformers library. This library offers pre-trained models for tasks like translation, summarization, etc.

! pip install -U accelerate
! pip install -U transformers

This command installs several important libraries. The transformers[sentencepiece] part installs the Transformers library along with SentencePiece for tokenizing text. datasets are for loading and processing datasets easily. Sacrebleu and rouge_score are tools for evaluating model performance. Especially for text summarization tasks. Finally, py7zr is a library for handling 7zip compressed files.

!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

!nvidia-smi

It imports pipeline and set_seed from Transformers for building models, matplotlib for plotting visual graphs, datasets for loading data, and nltk for text tokenization. Additionally, it includes tqdm for progress bars and torch for deep learning.

# Import necessary libraries
# Import pipeline for easy model inference, and set_seed for reproducibility
from transformers import pipeline, set_seed
# Import matplotlib for visualization
import matplotlib.pyplot as plt
# Import load_dataset to load datasets
from datasets import load_dataset
# Import pandas for data manipulation
import pandas as pd
# Import AutoModelForSeq2SeqLM and AutoTokenizer for sequence-to-sequence tasks
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Import NLTK (Natural Language Toolkit) for text processing
import nltk
# Import sent_tokenize for sentence tokenization
from nltk.tokenize import sent_tokenize
# Import tqdm for progress bars
from tqdm import tqdm
# Import PyTorch for deep learning capabilities
import torch
# Download NLTK tokenizer models
nltk.download("punkt")

STEP 2:

Model setup

This code initializes and configures the PEGASUS model for the purpose of summarization. It begins by checking if a GPU is available. Based on that, adjust the device for either the CUDA GPU or the CPU. After this, it uses the checkpoint “google/pegasus-cnn_dailymail”, which is available on Hugging Face’s model hub, to download the PEGASUS model and the tokenizer. In the end, the model is also loaded to the existing device, which can be a GPU or a CPU, for quick processing.

# Import necessary transformers
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Check for GPU availability and set the device accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"
# Define the pre-trained Pegasus model checkpoint
model_ckpt = "google/pegasus-cnn_dailymail"
# Initialize the tokenizer using the specified pre-trained model checkpoint
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
# Load the pre-trained Pegasus model and move it to the specified device (CPU or GPU)
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

STEP 3:

Splitting the dataset into batches

This function generate_batch_sized_chunks is used to help by dividing the large lists into smaller batches. It takes two inputs. Which are a list of elements and a batch size. Therefore the function traverses each element of the list and then divides the list into parts of the specified batch size. It then yields each chunk one by one, making it efficient to process large datasets in smaller pieces.

def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements."""
    # Iterate through the list_of_elements in increments of batch_size
    for i in range(0, len(list_of_elements), batch_size):
        # Yield a batch-sized chunk of elements from the list_of_elements
        yield list_of_elements[i : i + batch_size]

Evaluating Summarization Accuracy

The function, calculate_metric_on_test_ds, evaluates how well the model summarizes the texts on the test data. It accepts a dataset, metric, model, and tokenizer as inputs. The generate_batch_sized_chunks function first splits the dataset into batches. Then, for each batch, it tokenizes the text and uses the model to generate summaries over it. The generated summaries are then decoded cleaned up and compared to those target summaries. Finally, the function uses the ROUGE score to perform the task of computing and returning the quality of generated summaries.

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
    # Split the input documents and reference summaries into batch-sized chunks
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))
     # Iterate through each batch of input documents and reference summaries
     for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):
        # Tokenize the input documents
        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")
        # Generate summaries using the model
        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''
        # Finally, we decode the generated texts,
        # replace the  token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]
        # Add the decoded summaries and references to the metric
        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]
        metric.add_batch(predictions=decoded_summaries, references=target_batch)
    #  Finally compute and return the ROUGE scores.
    # Compute and return the score of the specified metric
    score = metric.compute()
    return score

STEP 4:

Load the dataset

The code uses the load_dataset method to import the SAMSum dataset. It outputs the size of several data portions in the dataset. After that, it describes the features of the dataset, especially the dialogue and summary parts. For better understanding, a single dialogue and its summary from the test set are displayed.

dataset_samsum = load_dataset("nyamuda/samsum")
split_lengths = [len(dataset_samsum[split])for split in dataset_samsum]
print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDialogue:")
print(dataset_samsum["test"][1]["dialogue"])
print("\nSummary:")
print(dataset_samsum["test"][1]["summary"])

Accessing test dialogue example

This code selects the first dialogue from the test set. This allows us to view the conversation for that particular test sample.

dataset_samsum['test'][0]['dialogue']