Overview
The objective of this project is to perform a Summarization of a given document using SentencePiece and Transformers. We implement a deep learning model named PEGASUS. It is a pre-trained model that we further train to complete the task of summarization of texts from the SAMSum dataset. Here is what we will learn:
- We download the SAMSum dataset that contains dialogue based texts.
- We Fine-tune the PEGASUS model, a cutting-edge model focused on summarization tasks.
- We Summarize most information while still short and understandable.
- We Apply evaluation metrics such as ROUGE so that we can ensure that our model is accurate
Prerequisites
Before jumping into this exciting project, you need a few things to know. Let’s keep it simple and get you ready:
- Basic understanding of Python programming.
- A Google Colab account to run the project.
- Knowledge of how to use Google Drive for storing data.
- Familiarity with Transformers like Heard of BERT, GPT, or PEGASUS
- Need a huggingface account because we’re using pre-trained models from Hugging Face
- Knowledge of the basics of PyTorch. It’ll help you run and tweak the models.
- CUDA-Enabled GPU
Approach
In this project first, we load the SAMSum dataset. This dataset is full of conversations. Then, we fine-tune the PEGASUS model using Hugging Face Transformers to handle these dialogues and create short summaries. We tokenize the input text using SentencePiece to break it down into pieces. So that the model can be understood. After that, the model is trained to generate summaries while keeping the key information intact. Once the model is fine-tuned, we evaluate it using ROUGE metrics to ensure our summaries are concise but accurate. Finally, we put it to the test with some real-world examples and save the trained model for future use!
Workflow and Methodology
Install Necessary Packages: Before starting the project, first execute the command: pip install dependencies. It is important to have Transformers, SentencePiece, and datasets installed for this project to run well.
Load the Dataset: Load the SAMSum dataset using HuggingFace datasets library.
Tokenization: Consider the text as the input for SentencePiece and transform it into the encoded form. In this stage, the entire conversations are put in smaller bits. So that the model can comprehend and structure the information without any possible obstacles.
Fine-tune the Model: The next step is to fine-tune the PEGASUS model. This model aims to produce a summary of the provided texts.
Evaluate the Model: We employ the evaluation technique using ROUGE after the model training process to evaluate the trained model performance.
Test and Inference: Once trained, the model is instantiated and its ability to summarize previously unseen dialogues in the given example is assessed.
Save the Model: Finally, we store the model as well as the tokenizer. This assists in the purpose of reusing the trained model for a similar task in the future.
Methodology
Data Preparation: Download and load the SAMSum dataset. We prepare these for training by tokenizing them with SentencePiece.
Model Setup: We will be employing the PEGASUS model available in the Hugging Face Library.
Fine-Tuning: The model is trained on the tokenized SAMSum dataset.
Evaluation: When the training is done we measure the performance using ROUGE metrics.
Testing and Deployment: After evaluation, we test the model with new dialogues. Finally, we save the fine-tuned model and tokenizer.
Data Collection and Preparation Workflow
Data Collection Workflow
- Collect the Dataset: First, we collect the SAMSum dataset.
- Load the Dataset: We load the SAMSum dataset into our environment using the Hugging Face datasets library
- Explore the Data: We take a closer look at the dataset, which includes two main columns: dialogue and summary.
Data Preparation Workflow
- Tokenization: The data is tokenized by us using SentencePiece. Adding breaks to the text helps the model understand it and breaks it into smaller tokens so the model can work properly.
- Truncation and Padding: Short long texts and pad short ones, so that our dialogue size is fixed input size.
- Batch the Data: We tokenized the data and then split it into smaller batches.