Punctuation Restoration for Speech Transcripts
using seq2seq Transformers
Aviv Melamud
1
and Alina Duran
#
1
Cresskill High School, Cresskill, NJ, USA
#
Mentor
ABSTRACT
When creating text transcripts from spoken audio, Automatic Speech Recognition (ASR) systems need to infer
appropriate punctuation in order to make the transcription more readable. This task, known as punctuation restoration,
is challenging since punctuation is not explicitly stated in speech. Most recent works framed punctuation restoration
as a classification task and used pre-trained encoder-based transformers, such as BERT, to perform it. In this work,
we present an alternative approach, framing punctuation restoration as a sequence-to-sequence task and using T5, a
pretrained encoder-decoder transformer model, as the basis of our implementation. Training our model on IWSLT
2012, a common punctuation restoration benchmark, we find its performance is comparable to state of the art
classification-based systems with an F1 score of 80.7 on the test set. Furthermore, we argue that our approach might
be more flexible in its ability to adapt to more complex types of outputs, such as predicting more than one punctuation
mark in a row.
Introduction
Automatic speech recognition (ASR) is used in various applications, such as the preparation of medical reports, hands-
free typing of documents, implementation of voice-based user interfaces and virtual assistants, automatic transcription
of videos/lectures, and accessibility tools. The field of ASR has seen massive progress in recent years. State of the art
(SOTA) models, such as Wav2Vec 2.0 (Baevski et al. 2020), have achieved extremely low word error rates (WER)
on ASR benchmarks such as “TIMIT” (Garofolo et al. 1993) and “LibriSpeech” (Panayotov et al. 2015), measured at
8% and 1.4% Word Error Rates (WER), respectively. WER counts the percentage of erroneous substitutions,
deletions, and insertions of words in an ASR-generated transcript. While word accuracy is important for a quality
transcript, missing punctuation, not evaluated in WER metrics, has been shown to impact readability just as much as
word errors (Tündik et al. 2018). Due to the fact that punctuation is less explicitly indicated in speech, it is typically
inferred from context and has been addressed using different approaches. This task is known as punctuation
restoration.
Previous research on punctuation restoration has utilized Recurrent Neural Network (RNN) architectures
(Tilk and Alumäe 2016), and LSTMs (Tilk and Alumäe 2015), while most recent attempts have focused on using
transformer models (Nagy et al. 2021), due to their superior performance. Specifically, pretrained contextualized
language models, such as BERT (Devlin et al. 2018), fine-tuned on the punctuation restoration task, have yielded state
of the art performance (Courtland et al., 2020).
Common to most of the recent work is the framing of punctuation restoration as a classification task, where
a single punctuation mark is predicted for every position in the input text. In contrast, in this paper, we frame the task
as a textual sequence-to-sequence task applying pre-trained seq2seq transformer models, specifically Google’s T5
(Raffel et al. 2020), to the punctuation restoration of speech transcripts.
Volume 10 Issue 4 (2021)
ISSN: 2167-1907
www.JSR.org
1
Related work
Early punctuation restoration systems relied on classical machine learning methods, such as decision trees (Kolár et
al. 2004) and n-gram models (Gravano et al. 2009). However, these have largely been outperformed by large neural
network models. Since the task requires a model well suited to processing sequences, with an understanding of context,
Recurrent Neural Network (RNN) architectures have been used to restore punctuation, including RNNs with attention
(Tilk and Alumäe 2016), and LSTMs (Tilk and Alumäe 2015). Most recent literature focuses on transformers, due to
their superior contextual understanding and accuracy as compared to LSTMs and RNNs. Transformer encoders pre-
trained on large corpora of text, such as BERT (Devlin et al. 2018) and RoBerta (Liu et al. 2019), have obtained state-
of-the-art results in a variety of natural language tasks, and have thus been applied to punctuation restoration (Nagy
et al. 2021, Alam et al. 2020).
Common to most of the aforementioned models is the framing of punctuation restoration as a classification
task, where for every position in the input text, the model makes a classification decision between one of several
predefined classes (period, comma, question mark, none, etc.). Figure 1 illustrates this approach, where every position
in the input is encoded by the encoder and then outputs are predicted based on that encoding.
Previous encoder-only architectures (Nagy et al. 2021) pass this vector to a classifier, which classifies each
token of the input as being followed by one of several punctuation types as seen in figure 1.
Figure 1. Illustration of encoder-classifier model for punctuation restoration.
In this paper, we extend this line of research, by framing punctuation restoration as a sequence-to-sequence task rather
than classification. To do this, instead of using a single encoder, we train an encoder-decoder sequence-to-sequence
transformer model. Our goal is to determine the efficacy of using an encoder-decoder architecture instead of an
encoder-only system. Figure 2 illustrates the sequence-to-sequence encoder-decoder architecture used in this paper.
Volume 10 Issue 4 (2021)
ISSN: 2167-1907
www.JSR.org
2
Figure 2. Illustration of encoder-decoder seq2seq model for punctuation restoration.
We note that a couple of previous works used machine translation approaches to address punctuation
restoration (Peltz et al. 2011, Vāravs and Salimbajevs 2018). Similar to this work, these past works frame the
punctuation restoration task as a text-to-text task. However, our work is based on a simple vanilla sequence-to-
sequence architecture and newer pre-trained models that are more comparable to the models used by the recent state-
of-the-art work.
Methods
Model
We chose to use Google’s T5 (Raffel et al. 2020), a sequence-to-sequence encoder-decoder model, to perform
punctuation restoration. T5 uses a bi-directional transformer, similar to BERT, as its encoder, and an autoregressive
transformer decoder. It was trained to perform seq2seq tasks on 20T of the “Colossal Clean Crawled Corpus” (Raffel
et al. 2020).
As illustrated in Figure 2, when fed a sequence of tokens (unpunctuated text), T5 passes the input through an
encoder which generates a vector representation of the sequence. Then it passes that vector representation to a decoder,
which generates an output sequence of tokens. This architecture was shown to be a good fit for various tasks from
machine translation, to text summarization and question answering. In our case, to perform punctuation restoration,
the input sequence is the speech transcription without punctuation, and the output sequence is a fully punctuated
version of that input. Unlike the case of the encoder-classifier model, described in Section 2, the seq2seq architecture
does not impose any explicit constraints on the relation between the input and the output structures. Specifically, the
output can be of arbitrary length, seamlessly allowing the generation of two or more punctuation marks one after the
other (as in “That is incredible!!!”).
While T5 was originally implemented in the Mesh TensorFlow library (Shazeer et al. 2018), we utilize the
PyTorch implementation provided in Huggingface’s transformers library (Wolf et al. 2019).
Methodology
We finetune the pre-trained T5 model on a dataset of Ted talk transcripts (Federico et al., 2012). The problem is
formulated as a text-to-text task, where the input is an uncased segment of a Ted transcript devoid of punctuation, and
Volume 10 Issue 4 (2021)
ISSN: 2167-1907
www.JSR.org
3
the ground truth output is the original punctuated text (also uncased), as illustrated in Table 1. The text was broken
into 256-token long sequences. In order to most closely resemble the data T5 was pre-trained on (leveraging the
information it learned during training), we frame the task as sequence to sequence, keeping the original tokens used
for each punctuation character.
Table 1. A shortened example of the input-output pairs fed into the model during training.
Input
we know that right we've experienced that
Ground truth output we know that, right? we've experienced that.
Experimental Details
Dostları ilə paylaş: |