Hands of a guy on laptop keyboard

Speech Generator: Main AI technologies

Published on 21 September 2020
Updated on 05 April 2024

On 21 September, DiploFoundation launched the humAInism Speech Generator as part of its humAInism project. By combining artificial intelligence (AI) algorithms and the expertise of Diplo’s cybersecurity team, this tool is meant to help diplomats and practitioners write speeches on the topic of cybersecurity. 

Given the research nature of the project, the main purpose of the generator was to explore various new AI technologies and examine their useability in the field of diplomacy. For this purpose, we used several state-of-the-art algorithms for the generator, with three main purposes. 

  1. Semantic similarity search: Finding sentences with similar semantics from DiploFoundation’s corpuses of books and transcripts.
  2. Generation of long-form answers: Given a question, the algorithm finds relevant paragraphs from Diplo’s corpuses of books and transcripts, and generates new paragraphs with explanatory answers.
  3. Text generation: The algorithm is fine-tuned on diplomatic texts, and is used for the generation of new texts.

1. Semantic similarity search

We use the DistilBERT language representation model to encode sentences into 512-dimensional vectors. After that, the approximate nearest neighbor search algorithm is used to compare vectors and calculate their similarity score according to their angular distance. For this purpose, we implemented the technology listed below. 

1.1. DistilBERT model

DistilBERT is a transformers model, smaller and faster than BERT (Bidirectional Encoder Representations from Transformers), which was pretrained on the same corpuses in a self-supervised fashion using the BERT base model as a teacher. This means it was pretrained on raw texts only, with no humans labelling them in any way (which is why it can use a lot of publicly available data), and through an automatic process generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained with three objectives. 

  • Distillation loss: The model was trained to return the same probabilities as the BERT base model.
  • Masked language modeling (MLM): This is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input, then runs the entire masked sentence through the model, and predicts the masked words. This is different from the traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from the autoregressive models, like the generative pretrained transformer (GPT), which internally mask future tokens; it allows the model to learn a bidirectional representation of a sentence.
  • Cosine embedding loss: The model was also trained to generate hidden states as close as possible to the BERT base model.

In this way, the model learns the same inner representation of the English language as its teacher model, while being faster for inference and downstream tasks.

Reference: Sanh V et al. (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv, 1 March. Available at https://arxiv.org/abs/1910.01108 [accessed 20 September 2020].

1.2. Approximate Nearest Neighbors Oh Yeah (Annoy) search algorithm

Annoy is a C++ library with Python bindings which searches for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mapped into the memory so that many processes may share the same data.

Reference: The Annoy Python module on GitHub. Available at https://github.com/spotify/annoy [accessed 20 September 2020].

2. Generation of long-form answers

For this task, we used models that were pretrained on Wikipedia (Wiki-40B) and the Explain Like I’m Five (ELI5) questions datasets. We applied models on our custom Diplo dataset, consisting of Diplo books and Internet Governance Forum (IGF) transcripts. The process of generating answers is done in two stages. 

  1. At the retrieval stage, the pretrained custom-made embedder is used to project a BERT 512 embedding vector to a 128-dimensional space in a way that the dot inner product of the projection of the question vector and a projection of answer vector should be higher than the dot inner products of the projection of the question vector and a projection of any other answer vector. Document retrieval is conducted by the Max Inner Product Search (MIPS) of dense 128 embeddings with Faiss.
  2. At the generation stage, the pretrained BART sequence-to-sequence model is used for generating answers.

The applied algorithms are listed below. 

2.1. BERT 

The language representation model BERT is short for ‘Bidirectional Encoder Representations from Transformers’. Unlike recent language representation models, BERT is designed to pretrain deep bidirectional representations from an unlabeled text by jointly conditioning both left and right contexts in all layers. As a result, the pretrained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering (QA) and language inference, without substantial task-specific architecture modifications.

Reference: Devlin J et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, 11 October. Available at https://arxiv.org/abs/1810.04805 [accessed 20 September 2020]. 

2.2. Faiss search algorithm

The Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and the clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to those that possibly do not fit into the random-access memory (RAM). It also contains a supporting code for evaluation and parameter tuning. Faiss is written in C++, with complete wrappers for Python/NumPy. Some of the most useful algorithms are implemented on the graphics processing unit (GPU). Faiss was developed by Facebook Artificial Intelligence Research (FAIR).

Reference: Faiss on GitHub. Available at https://github.com/facebookresearch/faiss/wiki [accessed 20 September 2020].

2.3. BART

BART is a denoising autoencoder for pretraining sequence-to-sequence models. It is trained by: (1) corrupting text with an arbitrary noising function, and (2) teaching a model to reconstruct the original text. It uses a standard tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalising BERT (due to the bidirectional encoder), GPT (with a left-to-right decoder), and many other more recent pretraining schemes. BART is particularly effective when fine-tuned for generating text, but it also works well for comprehension tasks. It matches the performance of RoBERTa, with comparable training resources on GLUE (General Language Understanding Evaluation) and SQuAD (Stanford Question Answering Dataset); and achieves new state-of-the-art results on a range of abstract dialogues, question answering, and summarisation tasks, with gains of up to 6 ROUGE (Recall-Oriented Understudy for Gisting Evaluation). BART also provides a 1.1 BLEU (bilingual evaluation understudy) increase over a back-translation system for machine translation, with only target language pretraining.

Reference: Lewis M et al. (2019) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv, 29 October. Available at https://arxiv.org/abs/1910.13461 [accessed 20 September 2020].

3. Text generation

For the task of generating introductory sentences, we used a pretrained GPT-2 algorithm, which we fine-tuned on a dataset generated from the first three sentences of the UN General Debates Dataset (UN General Debates). 

Elements of the applied algorithm are listed below. 

3.1. GPT-2 

‘GPT-2 is a large transformer-based language model trained using the simple task of predicting the next word in 40GB of high-quality text from the internet. This simple objective proves sufficient to train the model to learn a variety of tasks due to the diversity of the dataset. In addition to its incredible language generation capabilities, it is also capable of performing tasks like question answering, reading comprehension, summarisation, and translation. While GPT-2 does not beat the state-of-the-art in these tasks, its performance is impressive nonetheless considering that the model learns these tasks from raw text only.’ (Rajapakse, 2020) 


Radford A et al. Language Models are Unsupervised Multitask Learners. OpenAI. Available at https://arxiv.org/abs/1910.13461 [accessed 20 September 2020].

Rajapakse T (2020) Learning to Write: Language Generation With GPT-2. Medium, 27 April. Available at https://medium.com/swlh/learning-to-write-language-generation-with-gpt-2-2a13fa249024 [accessed 20 September 2020].

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

The reCAPTCHA verification period has expired. Please reload the page.

Subscribe to Diplo's Blog