Speech Generator: Main AI technologies

Published on 21 September 2020 | Updated on 04 June 2024

Subscribe to Diplo's Blog

On 21 September, DiploFoundation launched the humAInism Speech Generator as part of its humAInism project. By combining artificial intelligence (AI) algorithms and the expertise of Diplo’s cybersecurity team, this tool is meant to help diplomats and practitioners write speeches on the topic of cybersecurity.

Given the research nature of the project, the main purpose of the generator was to explore various new AI technologies and examine their useability in the field of diplomacy. For this purpose, we used several state-of-the-art algorithms for the generator, with three main purposes.

Semantic similarity search: Finding sentences with similar semantics from DiploFoundation’s corpuses of books and transcripts.
Generation of long-form answers: Given a question, the algorithm finds relevant paragraphs from Diplo’s corpuses of books and transcripts, and generates new paragraphs with explanatory answers.
Text generation: The algorithm is fine-tuned on diplomatic texts, and is used for the generation of new texts.

1. Semantic similarity search

We use the DistilBERT language representation model to encode sentences into 512-dimensional vectors. After that, the approximate nearest neighbor search algorithm is used to compare vectors and calculate their similarity score according to their angular distance. For this purpose, we implemented the technology listed below.

1.1. DistilBERT model

DistilBERT is a transformers model, smaller and faster than BERT (Bidirectional Encoder Representations from Transformers), which was pretrained on the same corpuses in a self-supervised fashion using the BERT base model as a teacher. This means it was pretrained on raw texts only, with no humans labelling them in any way (which is why it can use a lot of publicly available data), and through an automatic process generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained with three objectives.

Distillation loss: The model was trained to return the same probabilities as the BERT base model.
Masked language modeling (MLM): This is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input, then runs the entire masked sentence through the model, and predicts the masked words. This is different from the traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from the autoregressive models, like the generative pretrained transformer (GPT), which internally mask future tokens; it allows the model to learn a bidirectional representation of a sentence.
Cosine embedding loss: The model was also trained to generate hidden states as close as possible to the BERT base model.

In this way, the model learns the same inner representation of the English language as its teacher model, while being faster for inference and downstream tasks.

Reference: Sanh V et al. (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv, 1 March. Available at https://arxiv.org/abs/1910.01108 [accessed 20 September 2020].

1.2. Approximate Nearest Neighbors Oh Yeah (Annoy) search algorithm

Annoy is a C++ library with Python bindings which searches for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mapped into the memory so that many processes may share the same data.

Reference: The Annoy Python module on GitHub. Available at https://github.com/spotify/annoy [accessed 20 September 2020].

2. Generation of long-form answers

For this task, we used models that were pretrained on Wikipedia (Wiki-40B) and the Explain Like I’m Five (ELI5) questions datasets. We applied models on our custom Diplo dataset, consisting of Diplo books and Internet Governance Forum (IGF) transcripts. The process of generating answers is done in two stages.

At the retrieval stage, the pretrained custom-made embedder is used to project a BERT 512 embedding vector to a 128-dimensional space in a way that the dot inner product of the projection of the question vector and a projection of answer vector should be higher than the dot inner products of the projection of the question vector and a projection of any other answer vector. Document retrieval is conducted by the Max Inner Product Search (MIPS) of dense 128 embeddings with Faiss.
At the generation stage, the pretrained BART sequence-to-sequence model is used for generating answers.

The applied algorithms are listed below.

2.1. BERT

The language representation model BERT is short for ‘Bidirectional Encoder Representations from Transformers’. Unlike recent language representation models, BERT is designed to pretrain deep bidirectional representations from an unlabeled text by jointly conditioning both left and right contexts in all layers. As a result, the pretrained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering (QA) and language inference, without substantial task-specific architecture modifications.

Reference: Devlin J et al. (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, 11 October. Available at https://arxiv.org/abs/1810.04805 [accessed 20 September 2020].

2.2. Faiss search algorithm

The Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and the clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to those that possibly do not fit into the random-access memory (RAM). It also contains a supporting code for evaluation and parameter tuning. Faiss is written in C++, with complete wrappers for Python/NumPy. Some of the most useful algorithms are implemented on the graphics processing unit (GPU). Faiss was developed by Facebook Artificial Intelligence Research (FAIR).

Reference: Faiss on GitHub. Available at https://github.com/facebookresearch/faiss/wiki [accessed 20 September 2020].

2.3. BART

BART is a denoising autoencoder for pretraining sequence-to-sequence models. It is trained by: (1) corrupting text with an arbitrary noising function, and (2) teaching a model to reconstruct the original text. It uses a standard tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalising BERT (due to the bidirectional encoder), GPT (with a left-to-right decoder), and many other more recent pretraining schemes. BART is particularly effective when fine-tuned for generating text, but it also works well for comprehension tasks. It matches the performance of RoBERTa, with comparable training resources on GLUE (General Language Understanding Evaluation) and SQuAD (Stanford Question Answering Dataset); and achieves new state-of-the-art results on a range of abstract dialogues, question answering, and summarisation tasks, with gains of up to 6 ROUGE (Recall-Oriented Understudy for Gisting Evaluation). BART also provides a 1.1 BLEU (bilingual evaluation understudy) increase over a back-translation system for machine translation, with only target language pretraining.

Reference: Lewis M et al. (2019) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv, 29 October. Available at https://arxiv.org/abs/1910.13461 [accessed 20 September 2020].

3. Text generation

For the task of generating introductory sentences, we used a pretrained GPT-2 algorithm, which we fine-tuned on a dataset generated from the first three sentences of the UN General Debates Dataset (UN General Debates).

Elements of the applied algorithm are listed below.

3.1. GPT-2

‘GPT-2 is a large transformer-based language model trained using the simple task of predicting the next word in 40GB of high-quality text from the internet. This simple objective proves sufficient to train the model to learn a variety of tasks due to the diversity of the dataset. In addition to its incredible language generation capabilities, it is also capable of performing tasks like question answering, reading comprehension, summarisation, and translation. While GPT-2 does not beat the state-of-the-art in these tasks, its performance is impressive nonetheless considering that the model learns these tasks from raw text only.’ (Rajapakse, 2020)

Reference:

Radford A et al. Language Models are Unsupervised Multitask Learners. OpenAI. Available at https://arxiv.org/abs/1910.13461 [accessed 20 September 2020].

Rajapakse T (2020) Learning to Write: Language Generation With GPT-2. Medium, 27 April. Available at https://medium.com/swlh/learning-to-write-language-generation-with-gpt-2-2a13fa249024 [accessed 20 September 2020].

Events Blogs Resources

AI and diplomacy – Workshop at ITU

16 Jun 25 - 16 Jun 25Geneva, Switzerland

6th AI Policy Summit 2025

03 Oct 25 - 04 Oct 25ETH Zurich, Online

Introducing the WSIS+20 for the Asia Pacific Internet Community

03 Jun 25 - 03 Jun 25Online

Diplo/GIP at IGF 2025

23 Jun 25 - 27 Jun 25Lillestrøm, Norway

Tech attache briefing: UN80 Initiative, AI, and digital governance

28 May 25 - 28 May 25Geneva - In Situ

Expert Workshop on the Rule of Law and Human Rights Aspects of Using Artificial Intelligence for Counter-Terrorism Purposes

08 May 25 - Geneve Centre for Security Policy

Swiss Plateforme Tripartite: Meeting on WSIS+20

06 May 25 - 06 May 25

WSIS+20 review: What’s in it for Africa?

07 May 25 - 07 May 25Geneva

Trump and tech: After 100 days

30 Apr 25 - 30 Apr 25Online

AI Apprenticeship for International Organisations blended course

29 Apr 25 - 29 Apr 25Geneva and online

GITEX Africa 2025

14 Apr 25 - 16 Apr 25

Demystifying AI: How to prepare international organisations for AI transformation?

29 Apr 25 - 29 Apr 25Geneva

Why military AI needs urgent regulation

As military AI becomes operational in today’s conflicts, the lack of regulation and accountability risks turning warfare into a domain governed by opaque algorithms and unchecked escalation. Without[...]

Julia Williams

09 Jul, 2025

AI Apprenticeship for IOs · From diplomats to AI builders

The AI Apprenticeship for International Organisations, developed by DiploFoundation, empowers professionals from entities like the UN, WHO, and CERN to create AI tools that enhance global cooperation.[...]

Anita Lamprecht

06 Jul, 2025

The future of global security and why cyber diplomacy matters

June’s G7 meeting, chaired by Canada, focused mainly on issues such as cybersecurity and artificial intelligence (AI), as well as latest conflicts, including those between Israel and Iran and Russia[...]

Ángela Herrero

03 Jul, 2025

AI and Magical Realism: When technology blurs the line between wonder and reality

The challenges of governing artificial intelligence often feel like something out of a Gabriel García Márquez novel, where the extraordinary blends seamlessly with the everyday, and the line between[...]

Jovan Kurbalija

27 Jun, 2025

AI in Sophie’s world: How a philosophy book can help us govern AI

As we convene in Oslo for the Internet Governance Forum, we reflect on the philosophical insights from Jostein Gaarder's "Sophie’s World." The novel's exploration of identity and constructed reality[...]

Jovan Kurbalija

21 Jun, 2025

Gulf AI deals mark a new era for AI diplomacy

Trump’s Middle East approach proves this concept has turned into real political action. By departing from restrictive export policies and instead building strategic AI partnerships, the [...]

Slobodan Kovrlija

19 Jun, 2025

Advancing Swiss AI Trinity: Zurich’s entrepreneurship, Geneva’s governance, and communal subsidiarity

Switzerland can inspire global AI transformation by leveraging its unique strengths: Zurich’s entrepreneurial spirit, Geneva’s governance expertise, and a focus on communal subsidiarity. This "AI [...]

Jovan Kurbalija

15 Jun, 2025

EU Digital Diplomacy: Geopolitical shift from focus on values to economic security

The EU's International Digital Strategy 2025 shifts focus from a values-centric approach to prioritizing geopolitical and economic security. While it retains a commitment to human rights, the new stra[...]

Jovan Kurbalija

10 Jun, 2025

Empowering communities through bottom-up AI: The example of ThutoHealth

In Botswana, a silent epidemic claims nearly half of all lives. Hypertension, diabetes, cancer, and other non-communicable diseases (NCDs) are responsible for 46% of deaths nationwide—a staggering s[...]

DiploFoundation

26 May, 2025

What can we learn from 160 years of tech diplomacy at ITU?

On May 17, 1865, the International Telecommunication Union (ITU) was founded by 20 European states to streamline telegraph messaging across borders, highlighting the need for multilateral cooperation [...]

Jovan Kurbalija

17 May, 2025

Part 1: An introduction to digital twins

When Spain & Portugal went dark, it wasn't just lights that failed. It was a reminder: technology isn't just a tool – it's the system we live in.[...]

Anita Lamprecht

14 May, 2025

Part 7: ‘Converging realities: Embedding governance through digital twins’

The metaverse is no longer a question of ‘what if’ – it’s already being built. Digital twins, embedded governance, and the collapse of the digital–physical divide mark the next frontier.[...]

Anita Lamprecht

05 May, 2025

2025

The latest from Diplo and GIP

Tailor your subscription to your interests, from updates on the dynamic world of digital diplomacy to the latest trends in AI.

Subscribe to more Diplo and Geneva Internet Platform newsletters!

Subscribe now

Trending in Diplo Academy

Trending in Resources

Trending in Topics

Courses & Programmes

Faculty & Alumni

Publications

Research

Trending in Blogs

Diplo Events

DigWatch Events

Trending Projects

Contact us

Social icons

Speech Generator: Main AI technologies

See also

Subscribe to Diplo's Blog

1. Semantic similarity search

2. Generation of long-form answers

3. Text generation

The latest from Diplo and GIP

Diplo: Effective and inclusive diplomacy

Diplo on Social