Search form

Word frequency is the most basic, informative, and commonly used measure in quantitative language analysis. The more occurrences a word has in a given text corpus, the more important the concepts related to it have to be – such a simple inference does not even call for further elaboration.

Word frequency analysis does not include so-called stop-words, such as could, did, had, got etc. Rather, this study includes only those words with a semantic relevance, the so-called semantic words. Semantic words carry most of the information in a sentence and encompass four groups: nouns, verbs, adjectives, and adverbs. In a later phase, this study will focus on the specific set of words used in the Internet governance process.

 

The most frequently used words

Think was the most frequently used word in the IGFs from 2006 to 2011. In the IGF text corpus, the word think ocurrs 20 590 times, followed by the word very with 19 854 occurrences, Internet with 18185 occurrences, much with 17 809 repetitions, and thank with 14 874 uses. The word governance comes in 54th place among the most frequently used words, with 4729 occurrences; the word person is highly ranked in 5th position with 14 874 occurrences; while IGF was used 10 101 times and is the 11th most frequently used word. Figure 1 depicts the frequencies of the 30 most frequently used words in the IGF meetings from 2006 to 2011.

Figure 1. The 30 most frequently used words in the IGF 20062011.
The frequency scale is given in thousands of word occurrences; the exact number of occurrences is given next to each word.

Politeness is important in diplomacy, says the IGF text corpus: the word thank is the 6th most used word overall, with 14 813 occurrences (although without being  necessarily related to the usage of  the word thank, note that the word much is ranked 3rd with 18 185 uses, and the word very is 4th with 17 809 occurrences).

IGF 2012 Baku update: The distribution of word frequency did not change much after the data generated during the Baku 2012 meeting were included in the IGF Text Corpus. The most frequent words (with the corresponding frequenices cited in parentheses) were the following: think (8501), internet (7441), very (6867), much (6852), one (5657), person (5483), thank (5212), go (4919), it's (4504), know (3990).

Word usage dynamics: 2006 – 2012

A look at the temporal evolution of the usage of specific words might be interesting. Let’s take a look at the changes in word frequencies for some of the most often used – and other interesting – words in the IGF 2006–2012. Figure 2a represents the number of occurrences for the words think (the most frequently used word overall), Internet, and IGF (the words that are obviously of central importance in this specific discourse), as well as person (the 5th most used word), country (to see when and whether we speak of countries more than of individuals), and governance (again of central importance in this diplomatic language):

 

Figure 2a. The temporal evolution of selected word usage, IGF 20062012.
The frequency scale is given in thousands of word occurrences.

As we now discover, the word think was not the most frequently used word in the IGF discourse from the beginning of its formation. In fact, the word Internet – as might have been expected – was most often used in 2006, 2007, 2008, and even 2009 (by a tiny advantage over think), ceding the lead to think only in 2010.

The reader might note that the overall word frequency/volume is significantly higher in the years 2010 - 2012. The reason is the inclusion of the transcripts from all workshops held in Vilnius (IGF 2010), Nairobi (IGF 2011) and Baku (IGF 2012), which increased the volume of contribution of these two meetings in the IGF text corpus. In comparison to the very high frequencies of words such as think, Internet, and person, words like country, IGF, and governance obviously fall in a lower stratus of those most often used. It could be that the comparison is not completely fair: the usage of the word governance, for example, might have been absorbed by the usage of IGF; but again, even taken together, these two terms would not be able to top as high as words think and Internet did. Figure 2a uncovers another interesting fact: during the previous six years of the IGF, we spoke much less of countries than of persons, with the word person being among the most frequently used words in the IGF discourse.

To give a glimpse of what will follow in the next months of reporting on our results, Figure 2b shows the temporal evolution of word usage for a few more important, selected words from the IGF text corpus.

 

Figure 2b. The temporal evolution of selected word usage, IGF 20062012.
The frequency scale is given in thousands of word occurrences.

All of the words whose frequencies are represented in Figure 2b fall in the region of less commonly used words, in comparison with the range of frequency presented in Figure 2a. The most frequently used word in Figure 2b, information, can perhaps be compared to country or IGF from Figure 2a in respect to usage. The word stakeholder, highly typical of Internet governance discourse since the introduction of the principle of multistakeholderism, tends to be among the 200 most frequently used words across all six years in our analysis, but it does not compare to the more frequently used information, world, development, or policy in Figure 2b. Later on, we will discuss the patterns of word usage by various stakeholders in the IGF discourse – certainly a more informative approach than the one based on observations of single word usage.

Figure 3 represents the change in the 30 most frequently used words in the IGF, and it is provided for those interested in more precise comparisons among the six IGF meetings. The sudden jump in overall word frequency resulting from the inclusion of the transcripts from workshop sessions held in Vilnius (IGF 2010), Nairobi (IGF 2011) and Baku (IGF 2012) is obvious in the last three panels in Figure 3.

Figure 3. The change in the 30 most frequently used words, IGF 20062012.
The frequency scale is given in thousands of word occurrences.

Figure 4 represents a word cloud for the 100 most frequently used words in the IGF text corpus 2006–2011.

Figure 4. IGF 2006–2011: Word Cloud.
Generated with Wordle.net

 

Linguistic analysis of the IGF text corpus: Power laws and Zipf’s law

The distribution of word frequency in any natural language corpus tends to exhibit a regularity similar to the distribution of wealth in any national economic system. According to the popular 80-20 rule (which is a misnomer, since it does not necessarily involve any proportion similar to 80:20), very few individuals in any economy control a huge proportion of wealth, and vice versa: a huge proportion of individuals control a small proportion of the wealth. We owe this observation to the famous Swiss-Italian economist, sociologist and mathematician Vilfredo Pareto, who probably formulated the regularity around 1897.[1] The fact is that natural languages exhibit the very same kind of regularity as economies: a huge proportion of word usage is carried by a few most frequently used words, with the remainder being divided among the large number of less frequently used words. This finding could be almost used as a signature of any natural language corpus.

More specifically, the IGF text corpus follows Zipf’s law. The law, owing its name to George Kingsley Zipf, who seems to be the first linguist to have discovered it (1935 [2]), states that in any natural language corpus the frequency of any given word is inversely proportional to its rank in the frequency table. For example, imagine the list of the first 100 most frequently used words in the IGF text corpus. Zipf’s law predicts that the word frequency will decline sharply, in a non-linear fashion, beginning from the most frequent word and progressing towards less frequent words. The left panel of Figure 5 depicts exactly this relationship for the IGF text corpus. It is based on the observation of word frequency for 24 325 different English words found in our corpus.

Figure 5. The rank-frequency plot of the IGF text corpus.
Left panel: the rank-frequency plot. Right panel: the rank-frequency plot in logarithmic coordinates (essentially, a test of Zipf’s law as explained in the text)

The left panel in Figure 5 presents the rank-frequency plot of the IGF text corpus. The x-axis represents the rank of words – their positions in the frequency tables, sorted from the most frequently used word (rank 1) towards the less frequently used words (rank 2, 3, 4, ...). The y-axis represents word frequency – the number of occurrences of the respective words in the text corpus. The sharp, non-linear decline which is obvious on the left panel is a typical expression of the Zipf’s law. The right panel in Figure 5 shows the same plot in logarithmic coordinates: both axes are the same, except that the values are log(rank) and log(frequency) on the x and y axes, respectively. If the complete distribution of word frequency in our corpus were to follow Zipf’s law, then the points on the right panel – the logarithms of rank and frequency – would fall almost perfectly on a straight line.

The non-linear expression of this important regularity in natural languages becomes linear when plotted in logarithmic coordinates, and this fact is often used to test whether a word frequency distribution actually obeys Zipf’s law. Most often this is not the case: obviously, only a part of the word frequency distribution of the IGF text corpus (the right, more linear part of the plot on the right panel) is approximately linear when plotted in logarithmic coordinates. The reader can compare this finding with the result obtained for the word frequency distribution of Wikipedia in Figure 6a and the word frequency distribution for the 5000 most frequently used words from the well-known Brown corpus in Figure 6b.

Figure 6a. The log-log rank-frequency plot of the Wikipedia word frequency list. The x-axis represents log(word rank), and the y-axis represents log(word frequency). Various linear models are fitted to these datasets with the green line approximating the right part of the distribution falling most closely to the typical formulation of Zipf’s law. Source: http://en.wikipedia.org/wiki/Zipf%27s_law

 

Figure 6b. The rank-frequency plot for the 5000 most frequent words in the Brown corpus . The x-axis represents word rank, and the y-axis represents word frequency; the right panel is in log-log coordinates . Data set source: http://www.edict.biz/textanalyser/wordlists.htm

The word frequency list of Wikipedia exhibits qualitatively the same pattern as the IGF text corpus. Only its right part, where less frequently used words are found, seems to follow Zipf’s law. One has to be careful when it comes to statements of the empirical validity of Zipf’s law for various datasets: recent research efforts show how this and similar regularities – all falling in the family of power laws – are found less often than previously thought, when more powerful statistical tests are used for purposes of estimation.[3] Another possibility is that parts of the frequency distributions obey Zipf’s law with different characteristics, and this possibility cannot be assessed if one chooses to analyse the complete word frequency distribution from a given corpus.

Figure 6b shows how the 5000 most frequently used words from the Brown corpus almost perfectly satisfy Zipf’s law (the log-log plot on the right panel of Figure 6b is approximately linear). Again, more powerful statistical tests (i.e. maximum-likelihood estimation of power-law behavior, [3]) show that only observations of word frequency higher than 56 – the class encompasses only the 1971 most frequently used words – follow Zipf’s law among the top 5000 words from the  Brown corpus. We have verified that Zipf’s law holds for the part of the IGF text corpus where lower word frequencies are found (see the right panel in Figure 5).

Again, we are interested in testing the possibility that Zipf’s law holds for the most frequently used words in our corpus. To perform this test, we need to isolate the most frequently used words from the complete word frequency distribution. We chose to analyse the 500 most frequently used words from the IGF text corpus; our results show that the 175 most frequently used words (with a frequency of 2093 and higher) from the IGF text corpus follow Zipf’s law. Figure 7 illustrates the zipfian behaviour of these 175 most frequently used words: similar to the case of the Brown corpus, the most frequently used words obey this well-known regularity. The reader should bear in mind that stop-words are not included in the word frequency distribution of the IGF text corpus, affecting exactly the range of the most frequently used words.

Figure 7. The rank-frequency plot for the 175 most frequent words in the IGF text corpus. Left panel: the rank-frequency plot. Right panel: the rank-frequency plot in logarithmic coordinates. The x-axis represents word rank (log(rank) on the right panel), and the y-axis represents word frequency (log(frequency) on the right panel).

 

References

[1] Johnson NL, Kotz S and Balakrishnan N (1994) Continuous univariate distributions Vol 1. Wiley Series in Probability and Statistics.

[2] Zipf GK (1935) The Psychobiology of Language. Houghton-Mifflin.

[3] Clauset CR, Shalizi &  Newman MEF (2009) Power-law distributions in empirical data. SIAM Review 51(4), pp 661–670. (arXiv:0706.1062, doi:10.1137/070710111)

 

 

Scroll to Top