The Importance of Coverage in Vocabulary Learning

The Relationship Between Lexical Size, Coverage, and Reading Ability

One of the main reasons we created the NGSL Project is our understanding of the importance of knowledge of high frequency vocabulary to reading ability and overall language proficiency. The NGSL Project is all about efficiency, and the systematic study of the words in our lists can work as shortcuts for second language learners of English. With 600,000 words in the English language and native speakers only knowing between 25-30,00 words it is clear that not all words are created equally.

The most frequent word of written English is the word “the” which represents between 6-7% of the words in most written texts. Knowing just the top 10 most frequent words of English covers approximately 25% of the words you would meet and the top 100 words would cover close to 50% of the words you would encounter. In the graphic above you can see a randomly selected 10 million word sample from the Cambridge English Corpus. The blue bar shows the coverage you would get if you knew the first 2000 most frequent words in that corpus, 83%. This means that 8.3 million of the 10 million words are the same 2000 words occurring over and over again. Knowledge of the next 2000 only adds another 5% and the next 2000 another 3%. This dramatic drop off after the first 2000 words is related to a mathematical principle known as Zipf’s Law (Crystal, 1987), and an indication of the extremely high importance of knowing those blue words. These are core, foundational words that are the basis of proficiency in a foreign language.

Though 83% is an excellent and important start, it is not enough and all of our NGSL wordlists provide far more than 83% coverage. There is a long history of research studies on what are known as “vocabulary thresholds”. The short version is this… If learners of English know less than 80% of the words on the page, reading comprehension is next to impossible (Laufer, 1990). The minimum point at which there will be more readers than non-readers is 90%, and the level at which most learners will be able to read and guess from context is 95% (Laufer, 1989, Laufer, 1992, Liau & Nation, 1985). More recent research talks about another threshold, 98%, which is the level where learners can read for pleasure (Hu and Nation, 2000).

This is the reason we have set 90% as the minimum threshold for our three core vocabulary lists (the NGSL for general English, The NGSL-S for spoken English and the NDL for children’s English). Each is seen as the most important step 1 in a learners vocabulary development, with step 2 being study of the appropriate SP list to quickly bring students to 95% or higher coverage (97% for our business service list, 98% for our fitness English list and 98.5% coverage for our TOEIC list).

An important corollary principle of Zipf’s law relevant to this discussion is that if the core words are not fully learned (the blue ones in the graphic above, or the NGSL, NGSL-S and NDL word lists for EFL learners) it is mathematically impossible to reach the all important vocabulary thresholds of 90%, 95% and 98%. Non-familiarity with these words almost assures that students will not be able to use top down skills, activate schema, guess from context, score well on reading exams, or develop reading fluency.