12dicts is a collection of English word lists. It differs in several important ways from most of the other free word lists you can download.
Originally, 12dicts was composed of lists derived from a specific set of 12 source dictionaries. In addition to these "classic" lists, 12dicts now includes lists derived from other sources. It would perhaps be appropriate to rename 12dicts to something more generic, such as BAWL (Beale's Assorted Word Lists), but I have not done so in order to preserve continuity.
A quick summary of the 12dicts lists and their characteristics is as follows:
The remainder of this document is organized as follows:
This is release 4.0 of 12dicts, released Jan. 18, 2003. It differs from previous versions by containing three additional lists which are not derived from the "classic" 12dicts sources. Changes to the classic lists are limited to error corrections.
The 12dicts project began as the n-dicts projects, n being a variable whose value finally stabilized as 12. The purpose of the project was to create a list of words approximating the common core of the vocabulary of American English.
The methodology of the project was to record and correlate the words listed in a number of small dictionaries. The number of dictionaries so recorded is now 12, comprising 8 ESL (English as a Second Language) dictionaries and 4 "desk dictionaries". The dictionaries chosen vary widely by publisher, by style, by completeness and by depth. In this version of 12dicts, all of them are dictionaries of American English (three from British publishers). The smallest of them contains about 20,000 entries, and the largest 46,000. (All totaled, there are about 75,000 entries, many of which appear in only a single dictionary.) All but two of them were published in the last seven years.
I initially tried two different ways of winnowing the 12dicts data to produce lists of common words. Both produced interesting results. One list, the 6of12 list, contains all words and phrases listed in 6 of the 12 dictionaries. One way of describing this list is that it contains those words and phrases which a (seeming) majority of lexicographers believe are relevant to people learning English, and/or to everyday usage. This list contains about 32,000 words and phrases. The other list, the 2of12 list, is more inclusive in that it includes words listed in as few as two of the source dictionaries, but less inclusive in that it excludes items of various sorts, including multiword phrases, proper names and abbreviations. This list contains about 41,000 words. It is perhaps more suitable for use in areas like spell checking or word games than the 6of12 list. (Honesty compels me to admit that neither of these lists is, by itself, a good choice for spell checking, due to the absence of inflections, proper names, Roman numerals, etc.)
A third list, 2of12inf.txt, developed later, is of a rather different character, and is discussed in a later section.
A more precise description of the criteria by which the above lists were composed is as follows:
A much smaller set of words (49) was added to the 2of12 list. These were of two sorts:
These annotations are:
|:||The word is an otherwise unmarked abbreviation. This suffix may appear in combination with another suffix.|
|&||The word is primarily a non-American usage.|
|#||The word is generally held to be a variant or less preferred form of another word.|
|<||This form of a word is held to be the primary form by fewer dictionaries than some other form of the word.|
|^||This form of the word was selected arbitrarily from a set of variants, none of which was clearly preferred.|
|=||Roughly, this indicates a "second class" word, as described below.|
|+||The word is a signature word.|
The words in the 2of12 list are not annotated.
The 2of12inf list is of a rather different character from the two original "classic" lists. Conceptually, it is simple. It consists of all the words in the 2of12 list, plus their inflections, amounting to about 81,000 words. This list may be more useful than the other lists for applications like word games. It was created to help Kevin Atkinson in his Aspell and SCOWL projects (for which, follow this link). Unlike the 6of12 and 2of12 lists, this list is not based exclusively on the contents of my 12 source dictionaries, and for this reason it has, I feel, less authority than the other classic 12dicts lists. It also probably has a significantly higher error rate than the other lists, for reasons explained below.
The criteria defining the 2of12inf list are as follows:
Though the 2of12inf list still consists mostly of very common words, criteria 3 through 5 above cause the 2of12inf list to contain a greater proportion of unfamiliar and unusual words than the other classic 12dicts lists.
The 2of12inf list was not derived directly from the 12 source dictionaries. The starting point was a subset of Kevin Atkinson's AGID list, a list of words, parts of speech and inflections derived from public-domain sources, notably Moby Words and WordNet. (See the file agid.txt in the 12dicts archive, which is a copy of the AGID "readme", for more information on the antecedents of AGID.) 2of12inf was created by a process of editing the AGID subset to remove spurious entries and those which reflected a more esoteric English vocabulary than the other 12dicts lists, and to add inflections which AGID failed to identify. This process required significantly less effort than would have been needed to derive the list directly from the source dictionaries. Unfortunately, a side effect of the process is that the result is likely to be somewhat less reliable than the other 12dicts lists. In particular, Moby Words is notoriously unreliable, and I find it unlikely that I have successfully identified all the spurious inflections its use has introduced. It is my hope in the future to release another edition of 2of12inf which is not derived from AGID, and therefore not "infected" by Moby Words.
After the first version of the 2of12inf list was released, I replaced one of the source dictionaries, officially an international dictionary but in actuality rather British in its orientation, with a more American dictionary by the same publisher. It was not practical (nor necessarily desirable) for me to go through the list removing inflections endorsed only by the superseded dictionary. For this reason, the 2of12inf list has a slightly more international character than the other 12dicts lists. It is not altogether clear that this is a bad thing.
Ideally, the 2of12inf list would contain only inflections listed in one of the 12dicts source dictionaries. This proved not to be practical. The reason for this has to do with the nature of these sources, which are mostly ESL dictionaries. An ESL dictionary might well list the word esophagus, but, because an English learner is unlikely to need to talk about this organ in the plural, it will probably not bother to list the plural form esophagi. For words of this sort, I therefore needed to obtain their inflections from other sources. Obviously, the decisions on when to include additional inflections were judgment calls, as were the choices of which inflections to add.
Adjectival inflections (comparatives and superlatives) proved to be an especially annoying problem. Only 2 of my 12 source dictionaries provided remotely reliable information of this sort. In fact, such information is sparse and inconsistent in most dictionaries of any size. I relied on a small set of additional dictionaries for this information, which was mostly disjoint from the sources for plurals and verb forms. Several of these sources were Scrabble(r)-related, and therefore inclined to include forms of little plausibility such as iller/illest or fertiler/fertilest. Accordingly, I ended up rejecting some of the documented inflections on grounds of implausibility. I have no doubt that, in the process, I made a number of errors of both inclusion and exclusion and, in any case, many of the forms listed have no connection with any of the 12dicts source dictionaries.
One additional problem in the creation of the 2of12inf list was that of "uncountable" nouns and their plurals. Some English dictionaries, especially ESL dictionaries, as well as other linguistic sources attest to the existence of nouns which cannot be counted, or used in the plural. Examples of such nouns include mud, rayon, oregano, chess, fairness, wisdom, aluminum, training, materialism and chickenpox. This is an entirely commonsense notion, but a difficulty is the fact that the boundary between the countable and the uncountable is extremely vague and ill-defined. For example, the word coffee is ordinarily uncountable, but not when ordering in a restaurant, as is the word symmetry, except in physics or math. In general, it is possible to contrive a context where use of the plural of any noun whatsoever is reasonable.
An alternate position, therefore, is that in fact no nouns are uncountable, and that any noun which is not already plural possesses a plural. This position is especially useful in the context of word games, where words such as zeals and anthraxes may produce large scores. For this reason, the official Scrabble dictionaries list words such as thens, onces and mankinds, which most people find rather implausible. The fact that the 2of12inf list might well be useful in gaming contexts, together with the fact that the boundary between countable and uncountable nouns is so ill-defined, served as a powerful argument for inclusion of all plural forms, whether commonly used or not, while its derivation from ESL sources argued for including only the plurals of countable nouns, however distinguished.
In the end, I was unable to resolve this dilemma, and adopted a compromise. The 2of12inf list includes all plurals, but with the plurals of uncountable nouns marked, making it easy to remove them if they are not wanted. That left the issue of how to establish countability. Six of my source dictionaries included information on countability, which was adequate to decide the status of most of the included nouns. As for the rest, as usual, I used my best judgment. I will confess to occasionally overriding the source dictionaries when I believed they were clearly incorrect. (For instance, I chose not to mark the word hatreds as an uncountable plural, in defiance of the opinion of all my sources, on the grounds that it has been used in too many news stories from Bosnia to be considered unusual.) It is interesting to note that most of the plurals I added from auxiliary sources were of words considered uncountable.
The difficulties listed above, and the fact that I was forced to exercise personal judgment frequently in creating it, emphasizes a fundamental difference between this list and the other classic 12dicts lists. I have tried to make the 6of12 and 2of12 lists reflect only the source dictionaries, and to keep my own judgments and opinions out of the picture (except for my addition of signature words). This has proved impossible to achieve for the 2of12inf list, which accordingly represents a less authoritative and more arbitrary collection. Additionally, the 2of12inf list has undergone less proofreading and validation than the other lists, and I suspect the error rate is considerably higher than the idealistic goal of 0.02 % I advocate elsewhere in this document. Nevertheless, I hope it may prove to be of some use and interest.
I wish to offer my special thanks to Kevin Atkinson, for supplying me with the AGID list, and for encouraging me to add the inflections. Of course, any errors that remain in the 2of12inf list are my own responsibility, and should not be blamed on Kevin, AGID, or even on Moby.
The 3esl list represents another attempt to produce an English "core vocabulary" list. It is about 2/3 of the size of the 6of12 list, which it resembles in terms of the sorts of words included.
The 3esl list is a far more subjective list than any of the classic 12dicts lists. It was compiled from 3 small ESL dictionaries, using the same criteria for eligibility as the 6of12 list. I started with a list composed of all words from the smallest of the 3 sources, plus all words contained in both of the others. This list was then edited in the following ways:
All of these changes were quite subjective in nature, and quite numerous. Probably more than 10 % of the candidate words were added or removed in this way. For this reason, it is pointless to speak of signature words for this list; the composition of the list is too arbitrary for the term to make any sense. (I will note that the list is still not entirely arbitrary, as I added only words found in some form in one of the sources, and removed no words present in two of the sources other than duplicates. Thus, words like front page were not added, no matter how familiar, and words such as lugubrious were not removed, despite clearly not being part of any "core vocabulary".)
Like the 6of12 list, the 3esl list marks lower-case abbreviations with a ":" suffix, to prevent them from being mistaken for regular English words.
One final note on this list. The 3esl list contains about 1500 words not present in the 6of12 list. Because these two lists have the same rules for the kinds of words included, one could easily combine the two to produce a slightly larger list including a number of words whose omission from 6of12 is rather surprising. Be warned that in a few cases, the spelling chosen for words with multiple spellings is different in the two lists, and I would recommend that the duplicates be removed. (I'll be happy to provide a list of the duplicates if anyone wants one.)
All of the classic 12dicts lists are unabashedly oriented towards American English. I've received a few expressions of interest in a British English list. The result is the 2of4brif list. This list was compiled from 4 large "international" ESL dictionaries, published by British publishers. To this American, they are more British than they are international; quite possibly, they seem more American than international to British readers. It is interesting to note that, although there were only a third as many sources for this list as for the 12dicts lists, these dictionaries resembled each other far more closely than their American counterparts, which could mean that the 2of4brif list is as good an approximation of a "core" British English vocabulary as the 6of12 list is for American English. (Or, alternately, it may simply mean that my choice of sources was too narrow.)
This criteria for inclusion in this list were basically those of the 2of12inf list. In particular, inflections are included for all words, but hyphenated words, contractions, phrases, proper names and abbreviations are all excluded. One important difference between the two is the way in which inflections were determined for inclusion. The 2of12inf list includes some inflections found in one (or even none) of its sources. Further, as discussed in detail above, it includes plurals for words which are not normally considered to have plurals. The 2of4brif list differs in both of these regards. It includes only inflections endorsed by two or more of the sources, specifically excluding any plural forms for nouns listed as uncountable.
The 2of4brif list includes no signature words as such. I made a small number of adjustments for consistency, such as making sure that -ise and -ize spellings were equally represented, and adding plurals for ordinal numbers. (Why fourteenth would be defined as a fraction, but not seventeenth, I must simply regard as a mystery.) These edits were so few, and so clearly harmless, that I have not marked them.
Prospective users of the 2of4brif list should realize that it was compiled by an American. If my sources contained any glaring errors (and most dictionaries have a few), I might well not have noticed, and perpetuated them in the list. The fact that two citations were required is some protection against such an event, but no guarantee.
As the 2of4brif list is very similar in makeup to the 2of12inf list, a user who wants a larger, more international list than either could reasonably merge the two. If you do this, you should remove the unusual plurals (marked with a "%") from the 2of12inf list in the process, for consistency.
I created the 5desk list in an attempt to do a better /usr/dict/words (about which I offer many harsh criticisms elsewhere in this document). The sorts of words admitted are the same sorts that /usr/dict/words contains. Though somewhat larger in size than most versions of /usr/dict/words, this is still a short word list, striving for inclusion of words one is likely to encounter rather than the complete jargon of every possible scientific, artistic or occult endeavor.
5desk was assembled primarily from five "desk dictionaries". It was augmented by words from five minor sources, including a "vocabulary builder" and a collection of proper names. The list excludes prefixes, suffixes, phrases, hyphenated words, contractions and most abbreviations and acronyms. There was no requirement for multiple listings; all qualifying words from each of the sources were included. Inflections of included words were not included themselves except when irregular, or separately defined. Variant and non-American spellings were not excluded, and no signature words were added.
Words commonly considered to be abbreviations/acronyms were included if they contained at least one upper case character, and were defined with an explicit part of speech. This excluded items like Mr and Feb, which are abbreviations in the classic sense, but allowed words like DNA and ATM, which are used far more frequently than that which they abbreviate. While there is a trend in modern dictionaries to list such words as nouns (or occasionally verbs, adverbs, etc.), it is a trend in progress, and rather inconsistently applied. For this reason, the set of such words in the 5desk list is somewhat incoherent, including SPCA but not PETA, AIDS but not SIDS, KGB but not CIA, and PDQ but not ASAP.
One class of commonly-used words is regrettably absent from the 5desk list, because I was unable to find a satisfactory source for them. This is the class of commercial names such as Exxon, Tylenol, Pepsi and Chevy. This is probably forgivable, as this class of names is as ephemeral and transitory as teenage slang. The one-time household words Kool, Ovaltine, Philco and Ipana serve now only as answers to trivia questions, with modern wonders like Starbucks, Google, Ritalin and TiVo taking their place on the tongues of the trendy.
The 5desk list has clearly moved beyond any "core vocabulary" concept. It includes quite esoteric words (ogee, pleonastic), very uncommon spellings (thiamine, yuppy), and obscure geographical and historical names (Paricutin, Nevelson). Like /usr/dict/words, it is frequently inconsistent and arbitrary, but I hope at the least I have avoided including spelling errors, and overlooking the stuff of everyday conversation. Perhaps it will be useful as a compromise between basic lists such as 3esl, and truly massive lists like Mendel Cooper's ENABLE.
It may have occurred to some to wonder about how something like the n-dicts project came to be (though I assume that anyone who bothers to download this archive must already have some idea that such a project could be of interest).
Some years ago, there was a post to the sci.crypt Usenet newsgroup, on the subject of creating PGP passphrases using randomly selected entries from a supplied list of very short words. (If this sounds interesting, follow this link for an expanded version of the post.) The word list, which was extracted from /usr/dict/words on some UNIX system, seemed to me ill-suited to its intended purpose. It included arcane acronyms (bstj, fmc), misspellings (diety, ouvre) and words of amazing obscurity (bhoy, kombu). I decided I could do better (and eventually did). This caused me to start downloading English word lists, of which there are many, from the Internet. I was not impressed by the overall quality of these lists, and the few which were high-quality were all-inclusive, burying the everyday words under a mountain of archaisms and esoterica. The flaws of the vast majority of these lists are worth recounting:
One result of my frustration with this situation was my working with Mendel Cooper on ENABLE (for further information, check out this link), which was close to unique in having an active caretaker, one clearly concerned with quality, and in being oriented towards American rather than British English. But ENABLE is an all-encompassing list and, even if it had been complete at the time I started my search for a list of common words, it would not have been what I wanted for that reason.
I finally decided that only starting from scratch with a systematic approach was likely to get me what I was looking for, and that dictionaries intended for non-native speakers of English were the best possible source for words that are in some cases so familiar that we never think of them. This has led to the 12dicts lists, which I hope have managed to avoid the flaws recited above.(I should acknowledge one form of inconsistency exhibited by the 12dicts lists, which is that sometimes related words are spelled inconsistently. For instance, the 2of12 list contains both broadminded and broad-mindedness. This generally occurs as a result of the methodology used to build the lists. In the case of broadminded, only one source dictionary listed broadmindedness, which was therefore excluded. I felt unequal to trying to correct these inconsistencies, some of which are real and not mere artifacts of 12dicts, such as the contrast between self-conscious and unselfconscious.)
When I released the first version of 12dicts in 1999, I assumed I was done with it. It hasn't worked out that way. Before I declare it finished for a second time, there are a few more things I'd like to accomplish.
The 12dicts lists were compiled by Alan Beale. I explicitly release them to the public domain, but request acknowledgment of their use. (Actually, the dependency of the 2of12inf list on AGID prevents its release into the public domain. However, I do not impose any additional requirements on its use beyond those imposed by AGID and its sources, as described in agid.txt.) Feel free to send comments, suggestions, inquiries and/or large sums of money to me at firstname.lastname@example.org. If you find 12dicts useful, I'd love to hear about it.