CEL

Common English Lexicon (CEL), a word list with common words, for word games.

Background

What is the CEL?
- The CEL stands for Common English Lexicon, created by Eric Smith (Fj00) and Kenji Matsumoto (strataji). The CEL focuses on the most common words that exist in the international Scrabble dictionary (CSW). As such, the CEL is about 1/4 the size of CSW, with just over 69,200 words in total and over 35,900 words under 9 letters. Additionally, CEL now contains just under 1,100 words from 16-45 letters.
How was the CEL compiled?
- The CEL was compiled by counting the number of times specific words were used in Internet articles, books, webpages, and Google searches. Those that made it past pre-specified thresholds had made it into CEL, as well as related words necessary for consistency.
What does it mean for a word to be "common"? Why are there still some weird words in here?
- We wanted the CEL to be an inclusive, diverse dictionary that reflected people throughout the world with a variety of interests and expertise. As such, we included entries that may span beyond basic vocabulary, and included various plants, animals, foods, scientific terms, etc. that might not be part of even the most educated person's vocabulary. Such terms are supported by millions of Google hits and sufficient usage in books, internet articles, etc. but may not be familiar to the majority of players using the CEL. Knowing all entries in the CEL would require an absurd amount of information about almost every subject.
What does ________ mean?
- Some words may not be initially intuitive. For example, the word 'mac' might not look like a word to many people, and it may confuse people that 'macs' is not a valid string, but will make sense once they understand the intended definition (short for macaroni, i.e. mac and cheese).
- All words in CEL have common definitions. The overwhelming number of words can be defined through dictionary or google, although in a few cases, using an advanced search in one of our sources may help players looking to define an entry.
Why are there so few -S/-IER/-ED/RE- etc. words?
- Scrabble players may be accustomed to a lexicon with every conceivable conjugation, but we feel that many of these conjugations are unwarranted and do not represent common English. As a result, we have removed conjugations that are both uncommonly used and not unessential to the word's common definition. This may make it more difficult to play a game of Scrabble, but restricting conjugations allows us to better reflect the English language.
Why isn't (insert common word) found in CEL?
- There are a variety of reasons why a common word may not have made it into CEL. The most common reasons are:
  - It didn't make it into the Scrabble dictionary. For example, KEYCHAIN is not valid in CSW.
  - It exists primarily as a proper noun. For example, alaska did not make it into CEL, since its primary form is baked alaska, it's still often capitalized, and it's not all that common of a dessert.
  - It exists in English but primarily from another language. For example, alma (alma mater) did not make it since we deemed it as being primarily Latin.
  - It failed to get the requisite number of hits in books, webpages, or web articles. There are some words that didn't qualify that mystify us, too.

Community

CEL is currently implemented on the Woogles platform for playing OMGWords (Orthogonal Morphemes Game). https://woogles.io

Woogles is a community whose mission is to:

Create a great place to play word games online.
Create a tool that lets people of all skill levels from all over the world improve at our favorite board game.
Build the best AI our community has ever seen.

Woogles achieved a successful Kickstarter in October last year, and was created by a team led by César Del Solar, one of the top North American Scrabble players. The Woogles team is also made up of several other top players.

Getting Started

The repository contains cel_2-45.txt, the master CEL list, which has words ranging from 2-45 letters. cel_2-15.txt contains subsets of CEL by word length, from 2-15 letters. cel_16-21.txt contains subsets of CEL by word length, from 16-21 letters. cel_2-21.txt contains subsets of CEL by word length, from 2-21 letters. cel_22-45.txt contains subsets of CEL by word length, from 22-45 letters.

The 2-15 folder contains subsets of CEL by word length, from 2-15 letters. The 16-21 folder contains subsets of CEL by word length, from 16-21 letters. The a-z folder contains subsets of CEL by starting letter, from A-Z, 2-15 and 16-21. More subset lists can be generated if desired. In the future, these lists will be updated to stay consistent with cel.txt.

Methodology

The working spreadsheet is public. It currently contains 3 main tabs. CEL is the current working list from 2-15 letters, CEL 16-21 is the current working list from 16-21 letters, and CEL 22-45 is the current working list from 22-45 letters. Other tabs: 2s is just 2 letter word selection, 3s is 3 letter word selection, and movies is another list described below. The spreadsheet can be viewed at: https://docs.google.com/spreadsheets/d/1N-LfFbuuVA176f6pI8oTHYtpUGOy7t2i_ovsCTalq4A/edit?usp=sharing

The starter list chosen was 2of12, which contains a list of common words and their inflections, created by finding words that were in at least 2 out of 12 dictionaries that were selected. More information on 2of12 here: http://wordlist.aspell.net/12dicts-readme/#2of12inf

The uncountable nouns (%) in 2of12 were removed almost completely, and it was filtered by a competitive word list. This resulted in a total of 80775 words. There were less than 100 words in 2of12 that were not in the competitive word list. The list was a good start, but due to the methodology and lack of oversight, many words remained that were either obscure or unused. Thus the great purge began.

The list was audited, and initially only the 4-9 letter words and their inflections were considered. I was going to handle the 2-3 letter words at a later point. I started going through the list the first time, but it was apparent that my strataji was evolving over time, even early on. I decided to finish one pass and then make another pass to try to correct the inconsistencies. The second pass, I ended up deciding to clean up the 10-15 letter words as well, since many of them would look odd to keep in the list.

Definitions were checked on https://www.dictionary.com/, https://www.collinsdictionary.com/, and Google dictionary.

When necessary, Google search was helpful for figuring out words. Results with less than 1M hits were typically removed.

After the second pass, it was determined that the process was too subjective and there needed to be more objectivity when deciding on words. I had access to a data set containing the cleaned full text of 8.4M internet articles (34 GB of text). That data set was processed for word frequency and then filtered by the same list.

A 0-10 score for each word was computed using this formula: (10/log(1 + freq(THE)))*log(1 + freq(word))

THE had the highest frequency of any 2-15 letter word, and is used to normalize the score.

Words above a score of 3 almost always make the cut, whereas words below 2 almost always miss the cut.

Names and proper nouns can confound the scores, so those words had to be recognized and handled on a case-by-case basis.

For 2 letter words, a cutoff of 7 was able to split the very common words from the rest. For 3 letter words, a cutoff of 5 was helpful in determining how common they were.

Google Books Ngram Viewer was also helpful for 2-3 letter words. It provided a percent score for each of the words based on their frequency within books. A 0-10 score was computed from it using the following formula: 10*(log(ngram) - min(log(ngram)))/(max(log(ngram)) - min(log(ngram)))

A cutoff of 4 was used for 2s, and a cutoff of 3.5 was used for 3s.

For the most part, onomatopoeia are left out, especially for 2-3 letter words with consonant clusters. I made that determination mainly due to my PTSD from playing HM in a game with classmates in school and getting mocked for it. There are some instances of onomatopoeia which are common enough to keep, such as various animal sounds used as verbs.

Expletives, slurs and pejorative words are not in CEL. There is a good case that such words could scare off some audiences. I am maintaining a separate list that contains such words.

Comparatives and superlatives are a gray area. Internet articles usually avoid using them, and their scores are low in general, whereas conversational English would be more likely to use them. Similarly, adverbs are likely used less in writing. These might need to be handled on a case-by-case basis.

Loanwords are generally avoided, unless they have a significant degree of commonality. Historical, archaic or outmoded words are also generally avoided.

Words that are common but limited to specific fields are avoided. Some instances are words only related to high-level chemistry, physics or mathematics.

Elements 1-118 in the periodic table are all included for completeness. It would be hard to determine what counts as a common element.

Obscure medicines, medical terms and chemicals are left out. Common medicines such as ASPIRIN or IBUPROFEN are left in.

Letters from the Greek alphabet have been included for completeness, regardless of commonality. Letters from the Hebrew alphabet have been omitted. However, spellings of letters from the English alphabet have also been omitted since they aren't typically spelled out in writing. Greek letters typically are treated as uncountable nouns, however a few have auxiliary definitions that require inflections, ALPHA(S), BETA(S), DELTA(S), OMEGA(S).

Uncountable nouns have the plural omitted in almost all cases, mainly thanks to the 2of12 list having labeled them. However, there were still many more that needed to be classified. Often there is a little bit of gray area, so some plurals are included. It mostly depended on the ratio between the frequency of the singular and plural forms.

Inflections are usually all included, however oftentimes a group of inflections will have a subset that are very common, whereas one or more words are not. For example, a noun and its plural along with an adjectival form might be included, where as the present participle of the verb is excluded. Another example is where only the adjectival form and the gerund form are common.

Even though all the 9-15 letter words were examined, they are a lot less important for some games where the majority of words employed are less than 9 letters. They were left in mainly for extending 7-8 letter words using inflections, and also as an attempt to slow Nigel Richards down when he learns the dictionary.

Variants of words are only included if they are common enough. Verbs ending in L often had both single and double L inflections, and there was a lot of unmanageable inconsistency with frequency. This might be a reason to look into it making it more consistent later.

Agent nouns (-ER, -OR, -IST) are avoided, unless they are significantly common, or a profession, machine or tool.

The only major currencies included are DOLLAR(S), EURO(S), [Swiss] FRANC(S), POUND(S), YEN. PESO(S), RENMINBI, RUBLE(S), RUPEE(S), SHEKEL(S), YUAN are also included due to commonality.

This is just a disclaimer that it is impossible to be objective in every instance, and it is fine that people disagree with some of the choices that were made. Hopefully the goal is achieved that people who don't play word games competitively will enjoy playing more with a word list that contains only words they know.

Stats

Again, the initial list contained 80775 words. The number of words removed was 24688, making 56087. There were 150 additions that weren't 2 or 3 letters. The total number of 2 letter words is 65 (out of 127 competitive), and the total number of 3 letter words is 640 (out of 1347 competitive). Combining the purged list with the additions and 2-3 letter words makes 56311 words, and here is the list broken down by word length:

Length	CEL Count	Competitive Count	Percent CEL/Competitive
2	65	127	51.18%
3	640	1347	47.51%
4	2246	5638	39.84%
5	4234	12973	32.64%
6	6593	23035	28.62%
7	8657	34345	25.21%
8	8961	42153	21.26%
9	8065	42935	18.78%
10	6543	37749	17.33%
11	4541	29591	15.35%
12	2932	21506	13.63%
13	1658	14703	11.28%
14	794	9652	8.23%
15	382	6103	6.26%
Total	56311	281857	19.98%

Unsurprisingly, the proportion of words in CEL is strictly decreasing as the number of letters increase. The number of CEL words with 2-8 letters is 31396 for reference. Here is the same table for words starting with letters A-Z:

Letter	CEL Count	Competitive Count	Percent CEL/Competitive
A	3278	16315	20.09%
B	3205	15380	20.84%
C	5282	25223	20.94%
D	3467	16719	20.74%
E	2336	11397	20.5%
F	2404	10676	22.52%
G	1712	9410	18.19%
H	1997	10619	18.81%
I	2320	9684	23.96%
J	448	2315	19.35%
K	360	3411	10.55%
L	1622	8105	20.01%
M	2897	16034	18.07%
N	1121	6782	16.53%
O	1533	9017	17.0%
P	4504	24560	18.34%
Q	279	1413	19.75%
R	3825	15168	25.22%
S	6657	32194	20.68%
T	2954	14656	20.16%
U	1554	9695	16.03%
V	817	4609	17.73%
W	1466	5950	24.64%
X	13	309	4.21%
Y	161	1040	15.48%
Z	99	1176	8.42%
Total	56311	281857	19.98%

After making the first revision, the total number of words is 60133. Here is what the same charts look like:

Length	CEL Count	Competitive Count	Percent CEL/Competitive
2	63	127	49.61%
3	645	1347	47.88%
4	2296	5638	40.72%
5	4320	12973	33.3%
6	6773	23035	29.4%
7	8948	34345	26.05%
8	9431	42153	22.37%
9	8676	42935	20.21%
10	7206	37749	19.09%
11	5103	29591	17.25%
12	3328	21506	15.47%
13	1917	14703	13.04%
14	954	9652	9.88%
15	474	6103	7.77%
Total	60133	281857	21.33%

Letter	CEL Count	Competitive Count	Percent CEL/Competitive
A	3511	16315	21.52%
B	3371	15380	21.92%
C	5694	25223	22.57%
D	3679	16719	22.0%
E	2401	11397	21.07%
F	2492	10676	23.34%
G	1787	9410	18.99%
H	2097	10619	19.75%
I	2412	9684	24.91%
J	472	2315	20.39%
K	388	3411	11.37%
L	1688	8105	20.83%
M	3082	16034	19.22%
N	1186	6782	17.49%
O	1659	9017	18.4%
P	4725	24560	19.24%
Q	311	1413	22.01%
R	4161	15168	27.43%
S	7345	32194	22.81%
T	3203	14656	21.85%
U	1673	9695	17.26%
V	894	4609	19.4%
W	1614	5950	27.13%
X	10	309	3.24%
Y	167	1040	16.06%
Z	111	1176	9.44%
Total	60133	281857	21.33%

After making a second and third revision, the total number of words is 67520. Here is what the same charts look like:

Length	CEL Count	Competitive Count	Percent CEL/Competitive
2	66	127	51.97%
3	635	1347	47.14%
4	2442	5638	43.31%
5	4665	12973	35.96%
6	7407	23035	32.16%
7	9901	34345	28.83%
8	10700	42153	25.38%
9	9834	42935	22.9%
10	8140	37749	21.56%
11	5823	29591	19.68%
12	3810	21506	17.72%
13	2292	14703	15.59%
14	1185	9652	12.28%
15	620	6103	10.16%
Total	67520	281857	23.96%

Letter	CEL Count	Competitive Count	Percent CEL/Competitive
A	3808	16315	23.34%
B	3829	15380	24.9%
C	6290	25223	24.94%
D	4190	16719	25.06%
E	2685	11397	23.56%
F	2872	10676	26.9%
G	2081	9410	22.11%
H	2400	10619	22.6%
I	2645	9684	27.31%
J	527	2315	22.76%
K	488	3411	14.31%
L	1988	8105	24.53%
M	3646	16034	22.74%
N	1396	6782	20.58%
O	1850	9017	20.52%
P	5369	24560	21.86%
Q	323	1413	22.86%
R	4483	15168	29.56%
S	8016	32194	24.9%
T	3572	14656	24.37%
U	1947	9695	20.08%
V	1047	4609	22.72%
W	1739	5950	29.23%
X	11	309	3.56%
Y	184	1040	17.69%
Z	134	1176	11.39%
Total	67520	281857	23.96%

After making a fourth major revision which includes 16-45 letter words, the total number of words is 69223. Here is what the same charts look like from 2-21 letters:

Length	CEL Count	Competitive Count	Percent CEL/Competitive
2	66	127	51.97%
3	635	1340	47.39%
4	2450	5621	43.59%
5	4681	12920	36.23%
6	7421	22971	32.31%
7	9932	34270	28.98%
8	10725	42103	25.47%
9	9854	42899	22.97%
10	8166	37715	21.65%
11	5860	29577	19.81%
12	3861	21499	17.96%
13	2348	14700	15.97%
14	1330	9651	13.78%
15	806	6103	13.21%
16	544	5810	9.36%
17	290	4549	6.38%
18	113	3505	3.22%
19	63	2285	2.76%
20	39	1367	2.85%
21	14	816	1.72%
Total	69223	301578	22.95%

Letter	CEL Count	Competitive Count	Percent CEL/Competitive
A	3903	18428	21.18%
B	3880	16092	24.11%
C	6490	27658	23.47%
D	4376	17746	24.66%
E	2788	12187	22.88%
F	2891	10847	26.65%
G	2107	9702	21.72%
H	2464	11784	20.91%
I	2825	10914	25.88%
J	533	2325	22.92%
K	493	3451	14.29%
L	2006	8401	23.88%
M	3757	16941	22.18%
N	1467	8199	17.89%
O	1913	9965	19.2%
P	5532	26941	20.53%
Q	325	1453	22.37%
R	4567	15773	28.95%
S	8104	33291	24.34%
T	3643	15329	23.77%
U	2020	10910	18.52%
V	1052	4724	22.27%
W	1751	5972	29.32%
X	13	317	4.1%
Y	186	1040	17.88%
Z	137	1188	11.53%
Total	69223	301578	22.95%

ddugovic/CEL

CEL

Background

Community

Getting Started

Methodology

Stats