Grab

Question

Grab

Closed this issue 5 years ago · 1 comments

12/09/2019

Given a dictionary of words, every letter has a distribution by a probability. E.g.

ape
apple
bear

In this example dictionary, letter a has a 2/3 probability to be followed by letter p and 1/3 probability to be followed by letter r.
We also need to use the probability to start and end, e.g. There is a 2/3 probability to start with letter a and 1/3 probability to start with letter b.
There is a 2/3 probability to end after letter e and 100% probability to end after letter r.

Implement a method generateRandomWord() using the dictionary distribution above.
Note: no word length is restricted.

Answer 1 · 2019-12-10T01:52:55.000Z

Intuition:

Generate a dict of next letter and its occurrence probability for each letter including "" as the start letter placeholder. In this example, it should be:

{ 
  "": {"a": 0.667, "b": 0.333}, // start placeholder
  "a": {"p": 0.667, "r": 0.333},
  "b": {"e": 1},
  "e": {"": 0.667, "a": 0.333},
  "l": {"e": 1},
  "p": {"e": 0.333, "l": 0.333, "p": 0.333},
  "r": {"": 1} // ending placeholder
}

Use that probability dictionary to calculate the next letter, we need to implement a method: randomNextLetter(). The trickiest part is how to leverage the distribution probability dictionary to generate the next letter. We can use Reservoir Sampling to decide whether a randomed float f generated by random.random(), in the semi-open range [0.0, 1.0). Basic idea is to accumulatively sum up the probability to the previous probability sum. E.g. from "p": {"e": 0.333, "l": 0.333, "p": 0.333}, we can get "p": {"e": 0.333, "l": 0.667, "p": 1}, and then test whether the randomed float f < current_sum_of_probability.
Overall logic will be like:

def generateRandomWord(wordList):
    distDict = generateDistDict(wordList)
    curLetter = ''
    letters = [curLetter]
    while True:
        curLetter = randomNextLetter(curLetter, distDict)
        if curLetter == '':
            return ''.join('', letters)
        letters += curLetter,