tmills/uhhmm

num_tokens?

lifengjin opened this issue · 3 comments

in uhhmm.py, line 68, it says

    num_tokens = np.sum(sent_lens) - num_sents

and down below we have

        assert pos_counts == lex_counts
        if not pos_counts == num_tokens:
            logging.warn("This iteration has %d pos counts for %d tokens" % (pos_counts, num_tokens) )

Why is this the case? We didn't secretly append anything to the word sequences right? If so, why is the lex_counts check a hard assert but the num_tokens is a soft warning? Shouldn't they both be assertion?
Empirically num_tokens tells me there should only be 3000 tokens where there are 5000 words in 2000 sentences, so maybe this is a relic from an ancient era?

So it is OK to dismiss what it says?

FYI, the warning turns out to have been triggered by an accounting error one of my recent changes introduced, related to the fact that we now temporarily append the EOS token to each sentence during count incrementation. I just fixed this in master but the erroneous warning will still be there in Lifeng's branch for now. It can be safely ignored.