num_tokens?

Question

num_tokens?

lifengjin opened this issue 8 years ago · 3 comments

in uhhmm.py, line 68, it says

    num_tokens = np.sum(sent_lens) - num_sents

and down below we have

        assert pos_counts == lex_counts
        if not pos_counts == num_tokens:
            logging.warn("This iteration has %d pos counts for %d tokens" % (pos_counts, num_tokens) )

Why is this the case? We didn't secretly append anything to the word sequences right? If so, why is the lex_counts check a hard assert but the num_tokens is a soft warning? Shouldn't they both be assertion?
Empirically num_tokens tells me there should only be 3000 tokens where there are 5000 words in 2000 sentences, so maybe this is a relic from an ancient era?

Answer 1 · 2017-03-20T22:17:35.000Z

This was a sanity check from when i added the code for doing batches and wanted to make sure i was counting right. I made one a non assert so it could survive a few missing parses.

…

On Mon, Mar 20, 2017, 6:12 PM lifengjin ***@***.***> wrote: in uhhmm.py, line 68, it says num_tokens = np.sum(sent_lens) - num_sents and down below we have assert pos_counts == lex_counts if not pos_counts == num_tokens: logging.warn("This iteration has %d pos counts for %d tokens" % (pos_counts, num_tokens) ) Why is this the case? We didn't secretly append anything to the word sequences right? If so, why is the lex_counts check a hard assert but the num_tokens is a soft warning? Shouldn't they both be assertion? Empirically num_tokens tells me there should only be 3000 tokens where there are 5000 words in 2000 sentences, so maybe this is a relic from an ancient era? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#62>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAN3DcmCJiGxgT_dZWlhYWuhSN09Cq8nks5rnvnhgaJpZM4MjFUt> .

Answer 2 · 2017-03-20T23:19:56.000Z

So it is OK to dismiss what it says?

Answer 3 · 2017-03-21T01:46:04.000Z

FYI, the warning turns out to have been triggered by an accounting error one of my recent changes introduced, related to the fact that we now temporarily append the EOS token to each sentence during count incrementation. I just fixed this in master but the erroneous warning will still be there in Lifeng's branch for now. It can be safely ignored.