allenai/bilm-tf

How to dump_token_embeddings?

Closed this issue · 1 comments

I have trained a model based on token input, and delete the 'char_cnn' parameters from options. Now, I want to extract features from trained ELMO, but I can't produce the 'embedding_weight_file' which is necessary for 'BidirectionalLanguageModel' init when its 'use_character_inputs' is False (because I have delete 'char_cnn', and haven't used char_embedding in my training graph)

I try to use function 'dump_token_embeddings()' to create 'embedding_weight_file', but internal of this function, it still need the 'char_cnn' parameters.
`

def dump_token_embeddings(vocab_file, options_file, weight_file, outfile):

'''Given an input vocabulary file, dump all the token embeddings to the
outfile.  The result can be used as the embedding_weight_file when
constructing a BidirectionalLanguageModel.'''

with open(options_file, 'r') as fin:
    options = json.load(fin)
max_word_length = options['char_cnn']['max_characters_per_token']

vocab = UnicodeCharsVocabulary(vocab_file, max_word_length)
batcher = Batcher(vocab_file, max_word_length)
ids_placeholder = tf.placeholder('int32', shape=(None, None, max_word_length))
model = BidirectionalLanguageModel(options_file, weight_file)
...

`
Does this function just write for the model with char_embedding?
And if I want to dump token embeddings, how do I need to modify this function?

  1. change UnicodeCharsVocabulary to Vocabulary
  2. the Batcher to TokenBatcher
  3. placeholder
  4. the BidirectionalLanguageModel

But if I cerate the BidirectionalLanguageModel without 'char_cnn', it will need 'embedding_weight_file' not None... This is deadlocked

OK, I have fixed this problem, I have modify codes of BidirectionalLanguageModel
`
comment blow codes

    # if not use_character_inputs:
    #     if embedding_weight_file is None:
    #         raise ValueError(
    #             "embedding_weight_file is required input with "
    #             "not use_character_inputs"
    #         )

and change file reading in
token_embedding_file -> weight_file

def _pretrained_initializer(varname, weight_file, embedding_weight_file=None):
...
if varname_in_file == 'embedding':
with h5py.File(weight_file, 'r') as fin:
`

and modify BidirectionalLanguageModelGraph
`
previews under else is None

    if embedding_weight_file is not None:
          ...
    else:
        # +1 for padding
        self._n_tokens_vocab = options['n_tokens_vocab'] + 1

`