google/patents-public-data

BERT for Patents yields 1024 element array, but embedding_v1 is 64 element

Opened this issue · 5 comments

How should I generate an embedding equivalent to embedding_v1? BERT for Patents generates a 1024 element embedding, but the embedding_v1 is a 64 element embedding.

The model to generate embedding_v1 has not been released, and we also haven't released pre-embedded patents with the BERT model in BigQuery.

You could experiment with learning a mapping from BERT to embedding_v1 with a linear layer - they should match up well because they're both based on text. embedding_v1 is a set-of-words unigram model.

Can you give some insight into how you dealt with limited window size for BERT?
Eg did you choose between abstract/patent/etc; Pool things? Something else?

Thanks for that quick response.
This repo is a great resource.

This repo is great. Thank you! Any plans to release the model that generated embedding_v1 or the BERT pre-embedded patents?