BERT for Patents yields 1024 element array, but embedding_v1 is 64 element

Question

BERT for Patents yields 1024 element array, but embedding_v1 is 64 element

Opened this issue 3 years ago · 5 comments

How should I generate an embedding equivalent to embedding_v1? BERT for Patents generates a 1024 element embedding, but the embedding_v1 is a 64 element embedding.

Answer 1 · 2021-06-30T21:59:15.000Z

The model to generate embedding_v1 has not been released, and we also haven't released pre-embedded patents with the BERT model in BigQuery.

You could experiment with learning a mapping from BERT to embedding_v1 with a linear layer - they should match up well because they're both based on text. embedding_v1 is a set-of-words unigram model.

Answer 2 · 2021-12-02T20:43:29.000Z

Can you give some insight into how you dealt with limited window size for BERT?
Eg did you choose between abstract/patent/etc; Pool things? Something else?

Answer 3 · 2021-12-02T21:37:54.000Z

Hi Saurabh, We limited the window to claim 1. Scott

…

________________________________ From: Saurabh Bhatnagar ***@***.***> Sent: Thursday, December 2, 2021 1:43 PM To: google/patents-public-data ***@***.***> Cc: sthorpe11 ***@***.***>; Author ***@***.***> Subject: Re: [google/patents-public-data] BERT for Patents yields 1024 element array, but embedding_v1 is 64 element (#49) Can you give some insight into how you dealt with limited window size for BERT? Eg did you choose between abstract/patent/etc; Pool things? Something else? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpatents-public-data%2Fissues%2F49%23issuecomment-984986213&data=04%7C01%7C%7Cf1a7f1780d5b4165a2f008d9b5d46c2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637740746225068084%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FAgVn6tHvK3T6%2BrbI2mBb3riU85pZ1dlXbK2dzRDpIg%3D&reserved=0>, or unsubscribe<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAGKNX4UQBHUJKWIQ25X6AV3UO7K7ZANCNFSM47STZ6JQ&data=04%7C01%7C%7Cf1a7f1780d5b4165a2f008d9b5d46c2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637740746225078042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=p8ICdI2HV4Yjx3vLe9NHwIYMvpz7xmO6VYcby0jbHjM%3D&reserved=0>. Triage notifications on the go with GitHub Mobile for iOS<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7C%7Cf1a7f1780d5b4165a2f008d9b5d46c2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637740746225078042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AppXLzUaP0L2Q3kdkfI9Iy325o3quxrPDpTY3hNlS5E%3D&reserved=0> or Android<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7C%7Cf1a7f1780d5b4165a2f008d9b5d46c2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637740746225078042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PnJPsbAEHEag5VJIDBMrg82aGdMVMLe%2FcCYnwD3kftE%3D&reserved=0>.

Answer 4 · 2021-12-02T21:52:14.000Z

Thanks for that quick response.
This repo is a great resource.

Answer 5 · 2024-02-16T15:50:43.000Z

This repo is great. Thank you! Any plans to release the model that generated embedding_v1 or the BERT pre-embedded patents?