This is the 4th Place Solution for the USPTO - Explainable AI for Patent Professionals (Kaggle competition).
- detail document: https://www.kaggle.com/competitions/uspto-explainable-ai/discussion/522200
- submission code (with magic): https://www.kaggle.com/code/ryotayoshinobu/uspto-4th-place-solution-w-magic-lb0-98
- submission code (without magic): https://www.kaggle.com/code/ryotayoshinobu/uspto-4th-place-solution-w-o-magic-lb0-91
- CPU: Intel Core i9 13900KF (24 cores, 32 threads)
- GPU: NVIDIA GeForce RTX 4090
- RAM: 64GB
- WSL2 (version 2.0.9.0, Ubuntu 22.04.2 LTS)
Please check the dockerfile in /kaggle/.devcontainer
- put the competition data in
input/uspto-boolean-search-optimization
- ex. input/uspto-boolean-search-optimization/test.csv
- put the whoosh wheel in
input/whoosh-wheel-2-7-4
- (
input/whoosh-wheel-2-7-4/Whoosh-2.7.4-py2.py3-none-any.whl
)
- (
- save tokenized text data
input/all-index/gen.ipynb
input/all-index-per-patent/split.ipynb
- Too many files may cause errors along the way. In that case, run it again (the rest will be saved in a separate directory).
input/all-index-per-patent/split/gen_db.ipynb
- count cpc frequency
input/cpc-counts/cpc_freq.ipynb
- count token frequency
input/token-counts/word_freq.ipynb
- generate DB for mapping cpc to patents
input/cpc2patents/save_db.ipynb
- generate DB for mapping rare_token to patents
input/rare-tokens/gen.ipynb
input/rare-tokens/save_db.ipynb
- generate DB for mapping patent to cpcs
input/patent2cpc/save_db.ipynb
- complete all steps above (1.~6.)
- generate .json for mapping cpc to patents
input/cpc-mapping/cpc_mapping.ipynb
- generate DB for (cpc, token) to patents
input/preprocess-complete/save_cpc_token_lists.ipynb
input/preprocess-complete/save_patent_wise.ipynb
input/preprocess-complete/split/gen_leveldb.ipynb
input/preprocess-complete-v2/save_cpc_token_lists.ipynb
input/preprocess-complete-v2/save_patent_wise.ipynb
input/preprocess-complete-v2/split/gen_leveldb.ipynb
- this step will take at least 2 weeks in my environment.
save_cpc_token_lists
may stop due to OOM or other reasons. In that case, please execute again. (It will resume from where it left off.)
- generate DB for publication_number (row id) to token
input/preprocess-all-token-single/save.ipynb
input/preprocess-all-token-single/gen_leveldb.ipynb
NOTE:
- To avoid OOM, free memory when each notebook is finished executing.
- The submission notebooks will delete all data in
/kaggle/working
when they are executed. - Sufficient disk space is required.
- At least 2TB is required.
- If you do not have enough space, please delete unnecessary intermediate files as appropriate.
- Dockerfile is used instead of
B4.requirements.txt
. src/config.yaml
is used instead ofB6. SETTINGS.json
.- There are no
B7. Serialized copy of the trained model
. B8. entry_points.md
is not included because my all codes are.ipynb
format.