This is the official repo of GROWN+UP and accompanying benchmarks published in the CIKM'22 proceedings. Latest preprint can be found on arxiv.
The hardware / software requirements are:
- Ubuntu 18.04 or newer
- Docker 19.03 or newer with GPU support
- (Optional but recommended)
- NVIDIA GPU with CUDA 11.2.1 support (GPU driver version: >=460.32.03)
- Typically, you don't need to care on Cuda library installation if you use Docker
- Git LFS
- This is super important and is required when you clone / pull from this repo, since some large data (ie, pretrained weights) is stored in LFS.
The top level folder structure of this repo consists of:
boilerplate-removal
Webpage boilerplate removal benchmarks for GROWN+UP as well as other baselines mentioned in the paper.genre-classification
Webpage genre classification benchmarks for GROWN+UPklassterfork
A subset of an ML framework containing GROWN+UP model components and other ML training necessities to reproduce results, built on Tensorflow v2.5pre-training
TODO
For more details, please consult the README.md in the appropriate folders.
To cite, please use this BibTex:
@inproceedings{10.1145/3511808.3557340,
author = {Yeoh, Benedict and Wang, Huijuan},
title = {GROWN+UP: A ''Graph Representation Of a Webpage" Network Utilizing Pre-Training},
year = {2022},
isbn = {9781450392365},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3511808.3557340},
doi = {10.1145/3511808.3557340},
booktitle = {Proceedings of the 31st ACM International Conference on Information & Knowledge Management},
pages = {2372–2382},
numpages = {11},
keywords = {web genre classification, webpage, boilerplate removal, feature extractor, self-supervised, graph neural network, backbone, pre-training},
location = {Atlanta, GA, USA},
series = {CIKM '22}
}