NVlabs/RADIO

What is the rtx-translate adaptor?

javiabellan opened this issue · 3 comments

My question is what is rtx-translate and how is it useful somehow?

Steps to reproduce:

import torch

radioModel = torch.hub.load('NVlabs/RADIO', 'radio_model', version='radio_v2', progress=True, adaptor_names=["clip", "openai_clip", 'dino_v2', "sam", "rtx-translate"])

inp = torch.rand(1, 3, 256, 256)
out = radioModel(inp)

for out_name,(summary, features) in out.items():
    print(f"{out_name:<10}\t{summary.shape}\t{features.shape}")

This prints:

backbone  	torch.Size([1, 2560])	torch.Size([1, 256, 1280])
clip      	torch.Size([1, 1024])	torch.Size([1, 256, 1280])
openai_clip	torch.Size([1, 768])	torch.Size([1, 256, 1024])
dino_v2   	torch.Size([1, 1536])	torch.Size([1, 256, 1536])
sam       	torch.Size([1, 1280])	torch.Size([1, 256, 1280])
rtx-translate	torch.Size([1, 128])	torch.Size([1, 256, 2048])

UPDATE

Looking at the adaptor config i can see some OCR datasets:

{'type': 'rtx_translate',
'name': 'rtx-translate',
'model': 'quality',
'feature_distillation': True,
'fd_normalize': False,
'fd_loss_fn': 'MSE',
'input_size': 1024,
'use_summary': False,
'fd_ohem': True,
'amp': True,
'data_dir': [
    ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nlp_fm/datasets/ocr/publaynet/webdataset', 0.4], 
    ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nlp_fm/datasets/ocr/staging/arxiv/hocr', 0.4], 
    ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nlp_fm/datasets/ocr/scene-text/scene-text/text_ocr/webdataset', 0.15],
    ['/lustre/fsw/portfolios/llmservice/projects/llmservice_nlp_fm/datasets/ocr/scene-text/scene-text/hiertext/webdataset', 0.05]
],
'batch_size': 2,
'sample_rate': 2,
'summary_loss_weight': 1e-05,
'fd_loss_weight': 0.13,
'vitdet_prob': 0.99,
'vitdet_window_sizes': [8, 16, 16],
'student_resolution': 1024,
'fd_upsample_factor': 4} 

New question: How i can use the rtx_translate feats to do OCR ?

Hello,

So it's an internal OCR model that we have, and RADIO was learning to match intermediate features. We don't have an integration API for it, but the idea is that it helps the backbone to explicitly model text features. We have unpublished results suggesting that indeed, at high resolution (>= 1024), RADIOv2 does capture really strong text features. The way you'd use it is by connecting the backbone to some other OCR system and training at least the non-backbone part of that model. It should be compatible with the usual suspects, such as Faster-RCNN, or even by using a transformer decoder to read out text in a manner similar to Pix2Struct.

Ok thanks for the response, so the 2048dim feats from the rtx-translate head was only for OCR training purposes, therefore I can not use that head for inference right?

Yeah, I don't think you'll get very much use out of them. The backbone features should have pretty strong OCR priors though, so if you're looking to do that sort of thing, give that a try.