Unable to use TPU on GKE using on-demand quota
Opened this issue · 0 comments
samos123 commented
Currently axlearn either adds a nodeSelector for spot=true or it adds a nodeSelector for reservation:
if tier == "0" and cfg.reservation is not None:
logging.info("Found tier=%s in env. Using reservation=%s", tier, cfg.reservation)
selector.update({"cloud.google.com/reservation-name": cfg.reservation})
else:
logging.info("Found tier=%s in env. Using spot quota", tier)
selector.update({"cloud.google.com/gke-spot": "true"})
tolerations.append(
{
"key": "cloud.google.com/gke-spot",
"operator": "Equal",
"value": "true",
"effect": "NoSchedule",
}
)
It should be possible to launch a job using on-demand TPU, however today that's not possible unless you remove this line:
selector.update({"cloud.google.com/gke-spot": "true"})