apple/axlearn

Unable to use TPU on GKE using on-demand quota

Opened this issue · 0 comments

Currently axlearn either adds a nodeSelector for spot=true or it adds a nodeSelector for reservation:

        if tier == "0" and cfg.reservation is not None:
            logging.info("Found tier=%s in env. Using reservation=%s", tier, cfg.reservation)
            selector.update({"cloud.google.com/reservation-name": cfg.reservation})
        else:
            logging.info("Found tier=%s in env. Using spot quota", tier)
            selector.update({"cloud.google.com/gke-spot": "true"})
            tolerations.append(
                {
                    "key": "cloud.google.com/gke-spot",
                    "operator": "Equal",
                    "value": "true",
                    "effect": "NoSchedule",
                }
            )

It should be possible to launch a job using on-demand TPU, however today that's not possible unless you remove this line:

selector.update({"cloud.google.com/gke-spot": "true"})