[GENERAL SUPPORT]: Error in Running Early Stopping Program from Ax Documentation

Question

[GENERAL SUPPORT]: Error in Running Early Stopping Program from Ax Documentation

vinaysaini94 opened this issue 4 months ago · 2 comments

Question

Error in Running Early Stopping Program from Ax Documentation

Please provide any relevant code snippet if applicable.

Hi Community,

I am having trouble running the early stopping tutorial from this link "https://ax.dev/tutorials/early_stopping/early_stopping.html". Specifically, I am encountering an error when executing the following code line:
"%%time
scheduler.run_all_trials()"

The error message I get is:
"{
	"name": "OSError",
	"message": "[WinError 87] The parameter is incorrect",
	"stack": "---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
File <timed eval>:1

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\ax\\service\\scheduler.py:1124, in Scheduler.run_all_trials(self, timeout_hours, idle_callback)
   1117 if self.options.total_trials is None:
   1118     # NOTE: Capping on number of trials will likely be needed as fallback
   1119     # for most stopping criteria, so we ensure `num_trials` is specified.
   1120     raise ValueError(
   1121         \"Please either specify `num_trials` in `SchedulerOptions` input \"
   1122         \"to the `Scheduler` or use `run_n_trials` instead of `run_all_trials`.\"
   1123     )
-> 1124 return self.run_n_trials(
   1125     max_trials=not_none(self.options.total_trials),
   1126     timeout_hours=timeout_hours,
   1127     idle_callback=idle_callback,
   1128 )

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\ax\\service\\scheduler.py:1071, in Scheduler.run_n_trials(self, max_trials, ignore_global_stopping_strategy, timeout_hours, idle_callback)
   1036 \"\"\"Run up to ``max_trials`` trials; will run all ``max_trials`` unless
   1037 completion criterion is reached. For base ``Scheduler``, completion criterion
   1038 is reaching total number of trials set in ``SchedulerOptions``, so if that
   (...)
   1068     3
   1069 \"\"\"
   1070 self.poll_and_process_results()
-> 1071 for _ in self.run_trials_and_yield_results(
   1072     max_trials=max_trials,
   1073     ignore_global_stopping_strategy=ignore_global_stopping_strategy,
   1074     timeout_hours=timeout_hours,
   1075     idle_callback=idle_callback,
   1076 ):
   1077     pass
   1078 return self.summarize_final_result()

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\ax\\service\\scheduler.py:964, in Scheduler.run_trials_and_yield_results(self, max_trials, ignore_global_stopping_strategy, timeout_hours, idle_callback)
    958 # Run new trial evaluations until `run` returns `False`, which
    959 # means that there was a reason not to run more evaluations yet.
    960 # Also check that `max_trials` is not reached to not exceed it.
    961 n_remaining_to_generate = self._num_remaining_requested_trials - len(
    962     self.candidate_trials
    963 )
--> 964 while self._num_remaining_requested_trials > 0 and self.run(
    965     max_new_trials=n_remaining_to_generate
    966 ):
    967     # Not checking `should_abort_optimization` on every trial for perf.
    968     # reasons.
    969     n_already_run_by_scheduler = (
    970         len(self.experiment.trials)
    971         - n_existing
    972         - len(self.candidate_trials)
    973     )
    974     self._num_remaining_requested_trials = (
    975         max_trials - n_already_run_by_scheduler
    976     )

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\ax\\service\\scheduler.py:1192, in Scheduler.run(self, max_new_trials)
   1190 self.logger.info(f\"Running trials {idcs_str}...\")
   1191 # TODO: Add optional timeout between retries of `run_trial(s)`.
-> 1192 metadata = self.run_trials(trials=all_trials)
   1193 self.logger.debug(f\"Ran trials {idcs_str}.\")
   1194 if self.options.debug_log_run_metadata:

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\ax\\utils\\common\\executils.py:163, in retry_on_exception.<locals>.func_wrapper.<locals>.actual_wrapper(*args, **kwargs)
    159             wait_interval = min(
    160                 MAX_WAIT_SECONDS, initial_wait_seconds * 2 ** (i - 1)
    161             )
    162             time.sleep(wait_interval)
--> 163         return func(*args, **kwargs)
    165 # If we are here, it means the retries were finished but
    166 # The error was suppressed. Hence return the default value provided.
    167 return default_return_on_suppression

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\ax\\service\\scheduler.py:639, in Scheduler.run_trials(self, trials)
    617 @retry_on_exception(retries=3, no_retry_on_exception_types=NO_RETRY_EXCEPTIONS)
    618 def run_trials(self, trials: Iterable[BaseTrial]) -> Dict[int, Dict[str, Any]]:
    619     \"\"\"Deployment function, runs a single evaluation for each of the
    620     given trials.
    621 
   (...)
    637         process.
    638     \"\"\"
--> 639     return self.runner.run_multiple(trials=trials)

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\ax\\core\\runner.py:70, in Runner.run_multiple(self, trials)
     50 def run_multiple(
     51     self, trials: Iterable[core.base_trial.BaseTrial]
     52 ) -> Dict[int, Dict[str, Any]]:
     53     \"\"\"Runs a single evaluation for each of the given trials. Useful when deploying
     54     multiple trials at once is more efficient than deploying them one-by-one.
     55     Used in Ax ``Scheduler``.
   (...)
     68         process.
     69     \"\"\"
---> 70     return {trial.index: self.run(trial=trial) for trial in trials}

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\ax\\runners\\torchx.py:159, in TorchXRunner.run(self, trial)
    156     parameters[\"tracker_base\"] = self._tracker_base
    158 appdef = self._component(**parameters)
--> 159 app_handle = self._torchx_runner.run(appdef, self._scheduler, self._cfg)
    160 return {
    161     TORCHX_APP_HANDLE: app_handle,
    162     TORCHX_RUNNER: self._torchx_runner,
    163     TORCHX_TRACKER_BASE: self._tracker_base,
    164 }

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\torchx\\runner\\api.py:262, in Runner.run(self, app, scheduler, cfg, workspace, parent_run_id)
    252 with log_event(
    253     api=\"run\", runcfg=json.dumps(cfg) if cfg else None, workspace=workspace
    254 ) as ctx:
    255     dryrun_info = self.dryrun(
    256         app,
    257         scheduler,
   (...)
    260         parent_run_id=parent_run_id,
    261     )
--> 262     handle = self.schedule(dryrun_info)
    263     ctx._torchx_event.scheduler = none_throws(dryrun_info._scheduler)
    264     ctx._torchx_event.app_image = none_throws(dryrun_info._app).roles[0].image

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\torchx\\runner\\api.py:308, in Runner.schedule(self, dryrun_info)
    301 with log_event(
    302     \"schedule\",
    303     scheduler,
    304     app_image=app_image,
    305     runcfg=json.dumps(cfg) if cfg else None,
    306 ) as ctx:
    307     sched = self._scheduler(scheduler)
--> 308     app_id = sched.schedule(dryrun_info)
    309     app_handle = make_app_handle(scheduler, self._name, app_id)
    310     app = none_throws(dryrun_info._app)

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\torchx\\schedulers\\local_scheduler.py:805, in LocalScheduler.schedule(self, dryrun_info)
    802         replica_log_dir = role_log_dirs[replica_id]
    804         os.makedirs(replica_log_dir)
--> 805         replica = self._popen(
    806             role_name,
    807             replica_id,
    808             replica_params,
    809         )
    810         local_app.add_replica(role_name, replica)
    811 self._apps[app_id] = local_app

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\torchx\\schedulers\\local_scheduler.py:693, in LocalScheduler._popen(self, role_name, replica_id, replica_params)
    682 def _popen(
    683     self,
    684     role_name: RoleName,
    685     replica_id: int,
    686     replica_params: ReplicaParam,
    687 ) -> _LocalReplica:
    688     \"\"\"
    689     Same as ``subprocess.Popen(**popen_kwargs)`` but is able to take ``stdout`` and ``stderr``
    690     as file name ``str`` rather than a file-like obj.
    691     \"\"\"
--> 693     stdout_, stderr_, combined_ = self._get_replica_output_handles(replica_params)
    695     args_pfmt = pprint.pformat(asdict(replica_params), indent=2, width=80)
    696     log.debug(f\"Running {role_name} (replica {replica_id}):\
 {args_pfmt}\")

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\torchx\\schedulers\\local_scheduler.py:731, in LocalScheduler._get_replica_output_handles(self, replica_params)
    729 combined_file = self._get_file_io(replica_params.combined)
    730 if combined_file:
--> 731     combined_ = Tee(
    732         combined_file,
    733         none_throws(replica_params.stdout),
    734         none_throws(replica_params.stderr),
    735     )
    736 return stdout_, stderr_, combined_

File c:\\Users\\Vinay Saini\\anaconda3\\Lib\\site-packages\\torchx\\schedulers\\streams.py:35, in Tee.__init__(self, out, *sources)
     33 for source in sources:
     34     r = io.open(source, \"rb\", buffering=0)
---> 35     os.set_blocking(r.fileno(), False)
     36     self.streams.append(r)
     38 self._closed = False

OSError: [WinError 87] The parameter is incorrect"
}"
I am running this code on a Windows 11 system using Jupyter Notebook.

Any advice on how to resolve this issue or modify the code to work on my setup would be greatly appreciated!

Thanks in advance for your help!

Code of Conduct

I agree to follow this Ax's Code of Conduct

Answer 1 · 2024-10-03T04:01:01.000Z

Hmm interesting - this looks to be very deep down in the torchx stack. The fact that it's an OSError makes me think that this is not really an Ax issue an more an issue with torchx on windows. There are a few similar issues out there:

https://discuss.pytorch.org/t/oserror-when-importing-torch-in-python-script/203520 (recommends downgrading python)
coherent-oss/coherent.deps#1
https://www.reddit.com/r/learnprogramming/comments/zvdvl1/new_to_using_kivy_keep_getting_error/ (suggests might be permissions related).

Answer 2 · 2024-10-16T21:25:45.000Z

Closing this out since it's been a while but please feel free to reopen if you still need assistance @vinaysaini94!