microsoft/hi-ml

Creating a new data assets uses default datastore by default

kenza-bouzid opened this issue · 1 comments

We've encountered this issue recently:

Dataset initialization failed: UserErrorException:
	Message: Failed with visit error: Failed with execution error: error in streaming from input data sources
	VisitError(ExecutionError(StreamError(NotFound)))
=> Failed with execution error: error in streaming from input data sources
	ExecutionError(StreamError(NotFound))
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "Failed with visit error: Failed with execution error: error in streaming from input data sources\n\tVisitError(ExecutionError(StreamError(NotFound)))\n=> Failed with execution error: error in streaming from input data sources\n\tExecutionError(StreamError(NotFound))"
    }
}

This happened because the data asset was created using the workspace's default datastore. On shared workspaces like (hai4, mprr1ws), the default datastore doesn't contain our private data.

One should specify --datastore when submitting the first ever job that creates the data assets as we can not delete/edit data assets once they are created.

Using the default workspace's datastore by default is not always desired, can we come up with another strategy for default datastores?

Possible options:

  • The user has to specify a datastore for the first job: in get_or_create_data_asset only create the data asset if datastore is specified, this would force the user to be careful the first time and specify the right datastore
  • Before using the default workspace's datastore: can we check that the dataset exists already? if not throw an error: Input Dataset doesn't exist in default datastore, please specify the right datastore ?

cc @ant0nsc @peterhessey

After some more investigation here, it seems that creating new assets using the _create_v2_data_assets in datasets.py is creating all new assets in the default datastore, even if the datastore is specified on the command line.