
Creating a new data assets uses default datastore by default

kenza-bouzid opened this issue · 1 comments

We've encountered this issue recently:

Dataset initialization failed: UserErrorException:
	Message: Failed with visit error: Failed with execution error: error in streaming from input data sources
=> Failed with execution error: error in streaming from input data sources
	InnerException None
    "error": {
        "code": "UserError",
        "message": "Failed with visit error: Failed with execution error: error in streaming from input data sources\n\tVisitError(ExecutionError(StreamError(NotFound)))\n=> Failed with execution error: error in streaming from input data sources\n\tExecutionError(StreamError(NotFound))"

This happened because the data asset was created using the workspace's default datastore. On shared workspaces like (hai4, mprr1ws), the default datastore doesn't contain our private data.

One should specify --datastore when submitting the first ever job that creates the data assets as we can not delete/edit data assets once they are created.

Using the default workspace's datastore by default is not always desired, can we come up with another strategy for default datastores?

Possible options:

  • The user has to specify a datastore for the first job: in get_or_create_data_asset only create the data asset if datastore is specified, this would force the user to be careful the first time and specify the right datastore
  • Before using the default workspace's datastore: can we check that the dataset exists already? if not throw an error: Input Dataset doesn't exist in default datastore, please specify the right datastore ?

cc @ant0nsc @peterhessey

After some more investigation here, it seems that creating new assets using the _create_v2_data_assets in is creating all new assets in the default datastore, even if the datastore is specified on the command line.