Creating a new data assets uses default datastore by default
kenza-bouzid opened this issue · 1 comments
We've encountered this issue recently:
Dataset initialization failed: UserErrorException:
Message: Failed with visit error: Failed with execution error: error in streaming from input data sources
VisitError(ExecutionError(StreamError(NotFound)))
=> Failed with execution error: error in streaming from input data sources
ExecutionError(StreamError(NotFound))
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Failed with visit error: Failed with execution error: error in streaming from input data sources\n\tVisitError(ExecutionError(StreamError(NotFound)))\n=> Failed with execution error: error in streaming from input data sources\n\tExecutionError(StreamError(NotFound))"
}
}
This happened because the data asset was created using the workspace's default datastore. On shared workspaces like (hai4, mprr1ws), the default datastore doesn't contain our private data.
One should specify --datastore when submitting the first ever job that creates the data assets as we can not delete/edit data assets once they are created.
Using the default workspace's datastore by default is not always desired, can we come up with another strategy for default datastores?
Possible options:
- The user has to specify a datastore for the first job: in get_or_create_data_asset only create the data asset if datastore is specified, this would force the user to be careful the first time and specify the right datastore
- Before using the default workspace's datastore: can we check that the dataset exists already? if not throw an error:
Input Dataset doesn't exist in default datastore, please specify the right datastore
?
After some more investigation here, it seems that creating new assets using the _create_v2_data_assets
in datasets.py
is creating all new assets in the default datastore, even if the datastore is specified on the command line.