MLTable - AzureML - Cache Environment variables
FrsECM opened this issue · 4 comments
Operating System
Linux
Version Information
mltable-1.6.1
azureml-dataprep-rslex~=2.22.2dev0
Steps to reproduce
- Run a job on a compute that is size S
- Mount Datastore as folder with mltables - Datastore total size> S
- Wait...
- Crash
For example, in AzureMachine Learning :
storage_paths = [
{'folder':'azureml://subscriptions/$sub/resourcegroups/$rg/workspaces/$ws/datastores/$ds/paths/'}
]
tbl = mltable.from_paths(storage_paths )
mount_context = tbl._mount()
mount_context.start()
# Iterate over files
In order to fix my issue, i need to add extra mount settings :
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-read-write-data-v2?view=azureml-api-2&tabs=python#available-mount-settings
I use a wrapper class in order to do this on multiple storage / containers :
@dataclass
class MyStorage:
mount_paths:List[int] = field(init=False,default_factory=list)
_is_mounted:bool = field(init=False,default=False)
_mount_context:Any = field(init=False,default=None)
def __post_init__(self):
os.environ['DATASET_MOUNT_CACHE_SIZE']="-40GB" # We leave at least 50GB available on the cluster.
os.environ['DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED']="True"
def mount(self):
print(f'Start Mounting storage...')
[print(f"- {path['folder']}") for path in self.mount_path]
tbl = mltable.from_paths(self.mount_paths)
self._mount_context = tbl._mount()
self._mount_context.start()
self._is_mounted = True
print(f'Mount Done - {self._mount_context.mount_point}')
def umount(self):
if self._is_mounted:
print(f'Start UnMounting - {self._mount_context.mount_point}')
self._mount_context.stop()
self._mount_context=None
self._is_mounted = False
print('UnMount Done...')
def __del__(self):
self.umount()
storage = MyStorage()
storage.mount_paths = storage_paths
storage.mount()
# Do stuff
del storage
I also tried to add the environment variable in the yaml job :
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
experiment_name: LARGE-JOB
display_name: Large Job
environment_variables:
DATASET_MOUNT_CACHE_SIZE: "-40 GB"
DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "True"
DATASET_MOUNT_FILE_CACHE_PRUNE_TARGET: "0.0"
....
But none of theses solutions are working well.
Expected behavior
I expect that the disk cache is pruned when it is reaching the -40GB limit on the compute machine.
Actual behavior
Currently, the cache continues to grow :
Even if i set environment variables in yaml :
And i can confirm that the environment variable are used in the job :
But it seems mltables are ignoring them.
Addition information
No response
For people who may have the problem, i got a fix :
import re
import os
def mount_options()->MountOptions:
max_size = None
free_space_required = None
cache_param = os.getenv('DATASET_MOUNT_CACHE_SIZE',None)
if cache_param:
CACHE_SIZE_PATTERN = r'^(?P<sign>-?)(?P<val>\d+).*(?P<size>[A-Z]{2})$'
match = re.match(CACHE_SIZE_PATTERN,cache_param)
if match:
size = match.group('size')
if size == 'GB':
coeff = 1024**3
elif size =='MB':
coeff = 1024**2
else:
raise NotImplementedError(f'Not implemented for size {size}')
value = int(match.group('val'))*coeff
if match.group('sign')=='-':
# We are in mode "free_space_required"
free_space_required = value
print(f'MountOption : {value} Max Free Space')
else:
# We are in mode "max_size"
max_size = value
print(f'MountOption : {value} Max Size')
return MountOptions(max_size=max_size,free_space_required=free_space_required)
###### You can now consume your mltable
storage_paths = [
{'folder':'azureml://subscriptions/$sub/resourcegroups/$rg/workspaces/$ws/datastores/$ds/paths/'}
]
tbl = mltable.from_paths(storage_paths )
mount_context = tbl._mount(mount_options=mount_options())
mount_context.start()
If i do this way, it works, but it ignores the prune target :
Anyway, it's a bug for me, the behaviour should be consistent with the documentation.
I have the same bug. data caching eats up all memory on 64Gb disk. Cant store training checkpoints.
Tried setting DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true
, but error arises, cant set boolean type.
When I set DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "true"
, nothing happens. Data keeps getting cached
I have the same bug. data caching eats up all memory on 64Gb disk. Cant store training checkpoints.
Tried setting DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: true, but error arises, cant set boolean type.
When I set DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED: "true" , nothing happens. Data keeps getting cached
Normally you can use the fix i did, just set the DATASET_MOUNT_CACHE_SIZE env variable with a size and normally it should work.
But anyway it should be fixed....