astronomer/astronomer-cosmos

Permission issue with Cosmos cache in some restricted environments

Closed this issue · 5 comments

Cosmos caching mecanism can trigger permission errors when the user does not have permission to chmod files in the cache location.

This can happen, for example, when using an Azure Files volume with AKS/Azure container instances/Azure container apps. A known (unfortunate) limitation of this system is that you can't change a file permissions, permissions are defined at mount time: https://stackoverflow.com/questions/58301985/permissions-on-azure-file.

I get the following error when trying to use Cosmos cache with such a volume:

[2024-05-29, 16:57:16 CEST] {taskinstance.py:2905} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 400, in wrapper
    return func(self, *args, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/cosmos/operators/base.py", line 266, in execute
    self.build_and_run_cmd(context=context, cmd_flags=self.add_cmd_flags())
  File "/home/airflow/.local/lib/python3.10/site-packages/cosmos/operators/local.py", line 499, in build_and_run_cmd
    result = self.run_command(cmd=dbt_cmd, env=env, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/cosmos/operators/local.py", line 366, in run_command
    cache._update_partial_parse_cache(partial_parse_file, self.cache_dir)
  File "/home/airflow/.local/lib/python3.10/site-packages/cosmos/cache.py", line 110, in _update_partial_parse_cache
    shutil.copy(str(latest_partial_parse_filepath), str(cache_path))
  File "/usr/local/lib/python3.10/shutil.py", line 418, in copy
    copymode(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/local/lib/python3.10/shutil.py", line 307, in copymode
    chmod_func(dst, stat.S_IMODE(st.st_mode))
PermissionError: [Errno 1] Operation not permitted: '/cache/cosmos/dbt/target/partial_parse.msgpack'`

The problem is a consequence of using shutil.copy which, according to the documentation "copies the file data and the file’s permission mode", and therefore fails in this context:

shutil.copy(str(latest_partial_parse_filepath), str(cache_path))
shutil.copy(str(latest_manifest_filepath), str(manifest_path))

Using shutil.copyfile should solve this issue.

To resolve the PermissionError caused by shutil.copy in the Cosmos caching mechanism when using an Azure Files volume with AKS, Azure container instances, or Azure container apps, you can modify the code to use shutil.copyfile instead of shutil.copy. This is because shutil.copy attempts to copy file metadata, which can lead to permission issues on certain filesystems like Azure Files.

Here is the modified code:

def _update_partial_parse_cache(latest_partial_parse_filepath: Path, cache_dir: Path) -> None:
    """
    Update the cache to have the latest partial parse file contents.

    :param latest_partial_parse_filepath: Path to the most up-to-date partial parse file
    :param cache_dir: Path to the Cosmos project cache directory
    """
    cache_path = get_partial_parse_path(cache_dir)
    manifest_path = get_partial_parse_path(cache_dir).parent / DBT_MANIFEST_FILE_NAME
    latest_manifest_filepath = latest_partial_parse_filepath.parent / DBT_MANIFEST_FILE_NAME

    shutil.copyfile(str(latest_partial_parse_filepath), str(cache_path))
    shutil.copyfile(str(latest_manifest_filepath), str(manifest_path))

def _copy_partial_parse_to_project(partial_parse_filepath: Path, project_path: Path) -> None:
    """
    Update target dbt project directory to have the latest partial parse file contents.

    :param partial_parse_filepath: Path to the most up-to-date partial parse file
    :param project_path: Path to the target dbt project directory
    """
    target_partial_parse_file = get_partial_parse_path(project_path)
    tmp_target_dir = project_path / DBT_TARGET_DIR_NAME
    tmp_target_dir.mkdir(exist_ok=True)

    source_manifest_filepath = partial_parse_filepath.parent / DBT_MANIFEST_FILE_NAME
    target_manifest_filepath = target_partial_parse_file.parent / DBT_MANIFEST_FILE_NAME
    shutil.copyfile(str(partial_parse_filepath), str(target_partial_parse_file))

    patch_partial_parse_content(target_partial_parse_file, project_path)

    if source_manifest_filepath.exists():
        shutil.copyfile(str(source_manifest_filepath), str(target_manifest_filepath))

By using shutil.copyfile, you avoid copying the file metadata, which should help in avoiding the PermissionError on Azure Files volumes [1].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

I believe we are using shutil.copy because we want the file metadata like modification time etc. Did you try to use cache-dir config. You can set this location where you have permission.

Sorry for the late reply @pankajastro!

If that is the intention, then shutil.copy is proably not achieving what you expect it to do. According to the documentation:

copy() copies the file data and the file’s permission mode (see os.chmod()). Other metadata, like the file’s creation and modification times, is not preserved. To preserve all file metadata from the original, use copy2() instead.

Looking at the source code of shutil.copy, you can see that it just runs shutil.copyfile and then shutil.copymode and I don't see why the second one would be relevant. If you really need to preserve these metadata, it seems that shutil.copystat would be more appropriate.

I am using the cache-dir config, but this is not really helping because my intention is to save the cache on this volume so that it can be shared across containers.

If that is the intention, then shutil.copy is proably not achieving what you expect it to do. According to the documentation

While that might be true, I can observe that we are comparing time in this context

if age_project_partial_parse > age_cosmos_cached_partial_parse_filepath:

shutil.copystat appears more promising for preserving this metadata, but I'm curious whether it resolves your permission issue, given that it also copies permission bits?

I ran some tests today, and you're correct that I run into a similar permission issue. Actually, even modifying the the file modification timestamp is not permitted.
However, I don't believe that is functionnally problematic: the file's mtime will be set to the time when it was cached and if the manifest is generated at a later time its timestamp will be newer than the cache file timestamp, which will trigger a cache refresh anyway. Assuming we use shutil.copyfile instead of shutil.copy, th eonly situation I can think of that would exhibit a functional difference, would be the following sequence of events:

  1. The dbt manifest is generated at time 0, with mtime 0
  2. Cosmos caches the file at time 2, and the cache file has therefore mtime 2
  3. The dbt manifest is regenerated, and the updated mtime is 1
  4. Next time Cosmos runs, it would therefore not refresh the cache because the cache's mtime > manifest mtime

I don't see how that could happen unless there is a clock mismatch between the process generating the manifest and Cosmos' caching process. That seems highly unlikely!

In addition, I also tested on a "normal filesystem" and can confirm that shutil.copy does not work as you expect:

>>> from pathlib import Path
>>> import shutil
>>> 
>>> source = Path('source.txt')
>>> source.write_text("Hello, World!")
>>> dest_dir = Path("/dest")
>>> source.stat()
os.stat_result(st_mode=33188, st_ino=5667023, st_dev=2096, st_nlink=1, st_uid=1000, st_gid=1000, st_size=13, st_atime=1718698252, st_mtime=1718698198, st_ctime=1718698198)
>>> # Test 1: shutil.copy
>>> dest1 = dest_dir / "test1.txt"
>>> shutil.copy(source, dest1)
>>> dest1.stat().st_mtime
os.stat_result(st_mode=33188, st_ino=5667073, st_dev=2096, st_nlink=1, st_uid=1000, st_gid=1000, st_size=13, st_atime=1718698252, st_mtime=1718698252, st_ctime=1718698252)
>>> # Test 2: shutil.copy2
>>> dest2 = dest_dir / "test2.txt"
>>> shutil.copy2(source, dest2)
>>> dest2.stat().st_mtime
os.stat_result(st_mode=33188, st_ino=5667040, st_dev=2096, st_nlink=1, st_uid=1000, st_gid=1000, st_size=13, st_atime=1718698252, st_mtime=1718698198, st_ctime=1718698278)
>>> # Test 3: shutil.copyfile
>>> dest3 = dest_dir / "test3.txt"
>>> shutil.copyfile(source, dest3)
>>> dest3.stat()
os.stat_result(st_mode=33188, st_ino=5667062, st_dev=2096, st_nlink=1, st_uid=1000, st_gid=1000, st_size=13, st_atime=1718699621, st_mtime=1718699625, st_ctime=1718699625)

The only effect of using copy instead of copyfile is that the file permissions will be copied, but that seems to be rather an undesirable behaviour 😉