apache/airflow

Windows support for Airflow

shachibista opened this issue · 22 comments

Description

Currently, the airflow project uses PEP-3143 style daemons to launch tasks (as implemented in https://pypi.org/project/python-daemon/), however this is targeted towards unix daemons. As a result, running airflow on windows requires multiple levels of abstraction each with their own problems. Would it be possible to use something like daemoniker (https://daemoniker.readthedocs.io/en/latest/) to launch tasks? What are the challenges and issues?

In machine learning workflows, with large datasets, it is a huge time-saver if the pipeline tasks can be run on the GPU. WSL 1 does not support GPU passthrough, docker through WSL 2 supports GPU passthrough only with the Insiders build, additionally it has issues with networking when connected to VPN (microsoft/WSL#5068).

Use case / motivation

Natively running airflow without WSL 1/2 or docker on Windows. This is helpful in cases where the company ecosystem is windows-based.

Possible implementation

The daemon module is only used to daemonize the scheduler and webserver. Here's a sample code that runs the scheduler (airflow origin/v1-10-stable) using daemoniker, comments are welcome:

# airflow/bin/cli.py
from daemoniker import Daemonizer

...

if args.daemon:
    with Daemonizer() as (is_setup, daemonizer):
        if is_setup:
            pid, stdout, stderr, log_file = setup_locations("scheduler",
                                                    args.pid,
                                                    args.stdout,
                                                    args.stderr,
                                                    args.log_file)
        
        _is_parent = daemonizer(
            pid,
            stdout_goto=stdout,
            stderr_goto=stderr
        )

    job.run()

Thanks for opening your first issue here! Be sure to follow the issue template!

have you encountered other problems with running Airflow on Windows? Windows support is highly anticipated by our users, but no one has dealt with this topic intensively yet. Personally, I use MacOS, but I support the idea of ​​adding support for Windows.

Yes. Following the installation manual on the homepage pip install apache-airflow installs the airflow command, but it is not a windows executable and windows does not recognize the #! ..../python3.exe shebang.

@shachibista Have you tried installing the development version from source? I think this change should fix this problem.
https://github.com/apache/airflow/pull/7808/files#r396126977

I think it would be great if someone could invest in Windows support. I believe there are few things - not only the daemon model but also Local Executor uses fork mechanisms which won't be able on Windows, also there might be some problem if you want to use Celery Executor on Windows: https://www.distributedpython.com/2018/08/21/celery-4-windows/ There are few POSIX-compliant packages used as well with might not work on Windows. And automated testing might be a problem since we are using Docker. It looks like quite a big effort to invest..

@mik-laj No, I haven't tried installing the development version from source. Is there a simple way to do it within windows?

I am afraid not. We know Airflow works in WSL2, but we also know it does not work on Windows. Unless you can convince someone to make it works for Windows, I am afraid it's not going to happen.

you can install the application from local sources by cloning the repository and then running the pip install -e . command

@mik-laj Yes, the development version fixes the issue with the airflow command, at least. But, I cannot start the scheduler due to the aforementioned issues.

@potiuk Are you sure there are no fork-like mechanisms for windows? I would really like to get this working at least using Local/SequentialExecutor.

@mik-laj Yes, the development version fixes the issue with the airflow command, at least. But, I cannot start the scheduler due to the aforementioned issues.

@potiuk Are you sure there are no fork-like mechanisms for windows? I would really like to get this working at least using Local/SequentialExecutor.

There are different mechanisms - here is the whole discussion about it: https://docs.python.org/3/library/subprocess.html#popen-constructor - but they work differently and Airflow relies on some of the properties of Popen and passing opened file handlers (for example to opened log files). I think there are also a number of other dependencies and possibly hard-coded UNIX path "/" across the code, also Windows is not POSIX-compliant, and I think there are many places where we rely on some tools or binaries which are part of POSIX standard.

I am not saying it's impossible, I just think it's quite an effort and unless you make all the tests pass on windows we can't even start thinking about it. You can start with forking Airflow and trying to make the test work on Windows. Github Actions support Windows runners, so this should be easy to enable.

We are heavily relying on Bash scripts for executing the tests and building Docker Images - and all our tests are run in Ubuntu docker image - however if you want to run it on Windows, it has to be done differently and likey not using Docker images - simply creating a virtualenv and installing everything.

Maybe you can find others who have time and would like to take a look at that together with you ? Simply start a discussion on our devlist and ask for help. I am afraid at this stage for the community, the fact that it works for WSL2 for Windows users is quite enough.

I know there were some changes implemented by @evgenyshulman from DataBand to make Airlfow work in a very limited way on Windows - so maybe rather than run a full set of tests on Windows, just getting a very simple support for Local Executor is possible ? Still Starting from a GitHub actions step installing Airflow on Windows is a good start, we cannot accept the code that is not tested, so being able to test it automatically is a prerequisite.

Happy to review any changes if you come up with tests running on Windows :).

In our company we have now a setup where we use Ubuntu server to host Airflow (Web-Server, Dask-Scheduler) and a Windows Server as Dask-Worker. We need the tasks to run on Windows since there are some dependencies in them that cannot easily be ported to other platforms. Since the Dask-Worker also needs to have Airflow installed we had to clone the repository and add some extensions to deal with all the POSIX-only python functions that are not available on Windows. We ended up adding a platform check in certain files and "mimicking" POSIX behavior where necessary.
This approach works really well in the limited manner we need it to work, but it would be great if such a custom solution could be replaced by something more official. We would be willing to share our insights, if the devs are interested in pursuing this.

Absolutely! I think that might be great thing to add to Airflow. Maybe you would like to open a PR about this (cc: me) with your changes and we can discuss how to approach it.

Great to hear! We will need a bit of time since we only cloned the repository and have not forked it yet. I will check with my team mates and create a PR as soon as I have the time. Thanks.

Looking forward to it. Today we've merged official MSSQL support so seems we are getting friendlier for Microsoft :)

We have a go, I will create a fork and CC you @potiuk in the PR. There is probably a lot of things we need to do since the only goal was to implement enough functionality for Dask to run properly.

We can do it in stages as well. Happy to introduce some parts and see if this needs/can be replicated elsewhere.

We also want probably to add some tests in the CI of ours to run on Windows. GitHub supports Windows runners as well so I am happy to work on incrementally adding more tests and run them in our CI.

Glad to hear it. We work mainly on Azure-DevOps so we are not very familiar with testing and CI tools on GitHub, but happy to learn. I have created the fork and started with implementing the changes. How would you go about step-wise integration?

Just split the changes needed maybe start with some small few lines part - I could then add the Windows CI tests around it on top. And we could add other PRs afterwards. Generally the smaller PR - the better :)

I've added the PR. After playing around a bit I now know that while this works fine for a Dask-Worker, it does not if you want to run the Web-Server or a Scheduler on Windows just because of the process handling. There probably needs to be something like one more layer of abstraction to handle the execution of processes platform agnostic.

Any updates on this?

If there is no update here, then there is no update. Everything here happens if somoene does it. Airflow is created by > 2100 users - more of them like you @pforai - users who contribute stuff if they need it.

Maybe you would like to take a lead on it and move it forward? We need users like you who apparenly have both the need and capabiliity (and in this case use Windows) who would like to move things forward and improve compatibility.