insitro/redun

redun does not execute any code for a simple example

Closed this issue · 4 comments

I'm trying a simple example with redun == 0.8.7:

from redun import task, File
import pandas as pd

PATH = File("input.csv")

@task
def load_data(path):
    return pd.read_csv(path)


@task
def main(path: File = PATH) -> File:
    data = load_data(path)
    data.to_csv("data.csv")
    return File("./data.csv")

where input.csv is simply

$ cat input.csv 
x,
1,
2,
3,

running redun then gives:

$ redun run cli.py main
[redun] redun :: version 0.8.7
[redun] config dir: .redun
[redun] Upgrading db from version -1.0 to 3.1...
[redun] Tasks will require namespace soon. Either set namespace in the `@task` decorator or with the module-level variable `redun_namespace`.
tasks needing namespace: cli.py:load_data, cli.py:main
[redun] Start Execution f6e622ee-907c-4500-8673-464f8e5a12b6:  redun run cli.py main
[redun] Run    Job c3c581e5:  main(path=File(path=input.csv, hash=98c594bd)) on default
[redun] 
[redun] | JOB STATUS 2022/04/28 00:18:12
[redun] | TASK    PENDING RUNNING  FAILED  CACHED    DONE   TOTAL
[redun] | 
[redun] | ALL           0       0       0       0       1       1
[redun] | main          0       0       0       0       1       1
[redun] 
[redun] Execution duration: 0.14 seconds
File(path=./data.csv, hash=65b8e975)

But no file is produced and the load_data() task never runs.

Is this a bug, or am I doing something wrong?

FWIW: modifying the code as below (to properly handle File, I think?) does not help:

from redun import task, File
import pandas as pd


INPUT = File("input.csv")


@task
def load_data(input: File) -> pd.DataFrame:
    return pd.read_csv(input.path) + 1


@task
def main(input: File = INPUT) -> File:
    data = load_data(input)
    data.to_csv("data.csv")
    return File("./data.csv")

Hey @elanmart. If you reformat your code like this, it works:

from redun import task, File
import pandas as pd

INPUT = File("input.csv")

@task
def load_data(input: File) -> File:
    df = pd.read_csv(input.path) + 1
    df.to_csv("data.csv", index=False)
    return File("data.csv")


@task
def main(data: File = INPUT) -> File:
    data = load_data(data)
    return data

Thanks @ricomnl for the suggestion.

@elanmart Thanks for the question. It's not a bug, but I agree its surprising and likely hard to see why the load_data didn't run. This is probably a good example of something to highlight better in the docs (perhaps in the FAQ).

If we take your example and add a print statement, we see that data.to_csv("data.csv") is a lazy expression.

@task
def main(path: File = PATH) -> File:
    data = load_data(path)
    x = data.to_csv("data.csv")
    print(x)
    return File("./data.csv")

which prints:

SimpleExpression('call', (SimpleExpression('getattr', (TaskExpression('load_data', (File(path=input.csv, hash=3c1348a5),), {}), 'to_csv'), {}), ('data.csv',), {}), {})

redun will only evaluate that lazy expression if you "use it", which in this case is to return it from main. However in your example above, you discard it. It may feel like you "used it" because you do return File("./data.csv"), but that is a value not directly derived from x (my example), and so redun doesn't realize it.

This is why @ricomnl change works. Let me know if this explanation helps.

Thanks @ricomnl , @mattrasmus , I think it makes perfect sense now.

I think it would indeed be great if this was included in the docs / FAQ! Or perhaps it would be possible for redun to show a warning if there are unused outputs in the graph?

FWIW, the way I would update the original sample as follow:

from redun import task, File
import pandas as pd


INPUT = File("input.csv")


@task
def load_data(input: File) -> pd.DataFrame:
    return pd.read_csv(input.path) + 1


@task
def save_data(data: pd.DataFrame) -> File:
    data.to_csv("data.csv")
    return File("data.csv")


@task
def main(input: File = INPUT) -> File:
    data = load_data(input)
    ret = save_data(data)

    return ret