pyiron/pyiron_base

`Project.wrap_python_function` does not play nicely with name lengths

Opened this issue · 5 comments

Because the hash is pretty long by itself, the new `` functionality bumps into our 50 character limit on job name lengths and fails in an ungraceful manner:

from pyiron_base import Project

pr = Project("test")
pr.remove_jobs(recursive=True, silently=True)

def f(x):
    return 42

job = pr.wrap_python_function(f)
job.run() # Works fine


def fffffffffffffffffff(x):
    return 42

job = pr.wrap_python_function(fffffffffffffffffff)
job.run()
# ValueError: Invalid name for a PyIron object: must be less then or equal to 50 characters

If we're sure we know what the max length of the hash will be (idk if this is fixed) then the PythonFunctionContainerJob.python_function could either warn that the name will be truncated, or fail hard. In the former case, PythonFunctionContainerJob.save also needs to truncate.

Otherwise, PythonFunctionContainerJob.save should fail more gracefully, e.g. by breaking the failing name apart into the function name length and the hash length, so the user can see whether it will be possible to sufficiently truncate the function name themselves and try again.

pmrv commented

Hashes are of fixed size, so issuing a warning on instantiation would be good. A hexadecimal representation of MD5 takes 32 characters. With base64 we could get it down to 24.

Alternatively, we could increase the limit from 50 to 256, that should keep us save. Or @XzzX do you have an idea if it is possible to remove the length constraint and what would be the performance penalty of doing so?

XzzX commented

We can remove the length constraint. However, we enable users to store arbitrary amounts of data then. From a performance perspective I think we should be fine.

Alternatively, we could increase the limit from 50 to 256, that should keep us save.

Only heuristically -- albeit to a very high probability. Nonetheless, in this case I would still want a check so that we fail early if it gets broken.

We can remove the length constraint. However, we enable users to store arbitrary amounts of data then. From a performance perspective I think we should be fine.

This is also a fine solution from my perspective 🚀

Either way we're going to need a PR to either (a) fail early, or (b) stop using 50 (in the database code, and the hard failure in pyiron_base.jobs.job.util)

pmrv commented

According to the postgresql docs there's no performance difference between fixed length and free strings (modulo storage, which I don't think matters, hundred million jobs would be ~25GB only)

We'll need to fix it still to a maximum because of filesystem limitations (255 bytes filenames for ext4).