Rahix/tbot

Compatibility of `machine.linux.Path` with unaware code

schlimmchen opened this issue · 10 comments

I have a need [1] to convert a tbot Path object to a PurePosixPath or to a string to pass onto other code. Currently I use super(machine.linux.Path, obj).__str__() which is "less than ideal" IMHO.

Is there another way? What does not work:

  • p.parent / p.name: The parent is also a tbot Path (makes sense) so that result is the same.
  • p.as_posix(): Has the name attribute of the host prepended to the string.

Shouldn't PurePath.as_posix() be implemented in machine.linux.Path to provide the expected behavior? Currently it is only
implemented in PurePath and calls __str__(), which is overridden by machine.linux.Path and which prepends the name of the host the path is associated with. Implementing machine.linux.Path.__str__() that way makes perfect sense, but it breaks as_posix().

Maybe as_posix() shall raise an Exception if the associated host is not isinstance(self.host, connector.SubprocessConnector) but otherwise return the expected string?

[1] My device under test (DUT) runs a webserver. The tests shall talk to it using Python requests over HTTPS. To avoid HTTPS certificate warnings, a known certificate and private key is deployed onto the DUT using SSH. It is convenient to use a tbot Path object for this in the context of (1) checking the certificate validity period using localhost.exec0(openssl...), (2) creating the certificate using localhost.exec0(openssl...), and deploying the certificate using tbot's shell.copy(). But finally the path must be handed over to the requests module, and it cannot be a tbot Path object for obvious reasons.

Hm, there is _local_str()...

I guess my issue with as_posix() is still relevant. And _local_str() is not "public", not part of the API, as indicated by the leading underscore.

Rahix commented

Yeah, this is a messy situation. I've always resorted to using _local_str() for this, but you are completely right - this is more of a crutch than a proper solution.

The original intention behind this design was to make it hard to accidentally use a Path with the wrong host by converting it to a string implicitly somewhere and the string version then loosing information about the associated host. Simplest example:

block_device = target_machine.rootfs / "dev" / "sda"
host_machine.exec0("dd", "if=/dev/zero", f"of={block_device}")

Because tbot code deals with many different hosts (machines) simultaneously, I think protecting against these kind of mistakes is quite important. The problem is that the cost of doing so is high. Suppose you want to do want to pass a tbot-path to dd, there is no pretty solution for this right now.


There is also a second problem with the current implementation: Because the tbot-path inherits from pathlib, you can pass tbot-paths to other python libraries which operate on the localhost. Nulling out __fspath__ at least makes most attempts at this fail but certainly not all.

I think this is a classic OOP code smell: This really shouldn't be inheritance, instead tbot's Path should be a standalone implementation which just keeps a PurePosixPath internally to handle the path juggling. As in

# machine/linux/path.py
class Path:
    def __init__(self, host: ..., path: str) -> None:
        self._host = host
        self._path = pathlib.PurePosixPath(path)

    # ...

    def __truediv__(self, key: typing.Any) -> "Path[H]":
        return Path(self._host, self._path.__truediv__(key))

This would solve problem number two for the most part, but not your original problem of course. I think there are two parts to it:

  1. When a path really is on the localhost, letting people use it like normal. Nor sure what's best here, either implement __fspath__() & friends conditionally or make people assert this intention explicitly by introducing an as_local_path() method which raises an exception when called on a remote path.
  2. The dd usecase I showed above. A long time ago I had a custom formatting implementation that looked like this: linux.F("of={}", block_device) (similar to str.format). With f-strings becoming increasingly popular I am not sure if this is a viable route anymore. I am very hesitant to just make str() return the local representation as I explained above, but the other solution of introducing an official version of _local_str() is also not pretty...

What do you think?

The original intention behind this design was to make it hard to accidentally use a Path with the wrong host

Understood. And that is absolutely a good reason to introduce tbot's Path. From my perspective this becomes very elegant in the context of shell.copy(). And it is indeed convenient to manipulate paths on remote hosts this way.

I think protecting against these kind of mistakes is quite important.

Unfortunately, tbot's Path does not protect you in the context of your example using dd. Instead a file named host:filename would be created and written to -- but I guess you are already aware of that? It would only break with absolute paths. For "real" protection, the __str__() method would need to be deleted/abstract to really prevent misuse. However, that's a breaking change for sure, and it would make reasonable use of str(p) in outputs impossible. It would be possible to introduce an explicit method achieving the same and deprecating __str__() for at least one tbot release before actually removing it?!

How about __str__() converts the path to a file:// URI? This would not break use of str() and humans would still be able to read messages that include such URIs, e.g., file://host/path/to/file. Maybe exec() could parse for such URIs, match the host name of the URI, test it against the expected host's name, replace the whole URI with the POSIX path or raise an Exception depending on the host name comparison? (What if two hosts have the same name?) To avoid cofusion with actual file-URI args passed to exec() (maybe used with curl?) a custom URI scheme could be used, like tbot://host:/path/to/file or similar (must support relative paths)? This would be a fallback-security-measure only, for cases where a string conversion happens, as we want users to pass tbot Path objects as arguments to exec() whenever possible.


Methods could be written such that they expect a host argument, e.g., as_posix(host), as in "I want the POSIX version of that path but please throw an error on me if the path is not associated with that host. But people will cheat and shoot themselves in the foot with p.as_posix(p.host). That would be a great addition for the internal use of _local_str() at least.


The dd situation could be improved by adding an as_parameter() method. p.as_parameter("if={}") yields a new object with a private formatstring if={} which formats itself to if=/path/to/file when exec() and family actually consume this object. Before that it could be checked if the path object is associated with the expected host because the input is still an object, not (yet) a string. This would require to add and document a new class machine.linux.Parameter and to update all methods that shall be compatible to this new class (updated type signatures, else if to handle new class). Oh well, we can avoid this if the as_parameter() method returns a new class that is derived from machine.linux.Special. It already is handled by all the methods that shall be able to consume this new object and it's _to_string() method already receives the host on which the parameter shall be used -- beautiful! (?)

That does not make sense in all cases, though: shell.copy() receives two tbot Paths, both of which are used in exec() on the source host, but the second one, even though prepared with as_paremeter(), will still be relative to the second host, which makes perfect sense but breaks the host check in exec(). Can we assume this is the only context in which an exception would be required?


Yes, composition rather than inheritance. Very good idea. The concepts of pathlib.Path and machine.linux.Path are so different that it makes no sense to say "a tbot Path is also a PurePosixPath", because it is not, it only behaves similarly. This change would also make discussions about is_relative_to(), joinpath(), relative_to(), with_name(), with_stem(), with_suffix(), as_uri(), and as_posix() (which are all broken for tbot Path) obsolete, as the class simply does not provide them any more until any of them is properly implemented with tbot's Path concept in mind. It also makes no sense, now that I think about it, that tbot's Path is a PurePosixPath, because of all the filesystem operations that are added to tbot's Path. This is contradicting the original idea to distunguish between PurePosixPath and (PosixPath or Path) where exactly the filesystem operations are not provided in the Pure* variants by design.


Regarding my original problem: Assuming tbot's Path will no longer be a PurePosixPath by inheritance, it makes it "clear" that it cannot be used interchangeably. I like the idea of implementing a as_local_path() method that returns a new PurePosixPath, which is nothing else than a copy of the internal PurePosixPath tbot's Path will manage. The method however raises an Exception if the associated host is not of the subprocess variety, i.e., is not the localhost.


Trying to summarize:

  1. Refactor tbot's Path from using inheritance to using composition: Add a note in CHANGELOG about this being a possibly breaking change (methods will "disappear", tbot Paths can no longer be handed to methods expecting a pathlib.Path -- which is intended).
  2. Implement a machine.linux.Path.as_parameter(formatstring) method: Return a class instance derived from machine.linux.Special that implements _to_string(host) such that the host is checked against the host associated with the path and that formats the path according to the formatstring.
  3. Implement a as_local_path() method: Raise an exception if tbot Path is not associated with localhost, otherwise return a PurePosixPath.
  4. Add a host parameter to machine.linux.Path._local_str() method and compare the path's host with the given host. Raise an exception on mismatch. This shall remove a whole bunch of duplicated code (the host check is performed most times before _local_str() is used).
  5. Make machine.linux.Path.__str__() return a custom URI à la tbot://host:path which allows machine.linux.Shell.escape() implementations to scan for such URIs and issue a warning ((optionally?) raise an exception?) if detected (prevent users from accidentially providing str(machine.linux.Path()) arguments to exec() and similar methods).

I feel very positive about 1 - 4. For 4 I'd like to propose a pull request (tomorrow). 5 is only an idea and I am not sure whether you agree and I don't think I addressed the "f-strings are becoming more popular" concern well.

BTW thank you for your extensive feedback and hope you are not put off by my very long ramble...

Rahix commented

Nice, thanks a lot for the thoughts!

  1. Refactor tbot's Path from using inheritance to using composition: Add a note in CHANGELOG about this being a possibly breaking change (methods will "disappear", tbot Paths can no longer be handed to methods expecting a pathlib.Path -- which is intended).

Totally agree. I'd say we should try to fix the remaining methods which are currently broken at the same time as doing this.

  1. Implement a machine.linux.Path.as_parameter(formatstring) method: Return a class instance derived from machine.linux.Special that implements _to_string(host) such that the host is checked against the host associated with the path and that formats the path according to the formatstring.

That's what my linux.F was doing in the past... The problem I have with it is that it feels so wrong to need a "special magic" formatting method for tbot. I think for now it is good enough to have that as_posix() method from your point 4:

m.exec0("dd", "if=/dev/zero", f"of={block_dev.as_posix(m)}")

While this can be misused (m appears twice in the line above and there is no check that it's the same both times), I think it at least makes the programmer think for a second which machine they intend to use here.

  1. Implement a as_local_path() method: Raise an exception if tbot Path is not associated with localhost, otherwise return a PurePosixPath.

Yes, this is perfect for the few times you need it. Indeed, we can even return a PosixPath here because we know it is meant for the current host.

  1. Add a host parameter to machine.linux.Path._local_str() method and compare the path's host with the given host. Raise an exception on mismatch. This shall remove a whole bunch of duplicated code (the host check is performed most times before _local_str() is used).

Great idea! Although I'd rather create a new "public" method and keep _local_str() the same for backwards compatibility for now. I like the path.as_posix(host) you suggested, but if you have a more descriptive name, go for it.

  1. Make machine.linux.Path.__str__() return a custom URI à la tbot://host:path which allows machine.linux.Shell.escape() implementations to scan for such URIs and issue a warning ((optionally?) raise an exception?) if detected (prevent users from accidentially providing str(machine.linux.Path()) arguments to exec() and similar methods).

Interesting... I am wary of in-band information transfer, I feel like there are too many ways something like this can go wrong. For now I think with 1, 3, and 4 in place we've got a good enough solution for the time being - I'd rather see how they fare in the field before attempting someting like this. But the idea is an interesting one nonetheless, I think.

The question that remains is what we should do with __str__()... As long as we have it, people will be able to accidentally pass Paths to code which str()s its inputs. On the other hand, dropping it will remove any reasonable means of just printing the path, except for repr(path) (= f"{path!r}") which is a) not known by a lot of people and b) looks ugly for situations where you just want to quickly print out the path.

I like the path.as_posix(host) you suggested, but if you have a more descriptive name, go for it.

I suggest to not use as_posix(host) as the name. The other methods in tbot's Path are kept compatible to pathlib.Path as well as reasonably possible, but as_posix(host) breaks that.

I am wary of in-band information transfer

Yes, of course, this is insanity, but so is Python's type safety... And this approach could, I guess, at least be considered a mitigation (attempt).

The question that remains is what we should do with str()...

Can we come up with a format that is (almost) never a valid path and (almost) always leads to an error downstream if a user unwillingly uses it on the command line or forwards it to unaware code (which str()s it)? Of course, this is also just a mitigation, and my guess is that there is no such string.

Emit a warning every time machine.linux.Path.__str__() is used?

Or maybe accept that a non-ideal __str__() is a cost for gaining the benefits of the class.


I am not pushing the as_parameter() thing, actually I am not very interested yet because I don't have the need yet, but consider that if two people come up with basically the same solution to the same problem, it might be the proper solution.

m.exec0("dd", "if=/dev/zero", block_dev.as_parameter("of={}"))
Rahix commented

I suggest to not use as_posix(host) as the name. The other methods in tbot's Path are kept compatible to pathlib.Path as well as reasonably possible, but as_posix(host) breaks that.

Good point. Any idea for an alternative? as_str(host) maybe?

Of course, this is also just a mitigation, and my guess is that there is no such string.

I agree, I don't see a robust "always invalid" format either...

Emit a warning every time machine.linux.Path.str() is used?

Instead of a warning I'd just remove the method entirely (or make it raise a descriptive exception). What is the value of a method which cannot be called without producing a warning?

Or maybe accept that a non-ideal str() is a cost for gaining the benefits of the class.

Right, I think the best way forward is to stick with what we have right now and maybe revisit this in the future, if new ideas (or language features) come up.

BTW, do you want to work on this or should I go ahead and start with the refactoring we discussed in this issue so far?

Good point. Any idea for an alternative? as_str(host) maybe?

I am thinking at_host() with a rational like "give me a string representation of that path at the given host, knowing that you'll throw an exception if the string is not associated with that host". 👍 ? 👎 ?

What is the value of a method which cannot be called without producing a warning?

(Temporary) Backwards compatibility. The chance to tell the user why the method is not there although the user rightfully expects it to be there. The latter can be achieved with an exception as you said.

BTW, do you want to work on this or should I go ahead and start with the refactoring we discussed in this issue so far?

I was working on the at_host() method when I had to shift priorities. Please go ahead, I am not afraid to rebase. Unless you are faster you can leave the at_host() and as_local_path() to me.

Rahix commented

at_host() is great, exactly on point. Alright, I'll start working on this, I'll ping you once I'm done.

@Rahix Do you agree that as_local_path() is not warranted any more? One can use at_host(localhost) instead. Code that processes a tbot Path and wants to use it locally should also have access to the localhost instance (and can always cheat by calling p.at_host(p.host))?

Rahix commented

I think for the time being, yes, the alternatives you showed should be enough. If we find that pathlib.Path(p.at_host(local)) is becoming a common pattern, we can rethink this decision, but right now I don't think it is going to be that useful. Even more so because we're adding more and more of the pathlib.Path API to tbot's Path which means there will be even less reasons to need the upstream version.