Refactor filepath equality checks
dcastro opened this issue · 8 comments
Clarification and motivation
In the past, we've had many bugs related to path equality. A few examples (there are probably more):
We've solved them mainly by sprinkling the codebase with normalise
, normaliseWithNoTrailing
, expandIndirections
, dropTrailingPathSeparator
, canonizeLocalRef
etc.
But this approach is extremely error-prone.
A more principled approach might be to canonicalize filepaths using canonicalizePath
.
We should canonicalize:
- All filepaths returned by
git ls-files
- All local references
Here's a suggestion (haven't actually tried it):
-newtype RepoInfo = RepoInfo (Map FilePath (Maybe FileInfo))
+newtype RepoInfo = RepoInfo (Map CanonicalPath (Maybe FileInfo))
data FileInfo = FileInfo
{ _fiReferences :: [Reference]
, _fiAnchors :: [Anchor]
}
data Reference = Reference
{ rName :: Text
, rLink :: Text
, rAnchor :: Maybe Text
, rPos :: Position
+ , rInfo :: ReferenceInfo
}
+ -- We probably won't need the type `LocationType` after this.
+data ReferenceInfo
+ = RICurrentFile
+ | RIOtherFile CanonicalPath
+ | RIExternal
+ | RIOtherProtocol
+newtype CanonicalPath = UnsafeCanonicalPath { unCanonicalPath :: FilePath }
+ deriving newtype (Eq, Show, Ord)
+
+canonicalizePath :: FilePath -> IO CanonicalPath
+canonicalizePath = UnsafeCanonicalPath . Directory.canonicalizePath
Then we should be able to simply use the Eq
and Ord
instances as usual for comparing CanonicalPath
s, checking if a CanonicalPath
exists in a Map
/Set
, etc.
Let's review the uses of normalise
and similar functions, and try to get rid of as many as possible.
Acceptance criteria
- We're canonicalizing paths at the boundaries (i.e. after reading filepaths from
git ls-files
and from scanning markdown files) - We're avoiding manually "massaging" paths using
normalise/dropTrailingPathSeparator/etc
throughout the codebase
I have been trying to tackle this refactor with the proposed approach and I think that the result looks promising. Just some details:
With this change, files in the output which used to be relative become absolute (canonical) unless we store also their non canonical representation or manage to get the relative path again before printing. Should we add this overhead in order to keep relative paths in the program output? Or let the output to show canonical paths from now on?
Also, currently Reference
s are constructed with no (pure) access to the repository root path, so there is no way to canonicalize absolute links at that point, which is required to safely check if the link corresponds to the same file or to other one. I have tried this and currently xrefcheck does not report a file as 'current file' when the reference is for example /README.md#anchor
or ./README.md#anchor
from inside the same README.md file. I can try to rearrange things in order to fix this with this refactor, or open a new issue. What do you think @Martoon-00?
Should we add this overhead in order to keep relative paths in the program output?
Good concern. I would go with printing the old, non-canonicalized filepaths.
Let me read through the code for a while, I'll respond to your last paragraph then.
I have tried this and currently xrefcheck does not report a file as 'current file' when the reference is for example /README.md#anchor or ./README.md#anchor from inside the same README.md file
You are referring to "(current file)" next to the link in the xrefcheck's output?
I guess something went wrong here. Initially it was named "local" and was intended to be a property related to the link's format (i.e. denoted whether the link contained the part with filepath or not). And in fd96731 it was renamed, and you well spotted that "current file" is likely to be interpreted differently now 🤔. It was not meant to be smart in understanding whether we point to the same file or not.
Let's probably take that commit as example and rename everything related to CurrentFileLoc
again to "file-local" or something like that which would let us be more unambigous. Could you make a separate PR for this?
It makes sense, so that tag just classifies file links into relative (./file.md#example), absolute (/file.md#example) or file-local (#example). As you said, I misunderstood it because of the "current file" text.
I will change it before applying the refactor, "file-local" sounds good and less unambiguous to me.
I have been applying the refactor in a branch. My main concern regarding the result is currently that I think that we should try to specify how we want to show filepaths in the program output, because the current behaviour may be difficult to exactly replicate after the refactor. These are some details that I have noticed by observing the golden test expected outputs:
- The output is different if we execute xrefcheck -r some-path compared with executing xrefcheck inside some-path.
- When reporting errors, files are shown with a leading "./" when the link is absolute and with no leading "./" when the link is relative.
I will try to think about a possible strategy and propose it here.
The output is different if we execute xrefcheck -r some-path compared with executing xrefcheck inside some-path.
I think, let's not care about preserving this. Running from within the repository can produce the same output as if run from the repository root, this may be even considered as a benign behaviour.
Although to make sure I get it correctly, I would be glad to see the diff myself. Could you create a pull request, maybe in a draft state, so that we could continue the discussion there?
When reporting errors, files are shown with a leading "./" when the link is absolute and with no leading "./" when the link is relative.
Not vice versa?
I think, let's not care about preserving this
Could you create a pull request, maybe in a draft state
👍
Not vice versa?
Right, not vice versa. These are examples from the current expected outputs in golden tests. Look at how the "file that does not exist" is shown:
➥ In file dir1/dir2/d2f1.md
bad reference (relative) at src:28:1-31:
- text: "bad-casing-file-rel"
- link: D2F2.md/
- anchor: -
File does not exist:
dir1/dir2/D2F2.md/
➥ In file dir1/dir2/d2f1.md
bad reference (absolute) at src:42:1-22:
- text: "file-abs-2"
- link: /d1f1.md
- anchor: -
File does not exist:
./d1f1.md
That's hilarious. I believe displaying it like this can be considered reasonable from several points of view, but anyway there is no need to stick to it, let's consider on a PR basis whether the change in the output format is worth it or not.