sebastianbergmann/diff

Clarification of the unified output format

morozov opened this issue · 19 comments

The unified output format looks a bit different from the one implemented in GNU diff. Consider the two files:

↪  cat a.txt 
1
2
3
4
5
6
7
8

↪  cat b.txt 
1
2
3
4
5
6'
7
8

If compared by GNU diff, the comparison result is the following:

--- a.txt	2017-10-11 12:40:59.304512847 -0700
+++ b.txt	2017-10-11 12:41:14.776322246 -0700
@@ -3,6 +3,6 @@
 3
 4
 5
-6
+6'
 7
 8

The changed line is surrounded by unchanged lines (3 at most, by default) to represent the context.

If compared by the Differ, the result is the following:

--- Original
+++ New
@@ @@
 1
 2
 3
 4
 5
-6
+6'

The context is represented by 5 lines before and none after.

If the number of non-changed lines before the change is greater than 5 (must be controlled by AbstractChunkOutputBuilder::getCommonChunks(int $lineThreshold)), then they are entirely dropped:

--- Original
+++ New
@@ @@
-7
+7'

At the same time, the non-changed lines after the change are dropped independently on the size of the chunk.

Is this behavior by design? If yes, what's the exact meaning of $lineThreshold.

@SpacePossum What do you think?

Hi,

This is by design (I think) as between 1.4 and 2.0 I've made only minor tweaks to the output itself.
The meaning of $lineThreshold is to state how many of the same lines may be repeated before a new hunk must be created.
However this no option to state how many context lines should be added before and after each hunk.
The are of few other details about the output (1.4 ~ 2.0) that make it non-compat with patch and/or git apply and differs from the udiff GNU as generated for example by diff.

If you want to generate diff's that are patch, git apply compat and follows udiff I would suggest using the 3rd party output builder for diff ^2.0;
https://github.com/GeckoPackages/GeckoDiffOutputBuilder
disclaimer, I wrote that output builder so I'm biased ;)

@SpacePossum I only brought up GNU diff to illustrate how context lines can be displayed in a human-friendly way. In my particular case, it's not required that the output is compatible with patch, but it's important that it's well understood by a human (I believe the same is true for PHPUnit).

Prior to 2.0, comparing the same files as in the description would yield the following result:

--- Original
+++ New
@@ @@
 1
 2
 3
 4
 5
-6
+6'
 7
 8

and

--- Original
+++ New
@@ @@
 6
-7
+7'
 8

In that case, at least a few common leading and trailing common lines were always displayed. Wouldn't it be useful if worked like this (or even better like in GNU diff) in 2.0 too? Or it's irrelevant in the case of PHPUnit?

The GeckoDiffOutputBuilder looks awesome but using it in infection/infection would require some workaround because GeckoDiffOutputBuilder requires sebastian/diff:^2.0 while infection/infection supports PHPUnit ^6.0 and thus has to support sebastian/diff:^1.4.

I see the issue, however I see no easy solution for resolving it with a diff 1.4 constraint.
The main reason for making diff 2.0 was (together with adding line numbers by default which ^1.0 lacks) to provide a way of tweaking the output of the differ. This would help your case as well.

I'm not familiar with infection/infection TBH so it is hard for me to advice here. If you are stuck, for what ever reason to diff ^1.0 there are the following paths to resolve the issue (but I would still advice strongly to update to diff 2.0 as soon as you can):

  • open an PR towards 1.4 to (re)add context lines. It wouldn't be very hard to do, however I feel 1.4 is considered closed for new features (basically what happened between 1.3 and 1.4 is that 1.3 added a lot of context lines without control over it, while 1.4 adds limited context lines but also without control over it)
  • use diff 1.3 until you can use diff 2.0, which seems plausible on first sight as your package requires diff and as dev requirement phpunit, but you would be stuck with a lot trailing same lines on a diff

All maybe not very satisfying, but other means like I did for a package like PHP-CS-Fixer are doable, yet require a bunch or work and still be fragile in the end.

@SpacePossum there's no need to make any updates to diff 1.4, it already displays the context lines. My only question is, can diff 2.0 be fixed in a way that it also displays them?

It could, the easiest way to do it is make diff 2.x requiring the GeckoDiffOutputBuilder, than you would get the builder you need. Question is if @sebastianbergmann is open for that suggestion?

My suggestion is to update the composer.json of this package by adding GeckoDiffOutputBuilder as a requirement. This would help for @morozov case.

I'm not a big fan of moving all of the code from the GeckoDiffOutputBuilder directly into this repo. This is mainly because I've more plans with it.

The main difference between the two output builders:

  • diff 2 add "comments" like ''#Warning: Strings contain different line endings!", Gecko doesn't
  • diff 2 always creates minimal diffs, Gecko can be configured to add context lines before and after a hunk
  • diff 2 has an option to add line numbers, Gecko always adds the linenumbers
  • diff 2 has a "free text" header option (any string is allowed), Gecko requires valid header lines
  • Gecko is designed to be patch and git apply compatible, diff 2 is a more open format to suits PHPUnit better
  • Gecko has two other options, commonLineThreshold and collapseRanges

Thank you, @SpacePossum. Correct me if I'm wrong, but would @morozov's case not also be solved by him updating his composer.json?

Yes, well, the issue is he has diff 1.4 || 2.0 as requirement, making it not possible to optionally require packages that require diff 2.0 but don't support diff 1.4. This is a common limitation of composer (and most package managers).

diff 2 always creates minimal diffs, Gecko can be configured to add context lines before and after a hunk

Does it mean that there is a bug in diff 2 since it preserves up to 5 leading common lines? It's not the minimal diff.

To make sure we're on the same page, I'm only trying to understand what's the expected output of diff 2 and would it make sense for PHPUnit (in the first place) to consistently (like GNU diff and Gecko do) display common lines around the changed ones. Right now, these lines are displayed inconsistently. Not sure if this should be solved by introducing a dependency.

Does it mean that there is a bug in diff 2 since it preserves up to 5 leading common lines? It's not the minimal diff.

you give the answer here:

I'm only trying to understand what's the expected output of diff 2 and would it make sense for PHPUnit (in the first place) to consistently (like GNU diff and Gecko do) display common lines around the changed ones. Right now, these lines are displayed inconsistently.

:)
The output by diff (by default) is not defined anywhere AFAIK. It has some history about it of course. The question you've was raised here: #64 when building the current output. diff 1.4 has the same inconsistent output btw.

Not sure if this should be solved by introducing a dependency.

With the dependency you can configure the number of common lines, so I think it will help for you?

Then according to the announcement in #64:

the output is now minimal around the chunks

the current behavior of displaying up to 5 leading common lines is a bug in diff 2.0 (diff 1.4 was not expected to produce the minimal output). And the fact that the other common lines are not displayed (unlike 1.4) is by design.

Do you think this bug should be fixed in diff 2.0?

With the dependency you can configure the number of common lines, so I think it will help for you?

Not sure if it's going to happen given the benefit/complexity ratio. Another reason against using particularly Gecko is that infection may need the line numbers to be not displayed. The reason is that infection compares the file contents reconstructed from the PHP AST, so depending on the formatting of the source file, the line numbers may be incorrect and misleading. In this case, the context lines would much more helpful.

Key is that diff never promised a formalized output, so it is near impossible what to consider unexpected.
TBH I'm kind of a bit lost about what you want to like to see changed and in which package.

I want a consistent output of the common lines (e.g. 3 before and after the change) in diff 2.

@morozov , that's the thing repeated in each answering comment here. Internal diff (both v1 and v2) doesn't have consistent nor standardized output. If you need it, you need to align on some specification. There is no such one for internal diff used by PHPUnit. Doing it is way more than "eg adding lines".
Proposing specification like "unified diff but no line numbers" is weird, big 👎 for custom variations of serious formats.

I must admit that I don't fully get your use case. Eg, in your scenario, input file could be 100kb one-liner of php (extra ugly, yes), and reconstructed content of code generated from PHP AST will have few hundreds probably, and PHP AST doesn't care about whitespaces at all.
I would rather get number of modified line not post-factum, basing on diff output between real code and regenerated one, but before mutating - during applying the mutation, you could figure out line number still in original code, without any differ->diff(orig, new) execution yet

that's the thing repeated in each answering comment here

@keradus what do you mean?

I must admit that I don't fully get your use case. […] I would rather get number of modified line […] before mutating

This would help reliable determine the line in which mutation happened but how then can we display the mutation? The diff is mostly a visualization tool which tells, for example: "I replaced greater than with less than in this condition":

--- Original
+++ New
@@ @@
-    if ($x > 0) {
+    if ($x < 0) {

Internal diff (both v1 and v2) doesn't have consistent nor standardized output. If you need it, you need to align on some specification. There is no such one for internal diff used by PHPUnit. Doing it is way more than "eg adding lines".

Please do not quote me on what I didn't say. I didn't say "eg adding lines" and I didn't ask for any changes. I'm saying that the output of common lines should be consistent and predictable (right now it's not). A reasonable default could be 3 lines around the change (same as in GNU diff).

Proposing specification like "unified diff but no line numbers" is weird, big 👎 for custom variations of serious formats.

Please do not quote me on what I didn't say. Right now sebastian/diff doesn't show line numbers by default but it's configurable. It's perfectly fine. I'm only asking if it makes sense to display context lines in a consistent manner. Right now it's inconsistent.

If maintainers think this conversation doesn't contribute to the value of the project, please feel free to close it.

I think it boils down to what the default for diff 2.0 should be, as it is already possible to make any other output by passing a custom builder.
What the default should be is up to the @sebastianbergmann / PHPUnit community,
I've no strong preference on it as I use diff only with custom (strict udiff) output builders.

With #75 merged, I think this can be closed.