fmang/opustags

Support for newlines in tags

hackerb9 opened this issue · 11 comments

Request

Please add support for newlines in tags. While I do not usually add newlines to the tags, it is extremely common to find them already existing in .opus files that have been downloaded, especially from YouTube via youtube-dl -x.

The problem

Currently, when using opustags -e on a file with newlines in the comments, opustags will not give any warning that it cannot preserve the newlines. Instead, when the file is saved, opustags dies badly, incorrectly stating that a tag is malformed.

warning: Leaving foobird.opus.BejwAL.opustags on the disk.
Stravinsky - The Firebird _ Gergiev · Vienna Philarmonic · Salzburg Festival 2000.opus: error: Malformed tag: Great presentation of the Vienna Philharmonic conducted by the russian Maestro Valery Gergiev, in one of the most powerful and greatest presentation of The Firebird (L'Oiseau de feu) of Igor Stravinsky at Salzburg Festival 2000.

Possible Solution

I suggest automatically converting a tag with multiple paragraphs into a multiple instances of the same tag. For example:

title=Stravinsky: The Firebird / Gergiev · Vienna Philarmonic · Salzburg Festival 2000
date=20111106
DESCRIPTION=Gran presentación de la Orquesta Filarmónica de Viena, conducida por el director ruso Valery Gergiev en [a mi juicio personal] una de las más grandes y magníficas interpretaciones del Pájaro de Fuego (L'Oiseau de feu) de Igor Stravinsky, que se tenga conocimiento, durante el Festival de Salzburgo 2000.

Great presentation of the Vienna Philharmonic conducted by the russian Maestro Valery Gergiev, in one of the most powerful and greatest presentation of The Firebird (L'Oiseau de feu) of Igor Stravinsky at Salzburg Festival 2000.

would become

title=Stravinsky: The Firebird / Gergiev · Vienna Philarmonic · Salzburg Festival 2000
date=20111106
DESCRIPTION=Gran presentación de la Orquesta Filarmónica de Viena, conducida por el director ruso Valery Gergiev en [a mi juicio personal] una de las más grandes y magníficas interpretaciones del Pájaro de Fuego (L'Oiseau de feu) de Igor Stravinsky, que se tenga conocimiento, durante el Festival de Salzburgo 2000.
DESCRIPTION=
DESCRIPTION=Great presentation of the Vienna Philharmonic conducted by the russian Maestro Valery Gergiev, in one of the most powerful and greatest presentation of The Firebird (L'Oiseau de feu) of Igor Stravinsky at Salzburg Festival 2000.

Reuse of a tag like this is explicitly allowed in the Vorbis comment specification, so I believe this is a good solution.

Example Kludge

I have a kludge which kind of works to add repeated tags, but only as long as the contents of a tag does not have an equals sign (=).

opustags foo.opus | awk -F= '
    $2 != "" { tag=$1; }; 
    $2 == "" { printf("%s=", tag) }; 
    {print}
 ' | opustags --set-all foo.opus -o bar.opus

If I'm lucky and there were no equals signs in the data, I can then use opustags -e bar.opus without error. I present this fragile kludge only as an example. Any generally applicable solution will require modifying opustags.

fmang commented

The multilinetag to multiple one-line tags conversion is not correct. Tag reuse is allowed for cases like songs with multiple artists, and is not semantically equivalent to multiline tags. While not as convenient as --edit, you should be able to robustly edit files with multiline tags using --add and --set.

The current format used for input and output is not ready for multiline tags, but maybe it could be extended to support escape sequences, like adding a backslash at the end of a line to indicate continuation, and supporting escaped backslashes too in case the tag does terminate with a literal backslash. Standard escaping stuff.

Alternatively, more formats could be supported, like YAML or JSON, so that we get the benefits of a robust format without reinventing the wheel. Plus, opustags would get even easier to integrate into external programs. However, I believe the current format is more user-friendly with --edit.

Would you be interested in contributing a patch for either option?

Okay, scratch that first idea. I didn't know that tag reuse was not semantically equivalent. Too bad as that would have made editing with a text editor easier than backslashes. (I've edited enough .Xresources files to know that that is not fun. Especially when you've got an invisible space after the backslash!)

I do think the current output format is nice because a text editor makes sense for editing multiple tags at once. Having to ensure I add a backslash at the end of each line can be tedious.

How about this as a proposal for the format: since tag names should not begin with whitespace, declare that any line that begins with a HORIZONTAL TAB character is a continuation of the previous tag. On input, the initial TAB is removed from continuation lines. On output, a TAB is added after an embedded LINEFEED.

Side note

Unimportant thoughts that occur to me as I think about implementing this:

  • For DOS CR-LF support, leaving CARRIAGE RETURN unchanged on input or output should suffice.
  • As a convenience to people editing an opustags file, on input a completely empty line can be treated the same as a line containing only a single TAB.

Yes, there are problems with the idea: for example, a person may try to add spaces instead of a single TAB and end up with an inscrutable error, or they may use multiple TABs for indentation and be surprised when only the first one is removed. However, it would be a better situation than the current state.

  • There would be no loss of information as `opustags foo.opus | opustags foo.opus --set-all -y' would be byte-for-byte the same.
  • Most text editors already handle such indentation nicely (as opposed to the backslash escape before newlines).
  • And the format is human friendly as people are familiar with seeing indentation for continuation. In fact, it would make the output from opustags easier to read in the case of tags with embedded newlines.

I might be able to make a patch to do such a thing, but I'd want to know if it's something you would even want or if there's some fatal flaw in my logic that I'm overlooking.

fmang commented

Your proposal to use exactly one tab as a prefix for continuation is intuitive, convenient, robust and simple. I like it.

Concerning the edge case, I suggest that:

  1. If opustags detects a line starting with spaces, it fails with an explicit error.

  2. Empty lines are implictly treated like a continuation (your idea is good), unless it’s the last continuation line. Basically, empty lines separating tags should be ignored:

    X=123
            Continuation.
    
    Y=456
    
  3. If the equal sign is immediately followed by a line feed, then the value should begin at the first continuation line, like this:

    X=
            First line.
            Second line.
    

    I think this is more readable, and should be preferred when outputting the tags.

These three points are nice to haves, so I suggest you ignore them for your first PR.

Overall, I think this is a good enhancement and would be glad to merge it if you make a patch.

Okay, I have a basic implementation working which roundtrips correctly for my files. I appear to have an error in my memory allocation logic, though. (Core dumps when I test read_comments with varying lengths of data). It probably has a simple fix, but it's past midnight and I don't see it right now. Do you?

I used realloc() to fit with the existing code that uses C's getline() and free(), but maybe getline should be replaced with whatever the C++ equivalent is so we don't have to worry about these sort of memory errors.

fmang commented

Let’s focus on the printing first. I notice that you use insert() to add the tab characters, which is costly. You’d better reuse the first scan and compute newline_count instead of has_newlines, then allocate an std::string of size comment.size() + newline_count, and fill it character by character using operator [].

That sounds like a reasonable optimization to make.

I made the optimization.

The worst case scenario for the std::string insert method is a comment with 64K of newlines. Runtime for that improved by 15x, but the absolute gain was small (went from 47 ms to 3 ms). Was it worth it given that copying by hand is a little bit uglier and worst case scenarios are rare? Maybe not yet. I believe it will be more obviously a good thing once comments larger than 64K are allowed.

fmang commented

Note that the performance multiplier you measured is not constant. The complexity went from O(n^2) to O(n). I don’t think opustags will suffer from performance issues anytime soon, but writing linear-time implementations is easy enough to be worth the effort I think.

Could you please squash your commits and make a pull request for the printing part only? It’s gonna be easier for me to review.

fmang commented

@hackerb9 I plan to make a new release soon and consider solving this issue too. Are you interested in making the pull request, or would you rather I finalize this myself?

fmang commented

I’ve started looking further into it in order to finalize the feature. I’ll probably keep your implementation proposal for history but began reworking it.

fmang commented

Ready for the next release.