mliebelt/pgn-parser

Improvement: Handling dirty pgn files

MichaelLeeHobbs opened this issue · 14 comments

Thanks for you help with the issue I posted. I wrote the function to clean pgn's.

const clean = (pgn) => {
    const regex = /[\d*a-hxRNBQKPO[\]{};.=/ \r\n+-]/
    let isOpen = false
    return pgn.split('').map((c, i) => {
        if (c === ']' && isOpen) isOpen = false
        else if (c === '[') isOpen = true
        const isValid = regex.test(c)
        if (!isValid && !isOpen) {
            console.log(`cleaning: ${c} at index: ${i}`)
            return ''
        }
        return c
    }).join('')
}

Some documentation would help here. If I understand the code correct, it does the following:

  • Ensure that only PGN is checked, so don't check tags.
  • Removes all characters that are not in the regular expression.

How about comments that may include any characters? How about NAGs like ∞, ⩲, ...?

I understand that a tool like that is needed, but I think that your approach is too simplistic. I have started expanding the pure parsing by doing a post parsing step. This may include validations, to deny accepting wrong PGN, or corrections (like yours) to fix known deficits. So I think that something like your code could be used, but only if it does not remove valid content.

I think that your approach is too simplistic.

Totally agree. Just sharing what I'm using for my use. Consider it a starting point or something alone those lines. If nothing else a seed for someone else who comes along with the same issue.

Edit: I'm willing to help and will take a look at the PGN spec latter. If there is a spot in the code where you think something like this should go let me know and I'll start on adding this with test to the branch I made.

I think the problem is more difficult than we think at the moment. I though about a PGN checker / converter that ensures that PGN complies to standard PGN, and this will be a part of it.

The problem we face at the moment is, that the parser has to be strict on some parts, and will break if there are additional character in it. So cleaning up beforehand may help, but only if you don't delete something that is allowed. I thought lately about having an (automatic) splitting of big files into game-chunks, where the parser is then used. This would allow more possibilities, to react on ParseErrors ...

So let us keep that as idea, and perhaps at some point in time, we are able to build something that cleans up, and does not change what is not needed.

I'm taking a look at https://www.openingtree.com/ ie https://github.com/openingtree/openingtree. It parses the BenkoGambit.pgn without issue. It also uses peg. Not sure how it handles dirty files yet.

Do that. In that special example, that was the one result that contained an additional back tick (both in the tag "Result" and in the pgn text. And both are not ok, and should lead to an error.

After a little digging I found that openingtree splits the pgn into individual games and parses each separately. It sort of silently fails/skips bad games. If you open the console you can see log messages related to the bad games.

Edit: Will be taking a deeper look at this issue this weekend. I'm working on a bot that will only use stockfish if it cannot find a move in the db for the given position. The idea is you can select an elo range and opening and the bot will randomly select a move from a game a human played as close to the range and opening. Obviously, I'm going to need millions of games in the DB. Finally, putting all these enterprise scaling skill to something a little more fun than the average business app.

OpeningTree splits pgn's into individual pgn's and parses them one at a time. See: https://github.com/openingtree/openingtree/blob/master/src/app/iterator/IteratorUtils.js If one fails it logs the event and moves to the next. I'm assuming this slows down the overall parse. I'm thinking you could do something similar as a option.

...
parse(pgnStr, {individual: true}) // where individual defaults to false and could be true/false

Could have a fallback option where it first tried to parse the whole pgn string then falls back to individual. I'll come back to this latter. Would like to get the bot working first to see if the concept is viable.

I download all my games from chess.com and noticed the cleaner was removing @. After investigation I found the following

  1. @f4 {[%clk 0:04:34.7]} 6... N@a6 {[%clk 0:04:10.9]}

So the cleaner was doing it's job correctly as I can find nothing on that notation in http://www.saremba.de/chessgml/standards/pgn/pgn-complete.htm or Chess.com's own notation guide.

Edit: I found the reason [Variant "Crazyhouse"]. Maybe have the parser check for the Variant tag and if found it skips?

As described in #73 the job would be much easier if things would not be globally mixed up:

  1. Split games, each game into tags and pgn.
  2. Use the parser to read what you need (the parser is able to read the N@f3 notation already).
  3. Only clean up what is necessary to do.

I first would like to see the many different things that break, before doing a cleanup that may break much more without a need.

See #74 how to implement this idea.

Parser throws on @Nf3 btw.

Sorry, that is the wrong notation. The notation looks like this: 1. e4 e5 2. Nf3 Nf6 3. Nxe5 d6 4. Nxf7 Kxf7 5. Bc4+ d5 6. exd5 N@d6

So first figure, then drop symbol, then column and row.

That is annoying. Didn't think about the notation being out of order.

I've dropped the pull request. I'm ending up with a lot of pgn tools. Going to make a pgn tools repo later this week and put it in there.

Ok, the proposal of cleaning "dirty" pgn files should be the following:

  1. Split the games first.
  2. Parse the games individually, and react accordingly.
  3. Make the grammar as robust as possible, and allow to read from the result problematic parts of the pgn (works with tags well, does not work well with the SAN notation itself).