niklasf/python-chess

`chess.pgn.read_headers` inserts empty header entries related to newlines and empty movetext

MatijaSi opened this issue · 6 comments

I am trying to parse a largeish (7,000,000 games) pgn using read_headers. However, I only managed to scan 84,039 games before it stopped as if it finished (no error message).

I managed to narrow it down to this testcase:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


{ Both Chinese players were late to the board for game two and were
defaulted }


0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

while True:
    headers = chess.pgn.read_headers(f)
    print(headers)

    if not headers:
        break

Which prints:

Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0')
Headers()

Investigating a bit further, there seems to be some issue related to newlines between games.

For example:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]

{ Both Chinese players were late to the board for game two and were
defaulted }

0-1

[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to games being (note the empty Headers() between both "real" games):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 None]

While file from original issue:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


{ Both Chinese players were late to the board for game two and were
defaulted }


0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to games being (again plenty of empties):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(),
 Headers(),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 Headers(),
 None]

So my code in original issue is slightly wrong: it looks at headers being false-ish:

if not headers:
    break

instead of comparing them to None:

if headers is None:
    break

However this is probably still bug in library, since empty line probably shouldn't be empty game. Additionaly it's somehow related to movetext being empty, since if we provide it we get different return:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


1. e4 e5 0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to (note that now there is no Headers() between games, but one extra still got appended):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 Headers(),
 None]

I have also had this problem.

If I put a blank line between the games, it works. So:

Example 1, BAD, does only parse the first game, no blank line between games:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*
[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

Example 2, GOOD, does parse both games, a blank line between the games:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

I think that both examples should work.

This is tricky to deal with for chess.pgn.read_game() with its current interface: It reads the file line by line, without being able to look ahead. And so with the parser at <-

[Header "A"]


1. e4

<-
[Header "B"]

a decision has to be made:

  • Guess that the game contains consecutive empty lines (not allowed!) and will continue. In this example, it would incorrectly consume the first header of the second game, which is bad.
  • Guess that the game is terminated by consecutive empty lines and is just missing a result marker like * or 1-0 (not allowed!). This terminates the game too early in your examples, which is bad.

Currently the parser always does the latter. This is not a bug, because the PGN is invalid anyway, but maybe some heuristics can be added to better deal with it.

Robustly handling all of this would require changing the API, so that the parser can look ahead one line, without necessarily consuming it. Pushing this back to 2.x, for that reason.