Get all game headers in find command
sshivaji opened this issue · 8 comments
Its a common need to sort games by date etc. I thought of using different code for this. However, I need to align against the returned game offset.
Can we have an api to get all game headers that can be returned by the find command. I dont see any other performant way to get all game headers and align it with the returned chess_db game offsets. This can also be done during book generation.
Code in python to get all game headers using all of the offsets is really slow..
On second thought and testing, it looks like I can use https://python-chess.readthedocs.io/en/v0.16.1/pgn.html#chess.pgn.scan_headers to get all headers. Its not as fast as chess_db book building but perhaps sufficient.
This is sufficient if the offset logic matches. I think it should.
The thing is we have two tools creating different kind of database now from .pgn move sections and they store different calculated offsets of games. Maybe it would be more practical if scoutfish calculate offsets the same way as chess_db (8 byte aligned). Then chess.pgn.scan_headers can be used to produce an mongodb/sqlite database with chess_db compatible offsets as primary keys and other field(s) to store header data. This way offsets coming from both chess_db and scoutfish would be used to query this sqlite file for header data.
Another solution can be extending the parser to parse header tags too and output it to say a mongodb or sqlite file.
I talked about this before, if we are going to output header tags, I think JSON format is sufficient, then other programs can process it as they wish.
This is related to the scoutfish issue. Will close it for now but will re-open with a real UI issue and screenshot.. :)
I was struggling to make this work fast enough with python-chess, it took about 4 minutes to get all headers in the 2.2 million base pgn file. I decided to modify the chess_db code and create a separate repo that only supports header and offset extraction. It is now about 8 times faster than the python-chess solution (not surprising as the code is in C++) - https://github.com/sshivaji/chess_pgn_headers
@gbtami, the output is in JSON and you can use it to put in SQLite or wherever you desire
This solves the issue of executing many typical queries with scoutfish (and chess_db), such as find me winning percentage of 2 bishops vs bishop and knight endgames in games where both players are greater than 2400 elo, 2500 elo, 2600 elo and so on. How does elo affect the winning percentage? I think the greater the elo the greater the winning percentage of 2 bishops vs bishop and knight endgame (but I will have to check).
Now, I think we can mostly focus on the UI integration, all the key backend tools are ready.
@mcostalba, I hope you dont mind the fork. I think your code base is the fastest way to extract headers that I am aware of today, and reusing it helps maintain the speed.