In the middle of adding in files for the Lc0 Artemis directory that I had started, I switched projects back to Bookstacks. This is the eBook site that I started in 2002 and which has always gotten an average of two hundred people wandering in per day.
I’ve always wished to automate the whole process, but never could. The way I had always pictured it is now how it finally is, with the help of ChatGPT, and tons of drafts of PHP and Python code. This is the fourth day in a row that I’ve been working on it. So I will get back to Chess Nerd, but in the meantime, I’m getting some work done that I’ve always needed to.
For what it’s worth, the process is this:
A Python script prompts me on the command line to give it an author name. The script then downloads all the English language books by that author from Project Gutenberg, renaming them consistently according to the API data, and then cleans the files by grabbing the book content from the DOM and then reassembling the books with a consistent h1, h2, h3 heading scheme.
I take a cleaned book and polish it up, making sure it’s entirely consistent. Then another Python script splits the book into separate HTML files, one per chapter — followed by another one to fix the mistakes of the one before it. 🙂
I use two more Python scripts to automatically generate an EPUB and PDF version. Then I upload it all to the server.
The single medium-length PHP file that runs the entire site then looks at the server on loading, and whatever it sees it displays in lists, navigation, and a viewer that shows you the books one chapter at a time, with a dropdown menu that lists links to the other chapters, and pagination links for previous, next, and a list of books.
It should be noted that there is intentionally no attribution given to Project Gutenberg. This is from what I read in their boilerplate whenever I actually read it, about twenty years ago. It said that if you use the Project Gutenberg name at all, you have to insert the boilerplate at the beginning and end of every book. But if you don’t want to do that, then you can’t use the Project Gutenberg name at all.
Ultimately the table of Lc0 match games refers to downloads. So this is the simple list of links to each of the Lc0 match games, split up by day, and taking into account the different runs that overlapped. Currently working on a script to automagically improve these files.
Regular expressions are not limited to software or to text editors. They also have use in office apps, search forms, etc. But this is just a quick heads-up on how to run a regular expression search and replace in Notepad++, which is a Windows text editor.
Say you have a file with only one PGN file total. We can experiment with this.
Perhaps you would like to strip out the comments. As you can see, all PGN comments are surrounded by curly brackets. These appear nowhere else in the PGN.
To use a regular expression is to look for more than simply one literal search term. In this case, you would search on \{.*?\}
Just look at part of the file. 88. Kg1 {-299.92/34 0.69s} Kg5 … You take out the space before it, then the comment and everything inside it. When I run that on the PGN included above, I get this result:
The space is self-explanatory. The backslash is to “escape” the left curly bracket, i.e. to keep it from being used as an actual reg exp character, instead of a search character. The next three are always together. They are a dot to say “any one character”, an asterisk to modify the dot to include as many as needed. The question mark to say, don’t get overambitious in your searching. The right curly bracket is then escaped (for the same reason the left one was) and that is really all it is.
Say you want to change the Event value to Big Tournament. You would use the following search text: ^\[Event ".*?"\]$ to indicate the beginning of a line, the tag you’re looking for, random content that you’re not keeping track of with parentheses, and then the rest of the tag and the end-of-line indicator.
^ means the beginning of a line. $ means the end of it. Backslashes escape the square brackets, and you can just replace the whole thing with \[Event "Big Tournament"\]
You don’t need to specify beginning or end of line in replace text, and since it’s always the same value, you just type it in literally.
These are two good examples to give you an idea of how to make this system work for you. It’s mostly a matter of looking things up and asking the various robots how to write the command line scripts. Things like that.
Still not totally convinced this is finished, since it took such a long time. I started with the 2025-01 FIDE XML players list, culled it with a ChatGPT-generated Python script so that only players rated 2000+ were included. Then converted that to XLSX (amongst other formats) and copied out the column in that spreadsheet for the FIDE IDs of those players. Then, with another Python script, I was able to download a JSON file for each player from the FIDE API, using a wrapper from a GitHub repository. These JSON files have all the information for those players. Not just general information, but the ratings and number of games for every month that they have a rating for. So these files turned out to be pretty long. The next step was converting all of those to a new XML players list, which includes all the history, as well as the general information. Even though the number of players is drastically reduced to only a bit more than 19,000, still the new XML players list is about twice the size of the old one. I did make sure to streamline the elements, so that, as much as possible, they resemble the elements in the regular players list.
The size of the ZIP is about 40 MB, and oddly the size of the XML file is about 1.2 GB. I’m not sure how it compressed so well, but it seems to have done so.
The player’s list is provided with each month, but not provided in the archives for previous months. While it would be somewhat useful to have an archive of them, there’s no reliable way to build it. However, all three lists are provided as XML, which can be combined with a script. So, using a Python script written by ChatGPT, I’m able to create combined XML files which don’t contain the inactive players, but nonetheless could easily stand in for the player’s list. And these can be provided for every month that there are XML files. Here is December 2024.
I’ve made many attempts now to create the perfect Python script for SF commits (via ChatGPT, of course), and finally realized that the problem was that almost everything necessary is already at Abrok. Official builds are over at the official site anyway. But all the commits are here, going back to 2018 or so. Here is a list of those commits, with direct links to the executables.
The FIDE player list is a combination of the three ratings lists (standard, rapid, and blitz) but is very big and includes lots of players who are registered but have no rating. After running a Python script to keep only the players rated 2000 or above, I can provide a much smaller list that might also be more useful. In the MediaFire directory for 2025-01, you can find it in XML, CSV, JSON, XLSX, ODS, and TXT.
These are tables of data related to the Leela Chess Zero self-test matches and NNUE networks. Aside from all the relevant data in the matches table, you can also turn the first value in each record — the ID field — into a URL to download the PGN itself if you put it in the following form:
It should be noted that while most of the links will come from run 1, some of them don’t, so the /1/ before the ID-NUM is actually variable data as well, and is also listed in the table along with the ID.
The data for the networks, similarly relevant, can not so easily be turned into URLs. They come with the table, but converting strips those out, and so far I don’t have a way around that. So instead I have the networks page itself, which is available at: https://storage.lczero.org/files/networks/
The above link is to a CSV I just made of the networks page itself. This is of limited use, so it’s static. In the next iteration of the script, I’ll try and put all this functionality together. Or, rather, get ChatGPT to do so, as I couldn’t code my way out of a sack of pythons.
NOTE: Those initial HTML links should be saved directly to the hard drive. Attempting to view either link would result in your browser tab crashing, as they’re both too large to display. Hence downloading them directly via the browser and then converting them to CSV, which is what I did, and what these files are:
And here is the Python script that ChatGPT wrote that generates the CSVs. Using PyInstaller (and with ChatGPT’s help, of course) I managed to get it into EXE form. All you have to do is double-click on it, and it will automagically download both HTML files for you, then convert them each to CSV, putting everything into whatever folder it happens to be in: