There’s Only One Real Way To Sort A PGN

Believe it or not, with something so common, it’s actually really difficult to sort big PGN databases. ChessBase does it, but to get your PGN in there you have to convert it to CBH or 2CBH, and that takes maybe half an hour per million games. Maybe not quite that much. And then you sort, and then export to PGN. If you were doing so with a 10M game database, it would take at least a couple hours.

Scid vs PC will do it, but their limit is 5 GB for a PGN, so that cuts out half of the big projects.

Scid will do it, but only if you right click over the games list and select to Copy — Export all the games to PGN. If you go in through the regular menu system, it won’t honor the way you’ve sorted the games. It will use the original sort.

So, believe it or not, this is the only thing you can do.

That being said, if you’re ever in a total pinch, and need to sort any PGN up to maybe 5 GB without the use of anything other than a text editor, there is a way.

Notepad++ will allow you to open files up to its memory limit, and for most purposes that maxes out around 6 or 7 GB (on my system).

All you have to do is to replace the newlines inside the individual games with double tildes (~~). Those never appear in any PGN, ever. So it works.

Before you begin the long process, though, you have to make sure that your movelists are each one line. This is easy to do with pgn-extract and the -w9999 argument, or in Scid vs. PC via the regular export to PGN feature. Or you can get a script that does it. But the following process only works if every movelist is just one line long.

The main thing is to not have double tildes between games. So it has to be a bit selective. First you have to deal with the blank line between the movelist and the headers. You do so with the regular expression pair:

\]$\r\n\r\n^1.
\]~~~~1.

The first line is the search text, the second is the replace text. This says look for a right square bracket, then look to see if it’s at the end of the line, then drop two newlines, and look for a one and a dot.

The replace text says, if you find that, turn it into a right square bracket, then two tildes, then a one and a dot.

Now you can do the bigger operation. You can connect everything. You do this with:

\]$\r\n^\[
\]~~\[

As you might guess, the search text is looking for a right square bracket at the end of a line, then one newline, then a left square bracket. Keep in mind that the movelist is already taken care of. So all we’re doing is connecting the rest of the lines in every game.

Now every game is one line, with a blank line in-between. Just run a third reg exp to remove these:

\r\n\r\n
\r\n

Simple as pie. This says look for two consecutive newlines and replace them with just one. (The r stands for “carriage return” and the n stands for “line feed”. This is a newline in Windows. In Mac it’s just \r and in Linux it’s just \n. Of course that holds for all these regular expressions.)

Now if you aren’t sorting by Event (which is probably the first header) then you need to make the one you want to sort by the first header in the games. So you have to move it to the front. Let’s say you want to sort by Date.

^(.*?)\[Date "(.*?)"\]~~
~~\[Date "\2"\]~~\1

This one says to look at the beginning of every line, then remember what it saw up until it found what it was looking for. Then, when finding it, to remember what was in the quotation marks. Then to move that to the front, and put everything else behind it.

Then, with whatever text editor you’re able to do this in, you have to sort the lines up or down as you like, bringing everything into order. Now that the games are sorted, they’re all one line, and the wrong header is at the front. First, to put the header back, just repeat the previous process but instead of the text Date you would use the text Event. At that point you just replace all double tildes with newlines. Like so:

~~
\r\n

So it’s complicated, and yet it requires nothing other than a text editor.

Leave a Reply

Your email address will not be published. Required fields are marked *