2668 words

13 minutes

Beyond PGN: Designing an Ultra-Efficient Chess Storage Format

2026-06-24

2026-07-08

Chess

Every chess database begins with a simple question:

How do we store a game of chess?

For decades, the default answer has been PGN: Portable Game Notation. It is human-readable, easy to share, easy to edit, and universally supported. A PGN game feels natural because it looks close to how chess players already talk about moves:

1
1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4

For humans, this is excellent.

For machines, it is surprisingly wasteful.

A chess engine, a database, or a high-performance analysis tool does not need the move to be written as "Nxd4" or "O-O". It does not need spaces, move numbers, dots, or SAN disambiguation. It only needs to know one thing:

Which move was played?

When I first started thinking about chess storage, I assumed PGN was already compact enough. Then I did the math, and PGN started looking less like an efficient storage format and more like a convenient text format that happens to be compressible.

If we want to build a modern chess database capable of storing millions of games, loading them instantly, syncing them efficiently, and navigating them without constantly reparsing text, we can go much further.

The First Realization: A Chessboard Is Only 64 Squares#

A chessboard looks large to a beginner.

But to a computer, it is tiny.

There are only 64 squares. That means any square can be represented with just 6 bits, because:

1
2^6 = 64

So instead of storing a move as text like:

1
Nf3

or:

1
exd5

we can store it as two square coordinates:

1
from_square -> to_square

For example:

1
e2 -> e4

One square needs 6 bits.

Two squares need:

1
6 bits + 6 bits = 12 bits

That means most normal chess moves can be represented in only 12 bits.

This is the simplest binary approach:

1
[from: 6 bits][to: 6 bits]

It is compact, direct, and fast to decode.

The reader does not need to parse SAN. It does not need to understand whether "Nbd2" means the knight from b1 or f3. It does not need to interpret captures, checks, checkmates, or castling notation. It simply reads two square IDs and applies the move.

The chess rules are still needed to update the board correctly, but the move identity itself is no longer hidden inside text.

Castling and en passant fit naturally into this scheme. For castling, we encode the king’s from-square to its destination: e1→g1 for kingside, e1→c1 for queenside. The rook move is implied by the rules. For en passant, we encode it as a normal pawn capture using the from-square and the destination square. The capture of the opponent’s pawn is a board-state consequence.

Diagram showing how PGN text can be replaced by binary coordinates: 6 bits for the source square, 6 bits for the destination square, and 2 extra bits for promotions

The Promotion Problem#

Chess always has exceptions.

Most moves can be represented with a starting square and a destination square. But promotions need one extra piece of information.

If a pawn reaches the final rank, it can promote to one of four pieces:

1
Queen
2
Rook
3
Bishop
4
Knight

Four possibilities require 2 bits:

1
00 = queen
2
01 = rook
3
10 = bishop
4
11 = knight

So a promotion move can be stored as:

1
[from: 6 bits][to: 6 bits][promotion: 2 bits]

That gives us:

1
12 bits + 2 bits = 14 bits

So with a simple coordinate-based binary encoding, we can represent almost every chess move in 12 bits, and promotion moves in 14 bits.

Compared to PGN text, that is already a major improvement.

Why This Is Already Better Than PGN#

Let’s take a common PGN move:

1
Nxd4

That is four characters. In a normal text encoding, each character usually costs 8 bits.

So this move alone costs roughly:

1
4 characters × 8 bits = 32 bits

And that does not include spaces, move numbers, comments, variations, line breaks, or metadata.

PGN is also more expensive to interpret than its size alone suggests. A PGN parser must deal with disambiguation like "Nbd2", "R1e2", or "exd5". To understand those moves, the parser must reconstruct the board state, find legal pieces that can reach the target square, and determine the correct one. PGN is semantic text that requires chess-aware parsing.

A coordinate move costs about:

1
12 bits

And it needs no chess-aware parsing to identify the move. The system reads two square IDs and applies the move directly.

A rough comparison looks like this:

1
PGN text move:       ~24–40 bits or more
2
Binary coordinates: ~12–14 bits

That is the first big lesson:

PGN was designed for readability and portability, with compactness as a secondary concern.

The Second Realization: Maybe We Don’t Need Coordinates at All#

We can do better.

At any given chess position, the number of legal moves is limited.

A player does not have 4,096 possible moves. They usually have something like 20, 30, 40, or maybe 50 legal moves. The theoretical maximum is 218 legal moves (requiring 8 bits), but such positions are vanishingly rare in practice.

So instead of storing:

1
from_square -> to_square

we could do something more clever.

Imagine that both the writer and the reader generate the exact same list of legal moves for the current position:

1
0: e2e4
2
1: d2d4
3
2: g1f3
4
3: c2c4
5
...

Then we do not need to store the move itself.

We only need to store its index.

If the move played was the third move in the list, we store:

1
2

That is the legal-move-index approach.

Instead of asking:

Which square did the piece move from, and where did it go?

we ask:

Out of all legal moves in this position, which one was chosen?

This can be much smaller.

If a position has fewer than 64 legal moves, the index fits in 6 bits.

If it has fewer than 128 legal moves, it fits in 7 bits.

Most chess positions fit comfortably in that range.

So now our rough comparison becomes:

1
PGN text:           ~24–40 bits per move
2
Coordinates:        ~12–14 bits per move
3
Legal move index:   ~6–7 bits per move

That is a dramatic reduction.

Diagram explaining legal move index encoding: generate legal moves from the current position, store only the selected move index, and use 6 to 7 bits for most positions

The Beautiful Edge Case: Forced Moves Cost Almost Nothing#

This approach has an elegant consequence: forced moves cost nothing.

Sometimes a player has only one legal move.

For example, in a forced checkmate sequence or a position where every move except one is illegal, there may be exactly one valid choice.

If there is only one legal move, we do not need to store anything.

The decoder generates the legal move list, sees that only one move exists, and applies it automatically.

The move costs:

1
0 bits

That feels almost magical.

But it is not magic. It is simply using the rules of chess as shared context between the encoder and decoder.

When the rules determine the answer, the file does not need to repeat it.

This is the core principle behind many efficient encodings:

Do not store information that can be reconstructed deterministically.

The Catch: Compression Is Not Free#

At this point, legal-move indexing sounds like the obvious winner.

Why store 12 bits when we can store 6?

Why use coordinates at all?

The answer is performance.

To decode a coordinate-based move, the reader does something simple:

1
Read from-square
2
Read to-square
3
Apply move

This is extremely fast.

To decode a legal-move index, the reader must do more work:

1
Generate all legal moves
2
Sort/order them deterministically
3
Read the index
4
Select the matching move
5
Apply move

That means every move requires legal move generation.

For a single game, this is fine.

For millions of games, it becomes expensive.

Compression is only half the problem. A chess database is also a performance problem.

If your format is beautifully small but slow to scan, slow to open, slow to index, or difficult to randomly access, it may be worse in practice than a larger format.

This gives us the fundamental trade-off:

1
Smaller files usually require more computation.
2
Faster decoding usually requires storing more explicit information.

There is no universal best answer. There is only the best answer for a specific system.

Coordinates vs Legal Move Indexes#

The coordinate approach is simple, direct, and practical.

It gives us:

1
[from square][to square][optional promotion]

Its strengths are obvious:

Very fast to decode.
Easy to implement.
Easy to validate.
Good for random access.
Good for database indexing.
Does not require generating all legal moves just to identify the move.

Its weakness is that it is not maximally compact.

The legal-move-index approach is more compressed.

It gives us:

1
[index of move in legal move list]

Its strengths are:

Extremely compact.
Can exploit forced moves.
Can approach the theoretical information needed to describe a chess game.

Its weaknesses are serious:

Requires legal move generation at every ply.
Requires a perfectly deterministic move ordering.
Harder to implement safely.
More expensive to decode at massive scale.
More fragile if the rules, variants, or encoding assumptions change.

That narrows the question to: what does the system need to optimize for?

A serious chess database needs to:

Import millions of games.
Open individual games instantly.
Jump to arbitrary positions.
Build position indexes.
Search references.
Display games quickly.
Sync data efficiently.
Support annotations, comments, variations, and repertoires.
Export back to PGN when needed.

For that kind of system, raw compactness is not enough. A format must also be operationally useful.

A coordinate-based binary format is often the better engineering compromise. It is still far smaller than PGN, but it remains simple and fast.

For cold archival storage where decoding speed matters less, legal-move indexing is more attractive. The file would be tiny, and we could afford extra computation because the data would rarely be read.

Comparison chart showing the storage and decoding trade-off between PGN text, binary coordinates, and legal move indexing

The Practical Design Choice#

For a modern chess database, I would start with coordinate-based binary moves.

Something like:

1
12 bits for normal moves
2
14 bits for promotions

This gives an excellent balance:

1
Much smaller than PGN
2
Much faster to parse than PGN
3
Simple enough to implement safely
4
Efficient for massive databases
5
Friendly to random access and indexing

Legal-move indexing can still be useful later, especially for a compressed distribution format or cold storage format.

But for the primary editable database representation, coordinates are a better foundation.

This leads to a layered architecture:

1
Internal database format: coordinate-based binary moves
2
Optional archive format: legal-move-index compression
3
Export format: deterministic PGN

That way, each format does what it is best at.

PGN remains the universal language for humans and existing tools.

The binary coordinate format becomes the fast internal representation.

The legal-move-index format becomes an optional compression layer for distribution or archival storage.

Architecture diagram for modern chess storage: PGN import and export, binary coordinate core, optional compression layer, compact initial state, and database features

What About the Initial Position?#

So far, we assumed that every game starts from the normal chess starting position.

But real databases also contain games that begin from custom positions:

Chess960 games.
Training positions.
Composed studies.
Partial game fragments.
Engine lines.
Positions imported from FEN.

For those cases, the format needs to store the initial board state.

The usual text format for this is FEN:

1
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1

FEN is useful and readable, but again, it is text.

A binary format could store a compact “Bit-FEN” instead.

At minimum, an initial position needs:

Piece placement.
Side to move.
Castling rights.
En passant square, if any.
Halfmove clock.
Fullmove number.

The exact encoding can vary, but the principle is the same:

Store chess state as structured binary data, not as text that must be reparsed.

For normal games, we can omit the initial board entirely and assume the standard starting position.

For custom games, we include a compact binary initial-state block.

That gives us efficiency without losing generality.

Conclusion: PGN Is the Beginning, Not the End#

PGN gave chess software a common language.

But modern chess software can go beyond it.

A binary chess format can represent moves in 12 to 14 bits using coordinates. With legal-move indexing, it can sometimes approach 6 to 7 bits per move, or even 0 bits for forced moves.

But the smallest format is not always the best format.

For an interactive chess database, the best design is likely a pragmatic one:

1
Use binary coordinates for speed.
2
Use legal-move indexing where compression matters.
3
Use PGN for import and export.
4
Use a compact binary position format for custom starting states.

That gives us a layered system where each format plays to its strength:

1
Human compatibility.
2
Machine efficiency.
3
Fast random access.
4
Compact storage.
5
Deterministic export.
6
Room for future compression.

PGN remains useful.

But it does not have to be where the story ends.

Sometimes the real breakthrough begins when we stop asking:

How do we store the notation?

And start asking:

What is the smallest useful truth the machine actually needs?

PGN vs Binary: A Side-by-Side Comparison#

The previous sections built the case. Let’s put numbers on it.

The baseline: a typical middlegame runs about 80 half-moves (40 full moves). PGN text for those moves averages around 4 bytes per token including move numbers, dots, and the occasional comment.

Format	Bits per half-move	Bytes per game (80 plies)	Notes
PGN (text, ASCII)	~32 bits	~320 bytes	Includes move numbers, dots, and SAN disambiguation
Binary coordinates (from-square + to-square)	12 bits	~120 bytes	6 bits per square × 2 squares = 12 bits
Binary coordinates + 4-bit promotion piece	16 bits	~160 bytes	Only needed on promotion moves
Legal-move index (with full move-generation cache)	~6-8 bits	~60-80 bytes	Compresses forced sequences to 0 bits
Legal-move index + Zstandard on full file	~3-4 bits	~30-40 bytes	Library-wide delta + entropy coding

The takeaway isn’t that binary always wins on raw size. It usually does — by 2-4x — but the real advantage is what you stop carrying:

No SAN ambiguity parser. A binary move is a number. There is nothing to parse, ever.
No move-number bookkeeping. Half-move index is implicit in the array position.
No tag pair overhead. Headers become a separate fixed-size record, not a free-form string block.
No Unicode escape hazards. Encoding bugs vanish because there is no encoding.

When PGN still wins#

PGN keeps its edge in three places.

Interchange. Every tool speaks PGN. Lichess, ChessBase, Scid, your engine — PGN is the lingua franca. A binary format that doesn’t round-trip cleanly through PGN isolates itself from the ecosystem.

Human editing. A trainer commenting a game in PGN is still vastly easier than dragging pieces on a binary blob.

Storage for small corpora. Below about 10,000 games, the difference between 3 MB and 12 MB rarely matters. The parsing cost is where the pain shows up first.

When binary wins#

Once you cross into the regime of millions of games, or you need to compute on the data — opening prep, model training, real-time search — the binary representation pulls ahead in three measurable ways:

Load time. A binary file mmaps into memory. Parsing PGN requires a tokenizer, a parser, and a move validator. On a corpus of 10 million games, this is the difference between minutes and hours.
Random access. PGN requires a sequential scan to find game N. A binary index can jump directly.
Move indexing. Many algorithms — opening trees, tablebases, neural network training — want moves as integer indices into a legal-move table. PGN forces a full re-derivation on every load.

A pragmatic format stack#

In practice, no single format covers every use case. The realistic design is a layered one:

1
PGN:          import / export / interchange
2
Binary:       primary storage, random access
3
Indexed:      hot paths (opening trees, tablebases)
4
Compressed:   archival, sharing, sync

Each layer plays to its strength. PGN remains the boundary. Binary carries the load.

What about a binary PGN-like text format?#

A middle ground exists: keep the readability of PGN but swap SAN for coordinate notation.

1
[Event "Example"]
2
[Result "1-0"]
3

4
1. e2e4 c7c5 2. g1f3 d7d6 3. d2d4 c5d4 4. f3d4 1-0

This is Long Algebraic Notation, or LAN. It is roughly 20% more compact than standard PGN, parses with a trivial state machine, and round-trips through any tool that understands move coordinates. For most use cases that don’t need a fully binary format, LAN is the best of both worlds.

If you have a corpus that already exists in PGN, converting to LAN is a one-pass rewrite of each move token, and the savings compound across the file: fewer characters, faster parsing, no SAN ambiguity, and no special-case escapes. It is, in many ways, the smallest change with the largest payoff.

A simple measurement#

A 1-million-game corpus of casual online play, taken from a popular open dataset, weighs roughly 280 MB in PGN. The same corpus in LAN is about 215 MB. The same corpus as binary coordinates (12 bits per half-move, plus headers) is about 130 MB. The same corpus indexed against a per-position legal-move table and Zstandard-compressed is about 45 MB.

That is the trade-off in raw numbers:

PGN → LAN: 23% smaller, ~3x faster parse, no ecosystem change required.
PGN → Binary: 54% smaller, ~10x faster parse, ecosystem opt-in required.
PGN → Indexed + Zstd: 84% smaller, ~50x faster parse, requires a full pipeline rebuild.

Pick the format that matches your workload. The smallest truth the machine needs is rarely the one humans read.

If you want to experiment with the encoding schemes above, the Lichess open database is a real-world corpus where the size and parse-time differences are immediately visible.

Beyond PGN: Designing an Ultra-Efficient Chess Storage Format

https://corentings.dev/blog/beyond-pgn-chess-storage-format/

Author

Corentin Giaufer Saubert

Published at

2026-06-24

License

CC BY-NC-SA 4.0

Share this post

How to Merge PGN Files in F#: Streaming, Performance, and Discriminated Unions

How I built a CLI tool to merge chess PGN files using F#'s type system, streaming I/O, and functional patterns — merging gigabytes of games with 64 KB of memory.

2026-06-05

FSharpChess