Beyond PGN: Designing an Ultra-Efficient Chess Storage Format
Every chess database begins with a simple question:
How do we store a game of chess?
For decades, the default answer has been PGN: Portable Game Notation. It is human-readable, easy to share, easy to edit, and universally supported. A PGN game feels natural because it looks close to how chess players already talk about moves:
1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4For humans, this is excellent.
For machines, it is surprisingly wasteful.
A chess engine, a database, or a high-performance analysis tool does not need the move to be written as "Nxd4" or "O-O". It does not need spaces, move numbers, dots, or SAN disambiguation. It only needs to know one thing:
Which move was played?
When I first started thinking about chess storage, I assumed PGN was already compact enough. Then I did the math, and PGN started looking less like an efficient storage format and more like a convenient text format that happens to be compressible.
If we want to build a modern chess database capable of storing millions of games, loading them instantly, syncing them efficiently, and navigating them without constantly reparsing text, we can go much further.
The First Realization: A Chessboard Is Only 64 Squares
A chessboard looks large to a beginner.
But to a computer, it is tiny.
There are only 64 squares. That means any square can be represented with just 6 bits, because:
2^6 = 64So instead of storing a move as text like:
Nf3or:
exd5we can store it as two square coordinates:
from_square -> to_squareFor example:
e2 -> e4One square needs 6 bits.
Two squares need:
6 bits + 6 bits = 12 bitsThat means most normal chess moves can be represented in only 12 bits.
This is the simplest binary approach:
[from: 6 bits][to: 6 bits]It is compact, direct, and fast to decode.
The reader does not need to parse SAN. It does not need to understand whether "Nbd2" means the knight from b1 or f3. It does not need to interpret captures, checks, checkmates, or castling notation. It simply reads two square IDs and applies the move.
The chess rules are still needed to update the board correctly, but the move identity itself is no longer hidden inside text.
Castling and en passant fit naturally into this scheme. For castling, we encode the king’s from-square to its destination: e1→g1 for kingside, e1→c1 for queenside. The rook move is implied by the rules. For en passant, we encode it as a normal pawn capture using the from-square and the destination square. The capture of the opponent’s pawn is a board-state consequence.

The Promotion Problem
Chess always has exceptions.
Most moves can be represented with a starting square and a destination square. But promotions need one extra piece of information.
If a pawn reaches the final rank, it can promote to one of four pieces:
QueenRookBishopKnightFour possibilities require 2 bits:
00 = queen01 = rook10 = bishop11 = knightSo a promotion move can be stored as:
[from: 6 bits][to: 6 bits][promotion: 2 bits]That gives us:
12 bits + 2 bits = 14 bitsSo with a simple coordinate-based binary encoding, we can represent almost every chess move in 12 bits, and promotion moves in 14 bits.
Compared to PGN text, that is already a major improvement.
Why This Is Already Better Than PGN
Let’s take a common PGN move:
Nxd4That is four characters. In a normal text encoding, each character usually costs 8 bits.
So this move alone costs roughly:
4 characters × 8 bits = 32 bitsAnd that does not include spaces, move numbers, comments, variations, line breaks, or metadata.
PGN is also more expensive to interpret than its size alone suggests. A PGN parser must deal with disambiguation like "Nbd2", "R1e2", or "exd5". To understand those moves, the parser must reconstruct the board state, find legal pieces that can reach the target square, and determine the correct one. PGN is semantic text that requires chess-aware parsing.
A coordinate move costs about:
12 bitsAnd it needs no chess-aware parsing to identify the move. The system reads two square IDs and applies the move directly.
A rough comparison looks like this:
PGN text move: ~24–40 bits or moreBinary coordinates: ~12–14 bitsThat is the first big lesson:
PGN was designed for readability and portability, with compactness as a secondary concern.
The Second Realization: Maybe We Don’t Need Coordinates at All
We can do better.
At any given chess position, the number of legal moves is limited.
A player does not have 4,096 possible moves. They usually have something like 20, 30, 40, or maybe 50 legal moves. The theoretical maximum is 218 legal moves (requiring 8 bits), but such positions are vanishingly rare in practice.
So instead of storing:
from_square -> to_squarewe could do something more clever.
Imagine that both the writer and the reader generate the exact same list of legal moves for the current position:
0: e2e41: d2d42: g1f33: c2c4...Then we do not need to store the move itself.
We only need to store its index.
If the move played was the third move in the list, we store:
2That is the legal-move-index approach.
Instead of asking:
Which square did the piece move from, and where did it go?
we ask:
Out of all legal moves in this position, which one was chosen?
This can be much smaller.
If a position has fewer than 64 legal moves, the index fits in 6 bits.
If it has fewer than 128 legal moves, it fits in 7 bits.
Most chess positions fit comfortably in that range.
So now our rough comparison becomes:
PGN text: ~24–40 bits per moveCoordinates: ~12–14 bits per moveLegal move index: ~6–7 bits per moveThat is a dramatic reduction.

The Beautiful Edge Case: Forced Moves Cost Almost Nothing
This approach has an elegant consequence: forced moves cost nothing.
Sometimes a player has only one legal move.
For example, in a forced checkmate sequence or a position where every move except one is illegal, there may be exactly one valid choice.
If there is only one legal move, we do not need to store anything.
The decoder generates the legal move list, sees that only one move exists, and applies it automatically.
The move costs:
0 bitsThat feels almost magical.
But it is not magic. It is simply using the rules of chess as shared context between the encoder and decoder.
When the rules determine the answer, the file does not need to repeat it.
This is the core principle behind many efficient encodings:
Do not store information that can be reconstructed deterministically.
The Catch: Compression Is Not Free
At this point, legal-move indexing sounds like the obvious winner.
Why store 12 bits when we can store 6?
Why use coordinates at all?
The answer is performance.
To decode a coordinate-based move, the reader does something simple:
Read from-squareRead to-squareApply moveThis is extremely fast.
To decode a legal-move index, the reader must do more work:
Generate all legal movesSort/order them deterministicallyRead the indexSelect the matching moveApply moveThat means every move requires legal move generation.
For a single game, this is fine.
For millions of games, it becomes expensive.
Compression is only half the problem. A chess database is also a performance problem.
If your format is beautifully small but slow to scan, slow to open, slow to index, or difficult to randomly access, it may be worse in practice than a larger format.
This gives us the fundamental trade-off:
Smaller files usually require more computation.Faster decoding usually requires storing more explicit information.There is no universal best answer. There is only the best answer for a specific system.
Coordinates vs Legal Move Indexes
The coordinate approach is simple, direct, and practical.
It gives us:
[from square][to square][optional promotion]Its strengths are obvious:
- Very fast to decode.
- Easy to implement.
- Easy to validate.
- Good for random access.
- Good for database indexing.
- Does not require generating all legal moves just to identify the move.
Its weakness is that it is not maximally compact.
The legal-move-index approach is more compressed.
It gives us:
[index of move in legal move list]Its strengths are:
- Extremely compact.
- Can exploit forced moves.
- Can approach the theoretical information needed to describe a chess game.
Its weaknesses are serious:
- Requires legal move generation at every ply.
- Requires a perfectly deterministic move ordering.
- Harder to implement safely.
- More expensive to decode at massive scale.
- More fragile if the rules, variants, or encoding assumptions change.
That narrows the question to: what does the system need to optimize for?
A serious chess database needs to:
- Import millions of games.
- Open individual games instantly.
- Jump to arbitrary positions.
- Build position indexes.
- Search references.
- Display games quickly.
- Sync data efficiently.
- Support annotations, comments, variations, and repertoires.
- Export back to PGN when needed.
For that kind of system, raw compactness is not enough. A format must also be operationally useful.
A coordinate-based binary format is often the better engineering compromise. It is still far smaller than PGN, but it remains simple and fast.
For cold archival storage where decoding speed matters less, legal-move indexing is more attractive. The file would be tiny, and we could afford extra computation because the data would rarely be read.

The Practical Design Choice
For a modern chess database, I would start with coordinate-based binary moves.
Something like:
12 bits for normal moves14 bits for promotionsThis gives an excellent balance:
Much smaller than PGNMuch faster to parse than PGNSimple enough to implement safelyEfficient for massive databasesFriendly to random access and indexingLegal-move indexing can still be useful later, especially for a compressed distribution format or cold storage format.
But for the primary editable database representation, coordinates are a better foundation.
This leads to a layered architecture:
Internal database format: coordinate-based binary movesOptional archive format: legal-move-index compressionExport format: deterministic PGNThat way, each format does what it is best at.
PGN remains the universal language for humans and existing tools.
The binary coordinate format becomes the fast internal representation.
The legal-move-index format becomes an optional compression layer for distribution or archival storage.

What About the Initial Position?
So far, we assumed that every game starts from the normal chess starting position.
But real databases also contain games that begin from custom positions:
- Chess960 games.
- Training positions.
- Composed studies.
- Partial game fragments.
- Engine lines.
- Positions imported from FEN.
For those cases, the format needs to store the initial board state.
The usual text format for this is FEN:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1FEN is useful and readable, but again, it is text.
A binary format could store a compact “Bit-FEN” instead.
At minimum, an initial position needs:
- Piece placement.
- Side to move.
- Castling rights.
- En passant square, if any.
- Halfmove clock.
- Fullmove number.
The exact encoding can vary, but the principle is the same:
Store chess state as structured binary data, not as text that must be reparsed.
For normal games, we can omit the initial board entirely and assume the standard starting position.
For custom games, we include a compact binary initial-state block.
That gives us efficiency without losing generality.
Conclusion: PGN Is the Beginning, Not the End
PGN gave chess software a common language.
But modern chess software can go beyond it.
A binary chess format can represent moves in 12 to 14 bits using coordinates. With legal-move indexing, it can sometimes approach 6 to 7 bits per move, or even 0 bits for forced moves.
But the smallest format is not always the best format.
For an interactive chess database, the best design is likely a pragmatic one:
Use binary coordinates for speed.Use legal-move indexing where compression matters.Use PGN for import and export.Use a compact binary position format for custom starting states.That gives us a layered system where each format plays to its strength:
Human compatibility.Machine efficiency.Fast random access.Compact storage.Deterministic export.Room for future compression.PGN remains useful.
But it does not have to be where the story ends.
Sometimes the real breakthrough begins when we stop asking:
How do we store the notation?
And start asking:
What is the smallest useful truth the machine actually needs?
Related Posts
How to Merge PGN Files in F#: Streaming, Performance, and Discriminated Unions
How I built a CLI tool to merge chess PGN files using F#'s type system, streaming I/O, and functional patterns — merging gigabytes of games with 64 KB of memory.
TDD Isn't About Bugs — It's Your Permission to Refactor
Learn why test-driven development is really about permission to refactor, not catching bugs. With TypeScript examples, Result<T, E> patterns, and behavior-based testing from 3 years in production.
Why I Built ZaString
On zero allocations, Span<T>, and the pursuit of performance without sacrifice.