10 min left
1956 words
10 minutes

Beyond PGN: Designing an Ultra-Efficient Chess Storage Format

Every chess database begins with a simple question:

How do we store a game of chess?

For decades, the default answer has been PGN: Portable Game Notation. It is human-readable, easy to share, easy to edit, and universally supported. A PGN game feels natural because it looks close to how chess players already talk about moves:

1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4

For humans, this is excellent.

For machines, it is surprisingly wasteful.

A chess engine, a database, or a high-performance analysis tool does not need the move to be written as "Nxd4" or "O-O". It does not need spaces, move numbers, dots, or SAN disambiguation. It only needs to know one thing:

Which move was played?

When I first started thinking about chess storage, I assumed PGN was already compact enough. Then I did the math, and PGN started looking less like an efficient storage format and more like a convenient text format that happens to be compressible.

If we want to build a modern chess database capable of storing millions of games, loading them instantly, syncing them efficiently, and navigating them without constantly reparsing text, we can go much further.


The First Realization: A Chessboard Is Only 64 Squares#

A chessboard looks large to a beginner.

But to a computer, it is tiny.

There are only 64 squares. That means any square can be represented with just 6 bits, because:

2^6 = 64

So instead of storing a move as text like:

Nf3

or:

exd5

we can store it as two square coordinates:

from_square -> to_square

For example:

e2 -> e4

One square needs 6 bits.

Two squares need:

6 bits + 6 bits = 12 bits

That means most normal chess moves can be represented in only 12 bits.

This is the simplest binary approach:

[from: 6 bits][to: 6 bits]

It is compact, direct, and fast to decode.

The reader does not need to parse SAN. It does not need to understand whether "Nbd2" means the knight from b1 or f3. It does not need to interpret captures, checks, checkmates, or castling notation. It simply reads two square IDs and applies the move.

The chess rules are still needed to update the board correctly, but the move identity itself is no longer hidden inside text.

Castling and en passant fit naturally into this scheme. For castling, we encode the king’s from-square to its destination: e1→g1 for kingside, e1→c1 for queenside. The rook move is implied by the rules. For en passant, we encode it as a normal pawn capture using the from-square and the destination square. The capture of the opponent’s pawn is a board-state consequence.

Diagram showing how PGN text can be replaced by binary coordinates: 6 bits for the source square, 6 bits for the destination square, and 2 extra bits for promotions


The Promotion Problem#

Chess always has exceptions.

Most moves can be represented with a starting square and a destination square. But promotions need one extra piece of information.

If a pawn reaches the final rank, it can promote to one of four pieces:

Queen
Rook
Bishop
Knight

Four possibilities require 2 bits:

00 = queen
01 = rook
10 = bishop
11 = knight

So a promotion move can be stored as:

[from: 6 bits][to: 6 bits][promotion: 2 bits]

That gives us:

12 bits + 2 bits = 14 bits

So with a simple coordinate-based binary encoding, we can represent almost every chess move in 12 bits, and promotion moves in 14 bits.

Compared to PGN text, that is already a major improvement.


Why This Is Already Better Than PGN#

Let’s take a common PGN move:

Nxd4

That is four characters. In a normal text encoding, each character usually costs 8 bits.

So this move alone costs roughly:

4 characters × 8 bits = 32 bits

And that does not include spaces, move numbers, comments, variations, line breaks, or metadata.

PGN is also more expensive to interpret than its size alone suggests. A PGN parser must deal with disambiguation like "Nbd2", "R1e2", or "exd5". To understand those moves, the parser must reconstruct the board state, find legal pieces that can reach the target square, and determine the correct one. PGN is semantic text that requires chess-aware parsing.

A coordinate move costs about:

12 bits

And it needs no chess-aware parsing to identify the move. The system reads two square IDs and applies the move directly.

A rough comparison looks like this:

PGN text move: ~24–40 bits or more
Binary coordinates: ~12–14 bits

That is the first big lesson:

PGN was designed for readability and portability, with compactness as a secondary concern.


The Second Realization: Maybe We Don’t Need Coordinates at All#

We can do better.

At any given chess position, the number of legal moves is limited.

A player does not have 4,096 possible moves. They usually have something like 20, 30, 40, or maybe 50 legal moves. The theoretical maximum is 218 legal moves (requiring 8 bits), but such positions are vanishingly rare in practice.

So instead of storing:

from_square -> to_square

we could do something more clever.

Imagine that both the writer and the reader generate the exact same list of legal moves for the current position:

0: e2e4
1: d2d4
2: g1f3
3: c2c4
...

Then we do not need to store the move itself.

We only need to store its index.

If the move played was the third move in the list, we store:

2

That is the legal-move-index approach.

Instead of asking:

Which square did the piece move from, and where did it go?

we ask:

Out of all legal moves in this position, which one was chosen?

This can be much smaller.

If a position has fewer than 64 legal moves, the index fits in 6 bits.

If it has fewer than 128 legal moves, it fits in 7 bits.

Most chess positions fit comfortably in that range.

So now our rough comparison becomes:

PGN text: ~24–40 bits per move
Coordinates: ~12–14 bits per move
Legal move index: ~6–7 bits per move

That is a dramatic reduction.

Diagram explaining legal move index encoding: generate legal moves from the current position, store only the selected move index, and use 6 to 7 bits for most positions


The Beautiful Edge Case: Forced Moves Cost Almost Nothing#

This approach has an elegant consequence: forced moves cost nothing.

Sometimes a player has only one legal move.

For example, in a forced checkmate sequence or a position where every move except one is illegal, there may be exactly one valid choice.

If there is only one legal move, we do not need to store anything.

The decoder generates the legal move list, sees that only one move exists, and applies it automatically.

The move costs:

0 bits

That feels almost magical.

But it is not magic. It is simply using the rules of chess as shared context between the encoder and decoder.

When the rules determine the answer, the file does not need to repeat it.

This is the core principle behind many efficient encodings:

Do not store information that can be reconstructed deterministically.


The Catch: Compression Is Not Free#

At this point, legal-move indexing sounds like the obvious winner.

Why store 12 bits when we can store 6?

Why use coordinates at all?

The answer is performance.

To decode a coordinate-based move, the reader does something simple:

Read from-square
Read to-square
Apply move

This is extremely fast.

To decode a legal-move index, the reader must do more work:

Generate all legal moves
Sort/order them deterministically
Read the index
Select the matching move
Apply move

That means every move requires legal move generation.

For a single game, this is fine.

For millions of games, it becomes expensive.

Compression is only half the problem. A chess database is also a performance problem.

If your format is beautifully small but slow to scan, slow to open, slow to index, or difficult to randomly access, it may be worse in practice than a larger format.

This gives us the fundamental trade-off:

Smaller files usually require more computation.
Faster decoding usually requires storing more explicit information.

There is no universal best answer. There is only the best answer for a specific system.

The coordinate approach is simple, direct, and practical.

It gives us:

[from square][to square][optional promotion]

Its strengths are obvious:

  • Very fast to decode.
  • Easy to implement.
  • Easy to validate.
  • Good for random access.
  • Good for database indexing.
  • Does not require generating all legal moves just to identify the move.

Its weakness is that it is not maximally compact.

The legal-move-index approach is more compressed.

It gives us:

[index of move in legal move list]

Its strengths are:

  • Extremely compact.
  • Can exploit forced moves.
  • Can approach the theoretical information needed to describe a chess game.

Its weaknesses are serious:

  • Requires legal move generation at every ply.
  • Requires a perfectly deterministic move ordering.
  • Harder to implement safely.
  • More expensive to decode at massive scale.
  • More fragile if the rules, variants, or encoding assumptions change.

That narrows the question to: what does the system need to optimize for?

A serious chess database needs to:

  • Import millions of games.
  • Open individual games instantly.
  • Jump to arbitrary positions.
  • Build position indexes.
  • Search references.
  • Display games quickly.
  • Sync data efficiently.
  • Support annotations, comments, variations, and repertoires.
  • Export back to PGN when needed.

For that kind of system, raw compactness is not enough. A format must also be operationally useful.

A coordinate-based binary format is often the better engineering compromise. It is still far smaller than PGN, but it remains simple and fast.

For cold archival storage where decoding speed matters less, legal-move indexing is more attractive. The file would be tiny, and we could afford extra computation because the data would rarely be read.

Comparison chart showing the storage and decoding trade-off between PGN text, binary coordinates, and legal move indexing


The Practical Design Choice#

For a modern chess database, I would start with coordinate-based binary moves.

Something like:

12 bits for normal moves
14 bits for promotions

This gives an excellent balance:

Much smaller than PGN
Much faster to parse than PGN
Simple enough to implement safely
Efficient for massive databases
Friendly to random access and indexing

Legal-move indexing can still be useful later, especially for a compressed distribution format or cold storage format.

But for the primary editable database representation, coordinates are a better foundation.

This leads to a layered architecture:

Internal database format: coordinate-based binary moves
Optional archive format: legal-move-index compression
Export format: deterministic PGN

That way, each format does what it is best at.

PGN remains the universal language for humans and existing tools.

The binary coordinate format becomes the fast internal representation.

The legal-move-index format becomes an optional compression layer for distribution or archival storage.

Architecture diagram for modern chess storage: PGN import and export, binary coordinate core, optional compression layer, compact initial state, and database features


What About the Initial Position?#

So far, we assumed that every game starts from the normal chess starting position.

But real databases also contain games that begin from custom positions:

  • Chess960 games.
  • Training positions.
  • Composed studies.
  • Partial game fragments.
  • Engine lines.
  • Positions imported from FEN.

For those cases, the format needs to store the initial board state.

The usual text format for this is FEN:

rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1

FEN is useful and readable, but again, it is text.

A binary format could store a compact “Bit-FEN” instead.

At minimum, an initial position needs:

  • Piece placement.
  • Side to move.
  • Castling rights.
  • En passant square, if any.
  • Halfmove clock.
  • Fullmove number.

The exact encoding can vary, but the principle is the same:

Store chess state as structured binary data, not as text that must be reparsed.

For normal games, we can omit the initial board entirely and assume the standard starting position.

For custom games, we include a compact binary initial-state block.

That gives us efficiency without losing generality.


Conclusion: PGN Is the Beginning, Not the End#

PGN gave chess software a common language.

But modern chess software can go beyond it.

A binary chess format can represent moves in 12 to 14 bits using coordinates. With legal-move indexing, it can sometimes approach 6 to 7 bits per move, or even 0 bits for forced moves.

But the smallest format is not always the best format.

For an interactive chess database, the best design is likely a pragmatic one:

Use binary coordinates for speed.
Use legal-move indexing where compression matters.
Use PGN for import and export.
Use a compact binary position format for custom starting states.

That gives us a layered system where each format plays to its strength:

Human compatibility.
Machine efficiency.
Fast random access.
Compact storage.
Deterministic export.
Room for future compression.

PGN remains useful.

But it does not have to be where the story ends.

Sometimes the real breakthrough begins when we stop asking:

How do we store the notation?

And start asking:

What is the smallest useful truth the machine actually needs?

Beyond PGN: Designing an Ultra-Efficient Chess Storage Format
https://corentings.dev/blog/beyond-pgn-chess-storage-format/
Author
Corentin Giaufer Saubert
Published at
2026-06-24
License
CC BY-NC-SA 4.0
Share this post