Positional Encoding · Deep Dive

Why sine & cosine
come in pairs

A first-principles walkthrough of why dimensions (0,1), (2,3), (4,5) … are always paired — and why you need both sin and cos at the same frequency.

scroll

01 — The goal

Every word needs a unique
position fingerprint

After embedding, each word is a 512-number vector encoding its meaning. But the model sees all words at once — it has no idea which came first. We need to inject position information directly into those 512 numbers.

What a good fingerprint requires
Bounded
−1 to +1
Never drowns the word's meaning
Unique
pos A ≠ pos B
No two positions look identical
Smooth
Near → similar
Nearby positions resemble each other
Generalisable
Any pos
Works even for unseen lengths

The naive idea — put position 0, 1, 2 directly into dimension 0 — fails immediately. At position 999 the value is 999, completely overwhelming the embedding. Sine and cosine waves satisfy all four requirements simultaneously.


02 — One wave fails

The hill problem:
sin alone is ambiguous

A sine wave is symmetric. Every value it produces on the way up (0 → π) is produced again on the way down (π → 2π). Two completely different positions land on the same height — like standing on two slopes of a hill at the same altitude.

0.5 pos A pos B sin=0.5 sin=0.5 same height — different position 0 π 0.5 1 sin(x)
sin(π/6) = 0.500 ← position 0.52 — "je"
sin(5π/6) = 0.500 ← position 2.62 — different word!

sin alone → model cannot tell them apart.
This is called phase ambiguity. A slower wave fixes the repetition problem, but then nearby positions become nearly identical (0.0001, 0.0002…). You cannot win with just one wave at one frequency.

03 — The unit circle

Every angle maps to
a unique point

The unit circle (radius = 1, centred at origin) is the key. When you rotate from the starting point by angle θ, you land on coordinates (cos θ, sin θ). No two angles under 2π land on the same point — which means the pair is always unique.

cos sin 0 1 1 θ (1.00, 0.00)
Angle θ (radians)
cos(θ) — x
1.00
sin(θ) — y
0.00
Pythagorean identity
cos² + sin² = 1.00
Key insight: when sin = 0, cos = ±1. When sin peaks, cos = 0. They are never both zero — the pair always has enough information to specify the angle uniquely.

04 — The pair fix

cos resolves the
ambiguity completely

The two positions that share the same sin value always have opposite cos values. Adding the cosine column breaks the tie — the pair (sin, cos) at the same frequency corresponds to a unique point on the unit circle.

sin alone — fails
pos A = 0.52
  sin = 0.500

pos B = 2.62
  sin = 0.500

indistinguishable ✗
sin + cos pair — works
pos A = 0.52
  (sin, cos) = (0.50, +0.87)

pos B = 2.62
  (sin, cos) = (0.50, −0.87)

unique point on circle ✓
The pythagorean guarantee
sin(θ)2 + cos(θ)2 = 1

This identity holds for every angle. It means (cos θ, sin θ) is always exactly on the unit circle — and since the circle has a unique point for every angle, the pair uniquely identifies each position. This guarantee only holds when sin and cos are at the same frequency.

Wave preview at pair i = 0
● sin (dim 2i) ● cos (dim 2i+1)
Why consecutive dimensions (2i, 2i+1)? Keeping each sin+cos pair in adjacent dimensions is mathematically clean — they form one self-contained "clock". Pairing dimensions far apart (e.g. 0 and 256) would mix different frequencies and lose the unit-circle guarantee.

05 — The formula, dissected

Every symbol
explained

Positional encoding formula
PE(pos, 2i)    = sin( pos / 100002i/512 )
PE(pos, 2i+1) = cos( pos / 100002i/512 )
Symbol Meaning Range / Value
posWord position in the sentence0, 1, 2, 3 …
iPair index — which clock?0 to 255
2iEven dimension → holds sin value0, 2, 4 … 510
2i+1Odd dimension → holds cos value1, 3, 5 … 511
512Total embedding dimensionsfixed
10000Scaling constant — controls frequency rangechosen empirically
100002i/512Frequency divisor — grows as i grows → slower wave1.00 (i=0) → 9,956 (i=255)
Why 10000? Large enough so the slowest wave (i=255) completes one cycle across ≈62,832 positions — meaning the encoding stays unique for sentences up to ~62,000 words long. The exact value is not critical; any large constant works.

06 — Interactive explorer

256 clocks,
each at a different speed

Fast clocks distinguish nearby positions. Slow clocks distinguish distant positions. Together, all 256 pairs cover every distance — from position 1 vs 2 to position 1 vs 62,000.

Pair explorer — drag to explore
Pair i
0
Dims filled
0 + 1
Freq divisor
1.00
Wave speed
fastest
Pair index i (0 = fastest · 255 = slowest)
Position sin (dim 0) cos (dim 1) Point on circle
The one-sentence answer

sin and cos at the same frequency are the two coordinates of one point on a unit circle — and no two positions ever land on the same point.