A first-principles walkthrough of why dimensions (0,1), (2,3), (4,5) … are always paired — and why you need both sin and cos at the same frequency.
After embedding, each word is a 512-number vector encoding its meaning. But the model sees all words at once — it has no idea which came first. We need to inject position information directly into those 512 numbers.
The naive idea — put position 0, 1, 2 directly into dimension 0 — fails immediately. At position 999 the value is 999, completely overwhelming the embedding. Sine and cosine waves satisfy all four requirements simultaneously.
A sine wave is symmetric. Every value it produces on the way up (0 → π) is produced again on the way down (π → 2π). Two completely different positions land on the same height — like standing on two slopes of a hill at the same altitude.
The unit circle (radius = 1, centred at origin) is the key. When you rotate from the starting point by angle θ, you land on coordinates (cos θ, sin θ). No two angles under 2π land on the same point — which means the pair is always unique.
The two positions that share the same sin value always have opposite cos values. Adding the cosine column breaks the tie — the pair (sin, cos) at the same frequency corresponds to a unique point on the unit circle.
This identity holds for every angle. It means (cos θ, sin θ) is always exactly on the unit circle — and since the circle has a unique point for every angle, the pair uniquely identifies each position. This guarantee only holds when sin and cos are at the same frequency.
| Symbol | Meaning | Range / Value |
|---|---|---|
| pos | Word position in the sentence | 0, 1, 2, 3 … |
| i | Pair index — which clock? | 0 to 255 |
| 2i | Even dimension → holds sin value | 0, 2, 4 … 510 |
| 2i+1 | Odd dimension → holds cos value | 1, 3, 5 … 511 |
| 512 | Total embedding dimensions | fixed |
| 10000 | Scaling constant — controls frequency range | chosen empirically |
| 100002i/512 | Frequency divisor — grows as i grows → slower wave | 1.00 (i=0) → 9,956 (i=255) |
Fast clocks distinguish nearby positions. Slow clocks distinguish distant positions. Together, all 256 pairs cover every distance — from position 1 vs 2 to position 1 vs 62,000.
| Position | sin (dim 0) | cos (dim 1) | Point on circle |
|---|
sin and cos at the same frequency are the two coordinates of one point on a unit circle — and no two positions ever land on the same point.