Okay, this is a minor hack, but I thought it was fun, so I thought I’d write it up here.
My original code for simulating the Chaocipher machine proceeded as follows: it found the character in the plaintext wheel (by linear search), then rotated each wheel to get the plain and cipher text entry to the zenith minus one and zenith positions respectively, then did the short cycle rotates. It used a temporary array to perform the first rotate, and then used memcpy to copy the results back. The short rotates were done in position. Hey, it was easy to write, but kind of ugly. In particular, it did twice as many read/write operations as necessary, and I hypothesized that it was a bottleneck.
So, I wrote a little code generator that instead generated “compiled” code for each of the 26 possible paths through the encoding code. Each entry in the cipher and plaintext wheels is read and written only once, and I use only an additional char of storage to hold temp values. Here’s the fragment of the code that executes if the cipher text is found in position number zero:
case 0: tmp = m->cw[1]; m->cw[1] = m->cw[2]; m->cw[2] = m->cw[3]; m->cw[3] = m->cw[4]; m->cw[4] = m->cw[5]; m->cw[5] = m->cw[6]; m->cw[6] = m->cw[7]; m->cw[7] = m->cw[8]; m->cw[8] = m->cw[9]; m->cw[9] = m->cw[10]; m->cw[10] = m->cw[11]; m->cw[11] = m->cw[12]; m->cw[12] = m->cw[13]; m->cw[13] = tmp; tmp = m->pw[0]; m->pw[0] = m->pw[1]; m->pw[1] = m->pw[2]; m->pw[2] = m->pw[4]; m->pw[4] = m->pw[6]; m->pw[6] = m->pw[8]; m->pw[8] = m->pw[10]; m->pw[10] = m->pw[12]; m->pw[12] = m->pw[14]; m->pw[14] = m->pw[15]; m->pw[15] = m->pw[16]; m->pw[16] = m->pw[17]; m->pw[17] = m->pw[18]; m->pw[18] = m->pw[19]; m->pw[19] = m->pw[20]; m->pw[20] = m->pw[21]; m->pw[21] = m->pw[22]; m->pw[22] = m->pw[23]; m->pw[23] = m->pw[24]; m->pw[24] = m->pw[25]; m->pw[25] = tmp; tmp = m->pw[3]; m->pw[3] = m->pw[5]; m->pw[5] = m->pw[7]; m->pw[7] = m->pw[9]; m->pw[9] = m->pw[11]; m->pw[11] = m->pw[13]; m->pw[13] = tmp; break;
You should notice that the cipher wheel doesn’t bother to move element zero at all (it doesn’t move if the plaintext letter is found at index 0). It also doesn’t touch the upper half of the cipher wheel, since those letters are left in place by this particular path. It then uses each element as the source and destination precisely once. Since there are two cycles, uses the temporary variable twice.
The resulting code is about 1450 lines long, but the resulting program runs almost 2.5x faster overall, searching about 10.8M nodes per second in my backtracking search. Worth the 20 minutes or so it took me to write the code generator.
Addendum: In theory at least, you could safely issue reads and writes which occur in different cycles in parallel, since you know they don’t target the same addresses. But I don’t know how to hint to the C compiler to perform that optimization. I also suspect that SSE instructions might allow you to load and swizzle faster.
It might also be faster to eliminate the subscripting by walking a pointer along the arrays.
The code generated by the compiler uses addressing modes which precompile all the offsets into the fetches and stores. I suspect those operate slightly slower in theory than simple pointer incrementing, but the fetches don’t occur in order, and I suspect memory latency swamps the actual instruction timing.
Maybe this algorithm can be easily parallelized and run in a GPU