Efficient FFT Algorithm and Programming Tricks

I do like programming “for fun”, and this includes writing programs that, well, have been written hundreds of times before, but which I never have bothered doing. Often, this is just an exercise: freely available libraries often exist for many tasks which are considerably better than anything that I write. Nevertheless, I don’t consider it pointless. There are situations where having simple, self-contained code that is free from any particular licensing issues.

For instance, I’ve never really bothered coding up an FFT implementation. (Actually, this isn’t entirely true: I helped work on an i860 based assembly language version twenty years ago when I was at Princeton, but I must admit that I didn’t do much of the heavy lifting.) When forced to use one, I typically just fire up FFTW and use it. It’s a fast, relatively simple and flexible FFT library, and well worth using. But there are some circumstances where it seems like overkill. So, I sat down and implemented a basic radix-2 in-place FFT in C using the algorithm from Crandall and Pomerance’s Prime Numbers. Running it on my MacBook, for an 8192 length complex forward transform, I get a time of about 3.95 ms per transform. The same size transform running with FFTW is almost a factor of ten faster: about .413ms per transform. Of course, my code is a big stomping 43 comfortable, loafing lines long.

But this got me thinking about how it might be possible to speed it up. I have a couple of books on my shelf at work that will probably useful (I looted them from people who were more knowledgeable than me at DSP who, for some reason, decided to give them away, but I haven’t read them). Even doubling the speed of this “naive” implementation would make it significantly more useful. And, it seems like it could be a reasonably fun thing to tinker with.

It’s too late today to make any significant progress, but I did dig up a useful link.

Efficient FFT Algorithm and Programming Tricks

Addendum: Updating the so-called “twiddle factors” in the inner loop was being done in a stupid way in my code before. It really only requires a complex multiply in the inner loop, which speeds it up considerably. Now, my 8192 element FFT takes a mere 1.16ms on my MacBook, a speedup of nearly 4x, and now it’s within a factor of 3 of the speed of FFTW. I don’t like the way it really looks yet, but it’s better than most. I’ll tidy it up soon and archive it here for future consumption.

Addendum: Now that it’s basically working, I’m left wondering about numerical accuracy. Accuracy of the discrete Fourier transform and the fast Fourier transform would appear to be relevent. It appears that my choice of recurrence relationship for the twiddle factors probably results in rather significant errors (although I also suspect that it’s more than entirely adequate for the normal audio tasks which I use these things for typically).