FFTW with threads #48

ka9q · 2024-12-31T08:19:37Z

I see that by default dumphfdl builds with FFTW's internal multithreading option enabled, with 4 threads specified. Have you benchmarked this?

I also use FFTW heavily to perform fast convolution in ka9q-radio, and I found that internal multithreading didn't buy me much, at least with the huge FFTs I use (e.g., 1,620,000 points). Although it reduced the clock time required to perform a single FFT, the overall CPU utilization went up. Since I already perform a lot of independent FFTs in parallel in separate application threads I found it's better (for me) to have FFTW use only a single thread. Since I'm running lots of parallel copies of dumphfdl (fed by ka9q-radio channel threads) I went into my copy of fft_fftw.c and changed the number of threads to 1, leaving multithreading enabled.

I did this because of a gotcha. Wisdom files written with threads = 1 are NOT compatible with those written with multithreading completely turned off. This means you can't share a system-wide wisdom file (e.g., /etc/fftw/wisdomf) unless everybody agrees to use the same FFTW thread settings. There's a per-application wisdom file, but it doesn't look like you're using one. If you like, I could have it create one in /var/lib/hfdl/wisdom and send a pull request. I've already placed systable.conf in that directory, as this is the standard place in Linux to hold application-specific data files.

dg9bja · 2025-01-01T12:03:06Z

Thank you for that information. I am running 8 instances (72 frequencies) of dumphfdl with a samplerate 192000 feeded by Red Pitaya (hpsdr). It reduced my cpu usage on my virtual machine.

szpajder · 2025-01-02T18:22:12Z

FFTW with 4 threads is the only way I can run 2 instances of dumphfdl (one of them at 6 Msps) on a single board computer like Odroid N2+. With 1 thread I get massive buffer overruns and basically no decodes. It's not about the overall CPU usage but about not saturating any of the cores (which aren't too beefy in this type of hardware).

ka9q · 2025-01-09T21:28:32Z

So leave threading enabled but make the number of threads a runtime option. Don't turn them off completely, as that will generate wisdom files incompatible with programs that do thread. Setting the number of threads to 1 has essentially the same effect while allowing sharing of a system wisdom file with programs that want to use more.

6 Ms/s seemed high, but then I realized you're doing your own multichannel downconversion internally. I am currently running a separate copy of dumphfdl for every channel, 106 in total, with ka9q-radio doing the downconversion. This works well, and certainly creates a lot of parallelism, but I need to compare total CPU use against fewer but wider channels, perhaps one per band. I use a 12 ks/s IQ input for each SSB signal (8 ks/s didn't work) which means I need a higher total sample rate for a bunch of nearby HFDL channels than one wider channel covering them all. BTW, if you need an analytic signal you can create one with a half-plane filter using fast convolution. This is how I do it: start with a real-input FFT to create a complex spectrum with hermitian symmetry (negative spectrum is mirror image of positive spectrum). Then remove the negative frequencies (with windowing to prevent time-domain ripples) and convert back to the time domain with a complex-output FFT. This would permit feeding dumphfdl with a conventional SSB receiver.

KA9Q-radio uses fast convolution with a shared forward FFT to implement a multichannel digital downconverter. Even at a 64.8 (all of HF) or 129.6 Ms/s (HF-6m) A/D input sample rate I still use single-threaded FFTs, though I give the option to run several threads each performing independent FFTs, which is faster that multithreading individual FFTs. Usually even this isn't necessary; I have a NUC with an i5-8260U @ 1.60GHz doing 1.62 megapoint real-input FFTs 50 times/sec while using only ~40% of a single core. FFTW is amazing.

ka9q · 2025-01-10T03:35:18Z

I just ran the experiment with per-band dumphfdl fed from ka9q-radio. It's faster than one per channel, but not dramatically so. I'm using 12 bands with a total sample rate of 1.388 Ms/s vs 106 individual channels @ 12 ks/channel = 1.272 Ms/s. I guess it works both ways.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FFTW with threads #48

FFTW with threads #48

ka9q commented Dec 31, 2024

dg9bja commented Jan 1, 2025

szpajder commented Jan 2, 2025 via email

ka9q commented Jan 9, 2025

ka9q commented Jan 10, 2025

FFTW with threads #48

FFTW with threads #48

Comments

ka9q commented Dec 31, 2024

dg9bja commented Jan 1, 2025

szpajder commented Jan 2, 2025 via email

ka9q commented Jan 9, 2025

ka9q commented Jan 10, 2025