Bit by Bit Signal Processing

The BxBFFT: An outstanding high-speed streaming FFT

The BxBFFT is an amazing high-speed streaming Fast Fourier Transform. Some of its advantages are these:

Power savings Half the power of other FFTs.
Resource savings Half the FPGA LUTs.
Throughput Highly parallel processing at the highest clock rates. Also the lowest latency.
Features Non-power-of-2 FFTs, real-to-complex FFTs, myriad included options.
Productivity Faster synthesis and simulation. Meets timing easier. Easier FFT controls.

The BxBFFT is designed for digital signal processing applications where the sample rate is many times the FPGA clock rate, such as cellular backhaul, radar systems, spectral switches, high-bandwidth beamformers, radio telescopes, test and measurement systems, analog system simulators, high-speed radios, and communication satellites. For these types of applications, it is second to none.

The sections below describe in more detail the BxBFFT's advantages, as well as what's included with purchase of a BxBFFT and BxBFFT pricing.

Power Savings

It is not uncommon for an FPGA design to approach either power limits or resource limits. Even when this is not true of a baseline design, it often becomes true because of the introduction of new product features. Power consumption of an FFT can thus make or break a design, or allow or disallow product upgrades. Power consumption also affects product life and reliability, as high consumption puts extra stress on the power supply, and high temperatures and large temperature swings increase the rate of component degradation.

The BxBFFT is highly optimized for power consumption. Multiple customers have found that a switch to the BxBFFT saved significant amounts of power in their designs, making those designs viable where before they were not.

Below are results from Xilinx Vivado synthesis for power consumption of the BxBFFT vs several other FFTs. It shows that BxBFFT power is typically lower than other FFTs by a factor of 1.5X to 2X.

The thermal image below was taken by a customer, J. Smith of UCSB, showing how BxBFFT power is lower in a real-world design. Her design would not process the desired 2048 tones with the Xilinx SSR FFT; only 1424 could be processed before the chip would reset itself from excess current draw. Replacing the Xilinx SSR FFT with a BxBFFT lowered the die temperature from 110 degrees to 75.7 degrees Celsius when running the 1424 tones -- a drop of 34.3 degrees!! The design then worked properly with the full load of 2048 tones.

Resource Savings

FPGA resources are another common design limitation. Designs that use fewer resources have more margin for initial implementation and for future upgrades. For the same design, they can use fewer FPGAs of smaller size and be cheaper to manufacture.

The BxBFFT uses substantially fewer FPGA LUTs than competing FFTs, as shown in the graph below. Required DSPs and memory are not significantly different among the best FFTs, although for very large FFTs the BxBFFT can save memory by autogeneration of twiddle coefficients rather than storing them in ROM. The BxBFFT also offers memory savings for the case where a scrambled output data order is acceptable.

Throughput and Latency Advantages

Sometimes designs need to meet strict real-time requirements, either in throughput or in latency. Both of these improve when an FFT runs faster. A faster FFT can be achieved with a higher achieved FPGA clock rate (Fmax) or with increased parallelism. Parallelism is measured by the processed complex data Points Per Clock (PPC), also called SuperSample Rate (SSR). Throughput is Fmax * PPC.

One issue is that as PPC increases, more resources are used, there is more resource contention, and thus the achieved Fmax of an FFT goes down. This may make the desired throughput unachievable.

For BxBFFTs, Fmax degrades less from resource contention. BxBFFTs are thus able to achieve higher throughput, because a high Fmax and high PPC are simultaneously achievable. The graph below shows this. The BxBFFT achieves high PPC and high Fmax simultaneously, when the other FFTs do not. Thus the BxBFFT provides the best throughput and latency.

BxBFFT Features

Occasionally a particular FFT feature is critical to an application. The BxBFFT supports the widest variety of features, out of the box. Here's a comparison:

Non-Power-of-2 FFTs

BxBFFTs support FFT sizes that are multiples of powers of 2, 3, 5, and 7, not just powers of 2. Non-power-of-2 BxBFFTs use extensive optimizations not available for power-of-2 cases. Although power-of-2 BxBFFTs are usually most efficient, non-power-of-2 BxBFFTs are not far behind. Occasionally, non-power-of-2 BxBFFT performance is even superior to the performance of the closest power-of-2 BxBFFT.

One of the most important advantages of non-power-of-2 BxBFFTs is that they allow non-power-of-2 Points Per Clock (PPC, aka SSR). This gives more options to make a design close. For example, to get a desired FFT throughput, PPC=4 might have too high an FPGA clock rate, but PPC=8 might require too much power or too many resources. In these cases, PPC=5 with a non-power-of-2 BxBFFT may lead to design closure. This factor becomes more significant as ADC and DAC rates increase. For example a design that can close with PPC=36 will use significantly less logic than the next power-of-2 step up of PPC=64.

In addition, non-power-of-2 BxBFFTs have system advantages. They can more easily match a design to frequencies of existing equipment or match it to frequency standards. They can allow single-clock operation of some designs, where power-of-2 FFTs would require multiple synchronous clock sources. These factors can make designs close that otherwise would not, or they can reduce FPGA logic and external part count.

The graph below shows available BxBFFTs below size 10,000 and from PPC=2 to PPC=10. It shows that although power-of-2 BxBFFTs generally have the lowest power consumption, non-power-of-2 BxBFFTs are often quite good. The graph following it shows power consumption of the Xilinx SSR FFT in the same range. Comparison of these two graphs shows the richness of BxBFFT offerings compared to the Xilinx SSR FFT. It also shows that carefully chosen non-power-of-2 BxBFFTs beat power-of-2 Xilinx SSR FFTs for power consumption.

Real FFTs

The BxBFFT supports FFTs with real inputs and complex outputs. These obtain spectrums with the highest accuracy, as there are no real-to-complex conversions between ADC data and the FFT, which create artifacts and impose filter rolloff. As usual, the BxBFFT also ships with the Real FFT inverse.

Background Reset (for Space Applications)

The BxBFFT supports a feature where it can be fully reset while operating, without interrupting processing. This feature supports high-reliability operation in space environments, which have natural radiation. Radiation causes Single-Event Upsets (SEUs), which can cause transient errors (such as resetting counters) or persistent errors (such as altering the logic programmed into the FPGA). Frequent periodic background resets of the BxBFFT fix the transient errors caused by SEUs without affecting normal operation. It is not necessary to detect that an SEU occurred.

Competing FFTs often cannot fix SEU errors in the background. As a consequence, competing FFTs often can't fix SEU errors periodically at all. This is because the continued interruptions would adversely affect required system availability. However, the system must fix SEU errors, because leaving an SEU in place corrupts processing and also affects availability. One solution is to detect SEUs, so that the FFT is reset only when it needs to be reset. This leads to complicated detection schemes that aren't fully reliable. Another solution is to use algorithms that allow FFT idle time in which SEUs can be repaired. However, idle time is not natural for many applications. The BxBFFT avoids these issues and these complications with its background reset.

In the case where an SEU makes a persistent alteration to FPGA logic, the standard approach is to have a "scrubbing" operation that reads back the FPGA configuration, checks for changes to the logic, and repairs them. This makes the persistent SEU transient. The BxBFFT's background reset works well with this, to automatically restoring operation as soon as the logic is repaired.

For the highest reliability, Triple Module Redundancy (TMR) triplicates logic into three legs and then votes on the answer. This means that even when one set of logic is affected by an SEU, proper operation is not affected because the other two legs outvote the incorrect answer. The full SEU-protection scheme has TMR, then scrubbing, then a background reset of the BxBFFT to automatically finish the SEU repair. Each of these operations are independent and decoupled, for easy implementation. The background reset doesn't just restore BxBFFT operation; it also restores proper BxBFFT sync to match the other two operating BxBFFTs, so that system operation is fully and automatically restored.

The BxBFFT's Ease of Use and Productivity Enhancements

The BxBFFT was designed to get you running quickly. It has features to make configuration, synthesis, and simulation faster and easier, saving NRE.

The BxBFFT also includes a Xilinx "IP Integrator" model for quick interoperation of the BxBFFT with Xilinx IP.

Configuration

The BxBFFT ships with an IP Integrator model, for getting it working fast with a Xilinx block design. It has easy controls for managing amplitude gain, but also allows precise stage-by-stage shift control. To handle the most demanding applications, amplitude can be managed with dynamic run-time monitoring and shifting controls.

The BxBFFT also has controls to select whether memory gets implemented as URAM, BRAM, or distributed RAM. These controls can be asserted globally, or they can be specifically targeted to individual BxBFFT stages. This helps fit a design in the FPGA, and also helps prevent overly tight resources of one memory type that might lead to longer routes and make it more difficult to meet timing.

Other memory-related controls can eliminate ROM twiddle tables at specific stages in favor of on-the-fly sine/cosine generation. This can save significant amounts of memory for large FFT sizes.

Pipelining can also be configured. The default pipelining works well when the BxBFFT is implemented in isolation, but in situations where there is external resource contention more pipelining may improve timing. For such a case, BxBFFT pipelining can be increased globally or at specific stages.

Another thing that can be configured is input and output order. Typically "Fully Natural" order is preferred at input and output, but occasionally "Scrambled" order on BxBFFT output is of benefit, since it can save a significant amount of memory. The zero point of FFT data can also be selected at input and at output. It can be either the first data point (a typical FFT standard) or in the data center.

Whether the BxBFFT is a forward FFT or inverse FFT is another selection.

BxBFFT data width can also be selected between 18 bits and 27 bits, to trade off between resources, FFT numerical accuracy, and ease-of-use. It is generally of benefit to start a design at 27 bits, which brings up a design easily with good numerical performance and no risk of overflow. The design can then be optimized to lower numbers of bits to reduce resources and power, while observing the effect on numerical accuracy.

Synthesis

BxBFFTs are faster in Vivado implementation than competitors, which can save significant engineering time during product development. In part this is because BxBFFT code is written in a SystemVerilog style that is direct and easy to parse, which reduces time in Vivado synthesis.

Part of the savings is also in place and route, because BxBFFTs have more timing margin than competitors. This additional timing margin is what allows BxBFFTs to achieve high Fmax and thus high throughput. Timing margin also means that the Vivado place and route steps don't need to work as hard to meet desired timing constraints. As a result, Vivado implementation time is shorter.

The graph below shows that other FFTs take a factor of 1.5 to 2 more implementation time than the BxBFFT.

Simulation

Simulation of the BxBFFT is also faster than competitors, which can save significant engineering time in product design and testing. Even more important is the time it might save in long verification runs. The fast simulation speed is due to the simple and direct nature of the BxBFFT's System Verilog code.

The BxBFFT is tested with several simulators, including Xilinx XSim, Icarus Verilog, and Verilator. Verilator support is especially important, since can provide immense speed increases of System Verilog simulations.

Below is a graph showing simulation time of various FFTs relative to the BxBFFT. In this case, simulation was with Icarus Verilog for System Verilog FFTs, and Xilinx XSim for VHDL FFTs. In most cases the FFTs simulate significantly slower than the BxBFFT, and in some cases immensely slower.

Comprehensive BxBFFT Delivery Package

The BxBFFT ships as a very comprehensive package, intended to forsee all customer needs.

A customer ordering a BxBFFT chooses an FFT size, chooses the parallelism in Points Per Clock (PPC), and chooses whether the BxBFFT is fully complex or real-to-complex. Sometimes the customer adds additional constraints, such as that LUT usage should be a minimum or memory usage should be. Bit by Bit Signal processing finds the combination of radix stages and optimizations that give the lowest power and resources for those parameters, and generates and delivers the BxBFFT. One of the reasons for the BxBFFT's high performance is that these parameters are set at delivery time. It means that optimizations can be performed specific to a BxBFFT's size and PPC. Other FFTs that use the same code or use the same design for all FFT sizes miss out on these size-specific optimizations, and the BxBFFT does not.

Most other settings are user-alterable, as parameters at the BxBFFT's top level. These include forward/inverse, input/output data order, whether input/output zero position is at left or in the center, the data bit width, settings to manage and control signal gain, pipelining control, memory implementation control, selection of AXIS-standard I/O interface or simpler BxB I/O interface, and a stage-by-stage selection of using normal ROM twiddles or on-the-fly generated twiddles.

The code for a BxBFFT is a single System Verilog file with several associated data files for twiddle ROM tables. The small number of files keeps the delivery neat and file management easy. Internal names are mangled to prevent name conflicts with other BxBFFTs, with different BxBFFT versions, or with other customer IP. Since the code is standard System Verilog, it is readily usable in customer development flows and is friendly to third-party tools.

The delivery also includes C++ and matlab BxBFFT models, which are faster to simulate.

The delivery includes many tests of the System Verilog, C++, and Matlab simulation models. These tests verify that all models work and that they give identical results. The tests also serve as examples of how to connect to the model, configure it, and get data in and out.

Tests are also included to show that Vivado synthizes the core correctly. Simulations of the Vivado-produced post-route netlist verify that Vivado has correctly synthesized the BxBFFT's code. The synthesis runs also give other information, such as the quantity of FPGA resources used by the BxBFFT and the achieved Fmax.

A Xilinx IP Integrator model is also included. For those using Xilinx block designs, this is the fastest way to instantiate and configure a BxBFFT.

Finally, there is extensive documentation regarding how to set up and configure the BxBFFT.

ASICs and non-Xilinx FPGAs

The BxBFFT was optimized first to be an excellent FFT, and then Xilinx optimizations were added on top of this. Thus many of the BxBFFT advantages will carry over to ASICs or to other FPGA product lines, especially with additional optimization work for those product lines.

Exploratory work for an Intel FPGA port was performed several years ago, and looked promising. However, a salable port does not currently exist.

Bit by Bit Signal Processing is interested in ports to other FPGA lines or to ASICs, if a sufficient business case exists. If this would significantly benefit your business, contact us.

Pricing

BxBFFT pricing is intended to make FFTs available for all professional uses at reasonable cost. If you think prices are unreasonable for your project, send an email with a justification for a different pricing scheme, and we'll discuss it.

Academic / Educational

The BxBFFT is available for small academic projects for US$1000 per BxBFFT. License terms will require that the BxBFFT is cited in papers to which the BxBFFT contributed, and that Bit by Bit Signal Processing should receive copies of any performance measurements made that are related to the BxBFFT. Distribution rights are not included with academic pricing. Bit by Bit Signal Processing will have rights to use information from academic projects to make BxBFFT advantages known for marketing purposes. Support for academic projects is likely, but not guaranteed.

Commercial

Commercial companies can get access to the entire range of BxBFFTs, with binary distribution rights and support, for US$15000 per year. Rights are purchased 3 years ahead, so the first-year cost is US$45000, and then it is US$15000 each year thereafter. Support ends after payments cease, and distribution rights end 3 years after payments cease. (These prices may be increased periodically to match inflation.) A wide range of power-of-2 BxBFFTs is immediately available after purchase. Non-power-of-2 BxBFFTs are generated at customer request with modest lead time, since there are too many to have them all pre-generated.

Other arrangements are possible to match your business needs. If you would like to propose an alternate arrangement, please do so.

Military

Purchases that could see applications with non-U.S. militaries will need to be reviewed for compliance with U.S. export law. Otherwise, this is the same as commercial applications.