This page is moving to https://bxbsp.com/BxBFFT.html
The BxBFFT: An outstanding high-speed streaming FFT
The BxBFFT is an amazing high-speed streaming Fast Fourier Transform. It is specifically targeted at jobs that require the highest FFT speeds. It is cross-platform, to support Xilinx Ultrascale/Ultrascale+ FPGAs, Xilinx Versal FPGAs, Altera FPGAs, and also implementation in ASICs. It has a large feature set, including many features not found in other FFTs. It has easy-to-use controls of FFT numerical performance, controls for resource utilization tradeoffs, controls to obtain the highest timing margins, and options supporting many special requirements.
The BxBFFT is designed for digital signal processing applications where the sample rate is many times the FPGA clock rate, such as radar systems, lidar systems, spectral switches, high-bandwidth beamformers, radio telescopes, test and measurement systems, analog system simulators, high-speed radios, cellular backhaul, and communication satellites. For these types of applications, it is second to none.
A PDF presentation is available next. Alternately, see the sections below on BxBFFT features and see the web pages on Xilinx Ultrascale FPGAs, Xilinx Versal FPGAs, and Altera Agilex7 FPGAs. Contact us for other FPGA families or for ASICs.
PDF Presentation on the BxBFFT
BxBFFT Features
Occasionally a particular FFT feature is critical to an application. The BxBFFT supports the widest variety of features, out of the box. Below is a comparison. Note that some vendors, such as Altera and Xilinx, have multiple FFT offerings. Only their highest-speed FFT compares with the BxBFFT. So the FFT features and performance shown here and elsewhere are for the vendor's highest-speed FFT, which often has fewer supported features than slower FFTs from that vendor.
Non-Power-of-2 FFTs
BxBFFTs support FFT sizes that are multiples of powers of 2, 3, 5, and 7, not just powers of 2. Non-power-of-2 BxBFFTs use extensive optimizations not available for power-of-2 cases. Although power-of-2 BxBFFTs are usually most efficient, non-power-of-2 BxBFFTs are not far behind. Occasionally, non-power-of-2 BxBFFT performance is even superior to the performance of the closest power-of-2 BxBFFT.
One of the most important advantages of non-power-of-2 BxBFFTs is that they allow non-power-of-2 Points Per Clock, which is the number of complex data points that the FFT processes in parallel each clock. (This is abbreviated PPC, and is sometimes called Super Sample Rate or SSR). Having more options for parallelism gives more options to make a design close. For example, to get a desired FFT throughput, PPC=4 might have too high an FPGA clock rate, but PPC=8 might require too much power or too many resources. In these cases, PPC=5 with a non-power-of-2 BxBFFT may lead to design closure. This factor becomes more significant as ADC and DAC rates increase. For example a design that can close with PPC=36 will use significantly less logic than the next power-of-2 step up of PPC=64.
In addition, non-power-of-2 BxBFFTs have system advantages. They can more easily match a design to frequencies of existing equipment or match it to frequency standards. They can allow single-clock operation of some designs, where power-of-2 FFTs would require multiple synchronous clock sources. These factors can make designs close that otherwise would not, or they can reduce FPGA logic and external part count.
The graphs below show available BxBFFTs and Xilinx SSR FFTs below size 10,000 and from PPC=2 to PPC=10. The first thing to note is the richness of BxBFFT offerings compared to a power-of-2-limited FFT such as the Xilinx SSR FFT. The BxBFFT has thousands of options to make designs close, where power-of-2-limited FFTs have only 21 options.
These graphs also show that although power-of-2 BxBFFTs generally have the lowest power consumption, non-power-of-2 BxBFFTs are also quite good. For example, non-power-of-2 BxBFFTs are often better in power consumption than the closest Xilinx SSR FFTs.
Real FFTs
The BxBFFT supports FFTs with real inputs and complex outputs. These obtain spectrums with the highest accuracy, as there are no real-to-complex conversions between ADC data and the FFT, which create artifacts and impose filter rolloff. As usual, the BxBFFT also ships with the Real FFT inverse.
Background Reset (for Space Applications)
The BxBFFT supports a feature where it can be fully reset while operating, without interrupting processing. This feature supports high-reliability operation in space environments, which have natural radiation. Radiation causes Single-Event Upsets (SEUs), which can cause transient errors (such as resetting counters) or persistent errors (such as altering the logic programmed into the FPGA). Frequent periodic background resets of the BxBFFT fix the transient errors caused by SEUs without affecting normal operation. It is not necessary to detect that an SEU occurred.
Competing FFTs often cannot fix SEU errors in the background. As a consequence, competing FFTs often can't fix SEU errors periodically at all. This is because the continued interruptions would adversely affect required system availability. However, the system must fix SEU errors, because leaving an SEU in place corrupts processing and also affects availability. One solution is to detect SEUs, so that the FFT is reset only when it needs to be reset. This leads to complicated detection schemes that aren't fully reliable. Another solution is to use algorithms that allow FFT idle time in which SEUs can be repaired. However, idle time is not natural for many applications. The BxBFFT avoids these issues and these complications with its background reset.
In the case where an SEU makes a persistent alteration to FPGA logic, the standard approach is to have a "scrubbing" operation that reads back the FPGA configuration, checks for changes to the logic, and repairs them. This makes the persistent SEU transient. The BxBFFT's background reset works well with this, to automatically restoring operation as soon as the logic is repaired.
For the highest reliability, Triple Module Redundancy (TMR) triplicates logic into three legs and then votes on the answer. This means that even when one set of logic is affected by an SEU, proper operation is not affected because the other two legs outvote the incorrect answer. The full SEU-protection scheme has TMR, then scrubbing, then a background reset of the BxBFFT to automatically finish the SEU repair. Each of these operations are independent and decoupled, for easy implementation. The background reset doesn't just restore BxBFFT operation; it also restores proper BxBFFT sync to match the other two operating BxBFFTs, so that system operation is fully and automatically restored.
The BxBFFT's Ease of Use and Productivity Enhancements
The BxBFFT was designed to get you running quickly. It has features to make configuration, synthesis, and simulation faster and easier, saving NRE.
Configuration
The BxBFFT has easy controls for managing amplitude gain, but also allows precise stage-by-stage shift control. To handle the most demanding applications, amplitude can be managed with dynamic run-time monitoring and shifting controls.
The BxBFFT also has controls to select whether memory gets implemented as URAM, BRAM, or distributed RAM. These controls can be asserted globally, or they can be specifically targeted to individual BxBFFT stages. This helps fit a design in the FPGA, and also helps prevent overly tight resources of one memory type that might lead to longer routes and make it more difficult to meet timing.
Other memory-related controls can eliminate ROM twiddle tables at specific stages in favor of on-the-fly sine/cosine generation. This can save significant amounts of memory for large FFT sizes.
Pipelining can also be configured. The default pipelining works well when the BxBFFT is implemented in isolation, but in situations where there is external resource contention more pipelining may improve timing. For such a case, BxBFFT pipelining can be increased globally or at specific stages.
Another thing that can be configured is input and output order. Typically "Fully Natural" order is preferred at input and output, but occasionally "Scrambled" order on BxBFFT output is of benefit, since it can save a significant amount of memory. A "Partially Natural" order is also available, which has limited use but saved one customer significant processing in a zero-pad operation. The zero point of FFT data can also be selected at input and at output. It can be either the first data point (a typical FFT standard) or in the data center.
Whether the BxBFFT is a forward FFT or inverse FFT is another selection.
BxBFFT data width can also be selected between 18 bits and 27 bits, to trade off between resources, FFT numerical accuracy, and ease-of-use. It is generally of benefit to start a design at 27 bits, which brings up a design easily with good numerical performance and no risk of overflow. The design can then be optimized to lower numbers of bits to reduce resources and power, while observing the effect on numerical accuracy.
Synthesis
BxBFFTs have more timing margin than competitors. This additional timing margin is what allows BxBFFTs to achieve high Fmax and thus high throughput. Timing margin also means that place and route steps don't need to work as hard to meet desired timing constraints. As a result, FPGA implementation time is shorter.
Simulation
Simulation of the BxBFFT is faster than competitors, which can save significant engineering time in product design and testing. Even more important is the time it might save in long verification runs. The fast simulation speed is due to the simple and direct nature of the BxBFFT's System Verilog code.
The BxBFFT is tested with several simulators, including Xilinx XSim, Altera Quartus, Icarus Verilog, and Verilator. Verilator support is especially important, since can provide immense speed increases of System Verilog simulations.
Below is a graph showing simulation time of various FFTs relative to the BxBFFT. In this case, simulation was with Icarus Verilog for System Verilog FFTs, and Xilinx XSim for VHDL FFTs. In most cases the FFTs simulate significantly slower than the BxBFFT, and in some cases immensely slower.
Comprehensive BxBFFT Delivery Package
The BxBFFT ships as a very comprehensive package, intended to forsee all customer needs.
A customer ordering a BxBFFT chooses an FFT size, chooses the parallelism in Points Per Clock (PPC), and chooses whether the BxBFFT is fully complex or real-to-complex. Sometimes the customer adds additional constraints, such as that LUT usage should be a minimum or memory usage should be. Bit by Bit Signal processing finds the combination of radix stages and optimizations that give the lowest power and resources for those parameters, and generates and delivers the BxBFFT. One of the reasons for the BxBFFT's high performance is that these parameters are set at delivery time. It means that optimizations can be performed specific to a BxBFFT's size and PPC. Other FFTs that use the same code or use the same design for all FFT sizes miss out on these size-specific optimizations, and the BxBFFT does not.
Most other settings are user-alterable, as parameters at the BxBFFT's top level. These include forward/inverse, input/output data order, whether input/output zero position is at left or in the center, the data bit width, settings to manage and control signal gain, pipelining control, memory implementation control, selection of AXIS-standard I/O interface or simpler BxB I/O interface, and a stage-by-stage selection of using normal ROM twiddles or on-the-fly generated twiddles.
The code for a BxBFFT is a single System Verilog file with several associated data files for twiddle ROM tables. The small number of files keeps the delivery neat and file management easy. Internal names are mangled to prevent name conflicts with other BxBFFTs, with different BxBFFT versions, or with other customer IP. Since the code is standard System Verilog, it is readily usable in customer development flows and is friendly to third-party tools.
The delivery also includes C++ and matlab BxBFFT models, which are faster to simulate.
The delivery includes many tests of the System Verilog, C++, and Matlab simulation models. These tests verify that all models work and that they give identical results. The tests also serve as examples of how to connect to the model, configure it, and get data in and out.
Tests are also included to show that Vivado synthizes the core correctly. Simulations of the Vivado-produced post-route netlist verify that Vivado has correctly synthesized the BxBFFT's code. The synthesis runs also give other information, such as the quantity of FPGA resources used by the BxBFFT and the achieved Fmax.
For Xilinx FPGAs, a Xilinx IP Integrator model is also included. For those using Xilinx block designs, this is the fastest way to instantiate and configure a BxBFFT.
Finally, there is extensive documentation regarding how to set up and configure the BxBFFT.
ASICs
The BxBFFT was optimized first to be an excellent FFT, and then Xilinx optimizations were added on top of this. Thus many of the BxBFFT advantages carry over not just to other FPGA product lines but also to ASICs. Porting the BxBFFT to ASICs requires re-optimization of low-level functional elements such as memories, real multiplies, and complex multiplies to the libraries that come with the ASIC process. The BxBFFT's capability to be targeted to ASICs in this way was shown when the same techniques were used to re-optimize the Xilinx implementation to support Altera FPGAs.
The BxBFFT design is fully pipelined, and this pipelining allows the highest timing margins to be achieved. This allows the highest ASIC clock rates to be achieved. Alternately, high timing margins allow ASIC voltage to be reduced while still meeting timing, for the lowest ASIC power consumption.
Bit by Bit Signal Processing is interested in ports to other FPGA lines or to ASICs, if a sufficient business case exists. If this would significantly benefit your business, contact us.
Pricing
BxBFFT pricing is intended to make FFTs available for all professional uses at reasonable cost. If you think prices are unreasonable for your project, send an email with a justification for a different pricing scheme, and we'll discuss it.
Academic / Educational
The BxBFFT is available for small academic projects for US$1000 per BxBFFT. License terms will require that the BxBFFT is cited in papers to which the BxBFFT contributed, and that Bit by Bit Signal Processing should receive copies of any performance measurements made that are related to the BxBFFT. Distribution rights are not included with academic pricing. Bit by Bit Signal Processing will have rights to use information from academic projects to make BxBFFT advantages known for marketing purposes. Support for academic projects is at a lower priority than commercial jobs.
Commercial
Commercial companies can get access to the entire range of BxBFFTs, with binary distribution rights and support, for US$15000 per year. Rights are purchased 3 years ahead, so the first-year cost is US$45000, and then it is US$15000 each year thereafter. Support ends after payments cease, and distribution rights end 3 years after payments cease. (These prices may be increased periodically to match inflation.) A wide range of power-of-2 BxBFFTs is immediately available after purchase. Non-power-of-2 BxBFFTs are generated at customer request with modest lead time, since there are too many to have them all pre-generated.
Alternately, BxBFFTs for a specific FFT size and speed can be purchased individually for commercial development or distribution. With no distribution rights, BxBFFTs in a selected FPGA family are US$2000 each, with 1 year support. The price drops to USE$1500 each for 3 to 8 FFT sizes, and US$1200 each for 9 or more BxBFFT sizes.
With never-ending distribution rights for unlimited products in the selected FPGA family, BxBFFTs are US$15000 each for 1 or 2 sizes, US$10000 for 2 to 8 sizes, and US$8000 each for 9 or more sizes.
Other arrangements are possible to match your business needs. If you would like to propose an alternate arrangement, please do so.
Military
Purchases that could see applications with non-U.S. militaries will need to be reviewed for compliance with U.S. export law. Otherwise, this is the same as commercial applications.