Introduction
Infinite Impulse Response (IIR) filters are the workhorses of real-time DSP systems. They offer superior frequency selectivity with fewer coefficients compared to FIR filters, making them ideal for resource-constrained embedded applications. However, every DSP engineer who has implemented IIR filters in fixed-point systems has encountered the dreaded instability: filters that work perfectly in floating-point simulation suddenly oscillate, saturate, or produce garbage output when deployed to hardware.
This isn’t just academic—it’s a production-stopping problem that has derailed countless projects. The issue stems from the fundamental tension between IIR filters’ recursive nature and the limited precision of fixed-point arithmetic. In this article, we’ll dissect exactly why this happens and provide practical solutions you can implement today.
Problem Analysis: The Perfect Storm of Recursion and Quantization
The Root Cause: Pole Migration
Consider a simple second-order IIR filter (biquad) in direct form II:
// Floating-point implementation (stable)
float biquad_float(float x, float* state, const float* coeffs) {
float w = x - coeffs[2]*state[0] - coeffs[3]*state[1];
float y = coeffs[0]*w + coeffs[1]*state[0] + coeffs[4]*state[1];
state[1] = state[0];
state[0] = w;
return y;
}
When we convert this to fixed-point, we quantize the coefficients and state variables. This quantization causes the filter’s poles to migrate from their designed positions. Even tiny movements can push poles outside the unit circle, making the filter unstable.
The Overflow Domino Effect
Fixed-point arithmetic has limited dynamic range. In recursive structures, intermediate results can overflow, causing wrap-around or saturation. This nonlinearity introduces distortion that can trigger instability:
// Naive fixed-point implementation (dangerous!)
int16_t biquad_naive(int16_t x, int16_t* state, const int16_t* coeffs) {
// These multiplications can easily overflow 16-bit range
int32_t w = (int32_t)x
- ((int32_t)coeffs[2] * state[0] >> 15)
- ((int32_t)coeffs[3] * state[1] >> 15);
int32_t y = ((int32_t)coeffs[0] * w >> 15)
+ ((int32_t)coeffs[1] * state[0] >> 15)
+ ((int32_t)coeffs[4] * state[1] >> 15);
// Truncation here loses precision
state[1] = state[0];
state[0] = (int16_t)(w >> 0); // Direct truncation!
return (int16_t)(y >> 0);
}
Coefficient Sensitivity
IIR filter coefficients near the unit circle are extremely sensitive to quantization. A filter designed with poles at 0.9999 might quantize to 1.0001 in fixed-point, crossing the stability boundary.
Solution: Practical Stabilization Techniques
1. Coefficient Scaling and Q-Format Optimization
The first line of defense is proper coefficient scaling. Instead of using full-range Q15 format, scale coefficients to maximize precision while avoiding overflow:
// Optimized fixed-point biquad with guard bits
typedef struct {
int32_t state[2]; // Extended precision states
int32_t coeffs[5]; // Q14 format coefficients
uint8_t shift; // Output scaling shift
} BiquadFixed;
int16_t biquad_optimized(int16_t x, BiquadFixed* filter) {
// Use 32-bit intermediate with guard bits
int64_t accumulator;
// Compute feedback with extended precision
accumulator = (int64_t)filter->state[0] * filter->coeffs[2];
accumulator += (int64_t)filter->state[1] * filter->coeffs[3];
accumulator = (accumulator + (1 << 13)) >> 14; // Round to Q14
int32_t w = ((int32_t)x << 14) - (int32_t)accumulator;
// Compute output
accumulator = (int64_t)w * filter->coeffs[0];
accumulator += (int64_t)filter->state[0] * filter->coeffs[1];
accumulator += (int64_t)filter->state[1] * filter->coeffs[4];
accumulator = (accumulator + (1 << 13)) >> 14;
// Update states with saturation
filter->state[1] = filter->state[0];
filter->state[0] = saturate32(w, 30); // Keep 30 bits for headroom
// Scale output with saturation
int32_t y = (int32_t)accumulator;
y = (y + (1 << (filter->shift - 1))) >> filter->shift;
return saturate16(y);
}
// Safe saturation functions
inline int32_t saturate32(int64_t x, int bits) {
const int64_t max_pos = (1LL << (bits - 1)) - 1;
const int64_t max_neg = -(1LL << (bits - 1));
if (x > max_pos) return (int32_t)max_pos;
if (x < max_neg) return (int32_t)max_neg;
return (int32_t)x;
}
2. Use Cascade of First-Order Sections
Second-order sections (biquads) are sensitive. For critical applications, use a cascade of first-order sections, which are inherently more stable:
typedef struct {
int32_t state; // Q15 state
int16_t alpha; // Q15 coefficient (0 < alpha < 1)
uint8_t shift; // Scaling parameter
} FirstOrderSection;
int16_t first_order_cascade(int16_t x, FirstOrderSection* sections, int num_sections) {
int32_t y = (int32_t)x << 15; // Convert to Q15
for (int i = 0; i < num_sections; i++) {
// y[n] = alpha * x[n] + (1-alpha) * y[n-1]
// Implemented as: y[n] = y[n-1] + alpha * (x[n] - y[n-1])
int32_t error = y - sections[i].state;
int32_t update = ((int64_t)error * sections[i].alpha) >> 15;
y = sections[i].state + update;
// Update state with saturation protection
sections[i].state = saturate32(y, 30);
// Scale between sections if needed
y = (y + (1 << (sections[i].shift - 1))) >> sections[i].shift;
}
return saturate16(y >> 15);
}
3. Implement Stability Monitoring and Recovery
Add runtime stability checks and automatic recovery mechanisms:
typedef struct {
BiquadFixed filter;
uint32_t instability_counter;
int16_t backup_state[2];
bool needs_reset;
} RobustBiquad;
int16_t robust_biquad(int16_t x, RobustBiquad* rb) {
// Check for instability (rapid state growth)
int32_t state_magnitude = abs(rb->filter.state[0]) + abs(rb->filter.state[1]);
const int32_t stability_threshold = 0x20000000; // 25% of 32-bit range
if (state_magnitude > stability_threshold) {
rb->instability_counter++;
if (rb->instability_counter > 10) {
// Persistent instability - reset filter
rb->filter.state[0] = rb->backup_state[0];
rb->filter.state[1] = rb->backup_state[1];
rb->instability_counter = 0;
rb->needs_reset = true;
}
} else {
// Stable operation - backup states
rb->backup_state[0] = rb->filter.state[0];
rb->backup_state[1] = rb->filter.state[1];
rb->instability_counter = 0;
}
return biquad_optimized(x, &rb->filter);
}
4. Pre-warping Coefficient Design
Design coefficients with quantization in mind. Use pre-warping techniques that account for fixed-point effects during the design phase:
// Design coefficients with stability margin
void design_stable_biquad(float fc, float Q, int16_t* coeffs_q14, float stability_margin) {
// Move poles inward by stability_margin (e.g., 0.99)
float r = exp(-M_PI * fc / Q) * stability_margin;
float theta = 2 * M_PI * fc;
// Direct form II coefficients
float a1 = -2 * r * cos(theta);
float a2 = r * r;
// Scale and quantize with headroom
float max_coeff = fmax(fabs(a1), fabs(a2));
float scale = 0.99 / max_coeff; // Leave 1% headroom
coeffs_q14[2] = (int16_t)(a1 * scale * 16384); // Q14
coeffs_q14[3] = (int16_t)(a2 * scale * 16384);
// Scale forward coefficients accordingly
// ... (forward path design depends on filter type)
}
Engineering Takeaways
Always simulate with fixed-point precision during design phase. Floating-point simulation lies about stability.
Maintain guard bits throughout the computation chain. A good rule: use at least 4 extra bits beyond your signal’s dynamic range.
Implement saturation at every critical point, not just the output. Overflow in intermediate calculations is often the instability trigger.
Consider cascade structures for high-Q filters. First-order sections trade some efficiency for greatly improved stability.
Add monitoring and recovery in production code. Even well-designed filters can become unstable with pathological inputs.
Test with worst-case signals, including DC, Nyquist, and random noise. Many filters pass sine tests but fail on real-world signals.
Conclusion
IIR filter instability in fixed-point systems isn’t a design flaw—it’s a fundamental challenge of implementing infinite-precision mathematics in finite-precision hardware. The solutions presented here aren’t academic exercises; they’re battle-tested techniques from embedded audio processing, telecommunications, and control systems.
The key insight is that stability isn’t just about coefficient quantization—it’s about managing the entire signal flow through the recursive structure. By combining careful coefficient design, extended precision arithmetic, and runtime monitoring, you can deploy IIR filters with confidence in even the most resource-constrained systems.
Remember: in fixed-point DSP, stability is something you build, not something you assume. For more on managing precision in embedded systems, see our article on Fixed-Point Arithmetic Patterns for Embedded DSP.
Engineering Summary: IIR filter instability in fixed-point systems arises from pole migration due to coefficient quantization and overflow in recursive calculations. Mitigate through: 1) Extended precision with guard bits, 2) Cascade of first-order sections for critical filters, 3) Runtime stability monitoring with automatic reset, 4) Coefficient pre-warping during design phase, and 5) Systematic saturation at all computation stages. Always validate with fixed-point simulation before hardware deployment.