Static Timing Analysis

What is STA: STA (Static Timing Analysis) is a method to validate the timing performance and hence, functionality of the designs. STA is based upon calculating the limits of minimum and maximum delay of logic elements through timing models. Using these calculated delays and based upon a set of equations, it is, then, determined if the design will pass or not.

An interesting thing to note about STA is that there is no importance given to actual functionality or state machine model of the design. The only thing of concern is how fast and accurately maximum and minimum delay bounds can be calculated.

Why is STA important: An SoC is supposed to run in a range of temperatures and voltages. Also, there are variations in process parameters while manufacturing chips. To guarantee performance  and functionality across all combinations, it is important to analyze timing and check for any possible timing failures. STA is a very fast method to achieve the same as opposed to dynamic timing simulations (spice simulations). In other words, STA is one of the most important steps of chip design flow to check the design performance with respect to timing constraints.

How STA works: As stated earlier, STA works on calculating timing bounds and validating against a set of timing equations. One of the most important aspects of timing is the delay of individual elements and overall delay between sequential elements. Let us consider a flip-flop sending some signal to another flip-flop through a combinational logic as shown in figure 1 below.


Figure 1: A sample signal propagation between two sequential elements

For the above to work properly, signal that is launched from FLOP1 on a clock edge should reach FLOP2 only after hold time has passed after the clock edge (definition of hold check). Thus, the sum of minimum delay values of all the elements from FLOP1 to FLOP2 must be greater than hold time of FLOP2, thus giving below equation for minimum delay limit.

FLOP1_delay (CK_to_Q_min) + NET1_delay(min) + CELL1_delay(A_to_Z_min) + NET2_delay(min) + CELL2+delay(A_to_Z_min) + NET3_delay - FLOP2_hold > 0

Similarly, the signal launched from FLOP1 on a clock edge should reach FLOP2 setup time before the next clock edge (definition of setup check). Thus, the sum of maximum delay values of all the elements from FLOP1 to FLOP2 must be less than time period of clock received by both flops - setup time of FLOP2, thus giving below equation for maximum delay limit.

FLOP1_delay (CK_to_Q_max) + NET1_delay(max) + CELL1_delay(A_to_Z_max) + NET2_delay(max) + CELL2_delay(A_to_Z_max) + NET3_delay < CLK_period - FLOP2_setup

Of course, we can differentiate max delay as rise_max/fall_max and min delay as rise_min/fall_min. But for simplicity, we chose not to differentiate. Also, we considered ideal scenario wherein clock arrives at the same time on both the flip-flops, and no cross-talk effects.

Now the question arises how all the delays mentioned are calculated. If you observe carefully, there are three kinds of delays mentioned above: cell delays, net delays and setup/hold check values. For cell delays and setup/hold check values, there are cell timing models, in liberty format in most of the cases. Liberty format implements a lookup-table based delay model which is a set of values varying with transition and load values. These values are interpolated based upon the actual load and slew values to calculate cell delays. For net delays, tools implement a delay calculation engine based upon parasitic values of the nets. There is a different model of such values for each of the corner-case scenarios; and STA is run separately for each such scenario to provide a complete coverage of the design accross all use-case scenarios.


How is STA different than dynamic simulations: Dynamic timing analysis needs a set of input vectors to work. It works by propagating actual values and calculating actual differential equations as provided in spice models, which are quite effort intensive. Moreover, the set of input vectors for a design with 50 inputs itself will be so big that it is not possible to run dynamic simulations at all possible corner-case scenarios for all set of input vectors. On the other hand, static timing analysis works on delay bounds without the need of any input vectors; and hence, is pretty fast. That is why, static timing analysis is a more popular way of timing analysis. On the other hand, of course, dynamic analysis is more accurate. So, the paths passing with very-very small margins can be run through spice simulations as well in order to be extra cautious about the robustness of the design against failures. Thus, in all, the overall approach can be to have both static as well as dynamic analysis for timing, with static timing analysis providing a complete coverage and dynamic simulations being a confidence booster for design robustness against failures by checking for real application specific input vectors.

How clock gating reduces power dissipation

As discussed in clock gating - basics, enable signal coming in data path is transferred into clock path in order to save dynamic power. But the question is exactly how is this power saved. In this post, we will discuss the same. 




A flip-flop implemented as a standard cell mostly has two internal inverters to generate clk' and clk_delay signals. So, even if the flip-flop input is kept constant, there is still toggling of data at these inverters, thereby dissipating dynamic power. In addition to this, there is internal power dissipation inside flip-flop due to charging and discharging of transistors' gates repetitively because of clock toggling, but this component is not a significant factor compared to dynamic power of inverters. Figure 2 below shows the internal structure of flip-flop, which has two latches in master-slave configuration and two inverters in clock path.

Figure 2: Flip-flop internal structure

Every clock cycle, these two inverters toggle regardless of flip-flop output toggling. However, implementation of clock gating will prohibit the toggling of these inverters when data is not toggling. Let us assume that a latch-based ICG is inserted. Thus, a mux in data path is replaced by an ICG in clock path. But there is a difference here. If there are, say 1000, flip-flops with same enable signal, there will be a common ICG inserted for these. Thus, instead of now 2000 inverters (inside 1000 flops) toggling when flip-flop output will be constant, we have only 2 inverters inside ICG consuming dynamic power. This is how dynamic power is saved. However, if only 1 flop had been clock gated in this manner, there would not have been any dynamic power saving, instead we have an ICG instead of a latch, it may result in overall loss in terms of area and power.

Whether there is any net saving is governed by how many flips-flops have been clock gated using a single ICG.

Also, as discussed, many muxes in data path with same enable are replaced by an ICG in clock path. Thus, there are advantages in terms of area and leakage power too, in addition to dynamic power.