: VLSI n EDA

Static Timing Analysis

What is STA: STA (Static Timing Analysis) is a method to validate the timing performance and hence, functionality of the designs. STA is based upon calculating the limits of minimum and maximum delay of logic elements through timing models. Using these calculated delays and based upon a set of equations, it is, then, determined if the design will pass or not.

An interesting thing to note about STA is that there is no importance given to actual functionality or state machine model of the design. The only thing of concern is how fast and accurately maximum and minimum delay bounds can be calculated.

Why is STA important: An SoC is supposed to run in a range of temperatures and voltages. Also, there are variations in process parameters while manufacturing chips. To guarantee performance and functionality across all combinations, it is important to analyze timing and check for any possible timing failures. STA is a very fast method to achieve the same as opposed to dynamic timing simulations (spice simulations). In other words, STA is one of the most important steps of chip design flow to check the design performance with respect to timing constraints.

How STA works: As stated earlier, STA works on calculating timing bounds and validating against a set of timing equations. One of the most important aspects of timing is the delay of individual elements and overall delay between sequential elements. Let us consider a flip-flop sending some signal to another flip-flop through a combinational logic as shown in figure 1 below.

Figure 1: A sample signal propagation between two sequential elements

For the above to work properly, signal that is launched from FLOP1 on a clock edge should reach FLOP2 only after hold time has passed after the clock edge (definition of hold check). Thus, the sum of minimum delay values of all the elements from FLOP1 to FLOP2 must be greater than hold time of FLOP2, thus giving below equation for minimum delay limit.

FLOP1_delay (CK_to_Q_min) + NET1_delay(min) + CELL1_delay(A_to_Z_min) + NET2_delay(min) + CELL2+delay(A_to_Z_min) + NET3_delay - FLOP2_hold > 0

Similarly, the signal launched from FLOP1 on a clock edge should reach FLOP2 setup time before the next clock edge (definition of setup check). Thus, the sum of maximum delay values of all the elements from FLOP1 to FLOP2 must be less than time period of clock received by both flops - setup time of FLOP2, thus giving below equation for maximum delay limit.

FLOP1_delay (CK_to_Q_max) + NET1_delay(max) + CELL1_delay(A_to_Z_max) + NET2_delay(max) + CELL2_delay(A_to_Z_max) + NET3_delay < CLK_period - FLOP2_setup

Of course, we can differentiate max delay as rise_max/fall_max and min delay as rise_min/fall_min. But for simplicity, we chose not to differentiate. Also, we considered ideal scenario wherein clock arrives at the same time on both the flip-flops, and no cross-talk effects.

Now the question arises how all the delays mentioned are calculated. If you observe carefully, there are three kinds of delays mentioned above: cell delays, net delays and setup/hold check values. For cell delays and setup/hold check values, there are cell timing models, in liberty format in most of the cases. Liberty format implements a lookup-table based delay model which is a set of values varying with transition and load values. These values are interpolated based upon the actual load and slew values to calculate cell delays. For net delays, tools implement a delay calculation engine based upon parasitic values of the nets. There is a different model of such values for each of the corner-case scenarios; and STA is run separately for each such scenario to provide a complete coverage of the design accross all use-case scenarios.

How is STA different than dynamic simulations: Dynamic timing analysis needs a set of input vectors to work. It works by propagating actual values and calculating actual differential equations as provided in spice models, which are quite effort intensive. Moreover, the set of input vectors for a design with 50 inputs itself will be so big that it is not possible to run dynamic simulations at all possible corner-case scenarios for all set of input vectors. On the other hand, static timing analysis works on delay bounds without the need of any input vectors; and hence, is pretty fast. That is why, static timing analysis is a more popular way of timing analysis. On the other hand, of course, dynamic analysis is more accurate. So, the paths passing with very-very small margins can be run through spice simulations as well in order to be extra cautious about the robustness of the design against failures. Thus, in all, the overall approach can be to have both static as well as dynamic analysis for timing, with static timing analysis providing a complete coverage and dynamic simulations being a confidence booster for design robustness against failures by checking for real application specific input vectors.

How clock gating reduces power dissipation

As discussed in clock gating - basics, enable signal coming in data path is transferred into clock path in order to save dynamic power. But the question is exactly how is this power saved. In this post, we will discuss the same.

A flip-flop implemented as a standard cell mostly has two internal inverters to generate clk' and clk_delay signals. So, even if the flip-flop input is kept constant, there is still toggling of data at these inverters, thereby dissipating dynamic power. In addition to this, there is internal power dissipation inside flip-flop due to charging and discharging of transistors' gates repetitively because of clock toggling, but this component is not a significant factor compared to dynamic power of inverters. Figure 2 below shows the internal structure of flip-flop, which has two latches in master-slave configuration and two inverters in clock path.

Figure 2: Flip-flop internal structure

Every clock cycle, these two inverters toggle regardless of flip-flop output toggling. However, implementation of clock gating will prohibit the toggling of these inverters when data is not toggling. Let us assume that a latch-based ICG is inserted. Thus, a mux in data path is replaced by an ICG in clock path. But there is a difference here. If there are, say 1000, flip-flops with same enable signal, there will be a common ICG inserted for these. Thus, instead of now 2000 inverters (inside 1000 flops) toggling when flip-flop output will be constant, we have only 2 inverters inside ICG consuming dynamic power. This is how dynamic power is saved. However, if only 1 flop had been clock gated in this manner, there would not have been any dynamic power saving, instead we have an ICG instead of a latch, it may result in overall loss in terms of area and power.

Whether there is any net saving is governed by how many flips-flops have been clock gated using a single ICG.

Also, as discussed, many muxes in data path with same enable are replaced by an ICG in clock path. Thus, there are advantages in terms of area and leakage power too, in addition to dynamic power.

Design query : How can we construct a 101 non overlapping counter using only combinational circuit for 32 bit input for example on considering 10101001 i want an output as 1 since there is only 1 101 non overlapping sequence

Solution: The design in question is a combinatorial design with 32-bit input (given) and a 4-bit output as shown in figure 1 below. How the output is 4-bit is a bit tricky. For this, we have to understand the problem. We have to count the number of non-overlapping "101" sequences in 32-bit input. Thus, "10101" counts as only a single occurrence, and "101101" counts as two occurrences. So, the maximum number of such patterns will occur when "101101" is repeated, which comes out to be 10 in a 32-bit number. 10 can be represented by a 4-bit number.

Figure 1: Design representation

One of the solutions, of course is to make a truth-table and then find a solution using logic equation solving. But the number of combinations possible here is huge and practically impossible to find a solution. So, we need to follow a modular approach here.

We can divide the problem into two parts, detecting the required pattern and then counting how many patterns actually were detected. We are introducing an intermediate 32-bit output, each bit (Nth bit) detecting if the pattern was found with Nth bit of input as the middle symbol of pattern. To detect non-overlapping "101" pattern, we need to look into 2 bits on each side, thereby making a combinational logic comprising of 7 bits. There will be special cases for terminal bits (here bit-0, bit-1, bit-30 and bit-31) where we know that there are less than 2 bits on one side. So, we need to have special logic for these bits.

The overall combinational logic will look like as shown in figure below. Int-N (Nth bit of intermediate output) is a resultant of (Bit-N-3 to Bit-N+3). The number of 1's in the intermediate output will tell how many patterns were detected, which, as discussed earlier, will be maximum 10.

Figure 2: Block diagram representation of design

Let us, now, proceed for a generic logic for Nth intermediate output. As discussed, Nth output (On) depends upon 7 bits, 3 bits on the up and 3 on down. The Nth bit of output will show "1"only for following cases: X1101XX & 00101XX. As 10101XX will be detected as "1" for N+2 bit of output. If we denote the variables involved as G,F,E,D,C,A, then the output expression becomes

On = FED'C + G'F'ED'C

On = ED'C (F+G')

For bit 31, the upper two bits do not exist. 101XX is the expression for getting output as 1. Thus, the equation, on a similar note, can be expressed as:

O31 = ED'C

For bit 30, the expression comes out to be X101XX. So, O30 also has same logic as O31.

For O1 and O0, we get the same expression as we get for On. The logic diagram for obtaining intermediate outputs is shown in figure 3 below:

Figure 3: Circuit with logic for intermediate output

Now, we have got the intermediate outputs showing which bits have the specified pattern detected. The number of "1"in the final output is our final answer. So, we need a special circuit to count the number of 1's combinationally in a bus as shown in figure 3 above. "Combinationally count number of 1's in a bus" explains how we can do this.

This post is in response to a query posted on our "post your query" page. In case you want to have an answer to your query, you can post a comment. We will try our best to answer.

Design query :: Combinationally count number of 1's in a 32-bit bus

Solution: The design in question is a combinational design with 32-bit input and 6-bit output as there can be maximum 32 1's and 32 stands "100000" in binary. Making a truth-table or K-map for this problem is not practical, so we have to take a modular approach. Let us divide the problem into detecting number of 1's among 4 bits and then adding the resulting numbers together providing the total count.

Let us first create a truth-table converting the number of 1's in a 4-bit stream into a 2-bit number. The resulting truth table is shown in figure 1.

Figure 1: Truth table for 4-bit count 1's circuit

Solving the above for O2, O1 and O0 using K-maps, we get the expressions as shown in figures 2, 3 and 4 below.

Figure 2: Expression for O2

Figure 3: Expression for O1

Figure 4: Expression for O0

Thus, we have 8 instances, each counting the number of ones pertaining to respective 4 bits. The next thing we need is to add these 8 three-bit numbers to obtain the resultant total number of 1's in the 32-bit number we got. For this, we can again follow modular approach to add two numbers at a time until we are left with a single number. The block diagram of the complete solution is shown below in figure 5.

Figure 5: Complete block diagram of counting number of 1's

On chip bus power reduction techniques

The process of data transmission on an on-chip bus leads to switching activity on the bus wires, which charges and discharges the capacitance associated with the wires and consequently leads to dynamic power dissipation.

Bus encoding is widely used technique to reduce dynamic switching power. For any encoding scheme the sender encoder encodes the signal, while receiver decoder decodes the signal with inverse function. The power reduction encoding techniques can be divided into 2 categories: a) self-switching power reduction b) coupling power reduction.

Self-switching is bit toggling between 0 and 1 level on a wire over time, causing this wire capacitance charging and discharging with respect to its metal layer. Following techniques are used to address this power dissipation:

1. Address bus encoding. It exploits the high regularity associated with address streams, which is characterized by local and temporal locality.

a. Gray code encoding. This scheme guarantees only bit flip in case of sequential addresses access.

b. T0 code. It uses an extra signal on bus which indicates whether the currently accessed address is the sequential of the previously accessed. If yes, the address bus isn’t toggled, and the receiver is responsible to calculate the address based on the previous.

c. T0-C code. Here the extra signal is eliminated and instead a new address is sent to indicate the address regularity finished.

2. Data bus encoding. Data bus on the contrary to address bus doesn’t possess any regularity but rather can be considered random. Therefore, no local and temporal locality can be effectively exploited.

a. Bus-invert code. It uses Hamming distance (the number of changed bits) computation between the current value and the next value on the bus and inverts the value if the distance is greater than half of the bit width. An additional indication signal is used to indicate the value is inverted.

b. Transition signaling. In this scheme logical 1 is indicated by level transition from 0 to 1 or from 1 to 0, while logical 0 doesn’t cause transition. This scheme ensures the number of transitions on bus is equal to the number of 1s and is effective with data where the number of 1s is less than the number of 0s.

Coupling power is dissipated when crosstalk between different wires of the bus happens. Following techniques are used to address this power dissipation:

1. Address bus encoding.

a. Permutation of address bus lines is done at physical design stage to reduce coupling. It can be achieved by orthogonal layout of the wires or passing them through different metal layers.

2. Data bus encoding.

a. CBI (coupling bus-invert). Is very similar to previously explained bus-invert code scheme but inverts the data to achieve better cross-coupling effect.

b. Transition pattern coding scheme (TPC). It adds signal to the bus to encode codeword patterns in which neighboring lines change in phase.

For more power reduction schemes you can refer to On-chip Communication Architectures book.

Courtesy www.shellbr.com.

Metastability tolerant designs

We discussed, in our posts metastability and how a flip-flop goes metastable, the basics of metastability and what causes metastability failures in designs. We also discussed the impacts of metastability failures in our designs. So, we are bound to think of ways of preventing metastability in designs. The only way to do so is not to let the input toggle during setup-hold window. This can be done if we have completely synchronous designs and setup-hold timing is met for all timing paths. But, every design is bound to have asynchronous signals as everything in this world cannot run on a single clock. For example, when you press reset button on a device, this has to be an asynchronous event since the event is generated by your body, which runs on a different clock than the device. :-)

Thus, in reality, we cannot prevent metastability. We can only reduce its existence or make our designs such that the occurence of metastability does not affect the state machine.

Avoid metastability to as much extent as possible: As discussed earlier, we can try to make the designs as such synchronous as possible. Thus, by virtue of setup-hold requirements being met, metastability will not be much of an issue. Another possible solution is to decrease the frequency of system. Less number of clock edges will mean less probability of data being captured during setup/hold window. However, do we really want to limit our designs' performance just for the sake of metastability? So, we must make our designs metastability tolerant.

Make our designs metastability tolerant to as much extent as possible: The most common way to make designs metastability tolerant is to add synchronizer stages. Doing this, we are allowing certain flip-flops in the design to go metastable, but not allowing their metastability to impact the design by propagating to later stages. However, this also does not guarantee perfect immunity to design failures, but reduces the occurent of design failures due to metastability to almost nil, if carefully designed.

Asynchronous FIFO

ASYNC FIFO is a frequency relationship agnostic bus synchronization technique and by that can be considered practically universal.

It is convenient to choose the write/read pointers of width by one bit bigger than needed by FIFO size. The msb then will play the role of “sign”. The pointers (bus) synchronization is performed with the help of Gray encoding. Gray code encoding is a popular technique to synchronize a bus because only one bit is changed at a time. This ensures we always sample or old or new value on the bus and never – inconsistent one. “g2b” and “b2g” is the logic to convert Gray code to binary and vice versa. It is out of the scope of this article to depict its design.

//write pointer

always @ (src_clk)

if (!rst_n)

wr_ptr <= ‘d0;

else if (push)

wr_ptr <= wr_ptr + 1’b1;

//read pointer

always @ (dst_clk)

if (!rst_n)

rd_ptr <= ‘d0;

else if (pop)

rd_ptr <= rd_ptr + 1’b1;

//full

assign full = (wr_ptr[log(FIFO_SIZE)] ^ rd_ptr_synch[log(FIFO_SIZE)]) &&

(wr_ptr[log(FIFO_SIZE)-1:0] == rd_ptr_synch[log(FIFO_SIZE)-1:0]);

//empty

assign empty = (wr_ptr_synch[log(FIFO_SIZE):0] == rd_ptr[log(FIFO_SIZE):0]);

The important thing to remember is the size of the FIFO has to be exactly power of two. This is because in any other case there will be multiple bit transitions even with Gray code encoding and thus bus synchronization with only one bit changed at a time is violated.

Courtesy http://www.shellbr.com.

Half-handshake synchronization scheme

Synchronization questions is one of the favorites among VLSI job interviewers. This is because they check not just the general intellectual abilities of the potential candidate but also the very specific professional knowledge which is usually acquired only by experience. When it comes to synchronization there are plenty of schemes. During the emerging interview it often comes to the "ultimate" decision - the synchronizer, which is tolerable to any source-destination conditions (relative frequencies, duration of signals, etc). The expected answer is very well known full-handshake scheme. It is definitely the "ultimate" solution. But its extra-generic nature comes at a cost of very long processing cycle (6 source + 6 destination cycles).

Less known is half hand-shake synchronization scheme which differs from full hand-shake scheme by that it utilizes signals toggling rather than level as an indication to transfer synchronization information from side to side.

At source and destination sides it is toggling (0 to 1 signal change or vice versa) of the synchronized valid signal or ack signal, which becomes an indication the synchronized output may be issued and/or the state changed. Toggled signal may be achieved by comparison (XOR) of the next and current signal value. The current signal value need to be latched at each processing cycle.

Half-handshake scheme provides 2 times better processing cycle than full hand-shake because it consists of only synchronization-acknowledge cycle rather than of synchronization-acknowledge-synchronization de-assertion-acknowledge de-assertion.

Courtesy http://www.shellbr.com.

How a latch/flip-flop goes metastable

In the post metastability, we discussed that inverter loop can be put into a meta-stable state. Since, latches and flip-flops consist of inverter loops controlled by transmission gates, they also are susceptible to meta-stability. For instance, consider a negative latch as shown in figure 1 and the clock waveform alongside. The instance of interest to us is the instance when switch_1 closes or at the transparency_close edge of the latch. Also, we know from theory two important concepts, "setup time" and "hold time". Let us call the region between setup time and hold time as setup-hold-window. If the data toggles before setup-hold-window, it is guaranteed to get captured and propagated to latch output. If data toggles after setup-hold-window, it is guaranteed not to get propagated to latch output. On the other hand, if it toggles during setup-hold-window, it may or may not propagate to the output. Also, it may happen that when the switch closes, the input level is such that latch goes into metastable state.

Figure 1: How latch goes into metastable state

As is evident from figure 1, input is still transitioning when switch has closed and the output goes metastable.

Similarly, as a flip-flop is also composed of latches configured in master-slave configuration, a flip-flop also goes metastable by same way. In general, we can describe it as:

A flip-flop/latch has a defined timing requirement in terms of when data should be available at its input so that it is correctly captured. These requirements are termed as setup and hold times. If these requirements are not met, there is a possibility of flip-flop going metastable.

In general, following are the scenarios which can cause a flip-flop's output to go metastable.

Asynchronous timing paths: Paths crossing clock domains, where the launch and capture clocks do not have definite phase relationship, cannot be assured to be captured outside setup/hold window.

If there is a timing path violating setup and/or hold, then the capturing flip-flop will go metastable at a certain PVT, where it is probable to get captured in setup-hold-window

How metastability impacts design: Let us assume that the output of flip-flop goes to a number of gates (say 100). So, as long as the flip-flop is in metastable range, it will cause short circuit current to flow in all the gates. This link shows the short circuit current to be in the range of 100 uA. So, large amount of short circuit current will flow for a considerable amount of time.

What helps a flip-flop come out of metastability?

As described earlier, theoretically, it is possible for the flip-flop to remain in metastable state for infinite time in the absense of any disturbance. However, there are certain factors, which help it to come out of metastability.

Ability of the inverter pair to detect a disturbance and act on it: If the inverter pair is able to detect even a smallest of the disturbances, it will act upon it and eventually come out of metastability. So, having these characteristics for transistors in inverter pair will help:

Low VT
High drive strength

Higher the time available for metastability resolution, more chances of having disturbance; hence, flip-flop will eventuall come out of metastability

In general, the ability of a flip-flop to come out of metastability is measured by a parameter known as MTBF (Mean Time Between Failures). It can be thought of as inverse of failure rate. Higher the MTBF, higher the probability of flip-flop coming out of metastability within a given time. It depends upon:

Technology factors
Time available to resolve metastability: Higher the time available, higher is MTBF
Frequency of the clock received by flip-flop: Higher the frequency of clock, lower is MTBF
Frequency of toggling of data received by flip-flop: Higher is frequency of data, lower is MTBF
Internal design of flip-flop: Ability of flip-flop to act on smallest of disturbances, as discussed earlier. In general, a flip-flop consuming more power and having high gain will be able to come out of metastability quickly

Metastability

What is metastability: Literally speaking, metastable state refers to a state "which is not so much stable" and a slight disturbance will cause the system to lose state. In the context of VLSI, specifically sequential design, in addition to two stable states "0" and "1", there is also a state in-between at which the output may wander for some time due to inherent feedback design of a latch. This state is known as meta-stable state and the phenomenon is known as metastability. In order to understand this, let us study two back-to-back connected inverters as shown in figure below.

Figure 1: Inverter loop schematic

When the output (N2) of top inverter is 1, M4 & M1 are on. Simlilary, when output is 0, M3 & M2 are on. If the voltage of output inverter is left at anything other than VDD or GND, M4 and M1 try to pull the output towards 1 and the other two pull it towards 0. The final settlement value (either 0 or 1) is dependent upon whose initial pull is stronger. However, if the initial pull combined of M4 & M1 is equal to M3 & M1, the output will not move. The output may or may not remain at this level for some time until equilibrium is maintained. However, if forcefully, we change the output by even a small value, it will settle at 0 or 1 depending upon the direction of forced change. This state is the so-called metastable state. The inverter pair can remain in this state as long as there is no disturbance in the voltage levels. This disturbance can be due to external factors such as external forced voltage or internal factors such as crosstalk. So, if it is left to come out of metastability by itself, the time to come out of metastability is unknown. It depends upon:

The value of voltage stimulus: If the level of disturbance is greater than a certain threshold, the output will start to move towards one of the stable levels. Of course, larger the value of initial disturbance, faster will be time to stability

Strengths of inverters to pull towards "0" and "1": The less voltage resolution capability the inverter pair has, it will be able to come of metastability in shorter time

How can we generate a pulse for every edge of the incoming pulse

It is a very common requirement to detect either positive edge, negative edge or both edges of a signal. And the circuit that can detect and generate a single cycle pulse is quite simple. In this post, we will discuss how we can detect positive edge, negative edge and both edges of a signal.

Detect positive edge of a signal: Positive edge of a signal means that the current state of the signal is "1" and previous state is "0". And a pulse signal means that the output of the circuit is "1" for one cycle. So, we need a circuit which generates "1" as output when present state is "1" and previous state is "0". It generates "0" as output otherwise.

Thus, output = D(n-2)' & D(n-1)

The required implementation is shown in figure 1 below:

Figure 1: Detection of positive edge of signal

Detect negative edge of a signal: Negative edge of signal means that the current state of the signal is "0" and previous state is "0". So, we need a circuit which generates "1" as output when present state is "0" and previous state is "1". It generates "0" as output otherwise.

Thus, Neg_edge_detect = D(n-2) & D(n-1)'

The required implementation is shown in figure 2 below:

Figure 2: Detection of negative edge of signal

Detecting both positive and negative edges of the signal: Simply doing an "OR" operation of Pos_edge_detect and Neg_edge_detect signal will produce an output which is a single cycle pulse for any of the edge of incoming signal. The requirement for consecutive edges of incoming signal is to be at least 2 cycles apart otherwise, the output will not be pulse, but will be a continuous signal.

Any_edge_detect = Pos_edge_detect + Neg_edge_detect

The required implementation is shown in figure 3 below:

Figure 3: Detection of both positive and negative edges of signal

The technique we discussed here delays the output by two cycles. Can you think of any other way to detect the edges of a signal which is more efficient?

Interview questions related to clock jitter and duty cycle variations

Below we list few of our posts related to clock jitter and duty cycle variation. Happy learning.

Clock jitter: Disusses the definition and types of clock jitter.

Which type of jitter matters for timing slack calculation?

Can jitter in clock effect setup and hold violations? : Discusses cases of setup and hold slack calculation where jitter in clock path can play a role

Duty cycle of clock: Discusses the definition of duty cycle and how it impacts timing slack of timing paths.

Duty cycle variation: Discusses in detail basics of duty cycle variation and its timing implications.

Duty cycle variation of inter-clock timing paths: Discusses the implication of duty cycle variation in case of root (master) clock to generated clock paths and inter-generated clock paths.

Duty cycle degradation: Discusses the factors responsible for duty cycle degradation

How to fix min pulse width violation: Discusses a few techniques to tackle min-pulse-width violations.

Duty cycle care-abouts for clock paths in reset assertion - Discusses how reset assertion can alter the duty cycle of clock, and what needs to be taken care of.

Our purpose is to make this page a single destination for any questions related to clock jitter and duty cycle variation. Please feel free to ask any question related to clock jitter and duty cycle variations.

Interview questions related to reset design and reset timing

Below we list few of our posts related to reset timing and some design concepts related to reset. Happy learning.

Reset basics - Discusses the purpose and design strategies related to reset

Synchronous and asynchronous resets - Discusses the basics of synchronous reset and asynchronous reset. Also discusses few differences between them

Reset synchronizer - Discusses the definition, need and working of reset synchronizers.

Recovery and removal checks - Discusses the timing aspects of asynchronous reset. Provides the definition of recovery check, removal check, recovery time and removal time.

Asynchronous reset assertion timing scenarios - Discusses if there may arise a need to time the assertion of asynchronous resets

Duty cycle care-abouts for clock paths in reset assertion - Discusses how reset assertion can alter the duty cycle of clock, and what needs to be taken care of.

Our purpose is to make this page a single destination for any questions related to reset design and timing. If you have any source of educational information related to reset, please comment or send an email to myblogvlsiuniverse@gmail.com and we will add it here. Also, feel free to ask any question related to reset design and timing.

Does it make sense to check hold violations at synthesis stage

As we know from basics of STA, hold timing equation is generally of the form:

Hold_slack = Data_path_delay - hold_time_of_flop - clock_skew

Which can be re-organized as follows:

Hold_slack = Data_path_delay + launch_clock_delay - capture_clock_delay - hold_time_of_flop

Data_path_delay + launch_clock_delay may be combined to represent arrival of data at the capture flip-flop with respect to clock source. The same is evident from below figure.

The modified equation, then, becomes:

Hold_slack = Data_arrival_at_capture_flop - Clock_arrival_at_capture_flop - hold_time_of_flop

It is clear from above equation that hold checks generally comprise of a race condition between clock and data minus a fixed number (is hold_time_of_flop really a fixed number is a separate question).

Before clock-tree synthesis, we do not have one major aspect of this equation; i.e. we do not know the arrival times of clock at launch and capture flip-flops. Also, we do not know how much is the skew and uncommon clock path contributing to on-chip variations; thus, requiring extra margin for hold slack. Talking about logic synthesis, we do not even have placement data most of the times, so we do not even have correct data path delay estimates with us. So, fixing hold at synthesis will not help as there will anyways be requirement for hold fixes taking into account actual data and clock path delays.

But the most important reason for not fixing hold before clock-tree synthesis is that the task of hold fixing is not much complex for the tools available. It may be as simple as downsizing logic and/or adding buffers in data path where setup slack is available. So, during synthesis and logic placement, tools focus on getting setup targets met and focus on hold after clock tree is built.

Another point to note is that after data nets routing, data path delay may increase because of detouring of data nets as compared to originally estimated. So, it may be wise not to fix hold violations of very small magnitude even after clock-tree synthesis and wait for data nets routing to see how many of those violations are still there.

4:1 mux as universal gate

A universal gate is a gate which can implement any given logic function. NAND and NOR gates are basically known as universal gates, since you can implement any logic function with these. A multiplexer, in a sense, can also be termed as a universal gate, since, you can realize any function by using a mux as a look-up-table structure. In this post, we discuss how we can utilize a 4:1 mux as a universal gate realizing 2-input gates.

Any two-input gate gives a definite value (either 0 or 1) for all the combinations of its inputs and can be represented in the form of truth table as shown in table below.

Here, A,B,C & D can be either "0" or "1" depending upon the functionality of the gate. For instance, for a 2-input AND gate, A = B = C = 0 and D = 1.

Utilizing a 4-input mux for implementing this generic 2-input gate, we can implement as shown below:

For instance, 2-input AND gate will be implemented as following:

This post was written as a response to a query from one of our readers. You can also post your query at post your query.

Duty cycle care-abouts for clock paths in reset assertion

In the post Asynchronous reset assertion timing scenarios, we discussed how we may need to time the assertion of asynchronous reset as well. In this post, we will talk about how important it is to talk about the duty cycle aspect as well. We will use the same example as we did in our previous post to help make a better connection.

In the figure below (discussed in Asynchronous reset assertion timing scenarios), Q0 goes from 0 ->1 and 1 -> 0 in the same cycle, thereby providing a very diminished high pulse, or very diminished low pulse to BIT_1 flip-flop depending upon the delay in reset path of BIT_0 flip-flop. This can result in violation of minimum pulse width requirement of BIT_1. We will discuss this in some detail in this post.

Let us talk only about clock edge number 4. Q0 goes 1 and BIT_1 receives it as positive edge of clock, thereby changing state too. The delay till BIT_1 receiving positive edge of clock is given as

POS_CK_AT_BIT_1 = (Latency of BIT_0) + (CLK->Q of BIT_0) + (BIT_0/Q -> BIT_1/CK)

Now, both Q0 and Q1 go 1, thereby causing the output of NAND gate going "0", which asserts the reset of both BIT_0 and BIT_1. BIT_1 now receives negative edge of clock, whose delay is

NEG_CK_AT_BIT_1 = (Latency of BIT_0) + (CLK->Q of BIT_0) + NAND_DELAY + (R -> Q of BIT_0) + (BIT_0/Q -> BIT_1/CK)

The width of high pulse that the flip-flop receives is equal to the difference between above two values:

HIGH_PULSE_WIDTH_AT_BIT_1 = NAND_DELAY + (R -> Q of BIT_0)

And low pulse is equal to

LOW_PULSE_WIDTH_AT_BIT_1 = CLK_PERIOD - HIGH_PULSE_WIDTH_AT_BIT_1

Now, depending upon the combinational delays mentioned as well as CLK_PERIOD, BIT_1 may receive a pulse (either high or low) with width less than what is permissible for its proper functionality. Thus, there will be a pulse width violation.

Looking at the equations for high and low pulse widths, it seems more probable to have high pulse width violating for BIT_1, unless either clock period is very less or buffering is there in reset path of BIT_0. We will need to increase buffering to reset pin of BIT_0 to increase width of high pulse and vice-versa.

The discussion we just had is applicable to any design in particular with reset controlling a signal driving another flip-flop's CK pin. However, one may argue following about this particular circuit:

This particular circuit is resistant to high pulse width violation, but may have low pulse width violations.

Can you argue in favor/against this statement? What is the reason for one making this statement. (Hint: Answer lies in the sequence of events causing CK to BIT_1 to go high and then go low).

Also read:

Asynchronous reset assertion timing scenarios

We have always heard that for asynchronous resets, only de-assertion needs to be timed. This is true for most of the designs as the guidelines followed for implementation of asynchronous resets. However, there may be a scenario wherein we may need to consider reset assertion as well for our timing checks. In this post, we will try to put some light on it.

In our post "recovery and removal checks", we have elaborated following for asynchronous resets:

* Reset assertion "combinationally" causes the output to go 0; i.e., assertion of reset does not wait for edge of the clock to alter the state of flip-flop

* Reset de-assertion waits for clock edge to propagate the value at "input" to "output". There are checks corresponding to de-assertion of reset with respect to clock known as "recovery" and "removal" checks.

Thus, we see that "recovery" and "removal" checks are defined only for de-assertion of asynchronous reset. So, one must think that there is no timing requirement for reset assertion. This is true in the sense that there is no requirement for reset assertion at the flip-flop that is receiving reset. And overall no timing requirement for a carefully implemented design. But some-times, there may be a corner-case scenario requiring the combinational path throuhg "reset -> output" to be timed. We will discuss some cases to elaborate our statement.

CASE 1: The output of flip-flop goes to another flip-flop, which itself is getting reset at the same time.

Here, since, the output in fanout is itself in reset state, it will not be sampling the output. Hence, there is no need to time the assertion of reset.

CASE 2: The output of flip-flop goes to an asynchronous domain or to a synchronizer which is not, itself, in reset. Here, the asynchronous domain flip-flop is expected to be a synchronizer in most of the cases. So, there does not arise a need for meeting timing.

CASE 3: The output of flip-flop goes to a flip-flop which is working on a synchronous clock and is expecting synchronous data. In this case, we will have to meet timing through the R -> Q of the source flip-flop and getting captured at the destination. We need to keep in mind that reset synchronizer also transfers through R -> Q for reset assertion. So, this case cannot be valid for global asynchronous reset assertion. For instance, let us consider below as a valid scenario.

For below case, we have to meet reset assertion timing from

ASYNC_RESET_SOURCE -> REG_B/R -> REG_B/Q -> REG_C/R -> REG_C/Q -> REG_D/D

Since, reset source is working on asynchronous clock, this is a design violation to get it captured on a flip-flop which is running functionally and expecting synchronous data.

Thus, there may not be any scenario to time asynchronous reset assertion globally. However, the state machines may locally utilize asynchronous reset assertion to get things done. For instance, you must have gone through a common problem known as "conversion of asynchronous counter to decade counter", wherein asynchronous reset pin is utilized to reset the count whenever count reaches 10. We will simplify it as "conversion of 2-bit asynchronous counter to modulo-3 counter" to serve our purpose.

Consider following design of 2-bit asynchronous counter. We have shown two flip-flops to register the output of this counter for illustration purposes and to make the understanding more clear. Flip-flop "BIT_1" gets the output of "BIT_0" as clock. All other flip-flops get CLK as clock. Whenever output of both "BIT_1" and "BIT_0" goes 1, the reset of both flops will cause output to go "00" in the same cycle. This is expected to reach the registers by next clock cycle so that it can be captured properly. The intermediate state "11" is not captured at the next stage as understood by basics of state machines.

We get following timing equations to be met in a single cycle (minus setup time and skews etc.).

BIT_1/CK -> BIT_1/Q -> BIT_1/R -> BIT_1/Q -> REG_1/D

BIT_1/CK -> BIT_1/Q -> BIT_0/R -> BIT_0/Q -> REG_0/D

BIT_0/CK -> BIT_0/Q -> BIT_1/R -> BIT_1/Q -> REG_1/D

BIT_0/CK -> BIT_0/Q -> BIT_0/R -> BIT_0/Q -> REG_0/D

Thus, this is an example of a scenario wherein we need timing for assertion of reset as well. It is captured at the next flip-flops running combinationally through the flip-flops' "Q" pin. Can you deduce the timing paths to be timed for the original case of "asynchronous counter as a decade counter" as well.

Another possible care-about of asynchronous reset assertion may be degradation of duty cycle, if the output is used as a clock such as highlighted path, where Q0 is being consumed as clock for "BIT_1" flip-flop. This can cause the minimum pulse width requirement to be violated for "BIT_1" flip-flop. This perspective of reset assertion is discussed here.

Also read:

Interview questions related to reset design and reset timing

Can we use discrete latches and AND/OR gates instead of ICG?

In the post, Integated Clock Gating Cell, we discussed that an ICG has a negative level-sensitive latch preceding an AND gate in order to relax hold timing for clock gating check. And we discussed that it gives benefits for area, power and timing. Let us discuss how area, power and timing are saved. We will discuss only for the case of AND gate, the same will follow for OR gate.

1. Architectural benefits - simplicity in clock handling: By introducing ICGs in place of discrete gates, you dont have to worry about the launch edge of the signal while writing RTL (for details, see here). One can always launch the signal from positive edge-triggered flip-flop for timing and architectural simplicity without worrying about possibility of glitch in clock path due to wrong polarity flip-flop launching enable signal.

2. Benefits in area and power: Having custom module allows for better utilization of resources inside the custom ICG module; hence, it is expected to have lesser power than a latch and an AND gate combined.

3. Benefits in timing: Having the path from latch -> AND inside ICG saves us from having to meet these paths individually, which could take a lot of effort with discrete latch and AND gate. Also, it allows for latch to have almost full time borrow, thereby making almost a full cycle path from a positive edge-triggered flip-flop to ICG.

Also read:

Clock gating interview questions

Design problem : Convert a multiplexer to priority mux (Logic restructuring for a multiplexer for timing critical paths)

Problem statement: Given an 8:1 multiplexer such that the input connected to 5th input is the most setup timing critical and other inputs are timing critical in the order D0 > D1 > D2 > D3 > D4 > D6 > D7. Restructure the logic accordingly.

Solution: We know that the most setup timing critical signal should have least logic in the data path. So, we need to prioritize 5th input such that it has least logic out of all the inputs. In other words, this is a problem of converting an ordinary multiplexer to a priority multiplexer. Let us first discuss how we can convert a multiplexer to priority mux.

Figure 1 below shows a multiplexer with 8-inputs D0 - D7 and selects S2,S1,S0.

Figure 1: 8:1 multiplexer

The equation for output is given as below:

O = S2.S1.S0.D7 + S2.S1.S0’.D6 + S2.S1’.S0.D5 + S2.S1’.S0’.D4 + S2’.S1.S0.D3 + S2’.S1.S0’.D2 + S2’.S1’.S0.D1 + S2’.S1’.S0’.D0

This multiplexer can be represented in the form of a priority multiplexer as required is as shown in figure 2 below.

We can start from the equation of the priority multiplexer and prove that it is actually equivalent to 8:1 mux.

The equation of the priority multiplexer is given as:

O = (S0.S1'.S2).D5 + (S0.S1'.S2)'.(S0'.S1'.S2').D0 + (S0.S1'.S2)'.(S0'.S1'.S2').(S0.S1'.S2').D1 + (S0.S1'.S2)'.(S0'.S1'.S2').(S0.S1'.S2')'.(S0'.S1.S2').D2 + (S0.S1'.S2)'.(S0'.S1'.S2').(S0.S1'.S2')'.(S0'.S1.S2')'.(S0.S1.S2').D3 + (S0.S1'.S2)'.(S0'.S1'.S2').(S0.S1'.S2')'.(S0'.S1.S2')'.(S0.S1.S2')'.(S0'.S1.S2).D4 + (S0.S1'.S2)'.(S0'.S1'.S2').(S0.S1'.S2')'.(S0'.S1.S2')'.(S0.S1.S2')'.(S0'.S1.S2)'.(S0'.S1.S2).D6 + (S0.S1'.S2)'.(S0'.S1'.S2').(S0.S1'.S2')'.(S0'.S1.S2')'.(S0.S1.S2')'.(S0'.S1.S2)'.(S0'.S1.S2)'.(S0.S1.S2).D7

Simplifying the above equation leads us to the equation of ordinary multiplexer.

VLSI UNIVERSE