How clock gating reduces power dissipation

As discussed in clock gating - basics, enable signal coming in data path is transferred into clock path in order to save dynamic power. But the question is exactly how is this power saved. In this post, we will discuss the same. 




A flip-flop implemented as a standard cell mostly has two internal inverters to generate clk' and clk_delay signals. So, even if the flip-flop input is kept constant, there is still toggling of data at these inverters, thereby dissipating dynamic power. In addition to this, there is internal power dissipation inside flip-flop due to charging and discharging of transistors' gates repetitively because of clock toggling, but this component is not a significant factor compared to dynamic power of inverters. Figure 2 below shows the internal structure of flip-flop, which has two latches in master-slave configuration and two inverters in clock path.

Figure 2: Flip-flop internal structure

Every clock cycle, these two inverters toggle regardless of flip-flop output toggling. However, implementation of clock gating will prohibit the toggling of these inverters when data is not toggling. Let us assume that a latch-based ICG is inserted. Thus, a mux in data path is replaced by an ICG in clock path. But there is a difference here. If there are, say 1000, flip-flops with same enable signal, there will be a common ICG inserted for these. Thus, instead of now 2000 inverters (inside 1000 flops) toggling when flip-flop output will be constant, we have only 2 inverters inside ICG consuming dynamic power. This is how dynamic power is saved. However, if only 1 flop had been clock gated in this manner, there would not have been any dynamic power saving, instead we have an ICG instead of a latch, it may result in overall loss in terms of area and power.

Whether there is any net saving is governed by how many flips-flops have been clock gated using a single ICG.

Also, as discussed, many muxes in data path with same enable are replaced by an ICG in clock path. Thus, there are advantages in terms of area and leakage power too, in addition to dynamic power.

Design query : How can we construct a 101 non overlapping counter using only combinational circuit for 32 bit input for example on considering 10101001 i want an output as 1 since there is only 1 101 non overlapping sequence

Solution: The design in question is a combinatorial design with 32-bit input (given) and a 4-bit output as shown in figure 1 below. How the output is 4-bit is a bit tricky. For this, we have to understand the problem. We have to count the number of non-overlapping "101" sequences in 32-bit input. Thus, "10101" counts as only a single occurrence, and "101101" counts as two occurrences. So, the maximum number of such patterns will occur when "101101" is repeated, which comes out to be 10 in a 32-bit number. 10 can be represented by a 4-bit number.

Figure 1: Design representation
One of the solutions, of course is to make a truth-table and then find a solution using logic equation solving. But the number of combinations possible here is huge and practically impossible to find a solution. So, we need to follow a modular approach here.

We can divide the problem into two parts, detecting the required pattern and then counting how many patterns actually were detected. We are introducing an intermediate 32-bit output, each bit (Nth bit) detecting if the pattern was found with Nth bit of input as the middle symbol of pattern. To detect non-overlapping "101" pattern, we need to look into 2 bits on each side, thereby making a combinational logic comprising of 7 bits. There will be special cases for terminal bits (here bit-0, bit-1, bit-30 and bit-31) where we know that there are less than 2 bits on one side. So, we need to have special logic for these bits.

The overall combinational logic will look like as shown in figure below. Int-N (Nth bit of intermediate output) is a resultant of (Bit-N-3 to Bit-N+3). The number of 1's in the intermediate output will tell how many patterns were detected, which, as discussed earlier, will be maximum 10.
Figure 2: Block diagram representation of design
Let us, now, proceed for a generic logic for Nth intermediate output. As discussed, Nth output (On) depends upon 7 bits, 3 bits on the up and 3 on down. The Nth bit of output will show "1"only for following cases: X1101XX & 00101XX. As 10101XX will be detected as "1" for N+2 bit of output. If we denote the variables involved as G,F,E,D,C,A, then the output expression becomes

On = FED'C + G'F'ED'C
On = ED'C (F+G')

 For bit 31, the upper two bits do not exist. 101XX is the expression for getting output as 1. Thus, the equation, on a similar note, can be expressed as:
O31 = ED'C

For bit 30, the expression comes out to be X101XX. So, O30 also has same logic as O31.

For O1 and O0, we get the same expression as we get for On. The logic diagram for obtaining intermediate outputs is shown in figure 3 below:

Figure 3: Circuit with logic for intermediate output


Now, we have got the intermediate outputs showing which bits have the specified pattern detected. The number of "1"in the final output is our final answer. So, we need a special circuit to count the number of 1's combinationally in a bus as shown in figure 3 above. "Combinationally count number of 1's in a bus" explains how we can do this.

This post is in response to a query posted on our "post your query" page. In case you want to have an answer to your query, you can post a comment. We will try our best to answer.

Design query :: Combinationally count number of 1's in a 32-bit bus

Solution: The design in question is a combinational design with 32-bit input and 6-bit output as there can be maximum 32 1's and 32 stands "100000" in binary. Making a truth-table or K-map for this problem is not practical, so we have to take a modular approach. Let us divide the problem into detecting number of 1's among 4 bits and then adding the resulting numbers together providing the total count.

Let us first create a truth-table converting the number of 1's in a 4-bit stream into a 2-bit number. The resulting truth table is shown in figure 1.

Figure 1: Truth table for 4-bit count 1's circuit

Solving the above for O2, O1 and O0 using K-maps, we get the expressions as shown in figures 2, 3 and 4 below.

Figure 2: Expression for O2

Figure 3: Expression for O1



Figure 4: Expression for O0

Thus, we have 8 instances, each counting the number of ones pertaining to respective 4 bits. The next thing we need is to add these 8 three-bit numbers to obtain the resultant total number of 1's in the 32-bit number we got. For this, we can again follow modular approach to add two numbers at a time until we are left with a single number. The block diagram of the complete solution is shown below in figure 5.

Figure 5: Complete block diagram of counting number of 1's


On chip bus power reduction techniques



The process of data transmission on an on-chip bus leads to switching activity on the bus wires, which charges and discharges the capacitance associated with the wires and consequently leads to dynamic power dissipation.
Bus encoding is widely used technique to reduce dynamic switching power. For any encoding scheme the sender encoder encodes the signal, while receiver decoder decodes the signal with inverse function. The power reduction encoding techniques can be divided into 2 categories: a) self-switching power reduction b) coupling power reduction.
Self-switching is bit toggling between 0 and 1 level on a wire over time, causing this wire capacitance charging and discharging with respect to its metal layer. Following techniques are used to address this power dissipation:
1.      Address bus encoding. It exploits the high regularity associated with address streams, which is characterized by local and temporal locality.
a.       Gray code encoding. This scheme guarantees only bit flip in case of sequential addresses access.
b.      T0 code. It uses an extra signal on bus which indicates whether the currently accessed address is the sequential of the previously accessed. If yes, the address bus isn’t toggled, and the receiver is responsible to calculate the address based on the previous.
c.       T0-C code. Here the extra signal is eliminated and instead a new address is sent to indicate the address regularity finished.
2.      Data bus encoding. Data bus on the contrary to address bus doesn’t possess any regularity but rather can be considered random. Therefore, no local and temporal locality can be effectively exploited.
a.       Bus-invert code. It uses Hamming distance (the number of changed bits) computation between the current value and the next value on the bus and inverts the value if the distance is greater than half of the bit width. An additional indication signal is used to indicate the value is inverted.
b.      Transition signaling. In this scheme logical 1 is indicated by level transition from 0 to 1 or from 1 to 0, while logical 0 doesn’t cause transition. This scheme ensures the number of transitions on bus is equal to the number of 1s and is effective with data where the number of 1s is less than the number of 0s.
Coupling power is dissipated when crosstalk between different wires of the bus happens. Following techniques are used to address this power dissipation:
1.      Address bus encoding.
a.       Permutation of address bus lines is done at physical design stage to reduce coupling. It can be achieved by orthogonal layout of the wires or passing them through different metal layers.
2.      Data bus encoding.
a.       CBI (coupling bus-invert). Is very similar to previously explained bus-invert code scheme but inverts the data to achieve better cross-coupling effect.
b.      Transition pattern coding scheme (TPC). It adds signal to the bus to encode codeword patterns in which neighboring lines change in phase.

For more power reduction schemes you can refer to On-chip Communication Architectures book.

Courtesy www.shellbr.com.

Metastability tolerant designs

We discussed, in our posts metastability and how a flip-flop goes metastable, the basics of metastability and what causes metastability failures in designs. We also discussed the impacts of metastability failures in our designs. So, we are bound to think of ways of preventing metastability in designs. The only way to do so is not to let the input toggle during setup-hold window. This can be done if we have completely synchronous designs and setup-hold timing is met for all timing paths. But, every design is bound to have asynchronous signals as everything in this world cannot run on a single clock. For example, when you press reset button on a device, this has to be an asynchronous event since the event is generated by your body, which runs on a different clock than the device. :-)

Thus, in reality, we cannot prevent metastability. We can only reduce its existence or make our designs such that the occurence of metastability does not affect the state machine.
  • Avoid metastability to as much extent as possible: As discussed earlier, we can try to make the designs as such synchronous as possible. Thus, by virtue of setup-hold requirements being met, metastability will not be much of an issue. Another possible solution is to decrease the frequency of system. Less number of clock edges will mean less probability of data being captured during setup/hold window. However, do we really want to limit our designs' performance just for the sake of metastability? So, we must make our designs metastability tolerant.
  • Make our designs metastability tolerant to as much extent as possible: The most common way to make designs metastability tolerant is to add synchronizer stages. Doing this, we are allowing certain flip-flops in the design to go metastable, but not allowing their metastability to impact the design by propagating to later stages. However, this also does not guarantee perfect immunity to design failures, but reduces the occurent of design failures due to metastability to almost nil, if carefully designed.

Asynchronous FIFO


ASYNC FIFO is a frequency relationship agnostic bus synchronization technique and by that can be considered practically universal.

It is convenient to choose the write/read pointers of width by one bit bigger than needed by FIFO size. The msb then will play the role of “sign”. The pointers (bus) synchronization is performed with the help of Gray encoding. Gray code encoding is a popular technique to synchronize a bus because only one bit is changed at a time. This ensures we always sample or old or new value on the bus and never – inconsistent one. “g2b” and “b2g” is the logic to convert Gray code to binary and vice versa. It is out of the scope of this article to depict its design.



//write pointer
always @ (src_clk)
      if (!rst_n)
                  wr_ptr <= ‘d0;
      else if (push)
                  wr_ptr <= wr_ptr + 1’b1;
//read pointer
always @ (dst_clk)
      if (!rst_n)
                  rd_ptr <= ‘d0;
      else if (pop)
                  rd_ptr <= rd_ptr + 1’b1;
//full
assign full = (wr_ptr[log(FIFO_SIZE)] ^ rd_ptr_synch[log(FIFO_SIZE)]) &&
(wr_ptr[log(FIFO_SIZE)-1:0] == rd_ptr_synch[log(FIFO_SIZE)-1:0]);
//empty
assign empty = (wr_ptr_synch[log(FIFO_SIZE):0] == rd_ptr[log(FIFO_SIZE):0]);

The important thing to remember is the size of the FIFO has to be exactly power of two. This is because in any other case there will be multiple bit transitions even with Gray code encoding and thus bus synchronization with only one bit changed at a time is violated.


Half-handshake synchronization scheme

Synchronization questions is one of the favorites among VLSI job interviewers. This is because they check not just the general intellectual abilities of the potential candidate but also the very specific professional knowledge which is usually acquired only by experience. When it comes to synchronization there are plenty of schemes. During the emerging interview it often comes to the "ultimate" decision - the synchronizer, which is tolerable to any source-destination conditions (relative frequencies, duration of signals, etc). The expected answer is very well known full-handshake scheme. It is definitely the "ultimate" solution. But its extra-generic nature comes at a cost of very long processing cycle (6 source + 6 destination cycles).

Less known is half hand-shake synchronization scheme which differs from full hand-shake scheme by that it utilizes signals toggling rather than level as an indication to transfer synchronization information from side to side.



At source and destination sides it is toggling (0 to 1 signal change or vice versa) of the synchronized valid signal or ack signal, which becomes an indication the synchronized output may be issued and/or the state changed. Toggled signal may be achieved by comparison (XOR) of the next and current signal value. The current signal value need to be latched at each processing cycle.

Half-handshake scheme provides 2 times better processing cycle than full hand-shake because it consists of only synchronization-acknowledge cycle rather than of synchronization-acknowledge-synchronization de-assertion-acknowledge de-assertion.

How a latch/flip-flop goes metastable

In the post metastability, we discussed that inverter loop can be put into a meta-stable state. Since, latches and flip-flops consist of inverter loops controlled by transmission gates, they also are susceptible to meta-stability. For instance, consider a negative latch as shown in figure 1 and the clock waveform alongside. The instance of interest to us is the instance when switch_1 closes or at the transparency_close edge of the latch. Also, we know from theory two important concepts, "setup time" and "hold time". Let us call the region between setup time and hold time as setup-hold-window. If the data toggles before setup-hold-window, it is guaranteed to get captured and propagated to latch output. If data toggles after setup-hold-window, it is guaranteed not to get propagated to latch output. On the other hand, if it toggles during setup-hold-window, it may or may not propagate to the output. Also, it may happen that when the switch closes, the input level is such that latch goes into metastable state.

Figure 1: How latch goes into metastable state
As is evident from figure 1, input is still transitioning when switch has closed and the output goes metastable.

Similarly, as a flip-flop is also composed of latches configured in master-slave configuration, a flip-flop also goes metastable by same way. In general, we can describe it as:

A flip-flop/latch has a defined timing requirement in terms of when data should be available at its input so that it is correctly captured. These requirements are termed as setup and hold times. If these requirements are not met, there is a possibility of flip-flop going metastable.

In general, following are the scenarios which can cause a flip-flop's output to go metastable.

  • Asynchronous timing paths: Paths crossing clock domains, where the launch and capture clocks do not have definite phase relationship, cannot be assured to be captured outside setup/hold window.
  • If there is a timing path violating setup and/or hold, then the capturing flip-flop will go metastable at a certain PVT, where it is probable to get captured in setup-hold-window
How metastability impacts design:  Let us assume that the output of flip-flop goes to a number of gates (say 100). So, as long as the flip-flop is in metastable range, it will cause short circuit current to flow in all the gates. This link shows the short circuit current to be in the range of 100 uA. So, large amount of short circuit current will flow for a considerable amount of time.


What helps a flip-flop come out of metastability?

As described earlier, theoretically, it is possible for the flip-flop to remain in metastable state for infinite time in the absense of any disturbance. However, there are certain factors, which help it to come out of metastability.


  • Ability of the inverter pair to detect a disturbance and act on it: If the inverter pair is able to detect even a smallest of the disturbances, it will act upon it and eventually come out of metastability. So, having these characteristics for transistors in inverter pair will help:
    • Low VT
    • High drive strength
  • Higher the time available for metastability resolution, more chances of having disturbance; hence, flip-flop will eventuall come out of metastability

In general, the ability of a flip-flop to come out of metastability is measured by a parameter known as MTBF (Mean Time Between Failures). It can be thought of as inverse of failure rate. Higher the MTBF, higher the probability of flip-flop coming out of metastability within a given time. It depends upon:
  • Technology factors
  • Time available to resolve metastability: Higher the time available, higher is MTBF
  • Frequency of the clock received by flip-flop: Higher the frequency of clock, lower is MTBF
  • Frequency of toggling of data received by flip-flop: Higher is frequency of data, lower is MTBF
  • Internal design of flip-flop: Ability of flip-flop to act on smallest of disturbances, as discussed earlier. In general, a flip-flop consuming more power and having high gain will be able to come out of metastability quickly