Design problem: Clock gating for a shift register

Problem: There is an 4-bit shift register with parallel read and write capability as shown in the diagram. We need to find out an opportunity to clock gate the module.

 Mode selection bits ("S1" and "S0") are controlling the operation of this shift register with following settings:

Solution: From the basics of clock gating, we know that if the stae of a flip-flop is not chaging, there lies an opportunity to gate its clock. Observing the table, we see that state of all flip-flops does not change when "S1,S0" are either "00" or "11". So, when mode selection bits are corresponding to these values, we can gate the clock to this shift register. Or, we can say that clock to the module should reach only when (S1 xor S0) is equal to 1.


Can you relate the timing of S1 and S0? Should they be coming from positive edge-triggered flip-flop or negative edge-triggered flip-flop? Clock gating checks explains the timing of clock gating signals with respect to clock.

Also read:




MOS transistor structure

A MOSFET (Metal Oxide Semiconductor Field Effect Transistor), or MOS, as is commonly called, is an electronic device which converts change in input voltage into a change in output current. The basic structure of a MOS transistor (as seen sideways) is as shown in figure 1. The substrate is a lightly doped semiconductor. Source and Drain regions are heavily doped regions of type opposite to substrate. In-between source and drain is a region called channel. Above the channel is a very thin layer of oxide. 

The voltage is applied to input terminal, which is called "Gate" terminal. If sufficient voltage is applied at the gate terminal, a channel gets formed between source and drain terminals. Depending upon the nature of channel formed, MOS is termed as N-MOS or P-MOS.

N-MOS: For an N-MOS, substrate is P-type, source and drain regions are N-type. Application of a positive voltage at Gate terminal with respect to substrate will result in formation of channel of electrons.

P-MOS: For a P-MOS, substrate is N-type, source and drain regions are P-type. Application of a negative voltage at Gate terminal with respect to substrate will result in formation of channel of holes.


What is the difference between a normal buffer and clock buffer?

A buffer is an element which produces an output signal, which is of the same value as the input signal. We can also refer a buffer as a repeater which repeats the signal it is receiving, just as there are repeaters in telephone signal transmission lines. You must have noticed that we have two kinds of buffers (or any logic gate) available in standard cell libraries as:

  • Clock buffer: The clock buffers are designed specifically to have specific properties that are supposed to be good for clock distribution networks (clock trees). The specific properties that are required in an ideal clock tree buffer are given as below. However, it is not possible to attain these ideal properties for every buffer at every technology node. It may be only possible to get close to these properties.
    • Equal rise and fall times
    • Less delays
    • Less delay variations with PVT and OCV
  • Normal buffer/data buffer: For a data buffer, the above properties are usually less desired
Usually, we can say that following differences may exist between a clock buffer and a normal buffer:
  • In SoCs, clock routing is done in higher metal layers as compared to signal routing. So, to provide easier access to clock pins from these layers, clock buffers may have pins in higher metal layers. That is, vias are provided in standard cell itself instead of necessitating on having in clock distribution network. For a data buffer, the pins are expected to be in lower layers only.
  • Clock buffers are balanced. In other words, rise and fall times of clock buffers are nearly equal. The reason behind this is that if the clock buffers are not balanced, there will be duty cycle distortion in the clock tree, which can lead to pulse width violations as discussed in minimum pulse width violation example. On the other hand, data buffers can compromise with either of rise/fall times. In other words, they dont need to have PMOS/NMOS size to be 2:1; and hence, can be of smaller size as compared to clock buffers.
  • Due to above reason, clock buffers consume more power as compared to normal buffers.
  • Generally, you will find clock buffers with higher drive strength as compared to normal buffers. So that a clock buffer can drive long nets and can have higher fanouts. This helps clock buffers, and hence, clock trees to have less overall delays.

Performance gain with latches

The property of latches being transparent gives them a basic characteristic, known as time borrowing, owing to which they can capture data over a period of time rather than an instant. Using this property of latches intelligently can result in performance advantage for specific design scenarios, especially for designs having asymmetric data paths in subsequent stages. Let us elaborate with the help of an example.
Let us suppose a design having two stages of pipeline with combinational logic in each stage as 12 ns and 5 ns respectively as shown in figure 1 below:

Figure 1: 2-stage pipelining

If we assume clock period to be 16 ns (half cycle being 8 ns), then each latch stage will borrow time from the subsequent stage as shown in figure below:





.

Now, since all the registers get the same clock signal, the minimu clock period is the maximum of combinational delays from REGA to REGB and REGB to REGC.

Tclk > MAX (TcombregA->regB, Tcombr(regB->regC))



Thus, this circuit cannot run with half clock period less than 12 ns, or clock period less than 24 ns.

This situation can be easened up if we replace REGB with a negative level-sensitive latch. Let us have a look at figure 2 below. Although the number of stages still remains the same, LATB can borrow time from next stage without impacting any logic.

Figure 2: Latch replacing register in the 2-stage pipelining
The same is shown in figure 3 below with the help of waveform. The clock is having a period of 9 ns. The latch can borrow time of 3 ns from next stage, still meeting the setup time by 1 ns. Thus, we have succeeded in reducing the half time period from 12 ns to 9 ns (time period from 24 ns to 18 ns), just by changing the register to a latch. This is how a latch can help gain in performance.

If there are multiple latch stages in series, each can borrow from the subsequent stage such that overall timing is met. For example, figure 3 shows 6 latches in series.


How delay of a standard cell changes with drive strength

A standard cell (let us say a buffer) can be represented as shown in figure 1 below, where 
R = Channel resistance 
Cds = Drain-to-source capacitance (internal capacitance of cell)
Cload = Load capacitance


So, RC time constant can be represented as "R * (Cds + Cload)".

What happens on increasing the drive strength? In our post "what is meant by drive strength", we discussed that the drive strength of a standard cell increases when we increase the size of its transistors. So, basically, a cell with drive strength 2X will have twice of width as compared to the one with 1X drive strength.
And we know that
Channel resistance decreases with "W".
Drain-to-source capacitance increases with "W".
So,  upon increasing the drive strength, its internal capacitance will increase and channel resistance will reduce by same amount. The same is depicted in figure 2 below.


Time constant of "1X" buffer = R * (Cds + Cload)
 Time constant of "2X" buffer = R/2 * (2Cds + Cload) 
Now, let us talk of following scenarios:

Special case 1: Load capacitance is negligible.
In this scenario, we are left with only internal resistance and capacitance of the cell.

Time constant of "1X" buffer = R * Cds
Time constant of "2X" buffer = R * Cds
So, in this case, there should not be any impact of increasing the drive strength of standard cell on delay. So, in case there is negligible load, we should not upsize the standard cell. Doing so may instead increase the overall path delay as increased drive strength cell will present increased load to the previous stage cell, thereby increasing the delay of previous stage.

Special case 2: Load capacitance is very large as compared to internal capacitance.
In this scenario,
Time constant of "1X" buffer = R * Cload
Time constant of "2X" buffer = (R * Cload ) / 2 
So, second buffer will take approximately half the time to charge the load capacitance as compared to "1X" buffer.

So, we see that the the maximum possible benefit in delay by increasing the drive strength of standard cell is a reduction by a factor of two. In the worst case, we may not see any benefit at all.

We can also look at above equation by splitting cell delay into two components:
  1. Cell delay due to its own intrinsic capacitance: It does not scale by drive strength and is a constant value for one kind of standard cells.
  2. Cell delay due to external load capacitance: It is variable and decreases as we increase the drive strength of standard cell.