CTSPhysical Design

Clock Tree Synthesis (CTS)

Admin
August 26, 2020
17 min read

Clock Tree Synthesis

Clock – A signal with constant rise and fall with ideally equal width (50% rise and 50% fall of the signal width) helps to control data propagation through the clock elements like Flip-Flop, Latches etc. The clock source mostly present in the top-level design and from there propagation happens. PLL, Oscillator like constant sources are being used normally in designs to get the clock.

Since the clock plays a very important role while propagating data to the frequency of the design, we need to be very careful while creating clocks in the design. We already read in SDC section of the physical design inputs about how we create clock in design. Once we have created the clock, we need to propagate these clock in a way that all the clock elements present in the design need to switch at the same time. To achieve this, we need to balance these clocks and here comes the role of Clock Tree Synthesis in physical design. Let’s talk in detail about this topic.

Inputs of Clock Tree Synthesis

1. Placement DB
2. CTS Spec File

Placement DB:

Placement DB contains Placement completed Netlist, DEF, LIB, LEF, SDC, UPF and other information which contain all the files from the placement database. This can be a zipped file. This DB is also known as PLACE EXIT db., which means we are not going to do any standard cell placement and related things here onward.

CTS Spec File:

CTS spec file contains the below information:

1. Inverters or buffers to be defined which will be used to balance the clock tree.
2. CTS Exceptions (End points of clock tree).
3. Skew group information.
4. Contains target Skew, max target transition and other timing constraints as per clock tree.
5. Top layer and bottom layer route info. VIA's information which will be used during clock route.
6 Clock related info (Generated clocks {Eg. Clock divider, Clock multiplier etc}).
7 NDR Rule definition.

Clock Tree Synthesis Overview

  • Clocks are used to synchronize data communication. Before clock tree synthesis, clock path behaves as ideal, where there is equal delay from clock source to sink.
  • The concept of clock tree synthesis (CTS) is the automatic insertion of buffers/inverters along the clock paths of the ASIC design to balance the clock delay to all clock inputs. Basically, clock gets evenly distributed throughout the design across all the sequential elements.
  • There are number of algorithms to build the clock tree:
    • H Tree
    • Clock Mesh
    • Spine
    • Fish bone

In recent times, in order to compete the clock tree balancing we use H tree algorithm. Let's go into the details of H Tree algorithm.

Algorithm steps for the H-Tree

  • Find out all the flops present.
  • Find out the center of all the flops.
  • Trace clock port to center point.
  • Now divide the core into two parts, trace both the parts and reach to each center.
  • Then from this center, again divide the area into two and again trace till center at both the end.
  • Repeat this algorithm till the time we reach the flop clock pin.
h-tree-clock-tree-synthesis
H-tree Structure in Clock Tree Synthesis

Standard H-Tree advantages:

  • Very good cross-corner scaling behavior.
  • Balanced by construction.
  • Assumes shielded and effect of congestion is insignificant.
  • Lower power than mesh.

Standard H-tree disadvantages:

  • Need to have power-of-two number of sinks (tap buffers).
  • Need rectangular unblocked area.
  • Higher power than ad-hoc CTS tree.
  • Generalization of H-Tree.
  • Flexibility in sink placement.
  • Non-rectangular floorplans with multiple blockages supported.
  • Intelligent trade offs made between skew and power.
  • Intended to be used for top of the tree, not whole tree.

Flexible H-tree – Limitations:

You will need to determine where to insert H-tree(s) in the clock architecture

  • Requires clock architecture understanding
  • User specified root pin
  • User specified pre-existing leaf pins or sink grid area for new leaf buffers
  • Newly inserted leaf buffers become tap buffers for multi-tap CTS

- The flexible H-tree only contains buffers or inverters

  • No logic or gating

Before going for the clock tree design, we need to have sanity checks where we look for following:

- Design is placed or not. Place Exit should complete.
- Clocks has been defined.
- Clock roots should not be on hierarchical pins.

Clock Tree Synthesis Steps

There are following steps which need to be performed during the Clock Tree Synthesis:

  • Clustering
  • DRV Fixing
  • Insertion Delay Reduction
  • Power Reduction
  • Balancing
  • Post-Conditioning

- Clustering

Depending on the geometry locations, the skew groups are being created as per the description in SPEC file.

- DRV Fixing

At this stage, DRVs (max_tran, max_cap, max_length, max_fanout) are fixed.

- Insertion Delay Reduction

At this stage, insertion delay is getting minimized as much as possible, which is one of our main goals for the Clock Tree Synthesis.

- Power Reduction

As we know clock is a major power consumer, so we need to analyze and fix it in such a way that power consumption will be less.

- Balancing

The main balancing happens at this stage with the help of different clock buffers and inverters.

- Post-conditioning

At this stage, again DRVs will be checked and if required then it will be fixed.

CTS Quality Checks

There are following quality checks for the Clock Tree Synthesis:

  • Minimize Insertion Delay
  • Skew Balancing
  • Duty Cycle
  • Pulse Width
  • Clock Tree power consumption
  • Signal Integrity and Crosstalk

Let’s discuss these topics in details:

Minimize Insertion Delay

latency-insertion-delay-source-latency-network-latency-vlsi
Source Latency and Network Latency

 

Advantages of the low Latency:

  • Less buffer, hence less power consumption. As we know clock paths are the heaviest power dissipated path.
  • Cell area reduction as less buffer in the clock path.
  • Less runtime since less buffer need to be inserted in the design, which saves optimization as well.

Skew Balancing

Skew: Skew is the difference of time between the clock path. Let’s understand more through the picture.

skew-launch-flop-capture-flop
Skew

Useful Skew

Useful skew is very important concept in CTS. Let's discuss this through an example.

before-useful-skew
Before Useful Skew

In the above picture, we can see that the first path is having positive 15 ps of skew, second path is having negative 5 ps of skew and third is having positive 5 ps skew. Now, if we can see among these three paths, the negative path can borrow skew from the first positive 15ps path, then skew will be balanced between the paths. After borrowing the skew, we can see that in below picture the skew looks positive in number for all the three paths. This concept is called as useful skewing and the skew we borrowed is useful skew.

after-useful-skew
After Useful Skew

Duty Cycle & Pulse Width

https://data.ivlsi.com/duty-cycle-pulse-width-vlsi-physical-design/

Clock Tree Power Consumption

As we know that clock network is the heaviest switching element in the design.

Clock tree power depends upon below two factors:

o Latency
o Transition

Signal Integrity & Crosstalk

https://data.ivlsi.com/signal-integrity-and-crosstalk-vlsi-physical-design/

1. Latency: If latency is less then less buffer in the design. So less power consumption.
2. Transition: If transition is good then less power consumption. Transition is being blamed always for the losses happening in the design w.r.t the trade-off.

Runtime:

  • Runtime depends upon how much time it takes to build the clock tree in a design. Optimization w.r.t to the QOR results and routing of the clock nets.
  • Basically, during optimization, we loss more time if we are having more tight constraints and we need to spend more time to achieve the targets after clock tree building. As we have setup and hold, both need to be taken care of, so timing optimization takes more time.

Clock Tree Structure

The clock tree has been divided into three parts- Top, Trunk, Leaf to understand deeper into the CTS quality and balancing. Below is the picture explaining the clock tree structure.

clock-tree-structure
Clock Tree Structure

Clock Tree Network

Clock tree information can be as per below:
- A transitive fanout of a root pin.
- A sink can belong to more than one clock tree.

Root pins: The starting point of the clock signal.
Internal Pins: The pin with which clock propagation happens from root to sink.
Sink Pins: The terminal point of a clock signal. Or sequential element pins or stop pins or Ignore pins.

Clock Tree Exceptions

There are following clock tree exceptions:

Stop Pin – No buffer/inverter insertion beyond this point (Don’t touch scenario)
Ignore Pin (Float Pins) – No DRV, No Balance
Exclude Pin – DRV Fixing but no balancing
Through Pin – DRV Fixing as well as Balancing

Timing Analysis and fixing

As we know the best approach is to look at the problem and analyse the root cause before going to solve it. So first see few sets of issues which can cause the timing violations after Clock tree synthesis.

Below can be the reason for a broken timing, let’s investigate it in details:

  • Clock Latencies, skews and uncertainties:
    • What is uncertainty vs. clock period?
    • Are the different clocks correctly balanced (unless there are false paths)?
  • Cell distribution over the path:
    • Are there suspiciously long buffering chains (>10 buffers back to back)?
    • Is it a short (< 5 instances) or a deep logic level path (> 30 instances)?
    • Are the drive strengths chosen correctly?
    • Are the correct library cells being used (fast cells for timing critical paths)?
  • Net load, slew, fanout and wire length:
    • Are there unexpectedly large fanouts (> 50) or long nets (> 1000 um)?
    • Are there nets with unexpectedly large load or slew compared to other nets?
  • Instance and net delay:
    • Are there instances or nets with unexpected large delay (> 5x) compared to others?
  • Net and Cell derating:
    • Are the derating values realistic? (between 0.8 and 1.2)
  • Congested region: Do we have congested region in a particular area from where clock buffer/inverter was not able to place there and went far away? If yes, then we need to de-congest that area so that clock buffer/inverter should get proper physical location to get placed. Placement or routing congestion both can make the timing broken.
  • Is the placement compact or widely spread over the floorplan?
  • Are the instances correctly spread from the start point to the endpoint?
  • We need to look at the path topology whether it is straight or detoured.
  • The channels we created during floorplan are sized correctly or not.
  • Does the path cross the power domains or getting detoured due to huge Marcos present in the design?
  • Many times, power domain shapes matter. So it's good to check the shape of power domain and ensure that it’s not too big or too small. I personally have faced this issue many times. So we need to take care of power domain shape in order to converge timing.

As we have looked into many reasons, let's see what are the steps we can use to debug timing in the design at post-CTS stage:

  • Look into timing debug window of any tool for the worst negative slack path.
  • Check the path and understand the driver and receiver cell and the cells present in the path.
  • Look for the insertion delay and check if some cells having high delay value.
  • Check the location of that particular cell in the design. Are they too far from the driver? Is the net having detour?
  • Has the leaf (Driver) cell got stuck into some channel of macros or near high placement congested area?
  • Check the issue why this cell has got high insertion delay and got detoured. If possible, block the area with hard blockage if it is a channel in between the macros.

Integrated Clock Gating (ICG) Cell and related concepts

https://data.ivlsi.com/integrated-clock-gating-icg-cell-and-related-concepts-vlsi-physical-design/

Clock Tree Route and NDR

Once we are done with the clock tree balancing after all the clock cells placement, we go for the final routing of the clock nets and fix it. After clock routing, whatever is left out will be used for signal routing. Now the question is which metal layers to be used and how we should do routing to avoid issues in later stages.

Generally, we use mid layers like M5 - M8 of TSMC 7nm where M0-M13 metal layers are available for clock routing. We know that higher metal layers have low resistance as compared to the lower metal layers but still why did we stop at M5-M8?

Normally we use M12, M11, M10, M9 for power routing and then remaining for clock routing. Also, using a stack of vias is also another issue hence M5-M8 is the best choice for clock routing.

There is always a defined metal width and spacing from foundry which is being used for routing, but if required we change the width and spacing for the metal layers to achieve our PPA target. We can increase the width and height but can’t use lower than what foundry has proposed.

During clock building, our target is timing and Power as well, since clock is continuous signal, it consumes more power. So in order to achieve good PPA, we require non-default-routing rule. Normally we use double width and double spacing for clock routing, however this might change depending on the requirement.

 

 

Share the Article