# Lecture 3: Nano-CMOS High-Level Synthesis CSCE 6730 Advanced VLSI Systems

Instructor: Saraju P. Mohanty, Ph. D.

**NOTE**: The figures, text etc included in slides are borrowed from various books, websites, authors pages, and other sources for academic purpose only. The instructor does not claim any originality.





#### Outline of the Talk

- Issues in Nano-CMOS
- Challenges in The Context of HLS
- Proposed Techniques in Current Literature
- Conclusions





# **Issues in Nano-CMOS**





Issues in Nano-CMOS Circuits ...

- Variability: Variability in process and design parameters has increased. They affect design decisions, yield, and circuit performance.
- Leakage: Leakage is increasing. Affects average as well as peak power metrics. Most significant for applications where system goes to standby mode very often, e.g. PDAs.
- **Power**: Overall chip power dissipation increasing. Affect energy consumption, cooling costs, packaging costs.





#### **Issues in Nano-CMOS Circuits**

- Thermals or Temperature: Maximum temperature that can be reached by a chip during its operation is increasing. Affects reliability and cooling costs.
- Reliability: Circuit reliability is decreasing due to compound effects from variations, power, and thermals.
- Yield: Circuit yield is decreasing due to increased variability.





## Variability: Origin and Sources

- Ion implantation
- Chemical mechanical polishing (CMP)
- Chemical vapor deposition (CVD)
- Sub-wavelength lithography
- Lens aberration
- Materials flow
- Gas flow
- Thermal processes
- Spin processes
- Microscopic processes
- Photo processes

Source: Singhal, DAC Booth 2007





## Variability: Types ...

#### **Parametric Variations**



CSCE 6730: Advanced VLSI Systems

UNIVERSITY OF NORTH TEXAS Discover the power of ideas

#### Variability: Types ...







#### Variability: Types ...

## **Variability Classifications**

#### Inter-Die or Intra-Die

#### Random or Systematic

#### **Correlated or Uncorrelated**

Spatial or Temporal





## Variability: Types

- Process variations are classified as:
  - Inter-die and Intra-die.

UNIVERSITY OF NORTH TEXAS Discover the power of ideas



### Variability: The Impact in a Wafer ...



Source–drain resistance is different for different chips in a same die.





Gate-to-source and gate-to-drain overlap capacitance is different for different chips in a same die.

Source: Bernstein et al., IBM J. Res. & Dev., July/Sep 2006.





## Variability: The Impact in a Wafer

- The impact of process variations is seen as design yield loss.
- Digital circuits are typically optimized for speed and power.
- Analog circuits are designed to meet as many as five to ten performance metrics.
- Variations in process parameters have a resounding effect on the performance metrics of analog/mixed-signal and RF circuits.
- Figure showing impact of effective transistor channel length on the speed of an adder cell.







## Variability: The 15 Device Parameters

- 1) V<sub>DD</sub>: supply voltage
- 2) V<sub>Thn</sub>: NMOS threshold voltage
- 3) V<sub>Thp</sub>: PMOS threshold voltage
- 4) t<sub>gaten</sub>: NMOS gate dielectric thickness
- 5)  $t_{qatep}$ : PMOS gate dielectric thickness
- 6) L<sub>effn</sub>: NMOS channel length
- 7) L<sub>effp</sub>: PMOS channel length
- 8) W<sub>effn</sub>: NMOS channel width
- 9) W<sub>effp</sub>: PMOS channel width
- 10) N<sub>gaten</sub>: NMOS gate doping concentration
- 11) N<sub>gatep</sub>: PMOS gate doping concentration
- 12) N<sub>chn</sub>: NMOS channel doping concentration
- 13) N<sub>chp</sub>: PMOS channel doping concentration
- 14)  $N_{sdn}$ : NMOS source/ drain doping concentration
- 15) N<sub>sdp</sub>: PMOS source/ drain doping concentration.





#### Power and Leakage ...







#### Power and Leakage

- The relative prominence of these components depend on:
  - Technology Node: 65nm, 45nm, or 32nm
  - Process : SiO<sub>2</sub>/Poly or High-κ/Metal-Gate



• BTBT tunneling is important for sub-45nm.





# Challenges in The Context of HLS





#### **High-Level Synthesis : An Effective Approach**

- High-level synthesis (HLS) is defined as the translation from behavioral hardware description of chip to its register-transfer level (RTL) structural description.
- Allows exploration of design alternatives, including low power, prior to layout of the circuit in actual silicon.
- An efficient way to cope with system design complexity.
- Can facilitate early design verification.
- Can increase design reuse.





#### Nano-CMOS HLS: Goal

- Variability-driven statistical HLS is stated as: Given an unscheduled data flow graph (DFG), it is required to find a scheduled data flow graph with appropriate resource binding such that specified costs for the circuit are minimized statistically while accounting for variability and satisfying constraints.
- The resource, latency, and/or yield constrained optimization problem can be formulated as follows:

Minimize: *PDF<sub>Cost, DFG</sub> (Mean, Variance)* ... (1) such that following resource, latency, and yield constraints, are satisfied:

 $\begin{array}{ll} Allocated (FU_{k,i}) \leq Available (FU_{k,i}), \ for \ each \ cycle \ c \ \dots \ (2) \\ Expected \left[ PDF_{DFG, \ Delay, \ Critical} (Mean, \ Variance) \right] \leq Delay_{DFG, \ Target} (3) \\ Yield_{Circuit} \geq Yield_{Target} \ \dots \ (4) \end{array}$ 

NOTE: PDF is probability density function.





#### Nano-CMOS HLS: Design Space





CSCE 6730: Advanced VLSI Systems



## Nano-CMOS HLS: Challenges

- Unified consideration of axes of design space exploration for trade-offs.
- Determination of statistical models for variability of different nano-CMOS technologies.
- Propagation of the statistics to different levels of circuit abstraction.
- Performing statistical modeling of power, leakage, and delay for different RTL components.
- Estimating power, leakage, delay, area, and yield be estimated during HLS in the presence of variations.



#### Nano-CMOS HLS: Feedback Needed





CSCE 6730: Advanced VLSI Systems



#### Nano-CMOS HLS: Questions

- How do the HLS phases (e.g. scheduling, binding) affect power, leakage, area, and yield in presence of variations?
- How do we judiciously consider design corners (e.g.  $V_{DD}$ ,  $V_{Th}$ ) to obtain a global power, leakage, and performance optimal circuit for given circuit constraints (from specifications)?





# **Proposed Approaches**





#### Nano-CMOS HLS : Approaches







CSCE 6730: Advanced VLSI Systems

# Statistical Nano-CMOS HLS for Power and Leakage

**Source**: S. P. Mohanty and E. Kougianos, "Simultaneous Power Fluctuation and Average Power Minimization during Nano-CMOS Behavioral Synthesis", in *Proceedings of the 20th IEEE International Conference on VLSI Design (VLSID)*, pp. 577-582, 2007.





#### Proposed Statistical Nano-CMOS HLS Framework





CSCE 6730: Advanced VLSI Systems



#### **Statistical HLS : Formulation**

Minimize: 
$$I_{Total}^{DFG}(\mu_{I}^{DFG}, \sigma_{I}^{DFG})$$

Subjected to (Resource/Time Constraints): Allocated $(FU_{k,i}) \leq \text{Available}(FU_{k,i}), \forall \text{cycle } c$  $D_{CP}^{DFG}(\mu_D^{DFG}, \sigma_D^{DFG}) \leq D_{Con}(\mu_D^{Con}, \sigma_D^{Con})$ 







• 3 level hierarchical approach.





- It is assumed that resources such as adders, subtractors, multipliers, dividers, are constructed using 2-input NAND.
- There are total *N* NAND gates in the network of NAND gates constituting a *n*-bit functional unit.
- $N_{CP}$  number of NAND gates are in the critical path.





• The PDF of a current component of a functional unit is calculated as:

 $I_{dyn}^{FU}$  = Statistical Summation over N  $(I_{dyn}^{NAND})$  $I_{sub}^{FU}$  = Statistical Summation over N ( $I_{sub}^{NAND}$ )  $I_{gate}^{FU}$  = Statistical Summation over N  $(I_{gate}^{NAND})$ • The PDF of delay can be calculated as:  $D_{prop}^{FU}$  = Statistical Summation over  $N_{CP}(D_{prop}^{NAND})$ 

• Correlation needs to be considered.





• Through Monte Carlo simulations the input process and design variations are modeled.







CSCE 6730: Advanced VLSI Systems

#### Statistical HLS : Library ... (PDFs of Currents and Delay)





Gate leakage current

#### Subthreshold leakage current





#### **Propagation delay**





CSCE 6730: Advanced VLSI Systems

# Statistical HLS : Library (Relative Contributions)







### Statistical HLS : Optimization ...

```
Simulated Annealing Algorithm (UDFG, Constraints, Library)
ł
     (01) Perform ASAP and ALAP scheduling.
     (02) Temp = Initial Temperature.
     (03) While there exists a schedule with available resources.
     (04)
              i = Number of iterations.
              Perform resource constrained ASAP and ALAP.
     (05)
     (06)
              Initial Solution \leftarrow ASAP Schedule.
     (07)
              S \leftarrow Allocate-Bind().
              Initial Cost ← Statistical-Cost(S).
     (08)
     (09)
             While (i > 0)
     (10)
                  Generate random transition from S to S*.
     (11)
                 \Delta-Cost \leftarrow Statistical-Cost(S<sup>*</sup>) – Statistical-Cost(S).
                  if{ (\Delta-Cost > 0) or ( e^{\Delta-Cost/Temp > random[0,1) ) } then S \leftarrow S*.
     (12)
                 i \in i - 1.
     (13)
              end While
     (14)
     (15)
              Decrement available resources.
     (16)
              Temp \leftarrow Cooling Rate x Temp.
     (17) end While
     (18) return S.
```



34

## Statistical HLS : Optimization

Statistical-Cost (S, Library)

{  $I_{dvn}^{c}$  = Statistical Summation over all FU in  $c(I_{dyn}^{FU})$  $I_{sub}^{c}$  = Statistical Summation over all FU in  $c(I_{sub}^{FU})$  $I_{gate}^{c}$  = Statistical Summation over all FU in  $c(I_{gate}^{FU})$  $I_{total}^{c} = \text{Statistical Summation} \left( I_{dyn}^{c}, I_{sub}^{c}, I_{gate}^{c} \right)$  $I_{total}^{DFG}$  = Statistical Summation over all cycles  $(I_{total}^{c})$  $Cost_{I}^{DFG} = \mu_{I}^{DFG} + 3 \times \sigma_{I}^{DFG}$ Similarly calculate delay cost  $Cost_D^{DFG}$  of the DFG.  $Cost = Cost_I^{DFG} \times Cost_D^{DFG}$ Return Cost.



#### Statistical HLS : Results







# Parametric Nano-CMOS HLS for Leakage

**Source**: S. P. Mohanty, R. Velagapudi, and E. Kougianos, "Physical-Aware Simulated Annealing Optimization of Gate Leakage in Nanoscale Datapath Circuits", in *Proc. 9th IEEE International Conference on Design Automation and Test in Europe (DATE)*, pp. 1191-1196, 2006.







#### Parametric HLS : Formulation

Minimize: 
$$I_{Total}^{DFG}$$
 (Parameters :  $\kappa, T_{gate}, V_{Th}, V_{DD}, L_{eff}, W$ )

Subjected to (Resource/Time Constraints): Allocated $(FU_{k,i}) \leq \text{Available}(FU_{k,i}), \forall \text{cycle } c$  $D_{CP}^{DFG}$  (Parameters :  $\kappa, T_{gate}, V_{Th}, V_{DD}, L_{eff}, W$ )  $\leq D_{Con}$ 







RSITY OF Discover the power of ideas

We calculate the direct tunneling current (*I*<sub>oxFU</sub>) of an *n*-bit functional unit as:

$$I_{ox FU} = \sum_{i=1}^{N} I_{ox NANDi}$$

where  $I_{oxNANDi}$  is the average gate oxide tunneling current dissipation of the *i*<sup>th</sup> 2-input NAND gate in the functional unit, assuming all states to be equiprobable.

 Similarly, the propagation delay and silicon area of an nbit functional unit are

$$T_{pdFU} = \sum_{i=1}^{N_{CP}} T_{pdNANDi}$$

$$A_{FU} = \sum_{i=1}^{N} A_{NANDi}$$





- At logic level we used BPTM BSIM4 models for analog simulation to find  $I_{\rm ox}$  and  $T_{\rm pd}.$
- Due to unavailability of silicon data we used an analytical estimate for area calculations.

$$A_{NAND} = K_{inv} \left( 1 + 4(n_{in} - 1)\sqrt{\frac{AR_{NAND}}{K_{inv}}} \right) * \left( 1 + \frac{\left(\frac{W_{NMOS}}{f} - 1\right)(1 + \beta_{NAND})}{\sqrt{K_{inv}AR_{NAND}}} \right)$$

where,

f

W<sub>NMOS</sub> = NMOS width,

= Minimum feature size for a technology,

 $k_{inv}$  = Area of minimum size inverter with respect to  $f^2$ ,

AR<sub>NAND</sub>= aspect ratio of NAND gate,

- n<sub>in</sub> = number of inputs, and
- $\beta_{NAND}$  = ratio of PMOS width to NMOS width.

#### Source: Bowman TED 2001 Aug

41





 $I_{ox}(\mu A) = A \exp\left(-\frac{T_{ox}}{\beta}\right) + \beta$ 

















## Parametric HLS : Optimization ...

- The objective is to reduce both the gate leakage and area of the circuit for given time constraints.
- The objective function used by the optimization algorithm is:  $Cost = a^* I_{ox} + b^* A$
- *I*<sub>ox</sub> of the circuit is calculated as the sum of tunneling current of all the nodes in the circuit. *A* is the sum of areas of all the allocated resources. '*a*' and '*b*' are the weights of current and area respectively. '*a*' and '*b*' are chosen in such a way the effect of current and delay are normalized.





## Parametric HLS : Optimization ...

(01) Initial Temperature  $t \leftarrow t_o$  and available Resources  $\leftarrow$  Resource constraints. (02) While there exists a schedule with available resources.

- (03) i = Number of iterations.
- (04) Perform resource constrained ASAP and resource constrained ALAP.
- (05) Make initial Solution as ASAP Schedule.
- (06)  $S \leftarrow Allocate Bind()$  and Initial Cost  $\leftarrow Cost(S)$ .
- (07) While (*i* > 0)
- (08) Generate a random thicknesses in range of  $(T_{ox} T_{oxL} T_{ox} + T_{ox})$
- (09) Generate random transition from S to  $S^*$ .
- (10)  $\Delta C \leftarrow Cost(S) Cost(S^*)$
- (11) if  $(\Delta C > 0)$  then  $S \leftarrow S^*$ .
- (12) else if  $(e^{\Delta C/t} > random[0,1))$  then  $S \leftarrow S^*$ .
- (13)  $i \leftarrow i 1.$
- (14) end While.
- (15) Decrement available resources.
- (16)  $t \leftarrow \text{Cooling Rate } \times t.$
- (17) end While.
- (18) return *S*.







### **Parametric HLS : Optimization**



Each layer corresponds to a different resource constraint, each time the number of  $T_{oxH}$  multipliers are decreased a new layer is formed. We observed that the number of design corners reduces when we use more multipliers of  $T_{oxH}$  thickness, since delay increases and mobility of the nodes is restricted in order to satisfy the time constraint.





#### Parametric HLS : Results



Results presented for different benchmarks for a delay trade-off factor of 1.4,  $T_{oxL}$  is 1.4nm and  $T_{oxH}$  is 1.7nm.





# Statistical Nano-CMOS HLS for Timing

**Source**: Jongyoon Jung, Taewhan Kim, "Timing Variation-Aware High-Level Synthesis", in *Proceedings of IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*, 2007, pp. 424-428.











# Statistical Timing HLS : Algorithm

- Branch-and-bound algorithm for scheduling and binding.
- The search process is speeded up using window-based search.
- Window is maximum number of consecutive clock cycles satisfying resource constraints.





## Statistical Timing HLS : Results

| Results Compared Over Traditional List<br>Scheduling |     |                   |                  |                      |  |  |
|------------------------------------------------------|-----|-------------------|------------------|----------------------|--|--|
| Benchmarks                                           |     | Yield<br>Obtained | Yield<br>Penalty | Latency<br>Reduction |  |  |
| Avg. of 4                                            | 90% | 92.9%             | 7.1%             | 18.8%                |  |  |
| Avg. of 4                                            | 80% | 88.1%             | 11.9%            | 20.2%                |  |  |





# Statistical Nano-CMOS HLS for Post-Silicon Tuning

**Source**: Feng Wang, Xiaoxia Wu, and Yuan Xie, "Variability-Driven Module Selection With Joint Design Time Optimization and Post-Silicon Tuning", in *Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC)*, 2008, pp. 2-9.





# Silicon Tuning HLS : Approach

- Two stage module selection:
  - Stage 1: An iterative algorithm for power and timing variability aware module selection.
  - Stage 2: A sequential conic program (SCP) to determine the optimal body bias for post-silicon tuning which influences design-time module selection.





# Silicon Tuning HLS : Results

| Power Yield For 99% Performance Yield Constraint |                     |                                                             |                                                                                   |       |  |  |
|--------------------------------------------------|---------------------|-------------------------------------------------------------|-----------------------------------------------------------------------------------|-------|--|--|
| Benchm<br>arks                                   | Power<br>Constraint | Yield for<br>Design Time<br>Variation<br>Aware<br>Selection | Yield for Post<br>Silicon Tuning +<br>Design Time<br>Variation Aware<br>Selection | ments |  |  |
| Avg. of 6                                        | No                  | 66%                                                         | 88%                                                                               | 38%   |  |  |
| Avg. of 6                                        | Yes                 | 83%                                                         | 92%                                                                               | 11%   |  |  |





#### **Summary and Conclusions**

- Most of the variability aware analysis and optimization works are at circuit or logic level.
- Work at architecture level and during HLS is slowly making progress.
- Pre-silicon and post-silicon approaches are introduced to improve power and timing yield.
- The main challenge in this unified consideration of variability, power, and timing.
- Another challenge is translation of process and physical level information to architecture level to close design-to-silicon loop.



