System-level Thermal and Reliability Analysis and Management for Multi-Core and 3D Microprocessors

 

Figure 1 Stress distribution  in a metal line (Courtesy  of Dr. Sukharev)

 

Figure 2  the EM-induced stress develoipment in the metal wire over time.

 

 

 

 

 

Principle Investigators: 

 

Dr. Sheldon Tan (PI), 

 

Graduate Students:

 

Xin Huang, Yan Zhu, Sahana Swarup, Taeyoung Kim,

Graduate Students (graduated)

 

Zao Liu (Intel Corp), Xuexin Liu (Synopsys)

Industry liaisons: 

1.    Dr. Valeriy Sukharev, Mentor Graphics Corporation

2.    Dr. Ashish X. Gupta, Intel Corporation

3.    Dr. Jinjun Xiong, IBM Research

4.    Dr. Logendran Bharatham, Freescale Semiconductor, Inc. 

 

Academic Collaborator:

  Dr. Hai Wang, The University of Electronic Science and Technology of China, Chengdu, China

  Dr. Haibao Chen, Shanghai Jiaotong University, Shanghai, China

Funding:

 

We appreciate the following funding agencies for their generous supports of this project. 

1.    National Science Foundation, NSF FRS (Failure Resistant Systems) program (CCF-1255899), ÒThermal-Sensitive System-Level Reliability Analysis and Management for Multi-Core and 3D MicroprocessorsÓ, $180K, April 1, 2013 to March. 31, 2016. PI (single PI).

2.   Semiconductor Research Corporation, NSF/SRC Multi-core Program (SRC 2013-TJ-2417), ÒThermal-Sensitive System-Level Reliability Analysis and Management for Multi-Core and 3D MicroprocessorsÓ,  $120K, April 1st, 2013 to Match 30, 2016, PI.

  1. Academic Senate COR (committee on research) Fellowships, ÒRuntime Thermal Management for Multi/many Core and 3D Integrated SystemsÓ, $7500, July, 2012  to June 2013. PI.

 

Awards:

1.     Dr. Valeriy Sukharev received the prestigious  SRC Mahboob Khan Outstanding Industry Liaison Award!

a.     Mahboob Khan Outstanding Industry Liaison/Associate Awards recognizes those individuals who demonstrate outstanding commitment and effectiveness in facilitation of university research, mentoring of graduate students, and dissemination of knowledge and research results to industry.

b.     Dr. Sukharev has been selected as a recipient of one of the 2014 Mahboob Khan Outstanding Industry Liaison/Associate Awards.  His dedication and personal contributions as a liaison to SRC research programs under the direction of Dr. Sheldon Tan, University of California Riverside on SRC research #2417.001 - Thermal-Sensitive System-Level Reliability Analysis and Management for Multi-Core and 3D Microprocessors has served to strengthen our industry.   SRC laud his efforts and hold his accomplishments as a role model for others.

c.    The Mahboob Khan Outstanding Industry Liaison/Associate Awards will be presented at the SRC TECHCON 2014 banquet on Monday, September 8th in Austin, TX. 

2.    X. Huang, T. Yu,  V. Sukharev,  S. X.-D. Tan, ÒPhysics-based electromigration assessment for power grid networksÓ, Proc. IEEE/ACM Design Automation Conference (DACÕ14),  San Francisco, June, 2014. (Best Paper Award Nomination (12 out of 787 submissions, 1.5%))

3.    H. Chen, S. X.-D. Tan, X. Huang, V. Sukharev,  ÒNew electromigration modeling and analysis considering time-varying temperature and current densitiesÓ, Proc. Asia South Pacific Design Automation Conference (ASP-DACÕ15), Chiba, Japan, Jan. 2015. .(Best Paper Award Nomination)

 

 

Project Descriptions

 

Reliability has become a significant challenge for the current multi-core and emerging 3D microprocessor design.  Aggressive transistor scaling and increasing processor power density leads to excessive on-chip temperature and increases the risk that microprocessors will fail. Many long-term failure mechanisms are very sensitive to the temperature or temperature changes such as electro-migration, stress migration and thermal-cycling. The elevated temperature and temperature gradients due to continuous integration in multi-core and emerging 3D microprocessors have significant adverse effects on those reliability issues. Wear-out based long-term reliability issues traditionally were addressed in the process and manufacturing stages. But as reliability becomes a major design constraint for nanometer VLSI systems, it must be addressed at different layers. As a result, there is an urgent need for reliability awareness and optimization at the micro-architectural design stage.  Since temperature has exponential impacts on many failure issues, it is crucial to have accurate and fast thermal estimation for reliability analysis and even optimization at the architecture and package levels.

This project addresses the fundamental challenges in system-level reliability modeling, analysis and optimization. The project consists of the following thrusts:

First, we propose to develop architecture-level full-chip reliability modeling and analysis techniques considering new structures of integration techniques and dominant hard failure mechanisms.  Then we will develop reliability-aware dynamic thermal management techniques for the multi-core and 3D stacking microprocessors. We will focus on the task migration and dynamic voltage and frequency scaling based thermal management techniques.

Second, we propose to develop full-chip thermal estimation and prediction techniques considering realistic conditions such as limited physical thermal sensors, presence of errors in thermal and power models, for run time system-level reliability analysis and optimization.  For fast thermal analysis and estimation at the design stage, we also propose a module-based hierarchical thermal analysis techniques, which promises both accuracy and efficiency. 

We expect the following results coming from this research:

1.     Development of architecture-level full-chip reliability modeling and analysis techniques.

2.     Development of reliability-aware dynamic thermal management techniques for the multi-core and 3D stacking microprocessors.

3.     Design full-chip thermal estimation and prediction techniques considering practical limited thermal sensors, noise errors,  for run-time thermal management and optimization

Task goal:

The objective of this project is to develop novel, efficient system and architecture level reliability analysis and optimization techniques for multi-core and 3D microprocessors. We seek to regulate on-chip temperature, which affect the wear-out faults the most, to manage the system reliability dynamically. 

Three thrusts in the task:

1.     Develop the fast thermal estimation and prediction techniques

2.     Full-chip failure rate and MTTF modeling analysis techniques

3.     New reliability-aware dynamic thermal management techniques

Features of the proposed methods

1.     Address the long-term thermal-sensitive reliability issues such EM, SM, TDDB, thermal cycling effects by system level thermal and power management.

2.     New fast physics-based EM assessment techniques which is more accurate and predictable than existing Black and BlechÕs equations.

3.     The thermal estimation and prediction techniques can consider the more realistic conditions.

 

Invited Presentations by Dr. Sheldon Tan and collaborators

 

1.     Nanyang Technological University, School of Electrical and Electronic Engineering, Singapore, Singapore , ÒThermal Modeling, Estimation and Prediction for Package Design and On-Chip  Temperature RegulationÓ,.  Aug. 16, 2011.

  1. The Hong Kong University of Science and Technology, Department of Electrical and Computer Engineering, Hong Kong, China,  ÒReliable Thermal Estimation and Prediction for On-Chip Temperature RegulationÓ, Aug. 22, 2011.

3.   Mentor Graphics Corp, Calibre Group, Fremont, CA, ÒThermal Modeling and Analysis Research for High-Performance Package and Chip DesignÓ, Dec. 14, 2011.

4.   MediaTek Singapore Pte Ltd, Singapore, ÒThermal Analysis and Runtime Management Research for Multi-core MicroprocessorsÓ,  July 27, 2012.

  1. International Talent Innovation and Entrepreneurship Week of Shanghai, 2012, Shanghai,  ÒNew Battery State of Charge Estimation
    Techniques for EVÓ,  Aug. 7, 2012.
  2. International Workshop on Emerging Circuits and Systems (IWECSÕ13),  University of Electronic Science and Technology of China (UESTC), Chengdu, Sichuan Province, China, ÒThermal resistance modeling and characterization for TSV and TSV arrayÓ, July 26,  2013.
  3. Seoul National University, Embedded System Research Center (ESRC), Seoul, Korea, ÒArchitecture Level Thermal Modeling, Management for Multi-core and 3D MicroprocessorsÓ, Dec. 10, 2013. Host: Prof. Naehyuck Chang of SNU.

8.   The University of Hong Kong, Department of Electrical and Electronic Engineering, Hong Kong, China, ÒNew More Physics-Based Full-Chip Electron-migration Modeling and AnalysisÓ, Jan. 24, 2014. Host: Prof. Ngai Wong of Univ. of HK.

9.   The University of California at San Diego, Department of Electrical and Computer Engineering, San Diego, CA. ÒNew Physics-Based Full-Chip Electron-Migration Analysis and System-level Reliability ManagementÓ, April 23, 2014. Host: Prof. Chung-Kuan Cheng of UCSD.

10.                  The Institute of Computing Technologies, State Key Lab of Computer Architecture, Chinese Academy of Science, Beijing, China, ÒPhysics-Based Full-Chip Electron-Migration Analysis and System-level Reliability ManagementÓ,  July 4th, 2014. Host: Prof. Yu Hu of ICT, CAS.

11.2nd International Workshop on Cross-layer Resiliency (IWCR 2014),  USC Information Science Institute (ISI),  Marina del Rey, CA, ÒPhysics-Based Full-Chip Electron-Migration Modeling and System-level Reliability ManagementÓ, July 28, 2014.

12.                  EDA workshop, Daejeon Convention Center, Daejeon,  Korea, ÒPhysics-Based Full-Chip Electron-Migration Modeling and Cross-Layer Reliability ManagementÓ, August 26, 2014.

  1. University of Electronic Science and Technology of China (UESTC), School of Microelectronics and Solid State Electronics, Chengdu, China, ÒPhysics-Based Full-Chip Electron-Migration Modeling and Cross-Layer Reliability ManagementÓ,  Sept. 10, 2014.
  2. 13th International Workshop on Stress-Induced Phenomena in Microelectronics (Stress Workshop), The University of Texas at Austin, Austin, ÒPhysics-Based Electromigration Assessment for Power Grid NetworksÓ, Oct. 15th, 2014.

 

Tutoiral Presentations by Dr. Sheldon Tan

 

 

 

Software Download

 

 

Relevant Publications

Journal publications

J1       D. Li, S. X.-D. Tan, E. H. Pacheco, M. Tirumala, ÒParameterized architecture-level thermal modeling for multi-core microprocessorsÓ, ACM Transaction on Design Automation of Electronic Systems (TODAES), vol. 15, no. 2, pp.1-22, February 2010 (one of top 10 downloaded ACM TODAES Articles published in 2010).

J2       T. Eguia, S. X.-D. Tan, R. Shen, D. Li,  E. H. Pacheco, M. Tirumala, L. Wang, ÒGeneral parameterized thermal modeling for high-performance microprocessor designÓ,  IEEE Transactions on Very Large Scale Integrated Systems  (TVLSI), Vol. 20,  No. 2, pp.221-224, Feb. 2012. 10.1109/TVLSI.2010.2098054.

J3       H. Wang, S. X.-D. Tan, D. Li, A. Gupta,  Y. Yuan, ÒComposable Thermal Modeling and Simulation for Architecture-Level Thermal Designs of Multi-core MicroprocessorsÓ, ACM Transactions on Design Automation of Electronic Systems (TODAES),  vol. 18, no. 2, March 2013.

J4       Z. Liu, S. X.-D. Tan, H. Wang, Y. Hua, and A. Gupta, ÒCompact thermal modeling for packaged microprocessor design with practical power mapsÓ, Integration, The VLSI Journal, vol.  47, no. 1, January 2014. (One of the most downloaded papers in 2014 after its publication, 178 downloads in 3 months)  see: http://www.journals.elsevier.com/integration-the-vlsi-journal/most-downloaded-articles/  Online access:  http://www.sciencedirect.com/science/article/pii/S0167926013000412

J5       Z. Liu, S. X.-D. Tan, X. Huang and H. Wang, ÒTask migrations for distributed thermal management considering transient effectsÓ, IEEE Transactions on Very Large Scale Integrated Systems  (TVLSI), (in press).

J6       Z. Liu, S. Swarup, S. X.-D. Tan,  H. Chen, H. Wang, ÒCompact lateral thermal resistance model of TSVs for fast finite-difference based thermal analysis of 3D stacked ICsÓ, IEEE Transaction on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 33, no. 10. Oct. 2014.

 

Conference publications

C1      H. Wang, S. X.-D. Tan, X. Liu, A. Gupta, ÒRuntime power estimator calibration for high-performance microprocessorsÓ, Proc. Design, Automation and Test in Europe (DATE'12), pp.352-357, Dresden, Germany, March 2012. 

C2      Z. Liu, S. X.-D. Tan, H. Wang, A. Gupta,  and S. Swarup , ÒCompact nonlinear thermal modeling of packaged integrated systemsÓ, Proc. Asia South Pacific Design Automation Conference (ASP-DACÕ13), pp. 157-162, Yokohama, Japan, Jan. 2013

C3       Z. Liu, T. Xu, S. X.-D. Tan, and H. Wang, ÒDynamic thermal management for multi-core microprocessors  considering transient thermal effectsÓ, Proc. Asia South Pacific Design Automation Conference (ASP-DACÕ13), pp.473-478, Yokohama, Japan, Jan. 2013.

C4      H. Wang, S. X.-D. Tan, S. Swarup, and X. Liu, ÒA power-driven thermal sensor placement algorithm for dynamic thermal managementÓ, Proc. Design, Automation and Test in Europe (DATE'13), pp.1215-1220, Grenoble, France, March 2013. 

C5      Z. Liu, S. Swarup, and S. X-D. Tan, ÒCompact lateral thermal resistance modeling and characterization for TSV and TSV arrayÓ,  Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCADÕ13), San Jose, CA,  Nov. 2013.

C6      Z. Liu,  X. Huang, S. X.-D. Tan, H. Wang, H. Tang, ÒDistributed task migration for thermal hot spot reduction in many-core microprocessorsÓ, in Proc. International Conference on ASIC (ASICONÕ13), Shenzhen, China, Oct. 2013

C7      Y. Chi,  S. X.-D. Tan, T. Yu, X. Huang and N. Wong, ÒDirect finite-element-based solver for 3D-IC thermal analysis via H-matrix representationÓ, Proc. Int. Symposium on Quality Electronic Design (ISQEDÕ14), San Jose, CA,  March, 2014.

C8      X. Huang, T. Yu,  V. Sukharev,  S. X.-D. Tan, ÒPhysics-based electromigration assessment for power grid networksÓ, Proc. IEEE/ACM Design Automation Conference (DACÕ14),  San Francisco, June, 2014. (Best Paper Award Nomination (12 out of 787 submissions, 1.5%))

C9      Z. Liu,  X. Huang, V. Sukharev and S. X.-D. Tan, ÒEM-reliability system modeling and performance optimization for high-performance microprocessorsÓ, TECHCONÕ2014 , Austin, TX,  Sept. 2014.

C10   V. Sukharev, X. Huang,  H. Chen and S. X.-D. Tan, ÒIR-drop based electromigration assessment: parametric failure chip-scale analysisÓ, Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCADÕ14),  San Jose, CA,  Nov. 2014.

C11   T. Kim, B. Zheng, H. Chen, Q. Zhu, V. Sukharev and  S. X.-D. Tan, ÒLifetime optimization for real-time embedded systems considering electromigration effectsÓ Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCADÕ14), San Jose, CA,  Nov. 2014.

C12   H. Chen, S. X.-D. Tan, X. Huang, V. Sukharev,  ÒNew electromigration modeling and analysis considering time-varying temperature and current densitiesÓ, Proc. Asia South Pacific Design Automation Conference (ASP-DACÕ15), Chiba, Japan, Jan. 2015. .(Best Paper Award Nomination)