Compact Thermal Modeling and Simulations, Software Thermal Sensor Techniques

 for Multi-Core and High Performance Integrated Systems

 

 

Description: Description: H:\project\thermal_model\main_thermal_proj_files\image001.gif

Description: Description: H:\project\thermal_model\main_thermal_proj_files\image002.gif

 

Principle Investigators: 

 

Dr. Sheldon Tan (PI), 

Dr. Yingbo Hua (co-PIs).

 

Graduate Students:

 

                Hai Wang, Zou Liu, Duo Li, Thom Eguia, Ruijing Shen, Shengyang Xu, Wei Wu, Pu Liu,  

 

Funding:

 

We appreciate the following funding agencies for their generous supports of this project. 

·         National Science Foundation, “Fast Software Thermal Sensing and Control for Efficient Dynamic Thermal Management”, (CCF- 0541456), 7/1/2006-6/30/2009, co-PI: Sheldon Tan, PI: Jun Yang.

·         National Science Foundation,  “Parameterized Architecture-Level Thermal Modeling and Characterization for Multi-Core Microprocessor Design”, (CCF-0902885), 8/1/09-7/31/12, PI: Sheldon Tan, co-PI: Yingbo Hua.

·         Semiconductor Research Corporation, “Parameterized Architecture-Level Thermal Modeling and Characterization for Multi-Core Microprocessor Design”, NSF/SRC Multi-core Program (SRC 2009-TJ-1991), Aug.1, 2009 to July 30, PI: Sheldon Tan, Co-PI: Yinbo Hua

·         UC MICRO Program (via Intel Corporation) (#07-101), “Parameterized Thermal Behavioral Modeling and Simulation for Designing System Platforms”, Sept. 2007 to Aug. 2008, PI: Sheldon Tan

·         UC MICRO Program (via Intel Corporation) (#08-11), “Parameterized Thermal Behavioral Modeling and Simulation for Designing System Platforms”, Sept. 2008 to Dec. 2009, PI: Sheldon Tan

·         Intel Corporation, Nov., 2009 to Dec, 2010. PI: Sheldon Tan

 

Project Descriptions

 

As more devices are integrated into the single chip with even increasing functionality, today's chips become very hot. A Pentium 4 processor typically burns ~70W within only a 3.2 x 3.2 cm2 die, generating local temperature above boiling point rapidly if the cooling system (which consumes power as well) is not efficient. These trends are resulting in heat fluxes at the chip level of over 100 W/cm2 in some applications. Designing the package for worst-case scenarios is too expensive and not efficient since it increases the cooling and packaging cost dramatically, and the worst cases are rare.  With an upper limit of 70-125°C as the maximum allowable chip temperature in many applications, acceptable online thermal management and regulation is a key enabler for next generation integrated electronic systems.  Excessive local thermal stress creates reliability problems for the entire system, speeds up the depreciation of expensive computing equipment, reduces the speed of processors, and causes significant leakage power consumption. Sub-threshold leakage of CMOS devices depends greatly on the substrate temperature. Thermal management continues to be identified by the Semiconductor Industries Association Roadmap as one of the five key challenges during the next decade for achieving the projected performance goals of the industry.

In this project, we  investigate the efficient thermal simulation and modeling techniques for architecture level thermal profile estimation and hot spot identifications to guide the thermal-aware chip and package design and provide fast thermal estimation for on-chip dynamic thermal management and regulations to mitigate the increasing thermal crisis in today multi-core microprocessors and high-performance integrated systems. We envision a fast thermal simulator, which can be used as soft thermal sensors  mitigating the problems with physical sensors for efficient on-chip dynamic thermal management.

To address the thermal estimation problems at architecture and package level, we need to address the following aspects (1) Thermal system modeling and characterizations, (2)  Fast thermal simulation and analysis techniques, (3) Accurate power estimation at the architecture level.

In the past several years, my group has made a number of contributions to those critical areas.  First, in the fast thermal simulation and analysis area (2),  we proposed a fast thermal moment matching (TMM) algorithm [J1, C1,C2],  which can perform transition thermal analysis in linear time and is proven to be well suited for online thermal management and regulations [J2]. The TMM has been used in Intel Corporation for package-level faster thermal simulation and thermal modeling.  We further another fast thermal analysis approach, FEKIS, which combines two existing numerical techniques: extended Krylov subspace reduction technique to reduce the thermal circuit complexity and large-step integration method to exploit the piecewise constant power input traces, which is typical in the power traces at the architecture level. The resulting method is 10X faster TMM is better suitable for chip level thermal analysis [J3,C7].

Second, in the thermal system modeling and characterization area, we also proposed a host of new methods to address the thermal modeling problem at the architecture and package levels. This research is concerned about building compact thermal circuits and systems (instead of solving the partial differential thermal diffusion equations) to facilitate fast thermal simulation without loss of accuracy.  Different than the traditional methods, where the thermal circuit systems from the given power and temperature information coming from the field solvers and measured data. We tried to build behavioral thermal models without regarding non-essential physical properties of a thermal system. We proposed a pencil-of-function (POF) based thermal modeling techniques for step-function power inputs. The resulting technique is called ThermPOF [J4,C5,C6], which can build the transfer functions from step power input and given temperature. We further extended the ThermPOF, called ParThermPOF, to consider the changing parameters of a package such as thermal conductivities of heat sinks, temperature at different location of the sinks etc. [J6, C8].

To mitigate the restrictions on the power inputs, which must be step functions in the ThermPOF method, we proposed new thermal modeling techniques based on recently proposed subspace identification method. The new method called, ThermSID, allows arbitrary power input in general as the training data. However, overfitting problem (the modeling may identify many no-essential measure errors instead of real system information) typically plaques those identification method. We proposed a cross-validation-like method to mitigate the overfitting issues in the ThermSID [C10,C11,J7].  For subspace based approach,  the method however, may suffer predictability problem when the practical power inputs are spatially correlated. Our further study shows that there exists a theoretical spatial rank (or the ranks of signals among different correlated power inputs) requirement to ensure model predictability.  On top of this, we develop a new algorithm, which generates independent power maps to meet the spatial rank requirement and can also automatically select the order of the resulting thermal models for the given error bounds [C15, C16].

We are also working on the composite/composable thermal modeling techniques, in which each thermal model can be connected electronically based on their physical connections in the large structure system to build a large thermal system for fast thermal validations. We will start from the accurate finite difference or finite element methods and build the compact thermal models with novel reduction techniques [C14].

Recently we proposed a new method, called FRETEP, to accurately estimate and predict the full-chip temperature at runtime under more practical conditions where we have inaccurate thermal model, less accurate power estimations and limited number of on-chip physical thermal sensors. First, we propose a new thermal sensor based error compensation method to correct the errors due to the inaccuracies in thermal model and power estimations. Second, we raise a new correlation based method for error compensation estimation with limited number of thermal sensors. Third, we optimize the compact modeling technique and integrate it into the error compensation process in order to perform the thermal estimation with error compensation at runtime. Last but not least, to enable accurate temperature prediction for the emerging predictive thermal management, we design a full-chip thermal prediction framework employing time series prediction method [C17].

To address the power estimation aspect of this project, our group also made a number of contributions.  First we proposed a new unit power estimation method based on the total power and access counts in the modern single and multi-core microprocessors [C3]. 

Another important contribution we made is the statistical leakage power analysis. One profound change in the chip design business is that engineers can't put the design precisely into the silicon chips. The so-called manufacture process variations start to play a big role and their influence on the chip's performance, yield and reliability becomes significant. Leakage power is specially sensitive to the process variations as the leakage power change exponentially with channel lengths and threshold voltages, to consider process vitiations specially in the presence of spatial correlations among the process variables, non-linear (quadratic) time complexity is required to compute the important statistical information (mean, variance).  We also proposed to address this efficiency problem first by variable reductions via PCA (principal  component analysis) and orthogonal polynomial representations [J5, C9]. This approach works quite well if the spatial correlations are strong.  To address this outstanding issue, we further proposed a linear time complexity algorithm, for the first time, using the virtual grid techniques (originally proposed by IBM for statistical timing analysis) [ C12,C13] . The resulting algorithm has linear time complexity for both weak and strong correlations and has very good accuracy and many order of magnitudes faster than the state-of-the-art methods.

In addition to the thermal estimation problem, we also addressed the thermal-aware design to improve the reliability of  on chip caches in the high-performance microprocessor design in the past [C4].

Description: Description: H:\project\thermal_model\main_thermal_proj_files\image003.gif

A quad-core architectureDescription: Description: H:\project\thermal_model\main_thermal_proj_files\image004.gif

Description: Description: Description: image001

Different thermal sensors in the Sink in a quad-core architecture.

Description: Description: Description: quad_base_cu_pic_zoom

Quad-core temperature with Aluminum sink

Description: Description: Description: response_surfaces_coded.eps

Parameterized thermal modeling (two parameters – distance in sink and thermal conductivities)

Description: Description: H:\project\thermal_model\main_thermal_proj_files\image008.jpg

Temperature distributions on a 16-core architecture

 

 

Invited Presentations by Dr. Sheldon Tan

 

·         Computer Science and Engineering Colloquium, UCR, “Architecture-level thermal and power modeling and simulation for high performance microprocessor”, May 21, 2007.

·         Bejing Normal University, Beijing, China,” Architecture-level power modeling and thermal estimation for high performance microprocessor designs”, July 9, 2007.

·         Tsinghua University, Beijing, China, “Architecture-level power modeling and thermal estimation for high performance microprocessor designs”, July 18, 2007.

·         Electrical Engineering Colloquium, UCR, “Architecture level power, thermal modeling, and reliable cache design for high-performance multi-core microprocessors”, Oct. 22, 2007.

·         Fudan Univ. Shanghai, China, “Architecture-level Thermal Modeling and Simulation for Chip-Multiprocessor Designs”, July. 10, 2008.

·         Intel Corporation, Corporation Technology Group, Hillsboro, OR, “Architecture-level Thermal Modeling and Simulation for Multi-Core Architecture Design”, Oct. 17, 2008.

·         International Workshop on Emerging Circuits and Systems (IWECS’09), Shanghai, China, “Chip-Level parameterized thermal modeling for multi-core microprocessor design”, July 6, 2009.

·         2nd Nanoelectronics and Advanced Design Seminar at INAOE (Institute National Astrophysics, Optical and Electrics) at Puebla, Mexico , “Architecture-level Thermal Modeling and Simulation for Multi-Core Chip Design”, May 21, 2010.

·         International Workshop on Emerging Circuits and Systems (IWECS’10), Hefei, China, “Composable Thermal Modeling for Multicore Microprocessor Design”, August 5, 2010.

·         University of Electronic Science and Technology of China (UESTC), Chengdu, China, “Thermal Modeling and Estimation for Multi-Core Microprocessor Design”, August 10, 2010.

·         Intel Corp. Chandler, AZ, ATTD Group, “Chip-Level Thermal Modeling and Characterizations for Single and Multi Core Processor Designs”, Sept. 13, 2010.

 

Software Download

 

The software package for ThermPOF , which is the matrix-pencil based thermal behavioral modeling technique,  can be found here.

The software package for ThermSID, which is subspace identification based thermal behavioral modeling technique,   can be found here.

The software package for ThermalSubCP, which perform the thermal modeling considering realistic power maps and ThermalSubPWL, which can model nonlinear thermal behavior using subspace identification method, can be found here.

 

Relevant Publications

Journal publications

J1.      P. Liu, H. Li, L. Jin, W. Wu, S. X.-D. Tan and J. Yang, “Fast thermal simulation for runtime temperature tracking and management”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and System, vol. 25, no. 12, pp. 2882-2893,  2006.

J2.          W. Wu,  L. Jin, , J. Yang, P. Liu and S. X.-D. Tan, “Efficient power modeling and software thermal sensing for runtime temperature monitoring ”, ACM Transaction on Design Automation of Electronic Systems (TODAES), vol. 12, no. 3, August, 2007.

J3.          S. X.-D. Tan, P. Liu, L. Jiang, W. Wu, M. Tirumala, “A fast architecture-level thermal analysis method for runtime thermal regulation”, ASP Journal of Low Power Electronics (JOLPE), vol. 4, no. 4, August, pp.139-148, 2008.

J4.          D. Li, S. X.-D. Tan, E. H. Pacheco, M. Tirumala, “Architecture-level thermal characterization for multi-core microprocessors”, IEEE Transactions on Very Large Scale Integrated Systems  (TVLSI), vol. 17, no. 10, pp. 1495-1507, October, 2009.

J5.          R. Shen, S. X.-D. Tan, N. Mi and Y. Cai, “Statistical modeling and analysis of chip-level leakage power by spectral stochastic method”, Integration, The VLSI Journal, vol. 43, no. 1, pp. 156-165,  January 2010. (online permanent DOI Link)

J6.          D. Li, S. X.-D. Tan, E. H. Pacheco, M. Tirumala, “Parameterized architecture-level thermal modeling for multi-core microprocessors”, ACM Transaction on Design Automation of Electronic Systems (TODAES), vol. 15, no. 2, pp.1-22, February 2010 (one of top 10 downloaded ACM TODAES Articles published in 2010).

J7.        T. Eguia, S. X.-D. Tan, R. Shen, D. Li,  E. H. Pacheco, M. Tirumala, L. Wang, “General parameterized thermal modeling for high-performance microprocessor design”,  IEEE Transactions on Very Large Scale Integrated Systems  (TVLSI), Vol. 20,  No. 2, pp.221-224, Feb. 2012. 10.1109/TVLSI.2010.2098054.

J8   H. Wang, S. X.-D. Tan, D. Li, A. Gupta,  Y. Yuan, “Composable Thermal Modeling and Simulation for Architecture-Level Thermal Designs of Multi-core Microprocessors”, ACM Transactions on Design Automation of Electronic Systems (TODAES),  vol. 18, no. 2, March 2013.

J9    Z. Liu, S. X.-D. Tan, H. Wang, Y. Hua, and A. Gupta, “Compact thermal modeling for packaged microprocessor design with practical power maps”, Integration, The VLSI Journal, (in press).

Conference publications

 

C1        H. Li, P. Liu, Z. Qi, L. Jin, W. Wu, S. X.-D. Tan, and J. Yang, “Efficient thermal simulation for run-time temperature tracking and management”, in Proc. Int. Conf. Computer Design (ICCD), pp.130-133, San Jose, CA 2005.

C2        P. Liu, Z. Qi, H. Li, L. Jin, W. Wu, S. X.-D. Tan and J. Yang, “Fast thermal simulation for architecture level dynamic thermal management”, Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), pp.639-644, San Jose, CA, Nov. 2005.

C3        W. Wu, L. Jin, J. Yang, P. Liu and S. X.-D. Tan “A systematic method for functional unit power estimation in microprocessors”, Proc. IEEE/ACM Design Automation Conference (DAC’06), pp.554-557, CA, 2006.

C4        W. Wu, J. Yang,  S. X.-D. Tan, S.-L. Lu,Improving the reliability of on-chip caches under process variations”, in Proc. Int. Conf. Computer Design (ICCD), Lake Tahoe, pp. 325-332, CA 2007.  Best Paper Award (<2%).

C5        D. Li, S. X-.D. Tan, and M. Tirumala, “Architecture-level thermal behavioral modeling for quad-core microprocessors”, IEEE International Workshop on Behavioral Modeling and Simulation (BMAS), pp. 22-27, San Jose, CA, Sept., 2007.

C6        D. Li, S. X.-.D. Tan, and M. Tirumala, “Architecture-level thermal behavioral characterization for multi-core microprocessors”, Proc. Asia  South Pacific Design Automation Conference (ASP-DAC’08), pp.456-461, Seoul, Korea, Jan. 2008.

C7        P. Liu, S. X.-D. Tan, W. Wu and M. Tirumala, “FEKIS: A fast architecture-level thermal analyzer for online thermal regulation,  Proc. IEEE/ACM International Great Lakes Symposium on VLSI (GLSVLSI’08), pp. 411-416, Orlando, 2008.

C8        D. Li, S. X.-.D. Tan, E. H. Pacheco, M. Tirumala, “Parameterized transient thermal behavioral modeling for chip multiprocessors”, Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD),  pp. 611-617, San Jose, CA, Nov. 2008.

C9        R. Shen, N. Mi, S. X.-D. Tan, Y. Cai, X. Hong, “Statistical modeling and analysis of chip-level leakage power by spectral stochastic method”, Proc. Asia South Pacific Design Automation Conference (ASP-DAC’09), pp. 161-166, Yokohama, Japan, Jan. 2009.

C10    T. Eguia, S. X.-D. Tan, E. H. Pacheco, M. Tirumala, “Architecture level thermal modeling for multi-core systems using subspace system method”, in Proc. International Conference on ASIC (ASICON’09), pp. 714-717, Changsha, China, Oct. 2009. (Invited).

C11    T. Eguia, S. X.-D. Tan, R. Shen, E. H. Pacheco, M. Tirumala, “General behavioral thermal modeling and characterization for multi-core microprocessor design”, Proc. Design, Automation and Test in Europe (DATE'10), Dresden, Germany, pp.1136-1141, March 2010. 

C12    R. Shen,  S. X.-D. Tan, J. Xiong, “A linear statistical analysis for full-chip leakage power with spatial correlation”, Proc. IEEE/ACM International Great Lakes Symposium on VLSI (GLSVLSI’10), pp.27-232, Providence, RI, May, 2010.

C13    R. Shen,  S. X.-D. Tan, J. Xiong, “A linear algorithm for full-chip statistical leakage power analysis considering weak spatial correlation”, Proc. IEEE/ACM Design Automation Conference (DAC’10),  pp.481-486, Anaheim, CA, 2010.

C14    H. Wang, D. Li, S. X.-D. Tan, M. Tirumala and A. X. Gupta “Composable

 thermal modeling and characterization for fast temperature estimation”,  Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), Oct, Austin, TX, 2010.

C15  Z. Liu, S. X.-D. Tan, H. Wang,  R. Quintanilla and A. Gupta, “Compact thermal modeling for package design with practical power maps”, 1st International IEEE Workshop on Thermal Modeling and Management: Chips to Data Centers (TEMM), Orlando, FL, July, 2011.

C16 Z Liu and S. X.-D. Tan, Rafael Quintanilla and Ashish Gupta, “Compact behavioral thermal modeling for microprocessor design with spatially correlated power inputs”, TECHCON , Austin, 2011.

C17 H. Wang, S. X.-D. Tan, G. Liao, R. Quintanilla and A. Gupta, “Full-chip runtime error-tolerant thermal estimation and prediction for practical thermal management”,  Proc. IEEE/ACM International Conf. on Computer-Aided Design (ICCAD), San Jose, CA, Nov. 2011.

C18 H. Wang, S. X.-D. Tan, X. Liu, A. Gupta, “Runtime power estimator calibration for high-performance microprocessors”, Proc. Design, Automation and Test in Europe (DATE'12), pp.352-357, Dresden, Germany, March 2012. 

C19 Z. Liu, S. X.-D. Tan, H. Wang,  Y. Hua,  and A. Gupta, “Compact nonlinear thermal modeling of packaged microprocessors”, TECHCON’2012 , Austin, TX,  Sept. 2012.

C20 S. Xu, Y. Hua, and S. X.-D. Tan, “Thermal modeling and temperature prediction using least square model averaging with model screening”, TECHCON’2012 , Austin, TX,  Sept. 2012.

C21 Z. Liu, S. X.-D. Tan, H. Wang, A. Gupta,  and S. Swarup , “Compact nonlinear thermal modeling of packaged integrated systems”, Proc. Asia South Pacific Design Automation Conference (ASP-DAC’13), pp. 157-162, Yokohama, Japan, Jan. 2013

C22 H. Wang, S. X.-D. Tan, S. Swarup, and X. Liu, “A power-driven thermal sensor placement algorithm for dynamic thermal management”, Proc. Design, Automation and Test in Europe (DATE'13), pp.1215-1220, Grenoble, France, March 2013.