J. W. McPherson, Ph.D., Evelyn Landman, Shai Cohen, Noam Brousard, Raanan Gewirtzman, Inbar Weintrob, Eyal Fayne, Yahel David, Ph.D., Yuval Bonen, Omer Niv, Shai Tzroia, Alex Burlak All . ### **ABSTRACT** This paper describes a reliability degradation modeling and monitoring method based on a combination of IC novel embedded circuits, and off-chip machine learning algorithms which infer the digital readouts of these circuits during test and operational lifetime. Together, they monitor the margin degradation of an IC, as well as other vital parameters of the IC and its environmental stress. This method enables the prevention of future failures, and points to the Physics of Failure, thus estimating the time to failure. Keywords- Degradation monitoring, BTI, EM, Stress Migration, software platform, Margin Agents, Profiling Agents, machine learning algorithms and estimators, predictive mantainance, relability, Idsat, Vth, Ioff, HTOL), integrated circuits (IC). ### I. INTRODUCTION Historically, reliability physics has focused on models for time-to-failure. The time-to-failure models for circuits were generally developed using data gathered using very simple test structures that could be stressed to failure. However, today, with circuits playing a such a critical role in our life support systems (implantable medical devices, auto safety devices, home security devices, etc.) a circuit failure is no longer an option. So what will be the next big step in ensuring that circuit failures never occur? This is a critically important question that must be answered if truly autonomous driving vehicles are to be fully realized. The next paradigm shift in reliability assurance is believed to be in the area of degradation monitoring and analysis. It is well known that degradation is a precursor of failure with many failure mechanisms (HCI, BTI, EM, SM, etc.) showing continuous degradation well in advance of failure (1-5). Therefore, with relatively small monitoring circuits strategically placed and connected in many locations on the chip, there is no fundamental reason why these monitoring circuits cannot forewarn of chip-circuit degradation and alert the user of impending failure. In this way, actual circuit failure does not have to occur. Even for a mechanism such as TDDB (which usually shows little/ no degradation prior to failure), the monitor circuit can provide vital information such as voltage overshoots, duty cycle and local temperature rises --- all of which are known to accelerate TDDB (6). ### II. DEGRADATION MONITORING IMPLEMENTATION The method described in this paper is based on three types of novel IC embedded circuits with digital readouts, called "Agents"; Margin Agents, Profiling Agents, and Operational Agents. Margin Agents are circuits that monitor the setup timing margin of millions of combinational paths and memory array outputs in an IC, while consuming only minimal silicon area and power, without impacting timing or rout-ability. **Fig. 1** shows the Margin Agent scheme for monitoring sequential and combinatorial logic and memory outputs. Figure 1: Margin Agent connected to and monitoring a design/circuit The Margin Agents measure the worst margin of millions of paths, which are covered by each Margin Agent, without interrupting the IC regular operation, meaning there is no need to switch the IC into a test mode to get this information. The worst-case margin measured by each Margin Agent is stored in the IC and can be read at any time. The Operational Agents monitor the effects of actual operation of the IC in the system environment; the stress or workload it experiences, the voltage noise, clock characteristics, thermal stress, and other environmental and operation aspects of the IC. One type of the Operational Agents, the Workload Agent, provides an integrated stress measure, integrating voltage, temperature and toggle rate, into one number. This information can be used in production to compare different testing environments and the stress (as measure by V/T and Toggle rate) induced into the IC. During lifetime acceleration tests this can be used to correlate degradation to induced stress, while during operational lifetime this can be used as an explanation and correlation for degradation vs. stress. The Profiling Agents can be divided into two types of circuits, which have common goals and complement each other. The Design Profiling Agents (DPA) are circuits based on standard cells which mimic the specific design, design tools and methods. The Process Classification Agents (PCAs) are custom circuits that expose the actual transistor electrical parameters and the RC characteristics. These latter Agents, once activated to measure, generate digital data with high correlation to certain electrical parameters, such as Vth, Idsat and Ioff of that die, and the specific die area they are placed on. These Agents are capable of distinguishing between sets of electrical parameters of different Vth type, L effective, and MOS type (Pmos and Nmos). The insertion of all the Agents into the IC is done by a specialized IP compiler that analyzes the actual synthesis model of the design and generates RTL code of the three Agent types to be integrated into the design. The Agents are tailored to the specific design and process node, ensuring the best accuracy, coverage and efficiency. In Fig. 2, the flow of inferring the transistor electrical parameters is shown. The PCAs digital readout obtained in Post-Silicon stage (Post-Si) is used by Machine Learning estimators and algorithms which were formally trained during design using simulations, to output accurate estimation of the transistor electrical parameters. Post Silicon stage (Post-Si) refers to the production testing stages, both IC level and system level production or lifetime accelerated tests, and operational lifetime. Pre-Silicon (Pre-Si) refers to the IC design stage. Figure 2: PCAs data flow, Pre and Post-Si By measuring and reading the PCAs data during lifetime acceleration tests such as HTOL and Burn-In, and during lifetime operation, shifts and degradation in the electrical parameters of Pmos and Nmos transistors of different types can be monitored. The foundation of the degradation monitoring method is based on the Margin Agents which are constantly and accurately reporting the margin of millions of paths in the IC over time and on the fact that margin (delay) degradation is a good precursor of failure for many failure mechanisms (1-5). By alerting on different evolving degradation patterns, including degradation rate, absolute degradation margin relative to thresholds, time to failure estimation etc., actual failure can be avoided in real life situations. During lifetime acceleration tests such as HTOL and Burn-In, regularly reading these Agents can provide valuable insight into the modelled rate of margin (delay) degradation. The data collected during the process enables further conclusions of deeper insights: the margin (delay) degradation rate inspected in the Margin Agents, as well as the separate degradation rate for Pmos and Nmos transistor electrical parameters inspected in the PCAs, at lifetime acceleration tests and operational lifetime can shed light on the Physics of Failure (PoF) behind the degradation (NBTI, EM, SM, etc.), and allow the optimization of the qualification process and design guardbands. By using the PCAs and DCAs readouts in conjunction with different machine learning algorithms and estimators, ICs may be classified into "Families". "Families" are groups of ICs that share similar delay and current consumption parameters. This characteristic is maintained across different voltage or temperature conditions and across the supply chain, while the ICs are going through transformations: Wafer, packaged IC, assembled IC on a system. The "Family" parametric closeness of its members as well as the invariance characteristic is a very powerful tool for correlation and binning. An additional layer of reliability degradation is provided by the "Family" classification by using a novel outlier detection method. ICs from the same "Family" are expected to experience similar degradation per transistor type and similar margin degradation during lifetime acceleration tests. ICs that degrade with different pace or trend with respect to the ICs in their family are singled out as potential outliers. Completing the solution, Operational Agents are used to correlate between noisy environment, thermal stress, temperature profile and workload (general stress) profile to the measured IC delay degradation, relative to its family classification, during lifetime acceleration tests, and operational lifetime. They can provide explanations of the different delay degradations observed, between ICs/Systems or even within the IC. Throughout the lifetime of the product a software-based platform constantly reads the embedded Agents, inputting their readouts into machine learning algorithms that continuously improve the accuracy and precision of the failure predictions. Correlating readouts of a full population of a specific product further allows this platform to provide extremely reliable predictive maintenance. ### III. DATA/RESULTS AND DISCUSSION By integrating the Agents discussed above into an IC design one can achieve degradation modeling and monitoring capabilities during IC lifetime accelerated tests and later on during operational lifetime. In this paper we present the method capabilities, based on silicon experience. By reading the PCAs during the HTOL readout periods the electrical parameters of a transistors (Vth, Idsat, Ioff) are estimated as previously explained and plotted over time. **Fig. 3** describes the expected shift over time of Vth in Pmos and Nmos transistors, when the degradation mechanism behind this is BTI. ### pMOS NBTI vs nMOS PBTI Figure 3: BTI induced Vth Shift In Fig. 4 we present the method capabilities by showing the Vth shift over time of the Pmos transistors and Nmos transistors of different Vth and L-effective types as measured by the PCAs and inferred using the Machine Learning algorithms, at different readout time periods for several ICs. The actual Vth shift values are not shown (Y axis). Figure 4: Inferred Vth shift of Pmos and Nmos after HTOL stress, average for all ICs and all Pmos and Nmos transistor types The inferred Vth shift for the Pmos and Nmos transistors indicate that the degradation mechanism is BTI. Based on the measured trend per parameter the PoF can be deduced and a degradation model can be fitted by finding the coefficients best describing the degradation trend. Fig. 5 shows the NBTI time dependence, and how to calculate the degradation model coefficients using the calculated model coefficients method as shown in Fig. 6, on several ICs the n coefficient obtained was indeed between 0.2-0.3, and the $A_0$ obtained for the different ICs is different as expected, since this depends on the specific material manufactured. The Margin Agents' margin (delay) degradation monitoring capabilities where tested and proved during HTOL as well. A path delay degradation trend of the millions of paths monitored by the Margin Agents was found. The degradation curve and rate also suggested NBTI was the reason of the paths delay degradation over time. Figure 5: Finding degradation model coefficients Figure 6: Calculating the BTI coefficients **Fig. 7** shows the delay degradation measured by the Margin Agents, after the complete HTOL stress process. The delay degradation percentage values are normalized (Y axis). Since the Margin Agents cover millions of paths, a comprehensive delay degradation view is established. It shows that the delay degradation is different between different ICs, but also that different Margin Agents of the same IC show a non-uniform delay degradation and a dependency on the millions of paths under monitor by each Margin Agent. Figure 7: Margin Agents delay degradation distribution at the end of the HTOL stress In addition, the degradation can be shown on a Margin Map which reveals degradation-location dependencies in the IC or in a block in the IC as presented in Fig. 8. Each point in the Margin Map shows a specific Margin Agent location, while each Margin Agent covers millions of gates and paths being sampled by the monitored FFs. Figure 8: Delay degradation map measured by the Margin Agents, green means less delay degradation and red means more delay degradation Since the Margin agents provide visibility into degradation of millions of different paths in the IC, additional interesting insights can be discussed, one example is why specific paths degrade more than others. One possible theory may be the toggle rate exercised on the different paths by the test running during HTOL. It could also be the fact that the A<sub>0</sub> for the paths covered by the different Margin Agents is different since this depends on the specific material manufactured, and this may have some variation within die. Another theory could be that the different kind of cells/gates and/or their connectivity topology that comprise the millions of paths monitored by each Margin Agent, could explain the different delay degradation rate for different Margin Agents within the die. By characterizing hundreds of thousands of the paths covered by each Margin Agent in terms of attributes as %L Effective type, %Vth type, %RC delay, % of different cell sizes, % of different fanout topologies, etc. we can correlate this set of attributes to the degradation rate observed per Margin Agent. Fig 9. shows an example of performing such correlation of degradation per Margin Agents vs. monitored paths statistics. Figure 9: Correlation of delay degradation measured by the Margin Agents vs. monitored paths per Margin Agent statistics for cell Vt type In addition to the degradation monitoring during IC lifetime acceleration tests, the high coverage and non-intrusive nature of the Margin Agents (the Margin Agents operate while the IC is in functional mode without interfering regular IC operation) the degradation monitoring of each IC in operational lifetime can be achieved. By using the readouts from the Margin Agents during IC normal operation in-field, degradation trends can be quantified and alerts can be provided well before actual failure. These degradation trends can lead to the replacement of a single IC before its failure due to a manufacturing latent defect or a reliability issue, or to point to specific system level issues that may have emerged with time. In addition, in a large-scale reliability issue the Margin Agents can efficiently determine the affected vs non-affected population, preventing or dramatically shortening the time to resolution of such events reducing the cost and reputation damages caused by a large-scale recall. ### IV. CONCLUSIONS We showed a novel, non-intrusive method of delay degradation monitoring, which can be applied while the IC is in operational mode, without the need of interrupting the IC normal operation. This method has also a very high "reliability coverage", covering high percentages of the combinatorial and sequential logic and well as memory outputs within an IC. This method was proven in silicon using lifetime acceleration tests such as HTOL. In addition, we showed an "Agent Fusion" concept that can provide comprehensive degradation analysis from different aspects, such as transistors electrical parameters estimations and degradation over time; Family classification dividing ICs into groups with similar parametric behavior in order to be able to correlate between ICs that are expected to behave similarly; a workload monitoring or stress indicator that can explain why different ICs show different degradation values. ### **REFERENCES** - [1] J. W. McPherson, Reliability Physics and Engineering, 3rd Ed., Springer Publishing, 2019. - [2] M. Seok, et. al., Recent Advances in In-situ and In-field Aging Monitoring and Compensation for Integrated Circuits, IEEE-IRPS Proceedings, 5C.1-1 (2018). - [3] G. Park, et. al., All Digital PLL Frequency and Phase Noise Degradation Measurements Using Simple On-Chip Monitoring Circuits, IEEE-IRPS Proceedings, 5C.2-1 (2018). - [4] S. Jagannathan, et. al., Design of Aging Aware 5 Gbps LVDS Transmitter for Automotive Applications, 5C.3-1 (2018). - [5] R. Shah, et. al., Investigation of Speed Sensors Accuracy For Process and Aging Compensation, IEEE-IRPS, 5C.6-1 (2018). - [6] D. Patra, et. al., Accelerated BTI Degradation under Stochastic TDDB Effect, IEEE-IRPS Proceedings, 5C.5-1 (2018). - [7] Reddy, V, et. al., Impact of Negative Bias Temperature Instability on Digital Circuit Reliability (2002) proteanTecs Ltd. 36 Kdoshei Bagdad Dr. Haifa, Israel 33032 www.proteanTecs.com © 2019 proteanTecs. All rights reserved