No Fault Found
A Weibull trend, a located short, a supplier report that said EOS on everything, and a bench that never reproduced it. The investigation found where the failure was. Not what caused it.
There is a joke in failure analysis that has been told so many times it has stopped being funny and started being true.
When a chip fails and nobody can determine why, someone in the room will eventually say: electrical overstress. They will say it with the particular confidence of a person who has found a conclusion that cannot be disproven. They will say it because EOS — electrical overstress — is the universal solvent of semiconductor failure analysis. It dissolves everything. It explains the damage without explaining the cause. It redirects the question from what happened inside the chip to what happened outside it, and what happened outside it is almost always impossible to reconstruct with certainty. EOS is where investigations go when they have nowhere else to go.
This is the story of a case where we refused to go there. And arrived nowhere anyway.
The device had been on the market for less than two months when the returns started coming in.
It was a small mobile product — the kind people carry on their body, handle constantly, hold in their palm through an ordinary day. Inexpensive, high volume, the sort of device where the margin for error in both cost and reliability is narrow. The failure presentation was consistent across every unit we received: the back cover of the device, directly above one specific chip, was deformed. Not cracked, not scratched. Deformed — the matte plastic surface had smoothed and slightly warped, the way a material behaves when it has been heated past its glass transition temperature. The chip underneath was running hot enough to reshape its own enclosure.
The devices either would not power on at all, powered on briefly and died, or powered on and ran so hot they were uncomfortable to hold. We disassembled the units and isolated the chip. When we replaced it, the device returned to full function. When we left it in place and powered the device, the heating began almost immediately. The circuit board around it was intact. The rest of the electronics were fine. The chip was the issue.
What was wrong with the chip was not yet known.
At this point in most investigations the question is straightforward: is this a random event or a systemic one? The answer determines everything — the urgency, the scope, the resources committed, whether the program pauses or continues. Two failures in winter, with cold dry air and carpeted offices and people walking around in wool socks, can reasonably be attributed to electrostatic discharge. ESD is real. A human body crossing a carpeted floor in low humidity can accumulate enough charge to damage sensitive semiconductor junctions. It is not the most common cause of chip failure but it is a legitimate one, and it is the kind of event that produces a small, random, unpredictable cluster of returns that looks alarming and turns out to be noise.
We did not think this was noise.
The field return rate was climbing. Not randomly fluctuating — climbing. We plotted the time-to-failure data and fitted it to a Weibull distribution, which is the standard tool for understanding whether a failure population is random, infant mortality, or wear-out. The Weibull distribution has a shape parameter called beta. When beta equals one, failures are arriving randomly — no trend, no pattern, just the constant background rate of a process with no memory. When beta is less than one, the failure rate is decreasing — early defects washing out, the population getting healthier over time. When beta is greater than one, the failure rate is increasing. Something is accumulating. Something is getting worse.
Our beta was greater than one.
This was not winter static. This was a trend. Something was happening to these chips in the field that was not random, not environmental, not the result of customers walking on carpets. The Weibull distribution was telling us the population was deteriorating over time in a way that implied a systematic cause. It was not telling us what that cause was.
We opened the chips.
Semiconductor failure analysis is a discipline that requires humility before it requires anything else. The structures you are examining are measured in nanometers. The defects that matter are often invisible to any technique available outside a dedicated semiconductor fabrication facility. What we can do — what the best-equipped independent FA lab can do — is find where the damage is, characterize what it looks like, and attempt to determine from its morphology whether it is consistent with electrical stress, thermal stress, a manufacturing anomaly, or contamination. What we cannot reliably do, in most cases, is trace the damage back to its origin event with the certainty required to assign cause.
We used optical methods to identify the active region on the powered chip — the area generating the most heat. We used OBIRCH, which maps resistive anomalies by detecting local temperature changes induced by a laser scan, to locate the site of the short. We found it. A specific location on the die, a specific junction, a measurable resistive path where there should not have been one.
And then we arrived at the question that defines semiconductor failure analysis and that this case never answered.
Was that location the culprit, or the victim?
If it was the culprit — if a manufacturing defect at that junction had caused it to fail, had introduced a weak point that degraded under normal operation until it shorted — then the chip had a reliability problem. The supplier had shipped defective parts. The root cause was in the fabrication process, somewhere in the hundreds of steps required to build a modern semiconductor device, and finding it would require access to the fab's process data, equipment logs, and wafer-level inspection records that we did not have and would not be given.
If it was the victim — if something upstream had stressed that junction beyond its rated limits, had driven a voltage spike or a current surge or a thermal excursion that breached the dielectric and created the short — then the chip was fine and the system around it was the problem. The power circuit. The board design. Something in the operating environment generating conditions the chip was not specified to handle.
The damage pattern at the failure site was consistent with both hypotheses. This is the fundamental problem of semiconductor FA after a thermal event — the damage erases its own history. A junction that has been shorted catastrophically looks the same whether it shorted because it was weak or because it was pushed. The evidence that would distinguish culprit from victim is consumed in the event itself.
We sent units to the supplier.
The reports came back within weeks. Every chip. Every unit. The same conclusion on all of them.
Electrical overstress.
The supplier's reports were technically competent. The analysis was documented. The damage morphology was described accurately. And the conclusion — EOS — was not wrong, exactly. There had clearly been electrical stress. There was clearly a failed junction. Everything they wrote was defensible.
What the reports did not contain was any analysis of why the electrical stress had occurred, where it had come from, or whether the chip's own construction might have made it susceptible. EOS was the finding and EOS was also the endpoint. The reports did not distinguish culprit from victim. They did not need to — EOS, by definition, places the cause outside the chip and inside the system. The customer's system. Our client's product. Our problem, not theirs.
We had expected this. It did not make it less frustrating.
We ran every test we could think of.
Accelerated life testing. Temperature cycling. Voltage stress. Current stress. Power cycling. Combinations of stress conditions intended to reproduce or accelerate whatever mechanism was killing chips in the field. We tested hundreds of units across multiple production lots. We monitored them continuously. We looked for the Weibull trend to appear in the lab the way it was appearing in the field.
Nothing failed.
Not one chip failed in a way that reproduced the field failure. Not one unit showed the heating pattern. Not one back cover deformed. The failure mode that was appearing in customers' hands within weeks of purchase would not appear on our benches under any condition we could generate. This is its own kind of data — the absence of replication tells you something — but what it tells you in this case was ambiguous. Either the stress condition causing the field failures was something we had not thought to simulate. Or the chips that were failing had a characteristic the chips in our test population did not share. Or the failure required a combination of conditions that our tests were not reproducing simultaneously.
We could not tell which. Without replication there is no confirmation. Without confirmation there is no root cause. The investigation had found the location of the failure, characterized the damage, established that the failure rate was increasing, and eliminated nothing. Every hypothesis remained open. The case had the shape of an investigation without the substance of one.
The business had been patient. It could not remain patient indefinitely.
The conversation that followed was the kind that happens when an engineering problem exceeds the timeline available to solve it and becomes a business problem instead. The options on the table were roughly what you would expect. Kill the product — absorb the development cost, stop the bleeding, move on. Continue the investigation — more testing, more analysis, third-party semiconductor labs, deeper engagement with the supplier, months of additional work with no guaranteed outcome. Find a second source chip and qualify it, on the theory that if the problem was in the fabrication process it might not exist in a different fab's process. Or ship, with mitigations, and monitor.
Before the business could decide, someone asked a different question.
How hot does it actually get?
Not in the abstract. Not as a reliability metric or a failure rate. In practical terms — if a person were holding this device and the chip entered its failure mode, what temperature would they feel on the back cover, how long would they need to hold it to sustain a burn injury, and what was the probability of that scenario occurring given realistic usage patterns?
This was the reframing that ended the investigation. Not because it answered the question of what was causing the failures — it did not — but because it converted an unanswerable engineering question into an answerable safety question. We could not find the root cause. We could measure the thermal hazard.
We instrumented the back cover above the chip on failed units. We measured surface temperatures under the failure condition. We modeled contact duration against burn injury thresholds — the Stoll curve, the standard reference for thermal injury to skin. We estimated the probability of a user maintaining contact with the hottest area of the back cover for the duration required to cause injury, given how people actually hold and use a device of this form factor. The numbers were uncomfortable but bounded. The risk was not zero. It was also not the kind of risk that, at the field return rate we were seeing, implied imminent widespread harm.
The business made its decision.
They threw the kitchen sink at it.
A second source chip from a different supplier. A revised power circuit with tighter regulation and better transient suppression. A fuse added to the chip's power rail — a fuse that would protect against overcurrent events but would also render the device non-functional if it tripped, trading a hot device for a dead one. The chip relocated slightly on the board. The back cover thickened by a fraction of a millimeter to add thermal mass between the chip and the user's hand.
None of these changes were validated against a confirmed root cause, because there was no confirmed root cause. Each one addressed a hypothesis. Collectively they addressed all of them. The revised design went through qualification testing. It passed. The product continued shipping in its new configuration.
We never learned which change had mattered. We never learned if any of them had. The field return rate on the revised product was lower. Whether that was because the second source chip didn't have whatever the first chip had, or because the power circuit was cleaner, or because the fuse was catching something, or because the original issue had been a transient population-level defect that had exhausted itself — we could not say. We still cannot.
The case was closed because the program moved forward. It was not closed because anyone knew what had happened.
No Fault Found is the formal disposition in failure analysis for an investigation that examined the evidence and could not identify a defect. It is sometimes accurate — some returned units genuinely have nothing wrong with them, have been mishandled or misdiagnosed, and show no failure upon examination. But NFF is also, in practice, a filing cabinet. It is where you put the cases that defeated you. The ones where the evidence was insufficient, the tools were inadequate, the supplier was uncooperative, the failure mode unreproducible, the root cause permanently beyond reach.
This case belongs in that cabinet. Not because we did not look. Because the thing we were looking for does not leave the kind of evidence that can be found.
Semiconductor failure analysis operates at the boundary of what is physically knowable. The structures are too small, the events too fast, the damage too consuming. A junction that has failed carries almost no information about why it failed. The supplier knows this. The customer knows this. Everyone in the room knows this, which is why EOS is such a durable conclusion — it is not provably wrong, and in semiconductor FA, not provably wrong is sometimes the best available standard.
What I took from this case was something about the limits of the discipline I practice. Failure analysis is predicated on the assumption that physical evidence survives the failure event in a form that can be interpreted. For most failures, at most scales, this is true. A fractured switch retains its fracture pattern. A corroded connector retains its corrosion products. A damaged connector retains the dent.
A failed semiconductor junction retains almost nothing. The evidence is the damage, and the damage is the end of the story, not the beginning of it.
The chip ran hot. The enclosure deformed. The beta was greater than one. Somewhere in a fabrication process we never had access to, or in a power circuit that never misbehaved on our benches, or in some interaction between the two that we never found the conditions to reproduce — something went wrong. We do not know what. The product was redesigned around our ignorance and shipped anyway.
Culprit or victim. We never found out which.
We never do.