The Dent
Seven teams, one switch, and a fracture pattern too consistent to be random. The answer was in the assembly. The argument continued anyway.
Nobody panics quietly in hardware.
The call came in the way these things always do — not as a question but as a declaration. The mechanical switch was failing. Pre-mass production review, light drop testing, a core product feature that was now broken on units that were supposed to be ten million copies away from being someone's problem. The program was weeks from mass production sign-off. The blast radius, as someone put it in the first meeting, was the entire launch.
By the time we were pulled in, the room had already decided several things. The design was bad. The test coverage was insufficient. The reliability team had missed it. The placement of the switch on the board was wrong. The standard operating procedure was inadequate. Everything was wrong, and everything was wrong because of engineering, and engineering needed to fix it immediately because mass production could not wait.
This is a recognizable species of crisis. The feature involved is real — a mechanical switch triggered by physical rotation of the device, the kind of tactile confirmation that tells the firmware something has happened and tells the customer that the product is doing what it should. When it works, nobody notices it. When it fails, it is suddenly the most important thing in the building. And when it fails just before mass production, with tooling locked and schedules committed and supply chain already in motion, the organizational immune response is immediate and not always rational.
We let the noise run for a day before we started asking questions.
The switch itself was straightforward in concept. A small mechanical component mounted on the printed circuit board, positioned adjacent to a rotating assembly. When the rotation occurred, a physical protrusion on the assembly pressed against the switch, actuating it. The product design team had conceived the mechanism. The layout team had determined the switch's position and mounting on the board. The reliability team had validated it through the engineering builds. The process engineering team had written the procedure for assembling the enclosure around it. The factory had executed that procedure.
Five teams. One switch. No single owner.
The device quality team, who had found the failure, had a clear theory: the design was too sensitive, the switch too exposed, the tolerances too tight for a mass production environment. They had run the same drop test that reliability had run months earlier during engineering validation. The switch had passed then. It was failing now. To device quality, the explanation was straightforward — reliability had not done its job, the design had never been truly validated, and engineering owed the program a fix before a single production unit shipped.
Engineering's response was equally straightforward and pointed in the opposite direction. If the design was fundamentally flawed, why had it passed during engineering validation? Why was the failure appearing now, at this stage, on these samples, and not earlier? Either the test was wrong then or the test was wrong now, and either way the question of why the failure had not surfaced until the eleventh hour was a question worth asking before blame was assigned.
Both positions were reasonable. Both were also, in the way of reasonable positions held by people under pressure, slightly more confident than the available evidence warranted.
We started with the data.
The reliability team's early test results were clean. Same test methodology, same drop heights, same orientations, same number of cycles. We checked for engineering changes between the early builds and the pre-production units — changes to the switch specification, the board layout, the enclosure geometry, the assembly procedure. We were looking for the delta. The thing that had changed between passing and failing.
Nothing had changed. Not on paper.
The device quality team had an answer for that too. Functional testing, they pointed out, does not require a teardown. The early builds had been tested functionally — actuate the switch, confirm the firmware response, log the result, move on. Nobody had opened the devices afterward to inspect the switch physically. Which meant, they argued, that the switches in the early builds might already have been cracked. The fractures might already have been there. Functional testing would not have caught it if the switch was damaged but still making contact. The passing results, in other words, proved nothing.
It was a reasonable argument. It was also the kind of argument that is very difficult to refute cleanly, which is part of what made it attractive to the people making it.
We went to look at the failed units.
The fracture pattern was the first thing that stopped us.
In a free drop test — a device released from a fixed height, allowed to fall and impact a surface — the direction and magnitude of the force at any given internal component is a function of the drop orientation, the impact dynamics, the device geometry, and a collection of variables that produce, across a sample population, a distribution of outcomes. Some fractures here, some there. Some clean, some ragged. The randomness is not total but it is present. It is the signature of an environmental input that varies.
These fractures did not vary.
Across every failed unit we examined, the fracture pattern on the mechanical switch was consistent. Same origin location. Same propagation direction. Same stress signature. The force that had broken these switches had come from the same direction, at approximately the same magnitude, every time. This was not the signature of a drop. This was the signature of a process.
We set the drop hypothesis aside and started asking a different question. Not what had broken the switches, but what action — repeatable, directional, consistent — could produce that specific fracture pattern.
The answer was in the assembly.
To enclose the switch within the device, an operator presses two halves of the plastic housing together until they snap and latch. It is a simple operation, the kind performed thousands of times a day on a production line without a second thought. The SOP described it in a few lines. There was a fixture to assist with alignment, though the fixture had tolerances of its own.
We modeled the assembly action. We looked at what happened when the two halves were pressed together slightly off-axis — not dramatically misaligned, not the result of carelessness, just the ordinary variation that exists in any manual assembly operation when the fixture does not fully constrain the motion. When the misalignment was present, one half of the enclosure, as it was pressed closed, did not clear the switch. It impinged on it. The force of closing the housing — the operator pressing the two halves together — was transmitted directly into the switch body.
We went back to the failed units and looked at the switch housings under magnification.
There it was.
A dent. Small, consistent across units, located at the switch body adjacent to the fracture origin. Not from a drop. Not from rotation. From a finger pressing two pieces of plastic together on a factory floor, slightly off angle, in a way the SOP had not anticipated and the fixture had not prevented.
The switch was not failing in the field. It was failing during build.
We presented the finding. The room received it the way rooms receive findings that are clear but inconvenient — with a brief silence followed by the resumption of the prior argument in a slightly modified form.
The assembly process explanation was accepted, nominally. The SOP would be reviewed. The fixture would be evaluated. Corrective actions would be assigned. Mass production would proceed.
And then device quality made the point they had been waiting to make since the beginning. If the design were less sensitive — if the switch had more clearance, more protection, more margin against exactly this kind of assembly variation — the SOP would not need to be perfect. The fixture would not need to be perfect. The operator would not need to be perfect. A robust design tolerates imperfect assembly. This design did not. Which meant the design was the problem, the assembly process had merely revealed it, and engineering still owed the program an answer.
It was not a wrong argument. It was also not the argument that was going to get resolved in that room, or in any room that week.
The VP made the call that VPs make when the technical answer is clear and the organizational answer is not. Engineering would incorporate design guidelines to avoid switch placements this sensitive to assembly variation in future programs. Quality would revise the SOP and the fixture in coordination with process engineering and the product design team. Both teams would treat this as a shared learning. Nobody would be on record as having caused the problem.
The product launched on schedule.
I have been in many rooms where the root cause was found and the argument continued anyway.
This is one of the things they do not tell you about failure analysis when you are learning it — that the investigation and the resolution are separate events, and the second does not always follow from the first. We found the dent. We traced it to the assembly action. We demonstrated the mechanism with enough clarity that nobody in the room genuinely disputed the physical explanation. And then the room spent another week negotiating something that had nothing to do with physics.
What device quality wanted was not just a root cause. They wanted a finding that reorganized the ownership boundaries — that established, on the record, that engineering had produced something that could not be built reliably, and that the downstream consequences of that were engineering's responsibility to absorb. What engineering wanted was a finding that kept the boundary where it was — that said the design was sound, the process had failed, and the fix belonged to operations. The forensic answer — assembly variation, impingement, a fixture that didn't fully constrain the motion — did not cleanly satisfy either position, which is why neither team fully accepted it as a conclusion.
The VP's resolution was not a technical judgment. It was a political one, and it was probably the right call given what was actually at stake. The program needed to ship. The teams needed to be able to work together afterward. A clean verdict would have required one side to lose, and a side that loses that kind of argument does not forget it.
What I took from this case was something about the nature of ownership in complex systems. The switch had five teams responsible for different aspects of its existence — conception, placement, validation, procedure, execution. In a system with that many owners, the question of who is responsible for a failure is almost always the wrong question. The better question is where the system's assumptions broke down. In this case, the assumption was that a fixture with known tolerances and an SOP with a few lines of description were sufficient to constrain a manual assembly operation to the precision the design required. They were not. Nobody had checked whether they were.
The dent was small. The distance between the enclosure half and the switch body, when the alignment was off, was a fraction of a millimeter. The force required to fracture the switch was not large. Everything about the physical failure was minor, localized, almost trivial in isolation.
What was not trivial was that nobody owned the gap between the design's requirements and the process's capabilities. That gap is where the switch broke. It is also, in my experience, where most things break — not at the center of anyone's responsibility, but at the boundary between one team's assumptions and another team's execution.
Nobody owns the boundary. That is usually where you find the dent.