Sunday, September 16, 2012
Non Carborundum Illegitemi
The statistical work with Printing Systems was not my first time as a statistician. While working in diskette drive manufacturing I did some statistical analysis. Let’s start with that story. I worked on eight-inch diskette drives as a manufacturing engineer. I was responsible for testing all the drives manufactured in Boulder. IBM also had a plant in Italy that manufactured drives. I designed, built, and maintained a tool called ESTAR for Eight Station Test and Repair. This was a console used to operate and test the diskette drives at the end of their production line.
After IBM came out with the PC in 1982, we began work on five and one-quarter-inch drives for the PC. Prior to that we had been developing a four-inch drive, but the IBM team producing the PC in Boca Raton did not want to use such new and untested technology, so they chose the then industry standard 5.25-inch drives purchased outside IBM. When the PC became a large seller, we responded to that business opportunity, and Boulder began designing and building 5.25-inch drives.
One fact about these “floppy drives” was that the magnetic read/write head rode directly on the diskette. This could cause problems with wear. Although diskettes were designed to be as smooth as possible, they did contain what were primarily iron filings and would act like sandpaper wearing out the heads. We had done tests on our drives and were satisfied that they would have sufficient life even with this wearing down.
However, in the last couple of months before our drives would be ready for delivery, we realized that diskettes from different manufacturers had different characteristics effecting head wear. The different processes used by manufacturers to make diskettes produced varying degrees of roughness that the development lab had not expected. We had not tested all the different diskettes, but had focused only on the IBM diskette. That was a big mistake. Once we realized that there was so much variability in the surface smoothness between diskette manufacturers, we needed to perform a broader test using more types of diskettes.
Since I had just received a Master’s degree in mathematics, and I was responsible for drive testing on the current 8-inch manufacturing line, I was assigned to test the wear characteristics of the 5.25-inch drives with various types of diskettes. I gathered 35 PCs from around the plant, effectively borrowing them for two months. I set them up in a test lab and I quickly wrote a program that would move the head to the center of the diskette tracks and read the data for fifteen seconds. The drive would then reset to track zero, and then access the middle track again and read the data for another fifteen seconds. This process was repeated over and over again in a loop. The program would keep this up until it detected repeated failures to read caused by the recording head wearing out from fiction with the disks. The program kept track of the results and basically ran automatically without attention.
I had designed both the program and the experiment being run to measure the wear characteristics of the head and performing the tests on 35 PCs allowed me to test with several different brands of diskettes. There were seven primary brands that made up about 90% of the diskettes sold in the US. (The plant in Italy was responsible for testing with European diskettes.) So I put five of each brand of diskette in the computers and started the test.
I was using a statistical method called Weibull analysis to convert the results of the test into a “mean time to failure” which would be a measure of the life of the head with that particular diskette. The Weibull distribution or probability density function is widely used in reliability and life data analysis due to its versatility. Further analysis would combine the results for the different brands of diskettes and produce an overall estimate of life. I already had the extensive test results the development lab had done with one brand of diskette sold under the IBM name.
The parameterized distribution for the data can be used to estimate important life characteristics of the product such as reliability or probability of failure at a specific time, the mean life and the failure rate, and other life characteristics. However, to actually calculate results, the test must continue to its conclusion. That is, you have to run the test until diskette drives start to fail from wear. I didn’t need all 35 PC tests to fail, but I needed over half of them to reach the point of failure before I could produce meaningful results and estimates.
It was expected from the initial IBM diskette results that it would take over 30 days of continuous running to get to the point were drives were failing. Although we were still several months from the scheduled GA date (General Availability is the point where the product is available to the customer), still management was anxious to have the results as soon as possible.
Things went well for the first couple of weeks — as expected, and then the executives started to get antsy. The vice president of development was an acquaintance of mine since I had taught him how to use the PC when they first came out. (I taught a series of classes on using the PC at the Boulder plant when the PC was first released by IBM. My students included several site executives.)
He started calling me asking how the test was going. I explained that I would not have any results until drives started to fail. I would need a number of failure data points in order to plot the curve. He asked at what point in time, assuming no failures, the drives would meet the specifications we were aiming for. I replied that it was true that the longer the test ran without failures, the better. However, I could not accurately calculate life until I had a number of failure data points, as the time between failures was an important parameter in the calculation. I needed to know when in the life of the head the peak failures occurred in order to determine the parameters to use in the mathematical model that would predict general head life. That is, the life of an average recording head.
He understood since he as an engineer too, but that didn’t keep him from calling me several times a week during the latter phase of the testing. Finally the drives started to fail. Once the first drive failed, several others died within days. Over half of the drives had reach the point of failure during a one week period which was good because it meant the wear mechanism was a stable function. As soon as about 70% of the drives had failed, I could plot the Weibull curve, and I calculated a value for head life. The good news is that the results would meet our requirements.
An interesting sidelight was how worn out the diskettes themselves were. Not only did the diskette wear the recording head down like sandpaper, but the diskettes themselves lost magnetic coating due to friction with the head. There was a noticeable “rut” in the diskette at the center track from head wear, and yet the diskettes still were able to be read. We verified that fact by reading diskettes from drives that had failed the test in good drives. It was not our intent to measure diskette error and the test was somewhat unreasonable since the head stayed primarily in one place on the disk. In normal use the head would read and write on all the tracks, and not just ride continuously in one place. An important point in statistics and design of experiments is how the methodology used reflects real life performance. This experiment was designed to test head wear.
The failed heads were removed and examined under a microscope to determine additional information about failures, but we had seen worn our heads before and there were no surprises in that regard. Still my tests did add information to our process and knowledge of head performance and may have provided useful information. We didn’t want to have any surprises once thousands of customers started using our drives. The goal was for the diskette drives to function flawlessly for the life of the computer and we created estimates of how many drives would fail and how often our customer service people would have to replace a diskette drive based on this experimental data. IBM was well known for manufacturing highly reliable products, and we were able to verify these drives would not be a problem to our customers.
The mathematics I used to calculate results came right out of a math “cookbook.” Weibull analysis can be done with special statistical programs, but I just used an HP programmable calculator and some formulas from a math book. I was given the assignment because my manager knew I had just graduated with a math degree. What he didn’t realize was that my studies had only included one course on statistics, and that was during my under graduate work. My math degree was based on the area of “analysis” which is fancy math words for Calculus. I had taken lots of Calculus, Differential Equations, Vector and other advanced Algebra classes, but I had not trained as a statistician.
Still I had the basic tools needed to perform statistical analysis. I just had to read the right books and use the right formulas. I returned to statistics at the end of my IBM career. This time I was calculating product quality based on customer data and had the goal of continuous improvement of product quality. I used my statistical analysis to determine product quality over product life and to verify that new products IBM released were better than their predecessor. This involved a lot of metrics and setting product goals and I was always the “numbers guy” that the executives would come to set appropriate targets and measure if we were meeting those targets.
I focused on something call RA/MM which is the number of repair actions performed per machine-month. “Machine month” is simply the total inventory of a particular printer model in the field in a given month and “repair action” is any time a customer support representative had to go out and fix a problem on the machine. Depending on the complexity of the printer, we had a numerical goal for just how often the repairman had to be called. We had several other metrics to measure just how easy it was to fix failures on the printers, how long it took on average to perform a repair, or how much it was costing in parts to maintain the printer and other measurements that would allow us to determine the overall quality and the impact on customer satisfaction of the printers performance.
My job was to work with product engineering to set these targets and then work with field service to gather data and verify that the printer was meeting product. If it weren’t meeting targets, then we would plan some sort of action to rectify the problem; maybe a redesign of a part or going to a different part supplier to get more reliable components. It was a continual process of setting goals and monitoring performance against those quality targets.
Think of how often you have to make repairs on your car or imagine the lonely Maytag repairman who has nothing to do since the washing machine is so reliable. That is what I’m talking about.
I’ll save that tale for another telling.