Tuesday, December 14, 2010

Machine Vision and Microsoft’s New Kinect

My first experience with Machine Vision was in 1983. Let me define terms. “Machine Vision” is the process of using a computer to analyze visual data in the same way that the human brain analyzes the information fed to it by the eyes. I was working in magnetic recording head manufacturing. At that time magnetic recording heads were “glass sandwiches.” The process involved building up the pieces one step at a time. Each step new glass was applied; it was a formula of glass that melted at a slightly lower temperature than the previous glass. So you would take glass, add metal, melt, add more glass, melt the new glass, add more metal and glass, melt the latest layer. Thus were built up these miniature electronic devices that contained the magnetic recording and reading elements embedded in glass.

This process was not only time consuming and expensive, but the heads were too large for the next generation of miniaturized disk drives. So IBM pioneered (as it had so often done in extending the art of computer engineering) something called “thin film heads.” These state of the art (for 1980’s) recording heads were made more like the process used to make transistors and integrated circuits. And, they shared a problem with integrated circuits: yield. That is, not all the heads were good. Now that is OK. If you can cheaply, and at one time, make over 100 heads, it doesn’t matter if only 60 are good. But it does create the problem of determining which are good.

Of course, you could install the heads in a drive and test them that way, but that was way too expensive and wasteful. We did test these heads 100% once they were in a drive. In fact, that was my job and my creation, the ESTAR (Eight Station Test and Repair) machine, did just that. I was a test engineer and testing was my game.

But what we wanted to do was visually inspect the heads and eliminate as many of the bad heads before assembly as possible. Ideally we wanted to inspect the heads while they were still all attached to the same substrate called the “wafer.” That was done with a microscope viewed on a TV screen and a motorized stage that moved the heads so they could each be inspected one by one. The inspector could press a button while viewing a head and a dot of ink was dropped on the bad head. Press another button and the stage automatically repositioned to the next head for inspection. After inspection was completed, a machine would cut the wafer into individual heads and sort them based on the drop of ink: bad heads in the trash, good heads move down the line for assembly.

Now it was a very boring and time consuming job to inspect the heads manually, so — as the test engineer with an electronics engineering degree, design experience, knowledge of programming, and advanced math skills — naturally I got the job of creating a computerized head inspection tool.

This was my first experience with “Artificial Intelligence” or AI. That is the area of computer science interested in making computers “think” like humans do, learn like humans do, and in general duplicate human thought processes. AI was an important part of robotic designs and other interesting areas of research since the 50’s. So I took one of the inspection tools, fed the video into a computer and added computer control for the table and the ink button. I programmed the computer using state-of-the-art AI software and proceeded to “teach” it good and bad heads. I would feed the video for a good head and program the computer to view it as “good” and the same with bad heads programmed as “bad.” After running a few thousand heads through the “learning” circuits, I thought the computer would be able to distinguish good from bad. No, it didn’t do very well giving me both “false positives” and “false negatives.” In other words, it marked good heads as bad and bad heads were not “inked.”

So I modified the vision system adding a second camera, one in visible light and one with a blue light filter (since blue light seemed to show the imperfections better). That helped, but still we did not achieve the accuracy we needed. I worked with a Ph.D. math intern on algorithms, and the best we could get after working all summer was for the computer to correctly spot bad heads about 80% of the time and successful at not rejecting good heads over 90%. But that was not good enough. Human operators ran near 98% in both categories.

We kept increasing the precision of the algorithms, but pretty soon the system ran too slow. We needed this machine to inspect over 5,000 heads a day, and it just couldn’t do that in an 8 hour shift. Humans were required to insert the wafers, so we couldn’t run around the clock and finally the project was canceled. I did inherit a $10,000 Zeis optical system in the process, but later gave it to a friend working in Austin, TX who used it to inspect processor chips. He did have better luck than me.

The problem was that certain imperfections on the recording head surface were not detrimental to operation, while other imperfections were. My computer algorithms just could not distinguish the slight imperfections and which were damaging and which were not. I measured size of imperfections and reflection indices of the imperfections and even color (in a limited sense), but I could not get the level of discernment of the human eye and thinking brain. The computer just could not tell the difference between good and bad heads with the reliability that matched humans.

Now my lack of success was typical of AI at that point in time. Everyone looked for the holy grail of “faster processors.” Even better would have been parallel processors, since the algorithms I was running did tree searches and you can easily parallelize those algorithms.

Move the clock ahead 30 years. Enter game systems like the Wii and the Xbox 360. Now, as many of you know, the Wii has been doing very well in the market, partially because of the lower cost of that system compared to the comparable Sony and Microsoft offerings, but also because of the interesting controllers.  These are motion and position sensitive, wireless, hand held devices that let Wii game writers create interesting games like bowling, tennis, and exercise software.

So, Microsoft did them one better with the completely controller-less Kinect box. This is a device that employs machine vision (and machine hearing) to monitor the game players and detect movement of their hands, arms, legs, body, and head. The AI research that went into this box is fascinating and must have cost millions of dollars of research time. That coupled with an interesting, low cost interface is a fascinating thing. In fact, these boxes are being bought by hackers who are quickly modifying them to work directly with computers. MS doesn’t know exactly what to think about this. They want to sell the box, but they also want to sell Xbox 360 units. At the same time, they appreciate the interest and the good press. Now, in my opinion, this is just the start of something. I forecast lap top and desktop PCs with machine vision eliminating the need for a mouse or a touch screen and maybe even a keyboard. Imagine this interface integrated in a smart phone. Maybe we will enter text using the international sign language for the deaf. Interesting thoughts.

As an aside, Microsoft seems to be good at producing hardware. Their mouse was often the best on the market. It was designed and engineered at Microsoft in Ft. Collins, and I know many of the engineers that work there. Sadly, that engineering work has been sent to China, and my friends were laid off. As I’ve said before, I don’t think the manufacturing moved off shore will ever return. I’m much more concerned about the engineering moving off shore. And now back to you regularly scheduled program.

With the assistance of my friends at UBM TechInsights in Austin, TX, here is a breakdown of the Kinect hardware. Fabless semiconductor company PrimeSense, a Tel-Aviv, Israel company, enabled the technological feat via its PrimeSensor reference design, which it says lets a computer “perceive the world in three dimensions and translate these sections into a synchronized image.”

Another aside, the Israelis continue to build computer chips while the Palestinians produce “potato chips.” Add to list of articles in my queue, my opinions on the Israeli and Palestinian situation. I’m sure I could shed some light on that … sure! OK, back to Kinect.

In the MS approach, the room and its occupants are peppered with a pattern of dots, unseen by the users and generated by a near-infrared laser; the use of a Class I laser device provides focus at a distance without hazard to the players. I’m sure some of you have seen how Hollywood does some special effects, photographing actors in full skin suits covered with white balls. Same idea, the computer needs these fixed reference points to establish location and movement.

A CMOS image sensor in the Kinect detects reflected segments of the infrared dot pattern and maps the intensity of each segment to a corresponding distance from the sensor, with resolution of the depth dimension (z axis) down to 1 centimeter. Spatial resolution (x and y axes) is on the order of millimeters (which for those of you not good at metric measurements, are even smaller than centimeters), and RGB input from a second CMOS image sensor is pixel-aligned to add color to the acquired data.

The Kinect uses the three-dimensional position and movement data to produce corresponding on-screen movements by each player’s avatar. A motorized gear assembly keeps the image sensors aimed at the action. As players move, the Kinect follows. Four microphones are used to cancel echoes and background noise while helping determine which player has issued a voice command. It’s not too hard to think of other applications for this technology, but for now, it’s available as a video game interface.

Microsoft expects to sell a few million Kinect units by the end of the year, so it comes as no surprise that several of the commodity components have second and third sources. The 64 Mbyte DDR2 SDRAM socket may contain parts form Samsung, Elpida, and Hynix. Also, the 1-Mbyte NOR flash may be from Silicon Storage Technology or STMicroelectronics. The Kinect contains plenty of op amps and other small components that are easy to multiple source.

The “eyes’ of the Kinect are a pair of cameras, both of which incorporate CMOS image sensors from Aptina Imaging. The unit uses a PS1080 for communications via USB 2.0 with the application processor a Marvell product PXA168 — a low power, low-cost gigahertz-plus screamer that should have tech-frenzied gamers swooning. A pair of Wolfson Microelectronics WM8737L stereo A/D converters with built in microphone preamps accommodate the array of microphones.

Kinect also houses a MEMS accelerometer to support the unit’s limited range of motion provided by stepper and dc motor drivers. A USB hub controller from NEC and a pair of Texas Instruments USB audio streaming controllers and eight channel A/D converters round out the processing power.

What is really amazing is that all this algorithmic power is available for $150 retails. Just think of the possibilities beyond the Kinect: a TV with no remote; a computer with no mouse, no track pad, and no touch screen; affordable advances in home security; and any number of aids for the elderly and disabled. Whether you run off and attach your Kinect to a homemade robot, limit its use to the intended gaming purpose, or do neither, you’ll be seeing this technology again.

I remember IBM demonstrations at Disney World where keyboards were projected onto any surface with a red laser and then a camera would observe the “typist” using the keyboard. Obviously, MS has taken these ideas several steps farther.

Tell me again why I’m retiring now … the future is so bright I’ll have to wear shades — with laser dot, infrared technology built in, me thinks!

No comments:

Post a Comment