Is computer vision about to reinvent itself, again?
Ryad Benosman, professor of Ophthalmology at the University of Pittsburgh and an adjunct professor at the CMU Robotics Institute, believes that it is. As one of the founding fathers of event–based vision technologies, Benosman expects neuromorphic vision — computer vision based on event–based cameras — is the next direction computer vision will take.
“Computer vision has been reinvented many, many times,” he said. “I’ve seen it reinvented twice at least, from scratch, from zero.”
Benosman cites a shift in the 1990s from image processing with a bit of photogrammetry to a geometry–based approach, and then today with the rapid change towards machine learning. Despite these changes, modern computer vision technologies are still predominantly based on image sensors — cameras that produce an image similar to what the human eye sees.
According to Benosman, until the image sensing paradigm is no longer useful, it holds back innovation in alternative technologies. The effect has been prolonged by the development of high–performance processors such as GPUs which delay the need to look for alternative solutions.
“Why are we using images for computer vision? That’s the million–dollar question to start with,” he said. “We have no reasons to use images, it’s just because there’s the momentum from history. Before even having cameras, images had momentum.”
Image cameras have been around since the pinhole camera emerged in the fifth century B.C. By the 1500s, artists built room–sized devices used to trace the image of a person or a landscape outside the room onto canvas. Over the years, the paintings were replaced with film to record the images. Innovations such as digital photography eventually made it easy for image cameras to become the basis for modern computer vision techniques.
Benosman argues, however, .image camera–based techniques for computer vision are hugely inefficient. His analogy is the defense system of a medieval castle: guards positioned around the ramparts look in every direction for approaching enemies. A drummer plays a steady beat, and on each drumbeat, every guard shouts out what they see. Among all the shouting, how easy is it to hear the one guard who spots an enemy at the edge of a distant forest?
The 21st century hardware equivalent of the drumbeat is the electronic clock signal and the guards are the pixels — a huge batch of data is created and must be examined on every clock cycle, which means there is a lot of redundant information and a lot of unnecessary computation required.
“People are burning so much energy, it’s occupying the entire computation power of the castle to defend itself,” Benosman said. If an interesting event is spotted, represented by the enemy in this analogy, “you’d have to go around and collect useless information, with people screaming all over the place, so the bandwidth is huge… and now imagine you have a complicated castle. All those people have to be heard.”
Enter neuromorphic vision. The basic idea is inspired by the way biological systems work, detecting changes in the scene dynamics rather than analyzing the entire scene continuously. In our castle analogy, this would mean having guards keep quiet until they see something of interest, then shout their location to sound the alarm. In the electronic version, this means having individual pixels decide if they see something relevant.
“Pixels can decide on their own what information they should send, instead of acquiring systematic information they can look for meaningful information — features,” he said. “That’s what makes the difference.”
This event–based approach can save a huge amount of power, and reduce latency, compared to systematic acquisition at a fixed frequency.
“You want something more adaptive, and that’s what that relative change [in event–based vision] gives you, an adaptive acquisition frequency,” he said. “When you look at the amplitude change, if something moves really fast, we get lots of samples. If something doesn’t change, you’ll get almost zero, so you’re adapting your frequency of acquisition based on the dynamics of the scene. That’s what it brings to the table. That’s why it’s a good design.”
Benosman entered the field of neuromorphic vision in 2000, convinced that advanced computer vision could never work because images are not the right way to do it.
“The big shift was to say that we can do vision without grey levels and without images, which was heresy at the end of 2000 — total heresy,” he said.
The techniques Benosman proposed — the basis for today’s event–based sensing — were so different that papers presented to the foremost IEEE computer vision journal at the time were rejected without review. Indeed, it took until the development of the dynamic vision sensor (DVS) in 2008 for the technology to start gaining momentum.
Neuromorphic technologies are those inspired by biological systems, including the ultimate computer, the brain and its compute elements, the neurons. The problem is that no–one fully understands exactly how neurons work. While we know that neurons act on incoming electrical signals called spikes, until relatively recently, researchers characterized neurons as rather sloppy, thinking only the number of spikes mattered. This hypothesis persisted for decades. More recent work has proven that the timing of these spikes is absolutely critical, and that the architecture of the brain is creating delays in these spikes to encode information.
Today’s spiking neural networks, which emulate the spike signals seen in the brain, are simplified versions of the real thing — often binary representations of spikes. “I receive a 1, I wake up, I compute, I sleep,” Benosman explained. The reality is much more complex. When a spike arrives, the neuron starts integrating the value of the spike over time; there is also leakage from the neuron meaning the result is dynamic. There are also around 50 different types of neurons with 50 different integration profiles. Today’s electronic versions are missing the dynamic path of integration, the connectivity between neurons, and the different weights and delays.
“The problem is to make an effective product, you cannot [imitate] all the complexity because we don’t understand it,” he said. “If we had good brain theory, we would solve it — the problem is we just don’t know [enough].”
Today, Bensoman runs a unique laboratory dedicated to understanding the mathematics behind cortical computation, with the aim of creating new mathematical models and replicating them as silicon devices. This includes directly monitoring spikes from pieces of real retina.
For the time being, Benosman is against trying to faithfully copy the biological neuron, describing that approach as old–fashioned.
“The idea of replicating neurons in silicon came about because people looked into the transistor and saw a regime that looked like a real neuron, so there was some thinking behind it at the beginning,” he said. “We don’t have cells; we have silicon. You need to adapt to your computing substrate, not the other way around… if I know what I’m computing and I have silicon, I can optimize that equation and run it at the lowest cost, lowest power, lowest latency.”
The realization that it’s unnecessary to replicate neurons exactly, combined with the development of the DVS camera, are the drivers behind today’s neuromorphic vision systems. While today’s systems are already on the market, there is still a way to go before we have fully human–like vision available for commercial use.
Initial DVS cameras had “big, chunky pixels,” since components around the photo diode itself reduced the fill factor substantially. While investment in the development of these cameras accelerated the technology, Benosman made it clear that the event cameras of today are simply an improvement of the original research devices developed as far back as 2000. State–of–the–art DVS cameras from Sony, Samsung, and Omnivision have tiny pixels, incorporate advanced technology such as 3D stacking, and reduce noise. Benosman’s worry is whether the types of sensors used today can successfully be scaled up.
“The problem is, once you increase the number of pixels, you get a deluge of data, because you’re still going super fast,” he said. “You can probably still process it in real time, but you’re getting too much relative change from too many pixels. That’s killing everybody right now, because they see the potential, but they don’t have the right processor to put behind it.”
General–purpose neuromorphic processors are lagging behind their DVS camera counterparts. Efforts from some of the industry’s biggest players (IBM Truenorth, Intel Loihi) are still a work in progress. Benosman said that the right processor with the right sensor would be an unbeatable combination.
“[Today’s DVS] sensors are extremely fast, super low bandwidth, and have a high dynamic range so you can see indoors and outdoors,” Benosman said. “It’s the future. Will it take off? Absolutely!”
“Whoever can put the processor out there and offer the full stack will win, because it’ll be unbeatable,” he added.
— Professor Ryad Benosman will give the keynote address at the Embedded Vision Summit in Santa Clara, Calif. on May 17.