Reconstructing visual experiences from brain activity evoked by natural movies
Shinji Nishimoto, An T. Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu & Jack L. Gallant.
Current Biology, published online September 22, 2011.
Quantitative modeling of human brain activity can provide crucial insights about cortical representations and can form the basis for brain decoding devices. Recent functional magnetic resonance imaging (fMRI) studies have modeled brain activity elicited by static visual patterns and have reconstructed these patterns from brain activity. However, blood oxygen level-dependent (BOLD) signals measured via fMRI are very slow, so it has been difficult to model brain activity elicited by dynamic stimuli such as natural movies. Here we present a new motion-energy encoding model that largely overcomes this limitation. The model describes fast visual information and slow hemodynamics by separate components. We recorded BOLD signals in occipitotemporal visual cortex of human subjects who watched natural movies and fit the model separately to individual voxels. Visualization of the fit models reveals how early visual areas represent the information in movies. To demonstrate the power of our approach, we also constructed a Bayesian decoder by combining estimated encoding models with a sampled natural movie prior. The decoder provides remarkable reconstructions of the viewed movies. These results demonstrate that dynamic brain activity measured under naturalistic conditions can be decoded using current fMRI technology.
Simple example of reconstruction
The left clip is a segment of a Hollywood movie trailed that the subject viewed while in the magnet. The right clip shows the reconstruction of this segment from brain activity measured using fMRI. The procedure is as follows:
Reconstruction for different subjects
This video is organized as folows: the movie that each subject viewed while in the magnet is shown at upper left. Reconstructions for three subjects are shown in the three rows at bottom. All these reconstructions were obtained using only each subject's brain activity and a library of 18 million seconds of random YouTube video that did not include the movies used as stimuli. The reconstruction at far left is the Average High Posterior (AHP). The reconstruction in the second column is the Maximum a Posteriori (MAP). The other columns represent less likely reconstructions. The AHP is obtained by simply averaging over the 100 most likely movies in the reconstruction library. These reconstructions show that the process is very consistent, though the quality of the reconstructions does depend somewhat on the quality of brain activity data recorded from each subject.
Frequently Asked Questions About This Work
Could you give a simple outline of the experiment?
The goal of the experiment was to design a process for decoding dynamic natural visual experiences from human visual cortex. More specifically, we sought to use brain activity measurements to reconstruct natural movies seen by an observer. First, we used functional magnetic resonance imaging (fMRI) to measure brain activity in visual cortex as a person looked at several hours of movies. We then used these data to develop computational models that could predict the pattern of brain activity that would be elicited by any arbitrary movies (i.e., movies that were not in the initial set used to build the model). Next, we used fMRI to measure brain activity elicited by a second set of movies that were completely distinct from the first set. Finally, we used the computational models to process the elicited brain activity, in order to reconstruct the movies in the second set of movies. This is the first demonstration that dynamic natural visual experiences can be recovered from very slow brain activity recorded by fMRI.
Can you give an intuitive explanation of movie reconstruction?
As you move through the world or you watch a movie, a dynamic, ever-changing pattern of activity is evoked in the brain. The goal of movie reconstruction is to use the evoked activity to recreate the movie you observed. To do this, we create encoding models that describe how movies are transformed into brain activity, and then we use those models to decode brain activity and reconstruct the stimulus.
Can you explain the encoding model and how it was fit to the data?
To understand our encoding model, it is most useful to think of the process of perception as one of filtering the visual input in order to extract useful information. The human visual cortex consist of billions of neurons. Each neuron can be viewed as a filter that takes a visual stimulus as input, and produces a spiking response as output. In early visual cortex these neural filters are selective for simple features such as spatial position, motion direction and speed. Our motion-energy encoding model describes this filtering process. Currently the best method for measuring human brain activity is fMRI. However, fMRI does not measure neural activity directly, but rather measures hemodynamic changes (i.e. changes in blood flow, blood volume and blood oxygenation) that are caused by neural activity. These hemodynamic changes take place over seconds, so they are much slower than the changes that can occur in natural movies (or in the individual neurons that filter those movies). Thus, it has previously been thought impossible to decode dynamic information from brain activtiy recorded by fMRI. To overcome this fundamental limitation we use a two stage encoding model. The first stage consists of a large collection of motion-energy filters that span a range of positions, motion directions and speeds as the underlying neurons. This stage models the fast responses in the early visual system. The output from the first stage of the model is fed into a second stage that describes how neural activity affects hemodynamic activity in turn. The two stage processing allows us to model the relationship between the fine temporal information in the movies and the slow brain activity signals measured using fMRI. Functional MRI records brain activity from small volumes of brain tissue called voxels (here each voxel was 2.0 x 2.0 x 2.5 mm). Each voxel represents the pooled activity of hundreds of thousands of neurons. Therefore, we do not model each voxel as a single motion-energy filter, but rather as a bank of thousands of such filters. In practice fitting the encoding model to each voxel is a straightforward regression problem. First, each movie is processed by a bank of nonlinear motion-energy filters. Next, a set of weights is found that optimally map the filtered movie (now represented as a vector of about 6,000 filter outputs) into measured brain activity. (Linear summation is assumed in order to simplify fitting.)
How accurate is the decoder?
A good decoder should produce a reconstruction that a neutral observer judges to be visually similar to the viewed movie. However, it is difficult to quantify human judgments of visual similarity. In this paper we use similarity in the motion-energy domain. That is, we quantify how much of the spatially localized motion information in the viewed movie was reconstructed. The accuracy of our reconstructions is far above chance.
Other studies have attempted reconstruction before. How is your study different?
Previous studies showed that it is possible to reconstruct static visual patterns (Thirion et al., 2006Neuroimage; Miyawaki et al., 2008 Neuron), static natural images (Naselaris et al., 2009 Neuron) or handwriting digits (van Gerven et al. 2010 Neural Computation). However, no previous study has produced reconstructions of dynamic natural movies. This is a critical step toward obtaining reconstructions of internal states such as imagery, dreams and so on.
Why is this finding important?
From a basic science perspective, our paper provides the first quantitative description of dynamic human brain activity during conditions simulating natural vision. This information will be important to vision scientists and other neuroscientists. Our study also represents another important step in the development of brain-reading technologies that could someday be useful to society. Previous brain-reading approaches could only decode static information. But most of our visual experience is dynamic, and these dynamics are often the most compelling aspect of visual experience. Our results will be crucial for developing brain-reading technologies that can decode dynamic experiences.
How many subjects did you run? Is there any chance that they could have cheated?
We ran three subjects for the experiments in this paper, all co-authors. There are several technical considerations that made it advantageous to use authors as subjects. It takes several hours to acquire sufficient data to build an accurate motion-energy encoding model for each subject, and naive subjects find it difficult to stay still and alert for this long. Authors are motivated to be good subjects, to their data are of high quality. These high quality data enabled us to build detailed and accurate models for each individual subject. There is no reason to think that the use of authors as subjects weakens the validity of the study. The experiment focuses solely on the early part of the visual system, and this part of the brain is not heavily modulated by intention or prior knowledge. The movies used to develop encoding models for each subject and those used for decoding were completely separate, and there no plausible way that a subject could have changed their own brain activity in order to improve decoding. Many fMRI studies use much larger groups of subjects, but they collect much less data on each subject. Such studies tend to average over a lot of the individual variability in the data, and the results provide a poor description of brain activity in any individual subject.
What are the limits on brain decoding?
Decoding performance depends on the quality of brain activity measurements. In this study we used functional MRI (fMRI) to measure brain activity. (Note that fMRI does not actually measure the activity of neurons. Instead, it measures blood flow consequent to neural activity. However, many studies have shown that the blood flow signals measured using fMRI are generally correlated with neural activity.) fMRI has relatively modest spatial and temporal resolution, so much of the information contained in the underlying neural activity is lost when using this technique. fMRI measurements are also quite variable from trial-to-trial. Both of these factors limit the amount of information that can be decoded from fMRI measurements. Decoding also depends critically on our understanding of how the brain represents information, because this will determine the quality of the computational model. If the encoding model is poor (i.e., if it does a poor job of prediction) then the decoder will be inaccurate. While our computational models of some cortical visual areas perform well, they do not perform well when used to decode activity in other parts of the brain. A better understanding of the processing that occurs in parts of the brain beyond visual cortex (e.g. parietal cortex, frontal cortex) will be required before it will be possible to decode other aspects of human experience.
What are the future applications of this technology?
This study was not motivated by a specific application, but was aimed at developing a computational model of brain activity evoked by dynamic natural movies. That said, there are many potential applications of devices that can decode brain activity. In addition to their value as a basic research tool, brain-reading devices could be used to aid in diagnosis of diseases (e.g., stroke, dementia); to assess the effects of therapeutic interventions (drug therapy, stem cell therapy); or as the computational heart of a neural prosthesis. They could also be used to build a brain-machine interface.
Could this be used to build a brain-machine interface (BMI)?
Decoding visual content is conceptually related to the work on neural-motor prostheses being undertaken in many laboratories. The main goal in the prosthetics work is to build a decoder that can be used to drive a prosthetic arm or other device from brain activity. Of course there are some significant differences between sensory and motor systems that impact the way that a BMI system would be implemented in the two systems. But ultimately, the statistical frameworks used for decoding in the sensory and motor domains are very similar. This suggests that a visual BMI might be feasible.
At some later date when the technology is developed further, will it be possible to decode dreams, memory, and visual imagery?
Neuroscientists generally assume that all mental processes have a concrete neurobiological basis. Under this assumption, as long as we have good measurements of brain activity and good computational models of the brain, it should be possible in principle to decode the visual content of mental processes like dreams, memory, and imagery. The computational encoding models in our study provide a functional account of brain activity evoked by natural movies. It is currently unknown whether processes like dreaming and imagination are realized in the brain in a way that is functionally similar to perception. If they are, then it should be possible to use the techniques developed in this paper to decode brain activity during dreaming or imagination.
At some later date when the technology is developed further, will it be possible to use this technology in detective work, court cases, trials, etc?
The potential use of this technology in the legal system is questionable. Many psychology studies have now demonstrated that eyewitness testimony is notoriously unreliable. Witnesses often have poor memory, but are usually unaware of this. Memory tends to be biased by intervening events, inadvertent coaching, and rehearsal (prior recall). Eyewitnesses often confabulate stories to make logical sense of events that they cannot recall well. These errors are thought to stem from several factors: poor initial storage of information in memory; changes to stored memories over time; and faulty recall. Any brain-reading device that aims to decode stored memories will inevitably be limited not only by the technology itself, but also by the quality the stored information. After all, an accurate read-out of a faulty memory only provides misleading information. Therefore, any future application of this technology in the legal system will have to be approached with extreme caution.
Will we be able to use this technology to insert images (or movies) directly into the brain?
Not in the foreseeable future. There is no known technology that could remotely send signals to the brain in a way that would be organized enough to elicit a meaningful visual image or thought.
Does this work fit into a larger program of research?
One of the central goals of our research program is to build computational models of the visual system that accurately predicts brain activity measured during natural vision. Predictive models are the gold standard of computational neuroscience and are critical for the long-term advancement of brain science and medicine. To build a computational model of some part of the visual system, we treat it as a "black box" that takes visual stimuli as input and generates brain activity as output. A model of the black box can be estimated using statistical tools drawn from classical and Bayesian statistics, and from machine learning. Note that this reverse-engineering approach is agnostic about the specific way that brain activity is measured. One good way to evaluate these encoding models is construct a corresponding decoding model, and then assess its performance in a specific task such as movie reconstruction.
Why is it important to construct computational models of the brain?
The brain is an extremely complex organ and many convergent approaches are required to obtain a full understanding of its structure and function. One way to think about the problem is to consider three different general goals of research in systems/computational neuroscience. (1) The first goal is to understand how the brain is divided into functionally distinct modules (e.g., for vision, memory, etc.). (2) The second goal, contingent on the first, is to determine the function of each module. One classical approach for investigating the function of a brain circuit is to characterize neural responses at a quantitative computational level that is abstracted away from many of the specific anatomical and biophysical details of the system. This helps make tractable a problem that would otherwise seem overwhelmingly complex. (3) The third goal, contingent on the first two, is to understand how these specific computations are implemented in neural circuitry. A byproduct of this model-based approach is that it has many specific applications, as described above.
Can you briefly explain the function of the parts of the brain examined here?
The human visual system consists of several dozen distinct cortical visual areas and sub-cortical nuclei, arranged in a network that is both hierarchical and parallel. Visual information comes into the eye and is there transduced into nerve impulses. These are sent on to the lateral geniculate nucleus and then to primary visual cortex (area V1). Area V1 is the largest single processing module in the human brain. Its function is to represent visual information in a very general form by decomposing visual stimuli into spatially localized elements. Signals leaving V1 are distributed to other visual areas, such as V2 and V3. Although the function of these higher visual areas is not fully understood, it is believed that they extract relatively more complicated information about a scene. For example, area V2 is thought to represent moderately complex features such as angles and curvature, while high-level areas are thought to represent very complex patterns such as faces. The encoding model used in our experiment was designed to describe the function of early visual areas such as V1 and V2, but was not meant to describe higher visual areas. As one might expect, the model does a good job of decoding information in early visual areas but it does not perform as well in higher areas.
Are there any ethical concerns with this type of research?
The current technology for decoding brain activity is relatively primitive. The computational models are immature, and in order to construct a model of someone's visual system they must spend many hours in a large, stationary magnetic resonance scanner. For this reason it is unlikely that this technology could be used in practical applications any time soon. That said, both the technology for measuring brain activity and the computational models are improving continuously. It is possible that decoding brain activity could have serious ethical and privacy implications downstream in, say, the 30-year time frame. As an analogy, consider the current debates regarding availability of genetic information. Genetic sequencing is becoming cheaper by the year, and it will soon be possible for everyone to have their own genome sequenced. This raises many issues regarding privacy and the accessibility of individual genetic information. The authors believe strongly that no one should be subjected to any form of brain-reading process involuntarily, covertly, or without complete informed consent.