Attention during natural vision warps semantic representation across the human brain

Tolga Çukur, Shinji Nishimoto, Alexander G. Huth & Jack L. Gallant (Nature Neuroscience 2013, PDF 7.9M)

Humans are remarkably adept at finding specific categories of objects amongst formidable clutter in real-world scenes. Because humans can effortlessly search for thousands of distinct categories, it is unlikely that this remarkable ability is mediated by distinct brain areas dedicated to recognizing each and every category. A more efficient scheme would be to dynamically reallocate neural resources based on behavioral demands. To investigate this issue we recorded brain activity while people performed a challenging, dynamic visual search task involving natural movies. In one condition they searched for “humans” in natural movies, and in another condition they searched for “vehicles”. We selected these specific categories because they are common targets of visual search in daily life. We then assessed how dynamic visual search affects the brain representations of 935 object and action categories that appeared in the movies. We find that shifting attention between “humans” and “vehicles” dramatically changes brain representations of all categories: search for a specific category increases the amount of cortex that is devoted to processing the category of interest and decreases the amount of cortex devoted to processing other, irrelevant categories.

Video Explanation of the Paper

The first author, Tolga Çukur, gives a brief explanation of the motivation for the work, the method used to collect and analyze the data, and the meaning of the results.

Frequently asked questions about this work

What were the various experimental stages performed in this study?

Study participants performed three separate tasks during the experiments: passive viewing, covert visual search for “humans”, or covert visual search for “vehicles”. These visual tasks allowed us to assess the way that attention changes the processing of visual categories in the brain. This assessment can be broken down into the following stages: [1] Collect brain activity data while people watch movies and perform the designated tasks; [2] Code the presence of 935 object and action categories in the movies; [3] For each one of the 50,000 locations recorded across the cortex, find a set of weights that indicates how much each of the 935 categories changes brain activity; [4] Use principal components analysis (PCA) on the category weights obtained from passive viewing data to derive a continuous space that organizes categories according to their semantic similarity; [5] Project category weights obtained from visual search data into the semantic space; [6] For each location on cortex, compare the semantic-space projections across the two search tasks; [7] Visualize the results by projecting the WordNet features and the cortical flatmaps into the semantic space.

To reveal attentional changes in visual processing when the targets appear in the movies, all available data measured during visual search tasks were utilized. To reveal attentional changes in visual processing when the target did nots appear in the movies, stages [1]-[3] were performed after removing all data collected during those segments of the movies in which either ‘humans’ or ‘vehicles’ were present.

Do the findings of this study confirm or challenge current views of attentional processes in the human brain?

Attention is a neural control system that can alter the flow and processing of sensory information in the brain through several concurrent mechanisms. Previous studies have reported relatively simple mechanisms that mediate increases in the quality of brain activity evoked by attended objects, without affecting the way that information is represented in each brain region. Yet, because there are a limited number of cortical neurons, it seems unlikely that all brain regions retain fixed representations irrespective of behavioral demands. Our results provide the first evidence for a distinct mechanism that dynamically changes the way that visual information is represented across the entire human brain. This mechanism increases the amount of cortex that is devoted to representing task-relevant categories, and decreases the amount of cortex devoted to other irrelevant categories. Therefore, our results suggest a more complex and dynamic view of attention compared to conventional views.

Why are these findings important?

In everyday life, humans frequently search for particular visual objects. One might often search for buildings while navigating a city, for example, and someone might be looking for animals in search of their lost cat. Although real-world scenes are cluttered with many different objects, humans are extremely adept at finding target objects in natural environments and shifting their attention rapidly between different targets. Our results reveal a novel attentional mechanism that helps explain this remarkable human ability. This mechanism reallocates neural resources across the brain to maximize sensitivity for the target and improve target detection under demanding conditions. Our results suggest that the human brain optimizes behavior by focusing on cognitive operations relevant to the current task at the expense of reducing resources available for other mental tasks.

Can these results be explained by overt eye movements to look at target objects?

During natural visual search, humans can move their eyes to bring target objects to the center of their visual field, the point of greatest visual acuity. However, eye movements effectively alter the visual stimulus that falls onto our retinas, and hence they can potentially confound the results of visual attention studies. For this reason, we instructed participants to steadily fixate on a dot at the center of the visual field that was overlaid onto the movies. We recorded participant’s eye positions during the experiments; and we found that there were no systematic differences in eye positions between the two search tasks used in our study. This suggests that our results cannot be explained by overt eye movements.

Can these results be explained by focusing our attention on simple features such as spatial position or orientation of target objects?

To facilitate target detection during visual search, humans can focus their attention on simple features such as a specific location in the visual field or the spatial orientation of an object. We selected the natural movies used in this study from a diverse variety of sources, and these movies contained many different objects that appeared in many different viewing conditions. Thus, we do not expect that there are significant attentional changes in brain activity evoked by simple visual features such as spatial location, spatio-temporal frequency, orientation, and eccentricity. To further investigate this issue, we found a separate set of weights for each location on cortex that indicates how much each of these visual features changes brain activity. We found that only a negligible amount of change in brain activity across search tasks could be attributed to these features. This finding suggests that our results reflect visual attention to object categories as opposed to simpler visual features.

Do these results reflect bottom-up or top-down effects?

In the experiments that we have reported in this paper, the movies shown to the participants as they searched for “humans” were identical to those shown during search for “vehicles”. Thus, the changes in brain activity across the two tasks necessarily reflect top-down effects due to visual attention.

Is it possible to determine in what order different regions of cortex were activated?

Not from these results. Functional MRI (fMRI) only measures slow signals related to blood oxygenation, blood volume and blood flow. These signals are much slower than the underlying neural signals. So it is very difficult to try to determine the order in which areas are activated using fMRI.

Where does the WordNet graph come from?

WordNet is a relational tree that was developed by linguists. WordNet is an “is a” hierarchy. For example, a cat is a feline, a feline is a mammal, a mammal is an animal, and an animal is an organism. The organization of the graph is based purely on intrinsic properties of the world and language; it has nothing to do with the brain data.

What is PCA and what are PCs?

Principal components analysis (PCA) is a method for reducing a high-dimensional data set into a smaller, lower-dimensional space. The dimensionality reduction step in our experiments was critical for interpreting the results and for finding commonality across subjects. After model fitting each subject’s data consisted of a ~50,000 x 935 matrix of weights, and this is just far too much data to understand. PCA finds a lower-dimensional semantic space that accounts for the variance in the data. In our study, we primarily focused on the first 4 dimensions of this space, because those dimensions were almost identical across different individuals.

What does fMRI measure and how can the signals be interpreted?

Functional MRI (fMRI) does not measure neural activity directly. It measures changes in blood oxygenation, blood flow and blood volume that are caused by neural activity. Most of the useful fMRI signal comes from veins about the size of the sampling lattice (in this study, volumes of about 2x2x4 mm). Given this, the only safe assumption is that fMRI signals are monotonically related to the integrated synaptic activity of the local neuropil upstream from the site of measurement. Thus, fMRI signals likely reflect both excitatory and inhibitory synaptic activity of both feed-forward and feed-back connections. That said, many studies that have aimed at validating fMRI have shown that fMRI signals are most closely correlated with local field potentials (LFPs) and local multi-unit activity (MUA). Other studies have shown that LFPs and MUA are most closely associated with excitatory neurons. (Perhaps this is because local inhibitory influences in cortex appear to be relatively untuned, when compared to excitatory signals.) Finally, studies of the primate analog of the fusiform face area (FFA) have shown that tuning of single excitatory neurons and of voxels are similar. Still, the conservative position is to assume that the jury is still out on the precise relationship between fMRI signals and neural signals.

Why are data shown on the surface of a flattened cortex?

The cortex is a sheet of tissue that covers the visible surface of the brain. In the normal human, the cortex is highly folded so that the brain (and the head) can be as small as possible. To visualize the distribution of semantic tuning across the cortical surface, we computationally inflate the cortex, then put a few cuts in it, then flatten it out (to see how this works, use our public interactive brain viewer, which you can find here. Then we can project the data onto the surface of the flattened cortex in order to visualize it easily. Note that most studies show fMRI results in the 3D brain space. They do not flatten the cortex and they do not show flat maps. This makes their data difficult to see and interpret. Given that most fMRI studies are focused on cortical activity, we believe that these other studies would be easier to understand if they made flat maps as we did here.

Is it appropriate to generalize the results of this study to a larger population?

Most fMRI studies average the data from a typical sample size of, say, 20 subjects, and they only make inferences at the group level. Averaging data across subjects this way discards interesting data that reflects variability across individuals. Furthermore, because averaging is done in a standard brain space (MNI or Tailarach coordinates), it tends to accentuate signals in clear anatomical landmarks, while diminishing signals in other locations. Thus, the results are distorted and there is no real way to tell whether those results actually reflect what would be seen in any individual subject.

To avoid this problem, the tradition in psychophysics is to record substantially more data from fewer subjects (5 subjects in this study). This provides sufficient statistical power to analyze the data from individual subjects. Only after data processing is completed in individual subjects do we aggregate across subjects to recover general principles. This provides a much more sensitive assay than can be obtained through the more common way. Our procedure preserves all of the data at the individual subject level, and so we have considerably more detailed results that are much closer to ground truth. The only cost is that we need a greater amount of data from individual subjects, and the data processing applied to each subject is more labor intensive.

Several categories may simultaneously appear on screen in natural movies, how can brain activity evoked by separate categories be distinguished?

In this study we used regularized regression to estimate the relationship between brain activity measurements and 935 categories of objects and actions in the movies. The regression procedure works correctly even when multiple objects and actions appear on the screen simultaneously. There are some limitations: if two objects always appear together, then the regression procedure cannot discriminate between their effects. But fortunately, in natural movies the correlation between objects and actions is never perfect, so this is generally not a problem.

Why are results measured in terms of predictions rather than significance?

Most fMRI experiments focus on statistical significance. Statistical significance is important because it tells scientists that the thing that is being measured is not just random noise. However, a more rigorous and important criterion for any theory in science is its ability to predict the future. Because we seek to create a model that explains as much as possible about visual processing, we focus on predictive power rather than mere significance. To generate and test predictions we collect 2 different sets of data. The first set is used to estimate category weights for each location in cortex. The second set is used to validate the models by testing predictions of brain activity for each location.

Where is this research going?

We are now developing more detailed cortical maps that show how the brain represents various sorts of other perceptual and conceptual features, and how these representations are modulated by top-down influences such as attention. The approach used here and in our earlier papers (Kay et al., 2008; Naselaris et al., 2009; Nishimoto et al., 2011; Huth et al., 2012) provide a very powerful set of methods for recovering detailed information about the way that sensory and cognitive information are represented across the human brain under naturalistic conditions. Therefore, we are also beginning to examine other cognitive processes such as language, decision making and so on.