A continuous semantic space describes the representation of thousands of object and action categories across the human brain

Alexander H. Huth, Shinji Nishimoto, An T. Vu & Jack L. Gallant (Neuron 2012, PDF 8.4M)

Humans can see and name thousands of distinct object and action categories, so it is unlikely that each category is represented in a distinct brain area. A more efficient scheme would be to represent categories as locations in a continuous semantic space mapped smoothly across the cortical surface. To search for such a space, we used fMRI to measure human brain activity evoked by natural movies. We then used voxelwise models to examine the cortical representation of 1,705 object and action categories. The first few dimensions of the underlying semantic space were recovered from the fit models by principal components analysis. Projection of the recovered semantic space onto cortical flat maps shows that semantic selectivity is organized into smooth gradients that cover much of visual and nonvisual cortex. Furthermore, both the recovered semantic space and the cortical organization of the space are shared across different individuals.

Video explanation of the paper

The first author, Alex Huth, gives a brief explanation of the motivation for the work, the method used to collect and analyze the data, and the meaning of the results.

Frequently asked questions about this work

What were the various stages of the experiment?

The experiment can be broken down into the following stages: [1] collect brain activity data while people watch movies; [2] code the presence of 1705 object and action categories in the movie; [3] for each one of the 30,000 locations recorded across the cortical surface, find a set of weights that indicates how much each of the 1705 categories changes brain activity; [4] Use principal components analysis (PCA) to reduce the dimensionality of the data to something more manageable; [5] Visualize the results by projecting the Wordnet features and the cortical flat maps into the semantic space recovered in step 4.

Where does the Wordnet graph come from?

Wordnet is a relational tree that was developed by linguists. Wordnet is an “isa” hierarchy. For example, a wolf is a canine, a canine is a mammal, a mammal is an animal, an animal is an organism. The organization of the graph is based purely on intrinisic properties of the world and language, it has nothing to do with the brain data.

Do the semantic maps revealed by this study confirm or challenge current views of semantic organization in the human brain?

Previous theories on semantic organization in the human brain have generally fallen into two classes, region-based and distributed. Region-based theories state that certain categories (such as faces and body parts) are strongly represented in discrete, highly localized regions of cortex. Distributed theories state that certain categories (such as household objects) are represented by distributed patterns of activity. Our results suggest that the true answer lies somewhere between these extremes. Most parts of visual cortex respond to many different categories, but nearby areas of the cortex tend to respond to similar categories, and some parts are more narrowly tuned than others. Perhaps the most surprising aspect of the cortical maps that category selectivity is so widespread, encompassing the entirety of higher visual cortex and other parts of the brain as well.

Do these maps reflect bottom-up or top-down effects?

When someone is watching a movie and scene cut occurs, brain activity occurring immediately thereafter is caused by the objects and actions in the movie. But after the scene has played out a bit, the viewer apprehends the scene and brain activity reflects attention and expectation. Thus, our results probably reflect both bottom-up and top-down efffects. In the experiments that we have reported in this paper, the task was simply to fixate on a small spot that was superimposed at the center of the movie. They were not required to think about the scenes, or to do any sort of task. Therefore, we expect that most of the brain activity reflects bottom-up, non-attentional information. (We have also done some attention experiments using visual search, those results are now in review and will be published soon.)

What is PCA and what are PCs?

Principal components analysis (PCA) is a method for reducing a high-dimensional data set into a smaller, lower-dimensional space. The dimensionality reduction step in our experiments was critical for understanding the results and for finding commonality across subjects. After model fitting each subject’s data consisted of a ~30,000 x 1705 matrix of weights, and this is just far too much data to understand. PCA finds a lower-dimensional semantic space that accounts for the variance in the data. In our study, we focused on the first 4 dimensions of this space, because those dimensions were almost identical across different individuals.

What is the difference between PCs 1 versus 2-4?

PC1 distinguishes between categories with high stimulus energy (e.g., moving objects like ‘‘person’’ and ‘‘vehicle’’) and those with low stimulus energy (e.g., stationary objects like ‘‘sky’’ and ‘‘city’’). Thus, although PC1 accounts for a larger amount of the variability in brain activity than the other PCs, it is not very interesting because we would expect that bright, fast things would elicit more brain activity than dim, slow things. The more interesting PCs are 2-4, because these show interesting aspects of semantic organization that are not merely related to stimulus energy.

What does fMRI measure and how can the signals be interpreted?

Functional MRI (fMRI) does not measure neural activity directly. It measures changes in blood oxygenation, blood flow and blood volume that are caused by neural activity. Most of the useful fMRI signal comes from veins about the size of the sampling lattice (in this study, volumes of about about 2x2x4 mm). Given this, the only safe assumption is that fMRI signals are monotonically related to the integrated synaptic activity of the local neuropil upstream from the site of measurement. Thus, fMRI signals likely reflect both excitatory and inhibitory synaptic activity of both feed-forward and feed-back connections. That said, many studies that have aimed at validating fMRI have shown that fMRI signals are most closely correlated with local field potentials (LFPs) and local multi-unit activity (MUA). Other studies have shown that LFPs and MUA are most closely associated with excitatory neurons. (Perhaps this is because local inhibitory influences in cortex appear to be relatively untuned, when compared to excitatory signals.) Finally, studies of the primate analog of the fusiform face area (FFA) have shown that tuning of single excitatory neurons and of voxels are similar. Still, the conservative position is to assume that the jury is still out on the precise relationship between fMRI signals and neural signals.

Is it possible to determine in what order different regions of cortex were activated?

Not from these results. Functional MRI only measures slow signals related to blood oxygenation, blood volume and blood flow. These signals are much slower than the underlying neural signals. So it is very very difficult to try to determine the order in which areas are activated using fMRI.

Why are data shown on the surface of a flattened cortex?

The cortex is a sheet of tissue that covers the visible surface of the brain. In the normal human, the cortex is highly folded so that the brain (and the head) can be as small as possible. To visualize the distribution of semantic selectivity across the cortical surface, we computationally inflate the cortex, then put a few cuts in it, then flatten it out (to see how this works, use our public interactive brain viewer, which you can find here. Then we can project the data onto the surface of the flattened cortex in order to visualize it easily. Note that most studies show fMRI results in the 3D brain space. They do not flatten the cortex and they do not show flat maps. This makes their data difficult to see and interpret. Given that most fMRI studies are focused on cortical activity, we believe that these other studies would be easier to understand if they made flat maps as we did here.

This study uses co-authors as subjects. What are the advantages and disadvantages of this?

Most fMRI studies group the data from different subjects together, and they only make inferences at the group level. Grouping this way discards a lot of interesting data, and it tends to average out results that are due to individual variability. To maximize sensitivity we focus on modeling each subject individually. In this particular study the first three authors (Huth, Nishimoto and Vu) and two other members of the laboratory served as subjects. (Gallant did not serve as a subject, subject JG is someone else.) This is not uncommon in visual neuroscience experiments, because these experiments are tedious (they require many hours of scanning per subject), they require certain skills (like fixating steadily on a dot while a movie plays in the background), and they are unlikely to be heavily biased by anything the subject might do.

Is it appropriate to generalize the results of this experiment given that there were only five subjects?

Most people expect a more typical psychology sample size of, say, 20 subjects. However, when psychologists use large sample sizes, they end up averaging over subjects and they only show you the averaged results. After averaging there is no real way to tell whether those results actually reflect what would be seen in any individual subject. (If the variance between people was huge, then the mean would look nothing like the data acquired from any individual subject.) Furthermore, because averaging is done in a standard brain space (MNI or Tailarach coordinates), it tends to increase signal from clear anatomical landmarks, and reduces signal everywhere else. This distorts the results. To avoid this problem, the tradition in psychophysics is to record much more data from fewer subjects. This provides sufficient statistical power to analyze the data from individual subjects. Only after data processing is completed in individual subjects do we aggregate across subjects to recover general principles. This provides a much, much more sensitive assay than can be obtained the more common way. And this makes a huge difference in the results. Most fMRI studies show very strong localization, because their averaging procedure artificially increases signal in some places and decreases it in others. Our procedure preserves all of the data at the individual subject level, and so we have much more detailed results that are much closer to ground truth. The cost is that we need a lot more data from individual subjects, and the data processing applied to each subject is much more labor intensive.

Would the semantic organization of the brain be different in individuals from another culture ?

It is possible that semantic organization would vary between individuals that grew up in very different environments. For example, the way that the brain of someone who grew up in a western urban environment represents the world might be different from how the brain of someone who grew up in pre-industrial rainforest environment might represent the world. These differences probably aren’t huge in the visual system, because much of the organization of visual cortex is likely driven by structural features such as edges and surfaces that are present in both natural and man-made environments. But these differences are likely to be large in higher-order regions of the brain that are less tightly tied to the physical world.

When several categories are on screen at the same time in the movie, how can brain activity evoked by one category be separated from activity evoked by a different category?

In this study we used regularized regression to estimate the relationship between brain activity measurements and 1705 categories of objects and actions in the movies. The regression procedure works correctly even when multiple objects and actions appear on the screen simultaneously. There are some limitations: if two objects always appear together, then the regression procedure cannot discriminate between their effects. But fortunately, in natural movies the correlation between objects and actions is never perfect, so this is generally not a problem.

Why are results measured in terms of predictions rather than significance?

Most fMRI experiments focus on statistical significance. Statistical significance is important because it tells scientists that the thing that is being measured is not just random noise. However, a more rigorous and important criterion for any theory in science is its ability to predict the future. Because we seek to create a model that explains as much as possible about visual processing, we focus on predictive power rather than mere significance. To generate and test predictions we collect 2 different sets of data. The first set is used to estimate a semantic model for each voxel. The second set is used to validate the models by testing predictions of activity for each voxel.

Where is this research going?

We are now using the detailed category selectivity maps revealed here to answer specific questions about previously identified brain regions. We are also investigating how these maps are modulated by top-down influences such as attention. Finally we’re developing more detailed interactive maps that show how the brain responds to various sorts of perceptual and conceptual features. The approach used here and in our earlier papers (Kay et al., 2008; Naselaris et al., 2009; Nishimoto et al., 2011) provide a very powerful and quite general set of methods for recovering detailed information about the way that sensory and cognitive information are represented across the human brain. Therefore, we are also beginning to examine other cognitive processes such as language, decision making and so on.