Nature Neuroscience

# Mouse visual cortex areas represent perceptual and semantic features of learned visual categories

### Mice

All experimental procedures were conducted according to institutional guidelines of the Max Planck Society and the regulations of the local government ethical committee (Beratende Ethikkommission nach §15 Tierschutzgesetz, Regierung von Oberbayern). Adult male C57BL/6 mice ranging from 6 to 10 weeks of age at the start of the experiment were housed individually or in groups in large cages (type III and GM900, Tecniplast) containing bedding, nesting material and two or three pieces of enrichment such as a tunnel, a triangular-shaped house and a running wheel (Plexx). In a subset of experiments (stimulus-shift experiment, n = 3; local cortical inactivation experiment, n = 3), we used mice (12 to 15 weeks old; two female and one male) that expressed the genetically encoded calcium indicator GCaMP6s in excitatory neurons (B6;DBA-Tg(tetO-GCaMP6s)2Niell/J (Jax, 024742) crossed with B6.Cg-Tg(Camk2a-tTA)1Mmay/DboJ (Jax, 007004))77,78. All mice were housed in a room having a 12-h reversed day/night cycle, with lights on at 22:00 and lights off at 10:00 in winter time (23:00 and 11:00 in summer time), a room temperature of ~22 °C and a humidity of ~55%. Standard chow and water were available ad libitum except during the period spanning behavioral training, in which access to either food or water was restricted (for a detailed procedure see ref. 79).

### Head bar implantation and virus injection

A head bar was implanted under surgical anesthesia (0.05 mg per kg body weight fentanyl, 5.0 mg per kg body weight midazolam, 0.5 mg per kg body weight medetomidine in saline, injected intraperitoneally) and analgesia (5.0 mg per kg body weight carprofen, injected subcutaneously (s.c.); 0.2 mg ml−1 lidocaine, applied topically) using procedures described earlier79. Next, a circular craniotomy with a diameter of 5.5 mm was performed over the visual cortex and surrounding higher visual areas. The location and extent of V1 was determined using IOS imaging37,80,81 and the locations of higher visual areas were extrapolated based on the acquired retinotopic maps and literature36,37,82,83. A bolus of 150 nl to 250 nl of AAV2/1-hSyn-GCaMP6m-GCG-P2A-mRuby2-WPRE-SV40 (ref. 84) was injected at 50 nl min−1 in the center of V1 and into four to six higher visual areas at a depth of 350 μm below the dura (viral titers were 1.24 × 1013 and 1.02 × 1013 GC per ml). Following virus injection, the craniotomy was closed using a cover glass with a diameter of 5.0 mm (no. 1 thickness) and sealed with cyanoacrylate glue and a thin edge of dental cement. Animals recovered from surgery under a heat lamp and received a mixture of antagonists (1.2 mg per kg body weight naloxone, 0.5 mg per kg body weight flumazenil and 2.5 mg per kg body weight atipamezole in saline, injected s.c.). Postoperative analgesia (5.0 mg per kg body weight carprofen, injected s.c.) was given on the next 2 d. In some animals, we performed a second surgery (following procedures as described above) to remove small patches of bone growth underneath the window.

### Visual stimuli for information-integration categorization

Visual information-integration categories were constructed from a 2D stimulus space of orientations and spatial frequencies, in which the category boundary was determined by a 45° diagonal line6,53,58,85. In experiments with freely moving mice, the category space consisted of stationary square-wave gratings of approximately 7 cm in diameter, having one of seven orientations equally spaced by 15° between the cardinal axes, and seven spatial frequencies (0.03, 0.035, 0.04, 0.05, 0.07, 0.09 and 0.11 cycles per degree, as seen from a distance of 2.5 cm). Stimuli exactly on the diagonal category boundary were left out, resulting in two categories with 21 stimuli each (Fig. 1b). In the touch screen task, animals tended to weigh orientation over spatial frequency, which is possibly the result of greater variability in perceived spatial frequency than orientation during the approach to the screen.

In experiments with head-fixed mice, visual stimuli consisted of sinusoidal drifting gratings presented in a 32° diameter patch and extended by 4° wide faded edges, on a gray background. The stimulus was positioned in front of the mouse with its center at 26° azimuth and 10° elevation. In experiments without chronic imaging, stimuli had one of seven orientations spaced by 20°, and one of six spatial frequencies (0.04, 0.06, 0.08, 0.12, 0.16 and 0.24 cycles per degree) and drifted with 1.5 cycles per degree in a single direction. The category space was always centered on one of the cardinal orientations (for example, centered on 180° resulted in a stimulus range from 120° to 240°). The category boundary had an angle of 45° and was placed such that no stimuli were directly on the boundary (Fig. 2b). In these experiments, animals tended to weigh spatial frequency more strongly than orientation, which could indicate that the differences in spatial frequency were perceived as more salient.

For experiments in which the stimulus position was altered, the center of the computer monitor was repositioned from the default setting (right of the mouse, 26° azimuth) to a position straight in front of the mouse (0° azimuth) or left of the mouse (−26° azimuth; Fig. 2c). The monitor rotated on a swivel arm that was secured below the mouse such that the foot point (the point closest to the eye) was always in the center of the monitor. In addition, we verified that at each position the monitor was equidistant to the mouse. The relative position of the stimulus on the computer monitor and all other features were kept constant.

For chronic imaging, most stimulus parameters were identical to experiments without imaging. The complete stimulus space consisted of a full 360° range of orientations spaced by 18° (two directions of motion per orientation) and five spatial frequencies (either 0.06, 0.08, 0.12, 0.16 and 0.24 cycles per degree or 0.04, 0.06, 0.08, 0.12 and 0.16 cycles per degree). For each mouse, the category space was selected to contain six consecutive orientations (spaced by 18°) and the full range of five spatial frequencies, centered on one of the cardinal orientations (for example, centered on 180° resulted in a range from 135° to 215°). However, the stimuli were reduced in number; only the stimuli furthest from the boundary (initial stimuli) and closest to the category boundary (category stimuli) were used in the behavioral task (Fig. 3g). The reduced category space was implemented to consist of fewer stimuli, such that each stimulus would have a larger number of presentations (trials), thus facilitating a precise assessment of stimulus and category selectivity in the neural data. The angle of the category boundary in chronic imaging experiments was adjusted for rule bias to 23° (or 67° in two mice) to aid the animals that were biased to follow information of a particular stimulus dimension (see Extended Data Fig. 4b for the individual category space of each chronically imaged mouse).

### Touch screen operant chamber

Conditioning of freely moving animals was done in a modular touch screen operant chamber (MED Associates), which was operated using commercial software (K-LIMBIC) and was placed in a sound-attenuating enclosure86,87,88. The north wall of the operant chamber consisted of a touch screen with two apertures in which visual stimuli were presented, and a small petri dish that served as receptacle for a food pellet (equivalent to regular chow; TestDiet 5TUM). The south wall housed a lamp, a speaker and a retractable lever, and the east wall of the chamber held a water bottle.

Animals were pretrained in three stages. First, food-restricted mice were habituated to the experimental environment for a single, 20-min session, during which they were placed in the operant chamber and in which the food pellet receptacle contained 20–30 food pellets. In the next stage, the animals were exposed to a rudimentary trial sequence. After a 30–60-s intertrial interval, two visual stimuli were presented in the apertures of the touch screen monitor. The stimuli differed in both spatial frequency and orientation. Touching one of the two stimuli (the rewarded stimulus) led to delivery of a food pellet in the receptacle (food tray), while touching the other stimulus had no effect. If the mouse did not touch the rewarded stimulus within ~30 s from stimulus onset, the trial timed out and the next intertrial interval started. This stage lasted for two to four daily sessions (each lasting 1–1.5 h), until the mouse performed at least 50 rewarded trials. In the final pretraining stage, the lever was introduced. The trial sequence was almost identical to the previous stage, except now the trial started with lever extrusion instead of visual stimulus presentation. The visual stimuli were only presented after the mouse had pressed the lever. If the mouse failed to press the lever within ~30 s, the trial timed out (without visual stimulus presentation) and the sequence proceeded with the next intertrial interval.

Mice switched to the operant training paradigm as soon as they performed over 50 rewarded trials in the last pretraining stage. The trial sequence was very similar to the pretraining sequence, a 30–60-s intertrial interval was followed by lever extrusion (Extended Data Fig. 1b). When the mouse pressed the lever, it was retracted, and two visual stimuli were presented in the apertures of the touch screen. One stimulus was selected from the rewarded category and one stimulus was selected from the non-rewarded category such that they mirrored each other’s position across the center of the category space. If the mouse touched the screen within the aperture where the rewarded stimulus was presented, a food pellet was delivered in the receptacle. If the mouse touched the non-rewarded stimulus, the trial ended and proceeded to the next intertrial interval. Because the intertrial interval already lasted 30–60 s, no additional time-out or other punishment was implemented.

Finally, after mice had learned discriminating the first set of two stimuli (>70% correct), we introduced four additional stimuli, one step closer to the category boundary. The original stimuli were also kept in the stimulus set. If there was a reduction in performance, animals were trained for a second day on this new stimulus set. Over the next 3 d, we introduced six, eight and ten additional stimuli. The set of ten stimuli was trained for 2–3 d, after which we added the final 12 stimuli and the animals discriminated the full information-integration categorization space (Fig. 1b).

### Head-fixed operant conditioning

Head-restrained conditioning was performed in a setup described in ref. 79. In brief, the mouse was placed with its head fixed, on an air-suspended Styrofoam ball89,90, facing a computer monitor (Fig. 2a). The computer monitor was placed with its center at 26° azimuth and 0° elevation. The monitor extended 118° horizontally and 86° vertically, and pixel positions were adapted to curvature-corrected coordinates37. Two lick spouts were positioned in front of the mouse within reach of the tongue91. The setup recorded licks on each spout, as well as the running speed on the Styrofoam ball using circuits described in ref. 79. Water rewards were delivered through each lick spout by gravitational flow using a fully opening pinch valve (NResearch). The setup was controlled by a closed-loop MATLAB routine using Psychophysics Toolbox extensions92 for showing visual stimuli, and in addition, all signals were continuously recorded using a custom-written LabView routine.

Before head-fixed training, animals were habituated by handling, exposure to the Styrofoam ball and by drinking water from a handheld lick spout. After the habituation period, animals underwent head-fixed pretraining in two stages.

Pretraining phase 1 consisted of trials in which animals were trained to lick for reward on a single lick spout. Each trial in this training phase started with an intertrial interval of 2 s, followed by a period during which the mouse had to withhold from running and licking for 0.5 ± 0.05 s (a no-lick, no-run period). Next, stimulus presentation commenced, with the stimulus randomly selected from the full set of stimuli (all combinations of five different spatial frequencies and ten different orientations, moving in two directions; ‘Visual stimuli for information-integration categorization’). Stimulus presentation lasted 0.9 ± 0.1 s. After stimulus presentation, and a 0.1-s delay, there was a period in which the mouse could make a response (response window), lasting 10 s. The first lick on the lick spout within the response window resulted in immediate delivery of a water reward. The trial would count as correct and the trial sequence proceeded into the intertrial interval of the next trial. If the mouse did not make a lick, the response window would time out, the trial counted as a miss and the trial sequence also proceeded into the intertrial interval. The goal of this stage was to familiarize the mouse with the general sequence of withholding licking and running, stimulus presentation and licking for reward. Animals were typically kept in this stage for 4–6 d, and during these days the intertrial interval was gradually lengthened to 5 s.

Pretraining phase 2 consisted of the same basic trial structure as phase 1, but had two available lick spouts. During phase 2, the no-lick, no-run period was increased to 0.7 ± 0.1 s, stimulus presentation was lengthened to 1.5 ± 0.1 s and the delay between stimulus offset and response window was increased to 0.2 ± 0.1 s. The presented stimulus was chosen randomly from the same set as in phase 1, but now only one of the two lick spouts was randomly assigned for reward delivery (there was no relation between the stimulus and the rewarded lick spout). Water reward was given after the mouse had licked the predetermined lick spout. If the mouse licked the other spout, it had no effect on the trial flow; that is, the mouse could still lick the other spout and obtain the reward within the period of the response window. Pretraining phase 2 lasted until the animal performed >50 trials per day, and at least until the period of out-of-task baseline imaging ended (duration ranging between 7 and 17 d).

Following pretraining, animals were initially trained using two stimuli, one requiring a lick response on the left lick spout and one on the right lick spout. These training sessions implemented the same trial structure as pretraining phase 2 (Extended Data Fig. 2b), but now the stimuli indicated the side of the lick spout that would give a drop. For the first three to five training sessions, licks on the incorrect spout did not alter the trial flow (these sessions are marked as ‘shaping’ in the timeline in Extended Data Fig. 2a). After these initial shaping sessions, a lick on the incorrect spout during the response window period resulted in a time-out stimulus (black bar, 8° high and 106° wide, centered on the computer monitor), which was presented for the duration of the 2-s time-out. Time-out stimuli were not shown during imaging. After initial stimuli were discriminated with more than 70% correct, we gradually introduced more stimuli for categorization. As long as performance stayed above 65% correct trials, we added stimuli that were, each time, one step closer to the category boundary until the full categories were discriminated.

During pretraining phase 2 and subsequent training, an automated lick-side bias-correction algorithm directed the setup to increase the number of trials having the active lick spout on the side that the animal did not prefer (see ref. 79). This algorithm was stopped as soon as the animal showed signs of above-chance stimulus discrimination and was never implemented during sessions in which imaging was performed during the behavioral task. In a subset of experiments, we initially displaced the retinotopic position of the stimuli to the left and the right sides of the monitor (−16 and +16° azimuth) in such a way that it matched the side of the active lick spout where the response should be made. This was done to facilitate learning of the ‘lick-left’/‘lick-right’ association. This training stage is marked ‘shifted’ in the timeline depicted in Extended Data Fig. 2a. After mice reached the criterion using this position-shifted paradigm, we gradually shifted all stimuli to the center position and proceeded with the imaging of the time point ‘stimulus discrimination’ only when stable high performance was maintained without stimulus shifts.

In a subset of experiments (three of five mice from the experiment in which the monitor position was altered and in experiments presented in Extended Data Fig. 5), we connected the above-described lick-side bias-correction algorithm to a servo system that could micro-adjust the left/right position of the lick spouts. While online adjustments of the lick spout position were not made often, this method of physically opposing the lick spout position to the side bias could correct the left/right licking behavior of mice that occasionally defaulted to respond only on a single lick spout. These online adjustments, however, could not in any way affect behavioral performance or category-specific choices of the mouse.

### Time points of image acquisition

Imaging sessions were performed throughout the experiment and differed in several aspects. Each imaging time point was acquired over multiple days, with a different visual cortical area imaged on each day. For each mouse, the same subset of cortical areas was imaged at every imaging time point throughout the experiment. Thus, each time point contained the same complete cycle through all areas (Fig. 3a). We acquired imaging data using two different visual stimulation protocols, one for in-task imaging and one for out-of-task imaging.

Out-of-task imaging sessions were acquired at two baseline imaging time points, during the period of pretraining. In addition, one out-of-task time point was acquired at the end of the chronic imaging experiment (Fig. 3a). Out-of-task imaging sessions were always acquired after the behavioral session had been completed, thus the animal was in a satiated state. In these imaging sessions, the setup was kept in the same configuration as during behavioral training, except for that the lick spouts were moved out of the mouse’s view. The imaging sessions started with 15 min of darkness, followed by ~15 s of gray screen (50% luminance, allowing the animals to adapt to the screen brightness). Next, stimuli were presented, interleaved by periods of a gray screen. The stimuli were presented in eight blocks containing all 100 unique stimuli (all combinations of ten orientations, moving in two directions and five spatial frequencies). The order of stimulus presentation was shuffled within each block individually.

In-task imaging sessions started with 12 min of darkness, followed by ~15 s of gray screen (50% luminance) during which the mouse usually received a few drops of water to indicate that the task was about to start. After this pre-task period, the visual categorization task started and lasted for 35 min. At the end of the imaging session, there was another period of 12 min darkness and a 3-min period in which a water reward was given roughly every 20 s, on either the left or the right lick spout (pseudorandom side assignment per reward). In-task imaging sessions were performed at three distinct time points (Fig. 3a). The first in-task imaging time point was acquired during the baseline period, directly after pretraining was finished. At these time points, both the initial and the category stimuli were included in the stimulus set. If the animal made a mistake in such a session (that is, a lick on the incorrect side), no punishment or time-out was implemented (that is, the animal could still obtain a reward by making a lick on the other spout). In three animals, we performed a full repeat of the in-task baseline imaging time point. The second in-task imaging time point was acquired after the animal had reached the criterion on the visual discrimination task. At this time, only the initial stimuli were shown. Incorrect choices were always followed by a time-out, but without the visual time-out stimulus being shown. The final in-task imaging time point was acquired after the animal performed above chance on the category learning task. This task included only the category stimuli.

### Muscimol inactivation

At the end of the chronic imaging time series, five mice underwent two experiments on consecutive days, in which visual cortical areas were inactivated, or a control manipulation was performed. The order of cortical inactivation and the control experiment was counterbalanced across mice. Under isoflurane anesthesia (3% induction and 1.5% maintenance in O2), the chronically implanted cranial window was opened and the surface of the exposed cortex was treated for 20 min with a solution containing 5 mM muscimol in aCSF93. Subsequently, the cortex was covered with 0.75% agarose (in aCSF) containing 5 mM muscimol, and sealed with a cover glass. The mouse was allowed to recover for approximately an hour. During the behavioral experiment following this manipulation, we performed calcium imaging of L2/3 and L5 neurons in primary visual cortex to confirm cortical inactivation. The control experiment was executed in the exact same way, except that muscimol was not added to the aCSF.

For the targeted inactivation of specific visual cortical areas, three mice that were extensively trained on the information-integration category task underwent a series of muscimol (inactivation) and saline (control) injections into retinotopically determined visual cortical areas (V1, AL and POR). In all mice, inactivation and control conditions were interleaved by one day of behavioral training without manipulation (for timeline, see Extended Data Fig. 5a). Mice were lightly anesthetized with isoflurane (3% for induction and 1.2–1.5% for maintenance in O2), the chronically implanted window was opened and either a 25-nl solution of 5 mM muscimol in saline or 25 nl saline was injected 300 µm below the cortical surface. The injection parameters were calibrated to result in a spread of the injected solution approximately 700 µm radially from the injection center (Extended Data Fig. 5b). Injections targeted at area AL were done slightly more anterolaterally such that they likely also affected area RL, but not area LM. Injections targeted at area POR likely inactivated areas LI and LM also. Following the injection, the cortex was sealed with a cover glass. After approximately an hour of recovery, categorization behavior was tested.

### Intrinsic signal imaging

IOS imaging was performed according to methodology described before80. For IOS imaging during window implantation surgery, we illuminated the exposed, cleaned skull, within the 7-mm-diameter central opening of the head bar. We centered an approximately 5 × 5-mm FOV on stereotaxic coordinates of V1 and focused the image on the surface of the exposed skull using green light (540 nm). For IOS imaging through an implanted cranial window, we centered the FOV on the window and focused the image on the dural and pial blood-vessel pattern. Next, we changed the illumination wavelength to 740 nm (emission filter of 740 nm, full-width half-maximum value of 10 nm) and moved the focal plane down to approximately 800 μm below the skull surface, which was an estimated 300–400 μm below the pial surface. Images were acquired using a Teledyne DALSA Dalstar CCD camera and a Matrox frame grabber. Data processing and storage were done using a custom-written image acquisition and analysis program in MATLAB (MathWorks). During the period of image acquisition, we presented visual stimuli on a curvature-corrected37, gamma-corrected, LCD monitor (DELL; 59.9 cm wide and 33.8 cm high). The monitor background luminance was kept at 50% gray values, which was equiluminant to the visual stimuli when averaged over a larger area.

For discrete retinotopic maps81, the visual stimulus was a square-wave grating (0.04 cycles per degree), drifting at two cycles per second in eight directions in a semi-random sequence (500 ms per direction). The stimulus was presented for a duration of 6 s in a square or rectangular aperture of a specific retinotopic size (that depended on the number of apertures (patches) used for mapping). We typically used four or six patches for IOS imaging during window implantation surgery, thus presented the stimuli in a 2 × 2 or 2 × 3 vertical/horizonal grid. When imaging through an already-implanted cranial window, we typically used 12 (3 × 4), 15 (3 × 5) or 24 (4 × 6) patches. Stimulus presentations were interleaved by a 12-s inter-stimulus interval.

For continuous retinotopic maps37,94, we presented a checkerboard stimulus in a wide rectangular aperture spanning 20° on one axis and the full width/height of the monitor on the other axis. The checkerboard pattern consisted of a grid of full-contrast black and white patches, ~12° in size, repositioned and contrast inverted every 166 ms. The aperture in which the checkerboard was displayed drifted continuously across the screen. Each of the four cardinal drift directions looped either 10–20 times at a drift speed of 3–4° per second or 40–50 times at a drift speed of 15–20° per second, with a 30-s pause in between sets of drift-direction loops.

### Two-photon calcium imaging

In vivo two-photon calcium imaging95 was performed with a customized commercially available Bergamo II (Thorlabs) two-photon laser scanning microscope96 using a pulsed femtosecond Ti:Sapphire laser (Mai Tai HP Deep See, Spectra-Physics) and controlled by ScanImage 4 (ref. 97). The calcium indicator GCaMP6m98 and the structural marker mRuby2 (ref. 99) were both excited with a wavelength of 940 nm. Emitted photons were filtered for reflected laser light (720/25 short-pass filter), spectrally separated using a dichroic beamsplitter (FF560) and two band-pass filters (500–550 nm for GCaMP6m; 572–642 nm for mRuby2) and detected using two GaAsP photomultiplier tubes. Laser power was kept between 18 and 35 mW, depending on the depth of imaging and the quality of the chronic window. Images were acquired from two alternating planes, 40 μm apart, using a ×16 0.8-NA objective (Nikon) mounted on a piezoelectric stepper (Physik Instrumente). The xy image dimensions were 325 × 250 μm (512 × 512 pixels), and each image plane was acquired at a rate of ~15 Hz (total frame rate of ~30 Hz).

### Image processing

The background signal of the photomultiplier tubes was measured at the start of each imaging stack, and the mean background signal level was subtracted from the entire stack (dark noise subtraction). Lines in the images were scanned bidirectionally and an inadvertent line shift was corrected for by calculating the maximum cross-correlation of lines scanned in each direction. Image planes from acquired stacks were realigned to correct for in-plane movement artifacts, using an algorithm that calculates the maximum cross-correlation of the Fourier transforms of two images100.

### Within-session and across-session region of interest identification

To assist with image annotation, we produced a high signal-to-noise average image for each channel from the resulting stack as well as a maximum projection image using a running average of 5 s. In addition, we calculated a ΔF/F stimulus locked-response image in which brightness of the pixels indicated the stimulus-induced increase in fluorescence relative to baseline, for that pixel. The outlines of neuronal regions of interest (ROIs) from five mice were annotated manually by using the average image of each channel, but with assistance of the maximum projection and the ΔF/F response image. Annotations were made by one of three experimenters, and subsequently adjusted by a single experimenter using a custom-written MATLAB (MathWorks) program.

These manually annotated image stacks were used to train two multilayered convolutional neural networks programmed using Tensorflow101 and Python3, which were then used to annotate the imaging stacks for five additional mice (https://github.com/pgoltstein/NeuralNetImageAnnotation/). One network annotated the centers of neurons (5 × 5-pixel centroid region) and the other annotated the complete somata of neurons on a pixel-by-pixel basis. We used the average image of both imaging channels, as well as the ΔF/F response image as source data for the annotation. The input layer of the network supplied a 33 × 33-pixel FOV around each single pixel, thus its dimensions were 33 × 33 pixels by three channels. The network had four 3 × 3 convolutional layers with 2 × 2 max-pooling applied to each of these layers, and 16, 32, 64 and 128 channels in each layer, respectively. The last convolutional layer was connected to a fully connected layer containing 512 units, and the fully connected layer in turn connected to two output layer units, one indicating that the pixel was part of the ROI center or body, and one unit indicating the inverse. All layers consisted solely of rectifying linear units.

The network was trained by minimizing the softmax cross-entropy using the Adam optimizer102 on repeated batches of 2,000 samples, drawn equally from the training data (512 × 512 pixels from 122 images from five mice). Regularization during training was implemented by dropout in the fully connected layer with a probability of 0.5. Each network was trained using a learning rate varying between 10 × 10−3 and 10 × 10−5. The centroid-detecting network was trained on 10.6 × 106 samples, and the cell-body-detecting network was trained on 96.1 × 106 samples. Cross-validated pixel-wise performance was determined using 122 different annotated images of the same mice. The centroid-detecting network performed at 87.5% correct (precision of 0.95, recall of 0.79) and the cell-body-detecting network performed at 86.6% correct (precision of 0.88, recall of 0.84). Next, an algorithm identified centers of individual cells from the network-centroid annotations and used the network-body annotations to detect the outlines of these cells. Network annotations were further corrected by a single experimenter using a custom-written MATLAB(MathWorks) program.

Before further processing, we removed overlap between annotations using an algorithm. In addition, we removed all (parts of) annotations that, due to motion artifacts, shifted out of the FOV for more than 0.1% of the stack. We aligned annotations of all stacks from a single chronic recording using a custom-written MATLAB (MathWorks) program that matched ROIs across imaging sessions using an affine transform and allowed additional manual control over alignment parameters. Neurons that shared more than 50% overlap of the cell-body pixels were defined as a putative matched group. Finally, we manually inspected and corrected all matched groups that were present in all chronic recordings for continuity, missed annotations or false-positive annotations.

### Neuronal region of interest signal extraction

For each ROI, we calculated a GCaMP6m and mRuby2 fluorescence signal by taking the mean of all pixels within the ROI, for each channel separately. In addition, we calculated a local neuropil signal, a measure of local fluorescence intensity, over a circular region surrounding the ROI (2–33-μm ring). Using these signals, we first compensated for non-cell-specific fluorescence bleeding into the ROI signals by subtracting the neuropil signal time series, multiplied by 0.7 from the raw fluorescence time series, a method known as neuropil correction98,103,104. The median of the neuropil time series (multiplied by 0.7) was added, to offset the lower baseline fluorescence signals resulting from neuropil correction. Next, we compensated for small fluctuations in fluorescence that followed changes in the axial position of cells (for example, due to slow drift or motion artifacts) by calculating the ratio (R) between the green and red channel, as both channels should be affected equally by such out-of-plane motion105.

For each frame, an R0 value was calculated from the lowest 25% values in a 60-s window around that frame. The ΔR/R value was calculated by subtracting the R0 value from the fluorescence value (R) of a frame and dividing the remainder over the R0 value (adapted from ref. 106). To further remove artifacts, the resulting GCaMP6m ΔR/R fluorescence time series was processed using the constrained FOOPSI algorithm107,108, which fits the calcium ΔR/R time series with a biologically plausible model and provides an inferred spike time series for each neuron with high temporal resolution that was used in all following analyses. Visualized traces of inferred spike activity were smoothed with a five-frame flat kernel.

### Analysis of behavioral data

Behavioral performance was reported as the fraction of correct trials. In the touch screen task, this was quantified as the number of trials in which the mouse touched the correct (rewarded) stimulus, divided by the total number of trials in which the mouse made a touch response. In the ‘lick-left’/’lick-right’ task (head-fixed), this was quantified as the number of trials in which the animal licked on the correct lick spout, divided by the total number of trials in which the mouse made a lick response. Steepness of categorization, a function of the distance of stimuli to the category boundary, was determined from the steepness parameter of a fitted sigmoid curve.

While information-integration categories were trained with a systematic boundary requiring the linear integration of the two stimulus features, orientation and spatial frequency, not all mice bisected the stimulus space using the trained boundary angle. The boundary angle, as behaviorally expressed by the animal, was calculated by fitting a 2D plane through a three-dimensional space having orientation and spatial frequency on the x and y axes, respectively, and performance on the z axis. The behaviorally expressed boundary was defined as the intersection of the fitted plane with the plane z = 0.5. For category spaces with a reduced number of stimuli (as used in the chronic imaging experiment), the behaviorally expressed boundary vector was calculated using a support vector machine.

During behavioral experiments in which we shifted the stimulus position, we tracked the positions of both eyes using infrared cameras (The Imaging Source). We manually annotated the outlines of the eyes and pupils in a set of sample images using DeepLabCut109,110 and used the software to further annotate the movies (see Extended Data Fig. 2g for examples). The pupil diameter was calculated as the average distance between each of four sets of opposing markers on the pupil outline. Horizontal eye position was calculated as the distance from the center of the pupil (the mean of the x and y coordinates of the eight markers on the pupil outline) to the marker on the left side of the outline of the eye. Both pupil diameter and horizontal eye position were normalized to the width of the eye, defined as the distance between the left and right marker on the outline of the eye. Similarly, during two control experiments, we tracked features of the mouth of the mouse (see example annotated video frames in Extended Data Figs. 10f,i). We quantified the variable ‘relative mouth opening’ as the distance between the central marker on the upper-left jaw and the anterior marker on the lower jaw, normalized to the distance between the central markers on the upper-left and upper-right jaws (Extended Data Figs. 10f,i).

### Image analysis

Discrete retinotopic stimulation was analyzed for each retinotopically specific patch individually. The intrinsic signal response per pixel was quantified as percentage decrease during stimulus presentation (mean signal from 1 s to 6 s after stimulus onset) relative to baseline (mean signal from −6 s to −1 s before stimulus onset). The 2D maps of the IOS response per trial were averaged and smoothed to result in a single average intrinsic signal response map for each retinotopic stimulus position. These average maps were normalized to values of between 0 and 1, to compensate for lower signal strength of patches in the eccentricity of the visual field. From the individual average maps, a single image was constructed by assigning every pixel a color, based on the patch that elicited maximum activity (each retinotopic stimulus position was associated with a unique color).

Periodic visual stimulation was analyzed as described in ref. 37,94. In brief, the time series of each pixel in the continuous acquisition was low-pass filtered at four times the slowest stimulus-repetition frequency. The phase and power of the intrinsic signal at the stimulus-repetition frequency were determined for each pixel using a Fourier transform. Retinotopic maps detailing visual-response amplitude and preferred azimuth and elevation were subsequently produced by recalculating the phase to a position in monitor space and scaling the image by the signal power. Finally, equi-elevation and equi-azimuth lines were overlaid on a wide-field image of the cortical blood-vessel pattern.

HLS maps were calculated on a pixel-by-pixel basis from calcium imaging time series. First, a baseline fluorescence map was calculated by averaging all images acquired in the intertrial intervals preceding a visual stimulus presentation. Similarly, a stimulus fluorescence map was calculated for each stimulus individually by averaging all images acquired in the period from visual stimulus onset to 0.5 s after stimulus offset. A ΔF/F response map was subsequently calculated by subtracting the baseline from each stimulus map and dividing the remainder over the baseline map. For each pixel, the color (hue) was selected based on the stimulus that gave the largest ΔF/F response. The brightness (lightness) of each pixel was determined by the ΔF/F response amplitude to the best stimulus. Color intensity (saturation) was determined by calculating the resultant length of the stimulus-averaged ΔF/F responses, sorted from largest to smallest, mapped onto a circular space. This resulted in a value of 1.0 when only a single stimulus elicited a response, displaying full color saturation of the pixel. The resultant length was 0.0 when all stimuli drove equal ΔF/F response amplitudes, resulting in a white pixel. Multiple HLS maps detailing retinotopic position preference (for example, center versus surround of the visual field) were stitched together using coordinates from the microscope’s motor position controller, to produce a wide FOV HLS map with cellular resolution (Fig. 3b and Extended Data Fig. 3).

### Fraction of responsive neurons

For each imaging session, we quantified the fraction of responsive neurons using inferred spiking activity in the first second of visual stimulus presentation of trials featuring stimuli that were part of the reduced category space (the 1-s period was chosen because it contained relatively few running and licking events, and no rewards occurred). If the recording was an out-of-task imaging time point, in which each stimulus was repeated eight times, we performed a Mann–Whitney U test comparing the 1-s period just before stimulus onset to the 1-s period directly after stimulus onset. The following responsiveness criteria were applied for each stimulus: (1) the non-parametric test indicated a significant difference (P < 0.05) and (2) the peak inferred spike rate difference was at least 0.01. A neuron was classified as being responsive, when these criteria were met for at least a single stimulus of the reduced category space (containing ten stimuli).

In-task time points were analyzed slightly differently, because the chance of detecting responsive neurons scaled with the variable number of trials that the animals performed. We used subsampling to allow a direct comparison of the fraction of responsive neurons with the out-of-task time points. For each stimulus, we randomly sampled eight trials from the total number of trials, performed the same testing criteria as described above, and repeated the procedure 100 times, resulting in 100 estimates of the stimuli that a neuron was responsive to. From these data, we calculated the probability of the neuron being significantly responsive to at least one stimulus by dividing the number of subsamples with at least one significant stimulus over the total of 100 repeats.

Thus, each method resulted in a single vector listing the probability, per neuron, that it significantly responded to at least one visual stimulus. The out-of-task sessions resulted in binary entries reflecting probabilities of zero and one. The in-task sessions resulted in vectors having values on the interval (0, 1), reflecting probabilities that neurons were significantly responsive on a more continuous scale, as derived from subsampling. By averaging this vector, we obtain in both cases the probability of observing that a randomly chosen neuron from that session is significantly responsive, which, importantly, is equivalent to the overall fraction of responsive neurons per session.

The time-varying patterns of the fraction of responsive neurons across imaging time points (Fig. 4a and Extended Data Fig. 6a) were normalized to the range from 0 to 1, and grouped into clusters using the k-means clustering algorithm (scikit-learn). For different values of k (2 to 8), we compared the cluster inertia (within-cluster sum of squares) of the actual data to the mean cluster inertia of 100 shuffles (Fig. 4b). The difference between the real and shuffled cluster inertia indicates clustering performance and suggested that the data were best grouped into two clusters (Fig. 4b). Performing the same analysis, but with an artificial third cluster included, accurately detected three clusters. Furthermore, while the k-means algorithm has a random initialization step, multiple runs of the same analysis resulted in the same cluster groupings.

Using linear regression, we aimed to identify four components making up the time-varying pattern of fraction responsive neurons (as visually depicted in Fig. 4g). The components were: (1) a stable, non-time-varying fraction (baseline), which was assigned the value 1 at each time point; (2) an exponentially decaying fraction, having the value 1 for the first time point, 0.5 for the second, and so on; (3) a task modulation component having the value 1 for the in-task time points and 0 otherwise; and (4) a learning associated component having a value of 1 for all the post-category learning time points and 0 otherwise. We calculated the contribution of each time-varying component, by applying an NNLS fitting algorithm (SciPy) on the time-varying fractions of responsive neurons of each individual chronic recording. In general, the linear model fitted the time-varying fraction of responsive neurons well (R2 = 0.77 ± 0.21 s.d.; n = 39 chronic recordings), and each regressor made a unique contribution to the explained variance (decay, ΔR2 = 0.254 ± 0.056 (s.e.m.); task, ΔR2 = 0.408 ± 0.052 (s.e.m.); learning, ΔR2 = 0.095 ± 0.035 (s.e.m.); we did not quantify the unique contribution of base, as it is the intercept and cannot be shuffled; see below for a detailed explanation of ΔR2).

### Encoding model

To determine how individual task-related and other measured covariates influenced a neuron’s inferred spiking activity, we used a generalized linear model (GLM; encoding model) to predict the inferred spikes of each neuron per imaging frame15,19,111,112,113. Regressors for discrete events such as stimulus onset were represented by a boxcar function, while regressors for continuous parameters such as running speed were represented by scalar values for each imaging frame. All regressors were smoothed with a Gaussian kernel (σ = 0.5 s) and repeated over a defined range with 0.5-s steps (see Supplementary Table 3 and Fig. 5a,b for all individual regressors and their ranges).

Stimulus-onset aligned regressors encoded the stimulus parameters: orientation, spatial frequency and trained category. In addition, a regressor (task) encoded whether or not the mouse made a response in that trial, that is, fitting activity related to task engagement. One regressor set aligned to running onset (run), the first imaging frame in a trial in which running speed exceeded 1 cm per second. We implemented two choice-related regressor sets: one aligning with the first sequence of three licks in a row on the side where the mouse would also choose to lick in the response window (choice left/right 1), the other aligning with the first lick during the response window (choice left/right 2), which was the decisive lick in the behavioral paradigm. One regressor set (reward) aligned to reward occurrence, and one (T.O.) to the moment that the time-out was given. Two continuous regressor sets were constructed from the per-imaging-frame lick rate of the mouse, that is, lick rate (left) and lick rate (right), and one continuous regressor set reflected the running speed of the mouse. Finally, we added a constant offset to the model.

All regressor sets were combined in a single design matrix (Fig. 5a). The response variable, inferred spike activity, was smoothed with a Gaussian kernel (σ = 0.5 s). The data were subsequently divided in individual trials, only including data that were in range of at least one trial-aligned discrete regressor set. Model parameters were fit on a subset of trials (70%) using NNLS fitting (SciPy) and L1 regularization (where L1 = 0.1 × size of response variable). This specific L1 value was determined by comparing trained model fitted R2 values with cross-validated R2 values using a wide range of possible L1 values (Extended Data Fig. 7a). Model performance was expressed as R2 (equations 13, where N equals the number of imaging frames in the to-be-predicted response variable y, and pi is the frame-by-frame model prediction). Cross-validated model performance (R2) was calculated on the remaining 30% of trials.

$${{mathrm{SS}}_{rm{residual}}} = mathop {sum }limits_{i = 1}^N left( {p_i – y_i} right)^2$$

(1)

$${{mathrm{SS}}_{rm{total}}} = mathop {sum }limits_{i = 1}^N left( {y_i – bar y} right)^2$$

(2)

$$R^2 = 1 – frac{{rm{SS}}_{{rm{residual}}}}{{rm{SS}}_{{rm{total}}}}$$

(3)

Regressor sets were assigned into seven subgroups (Supplementary Table 3) and subgroup unique contribution to the explained variance was calculated by subtracting the R2 value of the model with regressors belonging to that subgroup shuffled, from the R2 of the full model, thus resulting in a ΔR2 value, similar to what is described in ref. 113. The ΔR2 would assume a value of 0 if all the variance that the subgroup explains can also be explained by any combination of regressors from other subgroups. A positive value for ΔR2 reflects the degree of explained variance that can only be explained by this specific subgroup. For the subgroups that were most central to our analysis (that is, stimulus, orientation/spatial frequency, category, choice and reward), we calculated the maximum variance inflation factor114 across all kernels of each model regressor, for every in-task chronic recording. We found that the value of the variance inflation factor never exceeded 5, and typically ranged between 1 and 4.

R2 values were compared to values of the same model fitted on trial-shuffled data to establish whether they were significantly above chance, using the following procedure. Chance-level model performance was determined by fitting the model to a trial identity response variable that was shuffled (thus keeping confounders like the nonspecific within-trial temporal structure and offsets the same in the shuffled model). Both the non-shuffled and shuffled model fits were repeated 100 times. We used a non-parametric bootstrap procedure115 to estimate the mean and 95% confidence interval of the cross-validated R2 values of both the shuffled and non-shuffled models. An R2 value was considered significant if (1) the mean of the shuffled R2 value was below the lower 95% confidence interval of the non-shuffled R2 value, and (2) the mean of the non-shuffled R2 value was above the upper 95% confidence interval of the shuffled R2 value.

A semantic CTI was calculated from weights of the left-category and right-category regressor sets (equation (4); where (bar w_L) is the mean of all left-category regressor weights across the cross-validation trials, and (bar w_R) is the same for right-category weights). Feature CTI was calculated from the weights of the orientation and spatial frequency-specific regressors selectively (equation (5)).

$${{rm{Semantic}};{rm{CTI}}} = frac{{bar w_L – bar w_R}}{{bar w_L + bar w_R}}$$

(4)

$${{rm{Feature}};{rm{CTI}}} = frac{{mathop {sum }nolimits_l^L left( {bar w_{ori_l} + bar w_{sf_l}} right) – mathop {sum }nolimits_r^R left( {bar w_{ori_r} + bar w_{sf_r}} right)}}{{mathop {sum }nolimits_l^L left( {bar w_{ori_l} + bar w_{sf_l}} right) + mathop {sum }nolimits_r^R left( {bar w_{ori_r} + bar w_{sf_r}} right)}}$$

(5)

Here, L is the number of left-category stimuli and R is the number of right-category stimuli. (bar w_{ori_l}) is the mean of all orientation regressor weights across cross-validations for left-category stimulus l and (bar w_{sf_l}) is the mean of all spatial frequency regressor weights across cross-validations for left-category stimulus l. (bar w_{ori_r}) and (bar w_{sf_r}) are the same, but for the right category (Fig. 6a).

### Statistics

Statistical analyses were performed using Python (3.7.10), Numpy (1.16.4) and Scipy (1.5.2). No statistical methods were used to predetermine sample sizes, but our sample sizes are similar to those reported in previous publications47,69,113. No data were excluded from the experiment involving touch screen operant chambers. We excluded five animals from the experiment involving head-fixed conditioning because they did not reach criterion on the stimulus discrimination task, three animals because their performance dropped to chance level during category learning, and one animal because it refused to lick on the left lick spout. We excluded three animals from the chronic imaging experiment because their cranial windows did not allow imaging at the time point of category learning. Data collection and analysis were not performed blind to the conditions of the experiments. All data are presented as mean (±s.e.m.) unless otherwise noted. Frequency observations were compared using a chi-squared test. Tests for normality of distributions were not conducted, as the number of observations was often below ten, and testing for normality would be underpowered. Thus, behavioral and imaging data were compared using non-parametric tests: a WMPSR test for paired samples, a Mann–Whitney U test for independent samples, and a Kruskal–Wallis test, followed by post hoc WMPSR tests or Mann–Whitney U tests, when more than two groups were compared. Significance of R2 values of individual neurons was determined using non-parametric bootstrap procedures as described above.

### Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link

Check Also
Close