VSS 2021 Abstracts

Jump to:
Hu, Y., Ongchoco, J. D. K., & Scholl, B. J. (2020). From causal perception to event segmentation: Using spatial memory to reveal how many visual events are involved in causal launching. Poster presented at the annual meeting of the Vision Sciences Society, 5/20/20, Online.  
The currency of visual experience is frequently not static scenes, but dynamic events. And perhaps the most central topic in the study of event perception is *event segmentation* -- how the visual system carves a continuous stream of input into discrete temporal units. A different tradition has tended to focus on particular types of events, the most famous example of which may be *causal launching*: a disc (A) moves until it reaches another stationary disc (B), at which point A stops and B starts moving in the same direction. Since these two well-studied topics (event segmentation and causal perception) have never been integrated, we asked a simple question: how many events are there in causal launching? Just one (the impact)? Or two (A’s motion and B’s motion)? We explored this using spatial memory, predicting that memory for intermediate moments within a single event representation should be worse than memory for moments at event boundaries. Observers watched asynchronous animations in which each of six discs started and stopped moving at different times, and (in different experiments) simply indicated each disc’s initial and final position. The discs came in pairs, and in some cases A launched B. To ensure that the results reflect perceived causality, other trials involved the same component motions but with spatiotemporal gaps between them (which eliminate perceived launching). The critical locations were the two intermediate ones (A’s final position and B’s initial position), and spatial memory was indeed worse for launching displays (perhaps because these locations occurred in the middle of a single ongoing event) compared to displays with spatiotemporal gaps (perhaps because these same locations now occurred at the perceived event boundary between A’s motion and B’s motion). This suggests that causal perception leads the two distinct motions to be represented as a single visual event.
Belledonne, M., Butkus, E., Scholl, B. J., & Yildirim, I. (2021). Attentional dynamics during multiple object tracking are explained at subsecond resolution by a new 'hypothesis-driven adaptive computation' framework. Poster presented at the annual meeting of the Vision Sciences Society, 5/23/21, Online.  
A tremendous amount of work on visual attention has helped to characterize *what* we attend to, but has focused less on precisely *how* and *why* attention is allocated to dynamic scenes across time. Nowhere is this contrast more apparent than in multiple object tracking (MOT). Hundreds of papers have explored MOT as a paradigmatic example of selective attention, in part because it so well captures attention as a dynamic process. It is especially ironic, then, that nearly all of this work reduces each MOT trial to a single value (i.e. the number of targets successfully tracked) -- when in reality, each MOT trial presents an experiment unto itself, with constantly shifting attention over time. Here we seek to capture this dynamic ebb and flow of attention at a subsecond resolution, both empirically and computationally. Empirically, observers completed MOT trials during which they also had to detect sporadic momentary probes, as a measure of the moment-by-moment degree of attention being allocated to each object. Computationally, we characterize (for the first time, to our knowledge) an algorithmic architecture of just how and why such dynamic attentional shifts occur. To do so, we introduce a new 'hypothesis driven adaptive computation' model. Whereas previous models employed many MOT-specific assumptions, this new approach generalizes to any task-driven context. It provides a unified account of attention as the dynamic allocation of computing resources, based on task-driven hypotheses about the properties (e.g. location, target status) of each object. Here, this framework was able to explain the observed probe detection performance measured at a subsecond resolution, independent of general spatial factors (such as the proximity of each probe to the MOT targets' centroid). This case study provides a new way to think about attention and how it interfaces with perception in terms of rational resource allocation.
Berke, M., Walter-Terrill, R., Jara-Ettinger, J., & Scholl, B. J. (2021). Flexible goals require that inflexible perceptual systems produce veridical representations: Implications for realism as revealed by evolutionary simulations. Talk given at the annual meeting of the Vision Sciences Society, 5/26/21, Online.  
How veridical is perception? Rather than representing objects as they actually exist in the world, might perception instead represent objects only in terms of the utility they offer to an observer? Previous work employed evolutionary simulations to show that under certain assumptions, natural selection favors "strict interface" perceptual systems that represent objects exclusively in terms of subjective utility. These simulations showed that interface perceptual systems regularly drive "veridical" systems (those that represent objects in terms of their ground-truth, observer-independent properties) to extinction. This view has fueled considerable debate, but we think that discussions so far have failed to consider the implications of two critical aspects of perception. First, while previous simulations have explored single utility functions, perception must always serve multiple largely-independent goals. (Sometimes when looking at an apple you want to know how appropriate it is for eating, and other times you want to know how appropriate it is for throwing.) Second, perception often operates in an inflexible, automatic manner -- proving 'impenetrable' to shifting higher-level goals. (When your goal shifts from 'eating' to 'throwing', your visual experience does not dramatically transform.) These two points have important implications for the veridicality of perception. In particular, as the need for flexible goals increases, inflexible perceptual systems must become more veridical to meet that need. We support this position with evolutionary simulations showing that as the number of independent utility functions increases, the distinction between 'interface' and 'veridical' perceptual systems dissolves. Under one utility function (or one inflexible goal), our simulations replicate previous findings that favor interface systems, but under multiple independent utility functions, we find that veridical systems are best able to accommodate multiple goals. Although natural selection evaluates perceptual systems only in terms of fitness, the most fit perceptual systems may nevertheless represent the world as it is.
Bi, W., Shah, A., Wong, K., Scholl, B. J., & Yildirim, I. (2021). Perception of soft materials relies on physics-based object representations: Behavioral and computational evidence. Poster presented at the annual meeting of the Vision Sciences Society, 5/23/21, Online.  
When encountering objects, we readily perceive not only low-level properties (e.g., color and orientation), but also seemingly higher-level ones -- some of which seem to involve aspects of physics (e.g., mass). Perhaps nowhere is this contrast more salient than in the perception of soft materials such as cloths: the dynamics of these objects (including how their three-dimensional forms vary) are determined by their physical properties such as stiffness, elasticity, and mass. Here we argue that the perception of cloths and their physical properties must involve not only image statistics, but also abstract object representations that incorporate "intuitive physics". We do so by exploring the ability to *generalize* across very different image statistics in both visual matching and computational modeling. Behaviorally, observers had to visually match the stiffness of animated cloths reacting to external forces and undergoing natural transformations (e.g. flapping in the wind, or falling onto the floor). Matching performance was robust despite massive variability in the lower-level image statistics (including those due to location and orientation perturbations) and the higher-level variability in both extrinsic scene forces (e.g., wind vs. rigid-body collision) and intrinsic cloth properties (e.g., mass). We then confirmed that this type of generalization can be explained by a computational model in which, given an input animation, cloth perception amounts to inverting a probabilistic physics-based simulation process. Only this model -- and neither the alternatives relying exclusively on simpler representations (e.g., dynamic image features such as velocity coherence) nor alternatives based on deep learning approaches -- was able to explain observed behavioral patterns. These behavioral and computational results suggest that the perception of soft materials is governed by a form of "intuitive physics" -- an abstract, physics-based representation of approximate cloth mechanics that explains observed shape variations in terms of how unobservable properties determine cloth reaction to external forces.
Colombatto, C., Chen, Y. -C., & Scholl, B. J. (2021). Gazing to look vs. gazing to think: Gaze cueing is modulated by the perception of others' external vs. internal attention. Poster presented at the annual meeting of the Vision Sciences Society, 5/23/21, Online.  
What we see depends on where we look, and where we look is often influenced by where others are looking. In particular, when we see another person turn to look in a new direction, we automatically follow their gaze and attend in the same direction -- a phenomenon known as gaze cueing. This reflexive reorienting is adaptive, since people usually shift their gaze to *look* toward the objects or locations they are attending to. But not always: Sometimes people shift their gaze to *think*, as when they look up and away while retrieving information from memory or solving a difficult problem. Such gazes are not directed at any particular external location, but rather signal disengagement from the external world to aid internal focus. Is gaze cueing sophisticated enough to be sensitive to others' (external vs. internal) focus of attention? To find out, we had observers view videos of an actress who is initially looking forward. She is then asked a question, and before responding she looks upward and to the side. The questions themselves concerned either an external stimulus ("Who painted that piece of art on the wall over there?") or an internal memory ("Who painted that piece of art we saw in the museum?"). Despite using identical videos (differing only in their audio tracks), gazes preceded by the 'external' (vs. 'internal') questions elicited far stronger gaze cueing, as measured by the ability to identify a briefly flashed symbol in the direction of the gaze. This effect replicated in multiple samples, and with multiple pairs of 'external' vs. 'internal' questions. This shows how gaze cueing is surprisingly 'smart', and is not simply a brute reflex triggered by others' eye and head movements. And perhaps more importantly, it demonstrates how perception constructs a rich and flexible model of others' attentional states.
Lopez-Brau, M., Colombatto, C., Jara-Ettinger, J., & Scholl, B. J. (2021). Attentional prioritization for historical traces of agency. Talk given at the annual meeting of the Vision Sciences Society, 5/22/21, Online.  
Among the most important stimuli we can perceive are other agents. Accordingly, a great deal of work has shown how visual attention is prioritized not just for certain lower-level properties (e.g. brightness or motion) but also for *social* stimuli (e.g. our impressive efficiency at detecting the presence of people in natural scenes). In nearly all such work, the relevant agents are explicitly visible -- e.g. in the form of bodies, faces, or eyes. But we can also readily perceive the *historical traces* that agents may leave behind. When walking along a hiking trail, for example, a stack of rocks along the side of the path may elicit the immediate strong impression that an agent had been present, since such configurations are exceptionally unlikely to be produced by natural processes. Does visual processing also prioritize such 'traces of agency' (independent from properties such as order and complexity)? We explored this using visual search, in scenes filled with two kinds of block towers. In Agentic Trace towers, the blocks were slightly misaligned (as would only likely occur if they had been intentionally stacked by agent), while in Non-Agentic towers they were perfectly stacked (in ways an agent would be unlikely to achieve). Across multiple experiments, observers were both faster and more accurate at detecting Agentic Trace towers (in arrays of Non-Agentic towers), compared to detecting Non-Agentic towers (in arrays of Agentic Trace towers). Critically, this difference was stronger than when the same stimuli were presented in ways that equated order and complexity (e.g. with additional vertical spacing), while eliminating perceived traces of agency. This attentional prioritization for "agency without agents" reveals that social perception is not just a response to the superficial appearances of agents themselves, but also to the deeper and subtler traces that they leave in the world.
Ongchoco, J., Walter-Terrill, R., & Scholl, B. J. (2021). Visual event boundaries eliminate anchoring effects: A case study in the power of visual perception to influence decision-making. Poster presented at the annual meeting of the Vision Sciences Society, 5/24/21, Online.  
Visual stimulation is continuous, yet we experience time unfold as a sequence of discrete events. A great deal of work has explored the consequences of such event segmentation on perception and attention, but this work has rarely made contact with higher-level thought. Here we bridge this gap, demonstrating that visual event boundaries can eliminate one of the most notorious (and stubbornly persistent) biases in decision-making. Subjects viewed an immersive 3D animation in which they walked down a long virtual room. During their walk, some subjects passed through a doorway, while for others there was no such event boundary -- equating the paths, speeds, and overall room layouts. At the end of their walk, subjects encountered an item (e.g. a suitcase on the floor) and were asked to estimate its monetary value. The other critical manipulation was especially innocuous, not appearing to be part of the experiment at all. Before the online trial began, subjects reported the two-digit numerical value from a visually distorted 'CAPTCHA' ("to verify that you are human") -- where this task-irrelevant 'anchor' was either low (e.g. 29) or high (e.g. 92). In the no-doorway condition, we observed the well-known anchoring effect: value estimates were higher for subjects who encountered the high CAPTCHA value. Anchoring is especially difficult to resist (even with enhanced motivation, forewarning, and incentives), but remarkably, anchoring was eliminated in the doorway condition. Further experiments replicated this effect in multiple independent samples (and with other objects), showed that it does not depend on explicit memory for the initial anchors, and confirmed that it was due to the event boundary per se (and not to superficial differences such as the visual complexity of the room with and without a dividing wall). This demonstrates how subtle aspects of visual processing can really *matter* for higher-level decision-making.
Uddenberg, S., Kwak, J., & Scholl, B. J. (2021). Reconstructing physical representations of block towers in visual working memory. Poster presented at the annual meeting of the Vision Sciences Society, 5/24/21, Online.  
Recent studies have explored the perception of physical properties (such as mass and stability) in psychology, neuroscience, and AI, and perhaps the most popular stimulus from such studies is the block tower -- since such displays (of stacked rectilinear objects) evoke immediate visual impressions of physical (in)stability. Here we explored a maximally simple question: what properties are represented during natural viewing of such stimuli? Previous work on this question has been limited in two ways. First, such studies typically involve explicit judgments ("Which way will it fall?"), which may prompt encoding strategies that would not otherwise operate automatically. Second, such studies can typically only explore those tower properties that are systematically manipulated as explicit independent variables. Here we attempted to overcome such limitations in an especially direct way: observers viewed a briefly-flashed block tower, and then immediately *reproduced* its structure from memory -- by dragging and dropping an array of blocks (initially presented on the simulated ground plane) using a custom 3D interface. This allowed us to directly measure the success of reproductions in terms of both lower-level image properties (e.g. the blocks' colors/orientations) and higher-level physical properties (e.g. when comparing the stability of the initial towers and their reproductions). Analyses revealed two types of evidence for the visual representation of 'invisible' abstract physical properties. First, the (in)stability of the reproductions (computed, e.g., in terms of the blocks' summed displacements from their original positions, as analyzed in a physics engine with simulated gravity) could not be directly predicted by lower-level image properties (such as the blocks' initial heights or spread). Second, reproductions of unstable towers tended to be more stable, but not vice versa. This work demonstrates how physical representations in visual memory can be revealed, all without ever asking anyone anything about physics.
Wang, V., Ongchoco, J., & Scholl, B. J. (2021). Here it comes: Working memory is effectively 'flushed' even just by anticipation of an impending visual event boundary. Poster presented at the annual meeting of the Vision Sciences Society, 5/22/21, Online.  
Though visual input arrives in a continuous stream, our perceptual experiences unfold as a sequence of discrete events. This form of visual event segmentation has important consequences for our mental lives. For example, memory is disrupted not only by elapsed time, but also by crossing an event boundary. Even an activity as simple as walking through a doorway can effectively 'flush' memory (just as one might empty a cache in a computer program), perhaps because this is when the visual statistics of our local environments tend to change most dramatically -- and it may be downright maladaptive to hold on to now-obsolete information. But just when does this 'flushing' occur? At the very moment we cross the boundary? When we encounter new post-boundary information? Here we provide what may be a surprising answer: even just the *anticipation* of an impending event boundary is sufficient to flush memory. Observers viewed an immersive 3D animation in which they walked down a long virtual room. Before their virtual walk, they saw a list of pseudo-words, their recognition memory for which was then tested immediately after the walk ended. Two of the conditions were inspired by past work: during their walk, some observers passed through a doorway, while others traversed the identical path through a room that had no such event boundary. Critically, we also tested a third condition, in which memory was probed just before the observers would have crossed through the doorway -- while carefully equating for elapsed time by manipulating the doorway's location. Relative to the baseline no-doorway condition, we observed reliable memory disruptions in *both* the 'doorway' and 'anticipation' conditions -- and additional control experiments confirmed that this was due to anticipation of the event boundary (and not just surprise). Visual processing thus *proactively* flushes memory by anticipating future events.
Wong, K., Bi, W., Yildirim, I., & Scholl, B. J. (2021). Seeing cloth-covered objects: A case study of intuitive physics in perception, attention, and memory. Poster presented at the annual meeting of the Vision Sciences Society, 5/23/21, Online.  
We typically think of intuitive physics in terms of high-level cognition, but might aspects of physics also be extracted during lower-level visual processing? In short, might we not only *think* about physics, but also *see* it? We explored this in the context of *covered* objects -- as when you see a chair with a blanket draped over it. To successfully recover the underlying structure of such scenes (and determine which image components reflect the object itself), we must account for the physical interactions between cloth, gravity, and object -- which govern not only the way the cloth may wrinkle and fold on itself, but also the way it hangs across the object's edges and corners. We explored this using change detection: Observers saw two images of cloth-covered objects appear quickly one after the other, and simply had to detect whether the two raw images were identical. On "Same Object" trials, the superficial folds and creases of the cloth changed dramatically, but the underlying object was identical (as might happen if you threw a blanket onto a chair repeatedly). On "Different Object" trials, in contrast, both the cloth and the underlying covered object changed. Critically, "Same Object" trials always had *greater* visual change than "Different Object" trials -- in terms of both brute image metrics (e.g. the number of changed pixels) and higher-level features (as quantified by distance in vectorized feature-activation maps from relatively late layers in a convolutional neural network trained for object recognition [VGG16]). Observers were far better at detecting changes on "Different Object" trials, despite the lesser degree of overall visual change. Just as vision "discounts the illuminant" to recover the deeper property of reflectance in lightness perception, visual processing uses intuitive physics to "discount the cloth" in order to recover the deeper underlying structure of objects.