Ambiguity in Pictorial Depth

Perception, 2007, volume 36, pages 1290 ^ 1304
doi:10.1068/p5591
Ambiguity in pictorial depth
Balaraju Battu, Astrid M L Kappers, Jan J Koenderink
Physics of Man, Helmholtz Institute, Utrecht University, Princetonplein 5, PO Box 80 000, NL 3508 TA Utrecht, The Netherlands; e-mail: a.m.l.kappers@phys.uu.nl Received 6 March 2006, in revised form 13 December 2006; published online 29 August 2007
Abstract. Pictorial space is the 3-D impression that one obtains when looking ìnto' a 2-D picture. One is aware of 3-D òpaque' objects. `Pictorial reliefs' are the surfaces of such pictorial objects in `pictorial space'. Photographs (or any pictures) do in no way fully specify physical scenes. Rather, any photograph is compatible with an infinite number of possible scenes that may be called `metameric scenes'. If pictorial relief is one of these metameric scenes, the response may be considered `veridical'. The conventional usage is more restrictive and is indeed inconsistent. Thus the observer has much freedom in arriving at such a `veridical' response. To address this ambiguity, we determined the pictorial reliefs for eight observers, six pictures, and two psychophysical methods. We used `methods of cross-sections' to operationalise pictorial reliefs. We find that linear regression of the depths of relief at corresponding locations in the picture for different observers often lead to very low (even insignificant) R 2 s. Thus the responses are idiosyncratic to a large degree. Perhaps surprisingly, we also observed that multiple regression of depth and picture coordinates at corresponding locations often lead to very high R 2 s. Often R 2 s increased from insignificant up to almost 1. Apparently, to a large extent `depth' is irrelevant as a psychophysical variable, in the sense that it does not uniquely account for the relation of the response to the pictorial structure. This clearly runs counter to the bulk of the literature on pictorial `depth perception'. The invariant core of interindividual perception proves to be of an àffine' rather than a Euclidean nature; that is to say, `pictorial space' is not simply the picture plane augmented with a depth dimension.
1 Introduction With `pictorial space' one conventionally indicates the 3-D spatial impression (2-D plus a `depth' dimension) that one obtains when looking into a 2-D photograph or any (perhaps artistic) rendering on a 2-D canvas. Pictorial reliefs are the surfaces of `pictorial' objects in pictorial space. A photograph of a scene does not specify the actual scene, but, rather, the photograph is compatible with an infinite number of virtual scenes that may be called `metameric scenes' (Belhumeur et al 1997; Koenderink and van Doorn 1997). One, perhaps trivial, interpretation is that of the photograph itself as a physical object. Each of the (typically infinitely many) metameric scenes would have yielded the same photograph. Each of these metameric scenes would èxplain' the photograph and must thus be considered veridical. An example that is well known in the literature of visual perception is the so-called Àmes room' (Ames 1952). Remarkably, to the observer the perception appears unambiguous though. Confronted with a picture, an observer can look either at it or into it. Looking at a picture evokes the perception of a flat physical object. When an observer looks into a picture (either monocularly or binocularly) he or she actively constructs pictorial space. This `construction' occurs largely on a precognitive level. One has conventionally described this process in terms of `depth cues'. Looking into a picture, an observer consciously experiences a 3-D pictorial space. Monocentric pictorial space (pictorial space obtained by looking at the picture from a single vantage point) is generally compatible with the `pictorial cues' available to the observer. Here àvailability' depends
Author to whom all correspondence should be addressed.
1291
both on the stimulus and on the competence of the observer. The combination of cues used by the observer is not known a priori and can only be inferred from the observer's responses. Although it is convenient to discuss 3-D pictorial structure in terms of 2-D position in the picture plane and `depth', this by no means implies that depth is a primary constituent of perceptions nor the immediate datum implied by the cues. Alternative to depth are the spatial attitudes or shapes of local surface elements. From a formal, mathematical perspective these are equivalent descriptions. Whether depth is a natural parameter in the description of pictorial relief is a matter of empirical research. It is a priori clear that an absolute depth reference is not applicable in the case of pictorial space though, since the observer is evidently not part of pictorial space. So-called `depth cues' rarely specify absolute depth, but either depth relations (for instance, the occlusion cue) or the attitude or shape of local surface elements (eg the shading cue). In the literature (eg Blake et al 1993; Palmer 1999; Hillis et al 2004), it is often assumed that `veridical depth' in a picture is totally determined by the image structure and independent of the observer. Indeed òbserver dependence' and `veridical' are generally considered to be mutually exclusive. One assumes each cue to yield a welldefined contribution to `depth', `slant', `tilt', etc, and assumes a deterministic mechanism of `cue combination', so as to arrive at the perception. The assumption of an ideal observer arriving at a unique perception consistent with the cues available from the picture leaves no role for any òbserver's freedom'. When the observer is less than ìdeal', the interpretation of cues depends on the observer's optical competence, the psychophysical task, and the observer's a priori assumptions of his/her biotope (the generic living environment). Both in daily life and in psychophysical studies the goal or task often induces or even forces the observer to select certain cues and ignore others. In the `standard model' the non-ideal observer is often assumed to be an ideal observer except for the lack of certain competences, ie the non-ideal observer is assumed to (perhaps unknowingly) ignore part of the available structure and/or combine various fragments of information in a less than optimal way. There is reason to believe that this `standard model' does not universally apply though, and that is the issue addressed in this paper. Two major approaches to pictorial space are ìnverse optics' (Marr 1982; Poggio 1984) and `Bayesian inferences' (Helmholtz 1867/1962; Yuille and Bu lthoff 1996). Both approaches deal with the fact that the cues heavily underconstrain the `solution'. The former approach achieves this through regularisation constraints, the latter by positing a prior probability distribution. In either case one can introduce an ideal observer whose perception would be uniquely determined by the picture. The fact that all cues are inherently ambiguous is well-known, although the ambiguity has been formalised in only a few cases. A particularly well-documented case is that of the shading cue (Belhumeur et al 1997). We have found that, typically, different observers do not arrive at the same unique responses (Koenderink et al 2001). Traditionally, such `variability' in observers' responses is considered as a flaw due either to extreme stimulus reduction, differences in psychophysical tasks, or to idiosyncrasies of the observers. This stems from the notion that there exists an ideal-observer model that predicts a unique result. Apparently, this is only partly the case. We find that the variability in observers' responses is indeed heavily constrained but in many cases significant. The nature of the ambiguity can indeed be predicted from ideal-observer models though. Observers thus have a certain leeway in arriving at the response as long as they fully exploit the available cues. Veridicality and idiosyncrasy of response do not mutually exclude each other.
1292
B Battu, A M L Kappers, J J Koenderink
Indeed, the inherent cue ambiguity forces an idiosyncratic component on the observer's perception. Since observers typically arrive automatically at some unique perception, the idiosyncratic component is apparently generated at the precognitive level. In pictorial space, this variability has been called the `beholder's share' by Gombrich (1975). The freedom of the observer is simply the ambiguity left by the cues, ie given by the syntactical structure of the image. It may be assumed to be influenced by the semantic (depth-cue-independent) content of image, though. Thus a formal theory of the cues is simultaneously a theory of the beholder's share. For some cues [for instance, the shading cue (eg Belhumeur et al 1997; Koenderink and van Doorn 1997)] such a formal analysis is available, but for most it is lacking. These considerations imply that the conventional notion of an ideal observer is much too restrictive, since cue ambiguity has to be taken into account. Since a complete formal theory of all cue ambiguities is not forthcoming, we are not in a position to outline the beholder's share in a formal manner. The study of the beholder's share is thus a matter of empirical research and is indeed the major topic of the present paper. Generally, even for the same picture, different psychophysical tasks yield different pictorial reliefs. In our view, this is only to be expected and not at all problematic. However, in traditional psychophysics this would be understood that something must be wrong. This is not a necessary conclusion, though. A reason might well be that the observers resolve the ambiguity of the stimulus in different ways. Different tasks might induce the observer to select different cues or combine them in a different way. It is a matter of empirical research how different responses are related in different psychophysical tasks and stimuli. In this paper we also investigate this expected task dependence. Since pictorial space is a mental entity, it is necessarily defined operationally. Any such operationalisation involves a stimulus, an observer, and a task. The task has to be considered an important aspect of the operationalisation and the stimulus ^ response relationship cannot be expected to be invariant over tasks. The empirical study of pictorial space thus necessarily involves a variation of pictures, observers, and tasks (eg Christou and Koenderink 1997; Koenderink et al 1996). The study of this interplay is an important topic of this paper. Koenderink et al (2001) reported that minor variations of a certain task influence pictorial relief to an unexpectedly large extent. They found that pictorial relief depends quantitatively on the task and the stimulus. They found large variations for slightly different tasks in the case of a photograph of an abstract piece of sculpture. They speculated that the semantic content of the picture, in this case the apparent bilateral symmetry of the object in the scene, was used by the observers to resolve the inherent ambiguity of the stimulus. In this paper, we address this speculation through a suitably defined set of stimuli. As Koenderink et al (2001), we use a bilaterally symmetric object, but this time photographed from a number of viewing angles. 2 Methods 2.1 Observers Eight paid observers voluntarily participated in the experiments. They were naive with respect to the stimuli and tasks. All observers had normal or corrected-to-normal visual acuity. 2.2 Stimuli A bicycle saddle was painted with white paint, mixed with rough sawdust, yielding a rough white surface texture. Six monochrome photographs of the bicycle saddle were prepared in six different views (see figure 1). A bicycle saddle was chosen because of its
1293
Figure 1. The stimuli used in this experiment are photographs of a white painted bicycle saddle illuminated from the upper right. The overall plane of the saddle is slanted either 208 (upper row) or 408 (bottom row). The plane of bilateral symmetry was rotated left by 30 8 (left column), straight forward (middle column), or rotated to the right by 308 (right column). The photographs have been taken with a visual angle of 138.
generic shape (as different from, for example, a sphere or an ellipsoid) and its bilateral symmetry. A texture was used that also yielded a clear contour in the image. The six photographs were taken with an Olympus 20E camera set to a field of view of 138. The saddle was illuminated with a collimated beam from the upper right of the saddle and the photographs were taken in colour and converted to greyscale. The saddle was placed in a horizontal ground plane in six different orientations. The overall plane of the saddle was slanted either 208 (figure 1, upper row) or 408 (figure 1, bottom row). The plane of bilateral symmetry was either rotated to the left by 308, straight forward, or rotated to the right by 308 (left, middle, and right pictures, respectively). In figure 1 all the stimuli are shown. They provide a rich set of cues to pictorial depth. The body shadows and the cast shadows clearly reveal the nature of the light source and the direction of the illumination. Note that the orientation of the saddle with respect to the light source has a major influence on the appearance of shape. The stimuli were presented on a 21 inch CRT monitor set to standard Mac g 1:8 , white point D65, in 160061200 pixel, 75 Hz, RGB mode. The sessions took place in a darkened room. Viewing distance was 75 cm. Viewing was monocular, observers using their preferred eye, the other eye being occluded. Their head was fixed with a chin-rest. 2.3 Tasks We measured pictorial relief with a method introduced by Koenderink et al (2001), which was roughly modelled on a method introduced by Frisby et al (1995). The basic idea of the method is to have the observer somehow specify a section of the pictorial relief defined by a straight line on the picture plane. When many of such sections have been obtained, along many straight lines that crisscross over the picture plane, such sections are then combined as a surface, the `pictorial relief', on which all these sections lie. Of course, in practice one cannot measure continuous sections, and one has to deal with a finite number of samples. A convenient way to do this is to define
1294
a regular grid of points in the picture plane and to measure the sections for all sets of collinear points on this grid. We used a hexagonal point grid, in which one has sections oriented at 1208 intervals. The observer indicates the section by moving the sample points in the `depth' direction. However, it is not the case that the observer is required to èstimate the depth', but to indicate the shape of the section through a series of samples. The absolute depth is actually discarded. After the observer has indicated the sections, these are then combined. A typical point of a hexagonal grid is then a member of three sections, each in a different direction. Thus the pictorial relief is overdetermined by the measurements. The way to deal with this is to find the best-fitting surface in a least-squares sense. In finding this fit one may freely shift the sections in depth, leaving their shape fixed, but discarding absolute depth. The degree to which the final surface fits the sections yields a convenient measure of confidence. This can be compared with the variance of settings of repeated sessions. This yields an estimate of whether the surface can be said to describe the data at all, ie whether a `pictorial relief' can be said to exist. The implementation of this method was straightforward. We put two windows on the computer screen, the `stimulus window' contained the picture, whereas the ìnteraction window' was used to implement the observer interaction. The section was defined in the stimulus window with a number of coloured, collinear, equally spaced dots superimposed over the (monochrome) stimulus image. At the initiation of a trial, an equal number of evenly spaced dots appeared in the interaction window. The observer was permitted to drag any of these points in a direction perpendicular to that of the initial row of dots with the computer mouse. A special mark in the interaction window differentiated the directions away from and towards the observer. The direction of the row of dots, as distinct from its orientation, was marked by using a special symbol for the `first' dot. The task of the observer was to arrive at a configuration of dots in the interaction window that looked like the profile for the section indicated in the stimulus window. A section was determined by at least three dots. The observer was free to move any dot, only the relative position of the dots being of relevance. Following Koenderink et al (2001) we implemented two variations of this task (figure 2). In one case the row of collinear dots in the interaction window was always horizontal, irrespective of the corresponding row in the stimulus window. In this paper we refer to this task as the CH (cross-section horizontal) task. In the other case, the row of collinear dots in the interaction window was parallel to the corresponding row in the stimulus window. In this paper we refer to this task as the CP (cross-section parallel) task. Although these tasks appear very similar, it turns out that many observers have a strong preference for one of the two (Koenderink et al 2001). The main choice in the actual implementation of the method is that of the coarseness of the regular point grid. For a given stimulus image we define the area for which the pictorial relief is to be measured by tracing a closed contour around its perimeter. We then triangulate this area in such a way that the number of faces in the triangulation is about 150. This number determines the magnitude of the task. Typically the number of points was about 140 (between 136 and 152), and the number of collinear dots in a section varied between 3 and 8. In any session, all rows of collinear dots contained in the triangulation were used. The same triangulation (and thus the same set of rows of collinear dots) was used in both the CH and the CP tasks. The resulting pictorial relief is represented in terms of a depth map. These depths are defined only in a relative sense; there is no absolute depth reference. It is important to remember that, although the relief is represented in terms of depth, the observers' task was not to indicate depth, but to indicate the shape of sections of the pictorial relief. We use the depth for the purpose of convenient analyses only.
1295
(a)
(b) Figure 2. In both the horizontal cross-section task (CH) and the parallel cross-section task (CP) the observer is confronted with the stimulus (left) overlaid with a set of collinear points. The observers adjust the corresponding points in the right hand window by dragging each collinear dot orthogonally to the collinear line in the direction of the reference dot (the isolated dot above all the dots) according to the relative depth differences of the collinear dots in pictorial space. (a) In the CH task, initially the collinear dots are horizontal with respect to the experimental window. (b) In the CP task, initially the collinear dots are parallel to the collinear dots in the stimulus window. In both tasks, the final adjustment typically indicates the cross-section of the pictorial relief, which was under the collinear dots. For clarity, the dots are somewhat larger than in the actual experiments.
3 Analysis The degree to which surface consistency is violated can be estimated from the depth variance per point of the triangulation as found from the least-squares estimation of the fit. Compared to the total depth range we find that for all observers the depth range exceeds the spread by at least a factor of 15 (range 15 ^ 130). This means that the number of discriminable depth levels ranges between 15 and 130. Thus the conclusion is that it is reasonable to speak of pictorial relief.
1296
A convenient and intuitive way to visualise the pictorial relief is a depth-contour map. Examples are shown in figure 3. In these maps we draw the contour of the triangulated area as a white line and in its interior we represent the local depth values with a grey tone, lighter corresponding to closer to the observer. In order to make it easier to see the shape, we use discrete levels indicated by equal-depth contours. The spacing of these equal-depth contours indicates the steepness of the pictorial relief: the closer these contours crowd together, the steeper the profile. Note that a closed
CH task CP task
HP
MV
JA
Figure 3. Pictorial reliefs for observers HP, MV, and JA for stimulus 5. The representative depth contour maps are drawn with equal interval spacings. Lighter shading indicates closer, darker shading more remote parts of the relief.
1297
contour indicates a depth extremum, that is a point either closest or farthest away from the observer. Thus in figure 3, the pattern at the top right represents a surface that slants away from the observer, whereas the pattern at the bottom left represents an almost frontoparallel surface. The spacing and number of depth contours indicate whether the surface is rather flattish, or has a steep inclination. Another way to analyse the results that is especially convenient for comparisons is to consider scatter plots of depth values at corresponding points. Unlike in the case of contour maps, in the case of depth scatter plots the configuration of the points in the picture plane is completely ignored. Nevertheless, this type of analysis is very useful since it allows us to find correlation measures for the depth values. Koenderink et al (2001) discovered that in many cases the correlations found from straight depth scatter plots are very low. They found that in many cases the correlations increased enormously when multiple correlations involving both the depth and the picture plane coordinates were considered. Let the depth values at the points (xi , yi ) in two different sessions be denoted (zi , zi0 ), then `straight depth correlation' considers the correlation between the lists (zi ) and (zi0 ), whereas àffine depth correlation' considers the linear model zi0 a bxi cyi dzi . The index i labels the points in the triangulation. The total number of points was about 140. Koenderink et al (2001) discuss many examples where the straight correlation of depth values for the responses of a certain observer and a single stimulus for two slightly different tasks was not significant, whereas the affine correlation was close to 1. In such cases, the depth values are not at all descriptive of the structure of the response. Affine depth correlation is an instance of the more general procrustean method of shape analysis, introduced by Kendall (1989). In this method, one compares or correlates two structures after applying a procrustean transformation to one of these. For instance, two cubes of arbitrary size correlate perfectly after a suitable scaling and rotation of one of them. A lozenge correlates perfectly with a square after a suitable rotation-magnification-shear. Formally, one characterises a geometry through its group of congruences. For instance, in Euclidean geometry, a congruence conserves distances and angles. In affine geometry (eg Coxeter 1961), only parallelity between lines and bisection of line segments are conserved. In projective geometry, only the incidence relations between points and lines are conserved. There exists a natural hierarchy of geometries. Thus, Euclidean geometry is also affine, and affine geometry is also projective, but in projective geometry parallelity is not defined, and in affine geometry distance is not defined. It is easily shown that the equation of affine depth correlation describes a congruence of affine geometry. The constant a describes a translation in depth, the constant d a scaling of the depth domain, whereas the constants b and c describe a transformation that is usually denoted as `shear' in terms of the conventional Euclidean geometry. A pure shear will transform a rectangle into a parallelogram. Thus a shear does not conserve Euclidean angles. `Rectangle' is a term that is meaningless in terms of affine geometry. Since the equation describes an affine congruence, it can be taken as a formal definition of `pictorial relief'. Pictorial relief is that property of a pictorial surface that is invariant with respect to arbitrary affine congruences. This is fully analogous to the conventional definition of Euclidean `shape'. Shape is the property of a surface that is invariant to arbitrary Euclidean congruencesarbitrary translations and/or rotations.
1298
Note that, whereas the straight depth correlation fully ignores the location of corresponding points in the picture plane, the affine correlation retains some of the geometry. The linear model represents a depth shift, a depth shear, and a depth scaling. The depth shift also occurs in the straight correlation (the coefficient a in the linear model) and is irrelevant in our analysis, since there is no natural origin of the depth domain. The shear and the depth scaling have significant meaning in the experiment, though. The interpretation of the depth scaling is immediate and corresponds to the slope of the straight scatter plot when the shear is negligible. Very significant depth scalings are commonly found in pictorial relief. They were already described by Hildebrand (1893/1945) more than a century ago. The geometrical significance of the shear (coefficients b and c of the linear model) is perhaps less readily appreciated. At any single point of the picture plane, the shear does not magnify the depth dimension but only shifts it in depth. These shifts are linearly distributed over the picture plane, with the result that a frontal parallel plane becomes slanted. Because the slant is typically both in the x and y directions, the shear has to be specified in two directions. A convenient way to specify a shear is by way of the slants of an originally frontoparallel plane in the x and y directions. These slants can be specified in degrees. The coefficients b and c in the linear model are simply the tangents of these angles. Notice that the total depth range over the picture is also influenced by the shear. Thus the total depth range cannot be immediately used to estimate the depth magnification. We analyse the results in two different ways. First of all, we consider comparisons for different tasks and different observers using the three methods discussed above. Then we also compare the result of each single session with that of an overall `grand mean'. For each stimulus the grand mean is the average of all normalised pictorial reliefs over all sessions over all observers. The pictorial reliefs are normalised through a transformation that renders them frontoparallel with unit rms-depth variation. We use the grand mean because we do not have a `veridical' fiducial surface (the problematic nature of veridicality as it is commonly used in the literature, is discussed in section 1). 4 Results In figure 3, a number of representative depth contour maps is shown for stimulus 5. The left and the right columns correspond to results of the CH and CP tasks. As even a cursory perusal of these data reveals, there exist very significant differences in the pictorial reliefs for different tasks or different observers. Perhaps the major differences are in the overall slants of the pictorial reliefs. The huge overall differences in spatial attitude make it virtually impossible to compare the shapes of the various reliefs (note that shape is by definition invariant against changes of spatial attitude). This indicates that an analysis of affine correlations is mandatory for pairwise comparisons of reliefs to make sense. For instance, a straight correlation is unlikely to reveal shape differences if the overall slants are significantly different. In the left column of figure 4 we show straight scatter plots of depth values for the same examples as in figure 3. The affine correlations are shown in the right column of figure 4. From top to bottom the plots represent scatter plots of depths obtained in the two tasks for different observers. The depth values of observer HP correlated well (R 2 1:00; we report adjusted R 2 throughout this paper). On the other hand, the plots of observers MV and JA show large scatter, indicating that their depth profiles were quite different in the two tasks. As a consequence, the corresponding R 2 s are quite low: 0.54 and 0.44, respectively. Affine correlation brings these R 2 values up to 0.99 and 0.94, respectively. It is clear that in the two different tasks, the pictorial reliefs indeed differ only by a shear and/or depth scaling.
1299
200 CH 100 CH*
0 HP 100
200 CP 200 100 CH 100 0 100 200 200 CH* 100 0 100 CP 200
50
0 MV 50
100 CP 100 CH 50 0 50 100 100 CH* 50 0 50 CP 100
75 50 25 0 25 50 75
JA
CP 75 50 25 0 25 50 75 75 50 25 0 25 50
CP 75
Figure 4. Scatter plots of the depth values in the case of observers HP, MV, and JA for stimulus 5, and for CH and CP tasks. The left column shows straight scatter plots for the depth values in the two tasks. In the scatter plots on the right, the depth values for the CH task have been affinely corrected (denoted as CH*). Notice the enormous increase of coefficient of determination after the multiple linear regression.
1300
In figure 5 bar plots of straight (black bars) and affine (white bars) R 2 s are shown (for all observers and all stimuli). Whenever the affine correlation is significantly better than the straight correlation (at 5% level) asterisks are shown on top of the bars. Here the significance takes the additional degree of freedom into account, for the affine correlations clearly can never be smaller than the straight correlations. The indications of significance in the figure are based on a standard two-sided test of Fisher's z transformation of Pearson's R values (Press et al 1988). In cases where the R values are very close to 1, this test is rather weak (and the difference is near zero and thus invisible in the plot). In the context of this experiment, the interesting cases are those where the differences between the straight and the affine correlations are very large. In almost all cases the pictorial reliefs for the two tasks differ merely in spatial attitude but not in shape.
1
Adjusted R 2
1 2 3 4 5 6
1 2 3 4 5 6 1 2 3 4 5 6 Stimulus
1 2 3 4 5 6
Figure 5. The coefficients of determination (R 2s) for the depths of pairs of methods for linear (black bars) and affine regression (white bars). The asterisks indicate cases where the affine correlation is significantly better than the straight correlations. The graphs correspond from left to right to observers HH, HP, LB, and MA (top), and to observers MV, OE, LI, and JA (bottom), respectively.
In figure 6 the R 2 s are plotted for the straight correlations and affine correlations between the observers for stimulus 2. The upper plots show the values of the CH task, the lower plots for the CP task. For all the observers the affine correlations are significantly better (typically much better) than straight correlations (at 5% level). In almost all cases the R 2 s in straight correlations with observer LB (observer 3) are very low (close to zero); R 2 s after affine correlations are improved but even then they are often still far from 1. Observer LB's responses are apparently different from the rest of the observers' responses. A comparison of the pictorial shape for observer LB with those of the other observers reveals that where the shapes for the other observers are very similar to each other, the pictorial shape for observer LB is indeed qualitatively different. In order to enable an easy comparison of all the observers' responses, we compare each to the standard grand mean (SGM). Observer LB was left out of the grand mean, because LB's results clearly deviate much from those of the other observers. Because of the normalisation used to find the SGM, an expected result is that there are pronounced shears along the vertical direction, rather less so along the horizontal direction (see figures 7 and 8). The shears are different for different observers and tasks. For example, the shears along the vertical axis in the CH task range from 0.208 to 6.468 and in the CP task from 0.298 to 138. Clearly, for some observers the tasks have an influence on their pictorial reliefs. The shears along the horizontal direction are very low (see figure 7) and range from 18 to 18.
1301
Adjusted R 2 0 1 Adjusted R 2 0
5 1
5 2
6 4
7 5
7 6
8 7
3 Observer combinations
Figure 6. The bars indicate the coefficients of determinations (R 2s) between the observers for the two tasks for linear regression (black bars) and affine regression (white bars) for stimulus 2. Numbers 1 through 8 correspond to observers HH, HP, LB, MA, MV, OE, LI, and JA, respectively.
1.0 0.5 0.0 0.5 1.0 HH HP MA
Shears along the vertical axis=8
1.0 0.5 0.0 0.5 1.0
MV
OE
LI
1.0 0.5 0.0 0.5 1.0
JA
3 4 5 Stimulus
3 4 Stimulus
3 4 Stimulus
Figure 7. The bars indicate the depth shears along the horizontal axis with respect to the reference plane for all observers and stimuli. The white bars represent shears in the CH task and the black bars in the CP task.
1302
Shears along the vertical axis=8
14 HH 12 10 8 6 4 2 14 MV 12 10 8 6 4 2 14 JA 12 10 8 6 4 2 1 2 3 4 Stimulus 5 6
HP
MA
OE
LI
3 4 Stimulus
3 4 Stimulus
Figure 8. The depth shears along the vertical axis with respect to the reference plane for all observers and stimuli. The white bars represent the depth shears in the CH task and the black bars in the CP task.
The depth scaling also depends on the observers and the tasks (see figure 9). In the CH task the scaling ranges from 2.5 to 14 and in the CP task from 2.5 to 28 (because of the normalisation of the SGM, all scaling fractions are less than 100). Notice that the total depth range is also influenced by the shear.
25 HH 20 15 10 5 0 25 MV 20 15 10 5 0 25 JA 20 15 10 5 0 1 HP MA
Depth scalings6100
OE
LI
3 4 Stimulus
3 4 Stimulus
3 4 Stimulus
Figure 9. The bars indicate the depth scalings (6100) of the z axis with respect to the standard grand mean for all observers and stimuli. The white bars represent the depth scalings in the CH task and the black bars in the CP task.
1303
5 Discussion This research was undertaken in order to study the nature of two types of evidence. The first type is that of the familiar `depth cues'. Note that `depth cues' is actually an awkward term, because only in rare cases does such a cue actually refer to `depth'. This is, for instance, the case for the `familiar size' cue. Most so-called depth cues rather specify a depth order (occlusion), a depth ratio (atmospheric perspective), a spatial attitude (texture gradient), or a curvature (shading). These cues are effective in early common, that is precategorical, stages of perception. In principle, the depth cues can be used in a bottom ^ up fashion. The second type of evidence is of a completely different nature. By definition, the depth cues exhaust the causal connections between image structure and scene structure. Thus, again by definition, the evidence of the second type cannot be singularly defined by image structure alone. This evidence depends on idiosyncratic assumptions on the part of the observer. For instance, in the present experiment, one might expect the observers to take the objects in the scene to be bilaterally symmetric in the Euclidean sense. Notice that this is indeed an assumption that is in no way forced by the image structure, for the bilateral symmetry is not necessarily present in the image. Such an assumption must be due to prior experiences of the observer with generic physical scenes. This type of evidence can only be used in a top ^ down fashion. It seems, a priori, likely that human observers will make similar assumptions given the general similarity of their experiences. This was indeed suggested by Koenderink et al (2001). In the present paper, we addressed this issue with a larger set of observers and stimuli. Do all observers use the classical `depth' cues in the same way? From the present study the answer is clearly yes. This conclusion follows from the fact that the affine correlations of the pictorial reliefs for different observers are generally very high (with the exception of observer LB). The inherent ambiguity of the pictorial cues from the first class indicates that one cannot expect the straight correlations to be high, since the depths are not specified by the cues (Belhumeur et a 1997; Koenderink and van Doorn 1997). In the affine correlations, the effect of the inherent cue ambiguities is removed. Thus, the only way to compare observers with respect to their ability to use the classical depth cues is to factor out the stimulus ambiguities, for instance by using affine correlations. Do all observers use the second type of evidence in the same way? From the present study the answer is: ``probably no''. Given that the observer exhausts the traditional depth cues, the ambiguity still leaves the freedom to decide idiosyncratically on the apparent frontoparallel plane and depth range. The choice cannot depend on the depth cues, but has to be based on prior experience. For instance, most objects are likely to be as deep as wide, thus settling the depth range. From the straight correlations, it is evident that both the apparent frontoparallel planes and the depth ranges differ widely for different observers for any given stimulus. The apparent frontoparallel planes can easily differ by as much as 108, and the depth ranges by factors between 2 and 4. The data do not reveal a clear pattern. In many cases, we notice that a change of task for a given observer and a given stimulus makes a large difference. There is some tendency for observer differences to persist over stimuli. However, this pattern is definitely not systematic. The suggestion of Koenderink et al (2001) that observers would use the evidence of the second type in similar ways because of their common visual experiences is not corroborated by the present data. General conclusions that can be drawn from these results are the following: first of all, we find conclusively that depth is irrelevant as a psychophysical variable in the context of pictorial relief. This is an important conclusion because much of the literature is preoccupied with depth. Suppose that z (x, y) is the depth z as a function of the location (x, y) for a given observer. Then another observer has a perfect affine
1304
correlation with the first observer if z 0 x, y a bx cy dz x, y , where the parameters a, b, c, and d are arbitrary, though d should be different from zero. In that case, the two observers have exploited the classical depth cues to the same extent and their pictorial reliefs may be said to be identical. The pictorial relief is the relevant psychophysical variable in these experiments. Another conclusion that follows from these data is that the sets of parameters fa, b, c, d g, that is the affine transformations that characterise the observers' perceptions, are essentially idiosyncratic. Koenderink et al (2001) have characterised these sets of parameters as `mental changes of viewpoint'. As mentioned in section 1, a likely hypothesis is that the observers are somehow influenced by the apparent bilateral symmetry of the depicted object. However, the present data indicate that such mental changes of viewpoint appear to be unrelated to the semantic content of the images. This is perhaps somewhat surprising and calls for further investigation.
References Ames A, 1952 The Ames Demonstrations in Perception (New York: Hafner) Belhumeur P, Kriegman D J, Yuille A L, 1997 ``The bas-relief ambiguity'', in Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Washington, DC: IEEE Computer Society Press) pp 1060 ^ 1066 Blake A, Bu lthoff H H, Sheinberg D, 1993 ``Shape from texture: ideal observer and human psychophysics'' Vision Research 12 1723 ^ 1737 Christou C G, Koenderink J J, 1997 ``Light source dependence in shape from shading'' Vision Research 37 1441 ^ 1449 Coxeter H S M, 1961 Introduction to Geometry (New York: John Wiley) Frisby J P, Buckley D, Bayliss F, Freeman J, 1995 `Ìntegration of conflicting stereo and texture cues in quasi-natural viewing of a torso sculpture'' Perception 24 Supplement, 136 Gombrich E, 1975 ``Review Lecture Mirror and Map: Theories of pictorial representation'' Philosophical Transactions of the Royal Society of London, Series B 270 119 ^ 149 Helmholtz H von 1867/1962 Treatise on Physiological Optics volume 3 (New York: Dover, 1962); English translation by J P C Southall for the Optical Society of America (1925) from the 3rd German edition of Handbuch der physiologischen Optik (first published in 1867, Leipzig: Voss) Hildebrand A, 1893/1945 Das Problem der Form in der bildenden Kunst (Strassburg: Heitz, 1893) [Translated by M Meyer, R M Ogden The Problem of Form in Painting and Sculpture (New York: G E Stechert, 1945)] Hillis J M, Watt S J, Landy M S, 2004 ``Slant from texture and disparity cues: Optimal cue combination'' Journal of Vision 4 967 ^ 992 Kendall D G, 1989 `À survey of the statistical theory of shape (with discussion)'' Statistical Science 4 81 ^ 120 Koenderink J J, Doorn A J van, 1997 ``The generic bilinear calibrationestimation problem'' International Journal of Computer Vision 23 217 ^ 234 Koenderink J J, Doorn A J van, Christou C, Lappin J S, 1996 ``Perturbation study of shading in pictures'' Perception 25 1009 ^ 1026 Koenderink J J, Doorn A J van, Kappers A M L, Todd J T, 2001 `Àmbiguity and the `mental eye' in pictorial relief'' Perception 30 431 ^ 448 Marr D, 1982 Vision: A Computational Investigation into the Human Representation and Processing of Visual Information (San Francisco, CA: W H Freeman) Palmer S E, 1999 Vision Science Photons to Phenomenology (Cambridge, MA: MIT Press) Poggio T, 1984 ``Low-level vision as inverse optics'', in Proceedings of Symposium on Computational Models of Hearing and Vision Ed. M Rauk (Tallin: Academy of Sciences of the Estonian SSR) pp 123 ^ 127 Press W H, Flannery B P, Teukolsky S A, Vetterling W T, 1988 Numerical Recipes in C The Art of Scientific Computing (Cambridge: Cambridge University Press) Yuille A L, Bu lthoff H H, 1996 ``Bayesian decision theory and psychophysics'', in Perception as Bayesian Inference Eds D Knill,W Richards (Cambridge: Cambridge University Press) pp 123 ^ 161
2007 a Pion publication
ISSN 0301-0066 (print)
ISSN 1468-4233 (electronic)
www.perceptionweb.com
Conditions of use. This article may be downloaded from the Perception website for personal research by members of subscribing organisations. Authors are entitled to distribute their own article (in printed form or by e-mail) to up to 50 people. This PDF may not be placed on any website (or other online distribution system) without permission of the publisher.

Ambiguity in Pictorial Depth

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ambiguity in Pictorial Depth

Uploaded by

Copyright:

Available Formats

Perception, 2007, volume 36, pages 1290 ^ 1304

Ambiguity in pictorial depth

Balaraju Battu, Astrid M L Kappers, Jan J Koenderink

Ambiguity in pictorial depth

B Battu, A M L Kappers, J J Koenderink

Ambiguity in pictorial depth

B Battu, A M L Kappers, J J Koenderink

Ambiguity in pictorial depth

B Battu, A M L Kappers, J J Koenderink

Ambiguity in pictorial depth

B Battu, A M L Kappers, J J Koenderink

Ambiguity in pictorial depth

200 CH 100 CH*

100 CP 100 CH 50 0 50 100 100 CH* 50 0 50 CP 100

B Battu, A M L Kappers, J J Koenderink

Ambiguity in pictorial depth

Shears along the vertical axis=8

1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

B Battu, A M L Kappers, J J Koenderink

Shears along the vertical axis=8

Ambiguity in pictorial depth

B Battu, A M L Kappers, J J Koenderink

2007 a Pion publication

ISSN 0301-0066 (print)

ISSN 1468-4233 (electronic)

You might also like