Perceptual Grouping

The projection of the three-dimensional world onto two-dimensional images results in the loss of information such as depth. The true three-dimensional structure has to be recovered by the visual system from images that could have arisen from an infinite number of possible scenes. The recovery of the lost three-dimensional information cannot be done uniquely based on a geometrical theory alone, and additional assumptions are necessary.

These assumptions could be very specific in some well specified domains. For example, the assumption might be that a database of all the possible objects that would ever be encountered, and their three-dimensional models, is given. The three-dimensional structure could then be deduced by using two-dimensional images as indices into the database. Although useful for applications in constrained environments, such assumptions are very restrictive and impractical for general vision.

Assumptions that do not restrict the domain of functionality must exploit properties of scenes and imaging that are general. The following questions arise in this regard. What are the general properties of scenes that might be used to perform domain independent interpretation of images? Must all properties be expressed explicitly in terms of the entities in the scene, and constraints imposed by the projective geometry? Or, could the properties of real world lead to their image plane counterparts defined strictly in terms of image plane entities with no direct reference to three-dimensional geometry? If so, what are the image entities to which these properties apply? How must the constraints be expressed if not in terms of the projective geometry? At what stage does the explicit three-dimensional nature of the scene enter the interpretation process?

To illustrate these questions, consider an image that contains two parallel lines. The parallelism could be the result of actual parallelism of two lines in three-dimensional space. Or, the parallel lines could be the projections of two parallel, planar curves viewed so that the planes containing the curves project as straight lines. This latter case is unlikely since it assumes an unstable viewpoint. In general, it appears safe to make the assumption that if two lines in the image plane are parallel, then they are also parallel in space, without requiring further data to verify that the image does not really result from an unstable viewpoint. Making this assumption eliminates the need of first obtaining a three-dimensional description of the lines, and then of testing whether they are parallel in three-dimensional space. This illustrates the use of image plane entities, and constraints on their image plane structure, to make inference about three-dimensional scene structure directly, without involving any three-dimensional structural primitives, or using specific previous knowledge about the contents of the scene [5].

Significance of image plane structure is determined by image plane entities. Structurally related entities are said to be grouped. Grouping thus means putting items seen in the visual field together, or organizing image data. The organization may be at different scales. The rules to detect the organization may be completely stated in terms of intrinsic properties of tokens being grouped, and their image plane relationships. Since the detected image organization ultimately captures the organization of the scene, application of these rules should be a useful step towards image interpretation. Thus, grouping is a form of early inference about the structure of objects in the scene being viewed without the explicit use of three-dimensional or domain specific knowledge. The image plane entities, or tokens, that may be grouped include blobs, edge segments, and geometrical features of image regions.

Gestalt Laws of Perceptual Grouping

Gestalt psychologists undertook the first detailed study of the grouping phenomenon in human vision in the first part of this century [1,2]. The Gestalt psychologists at the time identified certain rules or principles to explain the particular way the human perceptual system groups tokens together. They suggested that grouping among tokens takes place based on the following criteria as shown in the following figure.

For any given stimulus, one or more of these rules might be at work in determining the perceived grouping. If more than one of the rules are at work, then they might be cooperating or competing.

One question that must be answered then is how to resolve the conflicts that may arise among the results of applying these different rules. The Gestalt psychologists raised such questions. The explanations and mechanisms of the working details of the grouping processes, and reasons for the existence of such processes were proposed by researchers in computational vision [3,4,5,6].

Our environment causes the images to have the structure they do. Therefore, the criteria for grouping are intimately tied to properties of the environment. Some examples of properties of the environment that may have a significant impact on the nature of image structure are discussed by Marr:

``The visible world consists of smooth surfaces that have certain reflectances; a surface's reflectance function often is generated by processes operating at different scales; the tokens generated at a given scale by the same process tend to be similar to each other; markings on a surface which are generated by the same process often have certain coherent spatial arrangements; the loci of discontinuities in depth or surface orientation are often smooth almost everywhere.'' [3]

Possible interpretations of an image that depend upon an accidental viewpoint are generally ignored by the human visual system and interpretations that are viewpoint-stable are preferred. Further, the larger the number of simultaneous accidental alignments required to support a given interpretation, the less likely that particular interpretation becomes. The suggestion is made by some that the grouping criteria that the Gestalt psychologists identified result from assumptions which eliminate unlikely interpretations.


  1. K. Koffka, Principles of Gestalt Psychology, Harcourt Brace, New York, 1935.
  2. M. Wertheimer, ``Laws of Organization in Perceptual Forms,'' in A Source Book of Gestalt Psychology, W. D. Ellis (ed), pp. 71-88, Harcourt Brace, 1938.
  3. D. Marr, Vision, W.H. Freeman and Company, San Francisco, CA, 1982.
  4. A. P. Witkin and J. M. Tenenbaum, ``On the Role of Structure in Vision,'' in Human and Machine Vision, Jacob Beck and Barbara Hope and Azriel Rosenfeld (eds), pp. 481-543, Academic Press, New York, 1983.
  5. D. G. Lowe, Perceptual Organization and Visual Recognition, Kluwer Academics, Boston, 1985.
  6. Related Publications by Mihran Tuceryan

  7. N. Ahuja and M. Tuceryan, ``Extraction of Early Perceptual Structure in Dot Patterns: Integrating Region, Boundary, and Component Gestalt,'' Computer Vision, Graphics, and Image Processing, vol. 48, pp. 304-356, December 1989. (Abstract)
  8. D. A. Trytten and M. Tuceryan, ``Segmentation and Grouping of Object Boundaries Using Energy Minimization,'' in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 730-731, Mauii, Hawaii, June, 1991.