Go to Alexandria's home page
The Library of Alexandria

Windows Vision

Alexandria Home | Up One Level

Last updated 2/26/2005

Table of Contents

The Premise
The Core Algorithm
Expansion
Expansion's Shortcomings and Challenges
Contraction
Conclusions
Download the Demonstration Program

Back up to the table of contents The Premise

While thinking about getting a computer to have a better understanding of the visual world we inhabit, I realized there was a small class of visual tasks that merits study: the traditional, windowed, graphical user interface (GUI) most computer users are familiar with. We easily grasp and take for granted an understanding of complex visual concepts such as frame windows, icons, text boxes, and so forth. Why can't a machine do the same?

Having a full grasp of all of what one can visually see in a windowed operating system like Microsoft Windows, Mac OS, or X Windows would require a true, general-purpose vision system. I reasoned, though, that it should be possible to achieve practical goals by making a computer be able to recognize the primitives that we commonly encounter on a typical computer screen as such and, in doing so, be further able to do things like recognize which programs are open by pattern matching or identify regions of text for traditional text recognition. These in turn could form the basis of more sophisticated capabilities, like having programs providing more proactive help and advice based on what we're doing, having help systems that allow one to point at parts of a window instead of keying in search terms, providing enhanced automated monitoring of computer usage in high security facilities, or even helping the blind or otherwise handicapped to use GUIs.

Back up to the table of contents The Core Algorithm

The heart of this concept of getting a computer to eat its own dog food for a change is a fairly simple premise that can be translated into an algorithm. Most windows and controls are composed primarily of rectangular regions within regions. We'll call them "blocks". Study your own screen for a while and you'll see what I mean. Some programs such as media players that have artistic "skins" and many graphics-heavy web pages are harder to characterize this way, but let's leave them aside for purposes of this discussion.

This simple observation that window components are composed of blocks has important implications for vision. Let me further claim that most of these blocks have mostly smoothly connected outer margin pixels. A typical push-button, for example, will have text that sits on a mostly gray block. That block in turn sits inside some other blocks that may be a few pixels wide and give the button a beveled appearance. Around that may be a black border to clearly demarcate this button from the rest of the window in which it exists.

To be sure, there are plenty of blocks that are not mostly monotone like this. Many buttons, for example, have graphical icons that fill up most of their open space. Still, I find that even those tend to have smoothly connected outer margin pixels.

That most UI blocks have smoothly connected outer margins opens up the possibility of creating an algorithm for finding UI blocks by finding the outermost edges of those smooth margins. Consider the figure below, which illustrates this concept:

Figure 1
Figure 1: Expansion or contraction to find block boundaries.

This is a magnified view of the calculator program that comes with MS Windows. The algorithm under consideration includes two core operations for finding UI blocks: expanding outward and contracting inward. The red band and arrows illustrate expansion from a starting point and the orange band and arrows illustrate contraction from a larger starting rectangle.

The goal of an expansion algorithm would be to take a starting rectangle that is already located within the bounds of a UI block and have its own outer edges creep gradually outward until they find a continuously smooth outer margin and then until they find an abrupt change from that margin, presumably to some other, external UI components.

The goal of a contraction algorithm would be to take a rectangle that is already located outside the bounds of a UI block and have its own outer edges creep gradually inward until they find one or more continuously smooth outer margins just inside abrupt changes from that margin. This is more complicated than expansion, though, because such an algorithm would have to be able to deal with the possibility of multiple candidate blocks and be able to ignore ones that are below a certain size or complexity threshold to be considered of interest.

Back up to the table of contents Expansion

The careful observer might already have noticed that in the illustration above, the buttons are not perfectly rectangular. They have rounded corners. And that the button faces are not homogeneously colored. They have softened bevels around them. Another case that this image doesn't illustrate is when one has a large text box with text that spills off to the right of the visible window of text, such that those characters at the right edge leave no white margin before the boundary. These are some of the most basic challenges I identified in thinking about this algorithm. They are actually fairly easy to solve.

I had originally considered using a modified version of a basic flood-fill algorithm and identifying the extents of the flooded region. One problem I saw with doing this, though, is that the start point could be in the middle of an icon or piece of text or some other confounding region. I considered a variant of this idea that would involve picking start points that spiral outward from a center point until there is a sufficiently large area filled that has a mostly smooth, rectangular outer edge. This may actually work in some cases, but I worried about what might happen if there were text or other modest pixels that would stop a proper flooding? And what would happen if there was a pixel adjacent to the main block that had a color close enough to the flood fill's threshold so that the flooding spills out beyond the block?

Another approach I considered was to have an expanding box in which all pixels' colors would be sampled. Take the most common color. In many cases, that'll be the blocks' predominant color and can be used to see if the outer edges are mostly made up of it. But then, what about the case where most of the control is made up of a complex graphic? Only a small percent of the block's smooth outer margin would be visible.

It occurred to me that one way to do this that might be simpler and even save on computation would be to use a modified version of the above in which only the pixels of the outermost boundary of the current sampling region are considered. The goal would be to study the thin vertical and horizontal strips of pixels on the outer boundary of this sampling region to see if they contain mostly smoothly continuous colors. Because modern user interfaces more and more involve soft gradients from one color to another, often to convey a three dimensional feel, I thought it would be worthwhile to define "continuously smooth" in such a way that would allow for subtle changes in color from pixel to pixel; i.e., a gradient.

The key function used in this expansion algorithm analyzes a horizontal or vertical strip of pixels. The figure below illustrates this.

Figure 2
Figure 2: The search for a linear gradient.

Starting from the left (or top) and working to the right (or bottom), we look to see if each subsequent pixel is almost the same color as the previous one. We count up how many pixels do fit this simple criterion. When a pixel is found with a strongly different color, it is not counted. Analysis continues to the next pixel, looking to see if it continues the trend of the last pixel that was considered a match. This continues until all pixels in the strip are analyzed. The result of this first pass is a strength value that is calculated as the number of pixels fitting the trend divided by the total number of pixels in the strip. The strength's value is in the range from zero to one (or 0% to 100%, if you prefer). If fewer than half of the pixels fit a smooth trend, the algorithm doesn't bother returning a strength other than zero.

Because the first pixel might itself actually not fit the predominating trend -- remember the curved corners in the button example of figure 1; or the leftmost pixel in figure 2? -- the test is run again, starting from the second pixel and, assuming the previous pixels are not matching, continues forward. The test repeats again, starting from the third pixel. This continues forward until the start point is at the halfway point. Because a strength of less than 50% is unacceptable, there's no point in starting past the halfway point.

With each analysis pass of the given strip of pixels, we keep track of the analysis that yielded the highest strength value for the candidate gradient that began with the color of the first pixel tested. It's that strongest candidate that we'll eventually return. As a short-cut, if all pixels from the start point onward fit the gradient, there's no need to continue the analysis further because there can't be a stronger match to come out of subsequent passes.

The algorithm I experimented with is able to do the above and does demonstrate an ability to find mostly smoothly continuous linear gradients. When all four edges of a test rectangle are considered, the effect is quite useful. As the rectangle grows, one pixel at a time, we keep track of what the previous boundary was like, looking for the transition from "rough" to "smooth" edges and, moreover, for an abrupt change in color that would indicate we've reached the outer limit for that edge.

The next level up involves defining four predominating-gradient test strips together as a test rectangle. Given a starting point (or a small starting rectangle) that is inside a block to be found, this test rectangle expands outward one pixel length at a time. Each edge of the rectangle moves outward like a wave front, seeking a mostly smooth plain where the strength of the predominating gradient is above a predefined threshold. Once it finds that smooth plain, it will only stop advancing once it finds either a break in the smoothness or another smooth plain that has a significantly different color gradient. In figure 1, for example, the left edge of the red test box should stop advancing when it meets the black border around the button.

The "linear predominating gradient" algorithm I wrote in my demo program returns, among other things, the color of the left-most and right-most matching pixels in the gradient. I started out with the assumption that a proper expansion algorithm should be such that the test rectangle's "corners" should have the same colors and so form a roughly continuous gradient of color. I would thus test to make sure that the top edge's leftmost color matches the left edge's topmost color, for example, and so on with all four corners. But two things made me realize this is unnecessary and even undesirable. It should be unnecessary because a mostly smoothly continuous block's edges will, by definition, also be smoothly continuous. As such, the edges will not need to "compare notes" about each others' predominating gradients' colors.

Second, and more interestingly, I realized such a comparison would actually ignore an important feature of many modern UI components; that they have beveled edges. The following figure illustrates this:


Figure 3: Many UI elements have bevels -- lighter and darker outer edges that give a 3D look.

As you can see here, there are bevels around the four gray buttons, around the text box, to the left of the text box, surrounding the toolbars these elements are on, and so forth. While it's harder to make out, the outermost window frame (blue), caption bar, and main min / max / close buttons are also beveled.

These bevels might at first seem to confound the expansion algorithm because they don't fit the strict definition of a continuously smooth block. But think of them instead as separate little smooth edges. Our expansion algorithm is only testing edges. If we do not require the edges being tested to share the same basic colors to their corners, it becomes clear that a test rectangle that rests on a bevel should generally match it. And expanding outward from there should mean eventually finding hard edges to the bevel that indicate the outermost bounds of the bevel, itself.

One assumption the expansion concept makes is that expanding outward means finding that blocks are generally nested neatly inside of other blocks. One general function one might wish to perform using expansion alone is to find a hierarchy of nested blocks, starting from a single point and working outward to the entire desktop. Each time a block is found, a new test rectangle is created that is one pixel unit wider than the block found. One might expect a search starting within the minimize button in figure 3, for example, to first find the gray button face, then the white and dark gray bevel around that, then maybe the toolbar on which it sits, then the frame window containing it, and so on.

What makes the bevels interesting is that they can be used to help identify what a given block represents. We could craft a rule, for example, that takes note of cases where a block with another block that is one or two pixels around the first block and which is a bevel (different colored edges). We might then note the width and height and find the percentage of pixels inside the inner block that fit a smooth gradient. We could test those against a database of known block types and then add further logic -- perhaps a neural network -- to see if the image inside matches a known type of icon, for example. Or we might run an optical character recognition (OCR) scan against the box to see if there is recognizable text. Being able to recognize bevels can be quite helpful, then, in the area of interpretation of the results of expansion testing.

Back up to the table of contents Expansion's Shortcomings and Challenges

In a relatively short amount of time, I was able to create a simple demonstration of the expansion concept. Doing so helped (forced) me to clarify many of the concepts described above, but it also made clear some of the limitations of the concept thus far. Figure 4 illustrates some of them.


Figure 4: Screen shot of the demonstration program showing regions found when the user clicked certain spots in the image.

Each of the red boxes is the output of a single expansion test that began with a 10- by 10-pixel rectangle centered wherever I clicked my mouse on the image.

One thing that becomes immediately clear is how brittle the algorithm is when it comes to dealing with text. When I came up with the notion of mostly continuously smooth gradients punctuated by "dirt", it was in response to an expectation that an expansion algorithm would have to contend with text. What I didn't realize is that, for a given strip of pixels, a line of text can actually contribute a large percentage of the color (here, black pixels) to that strip (here, otherwise white). One key problem was setting an acceptable threshold for the strength of the predominating gradient of a test strip. Set it too low and the test rectangle will spill out beyond some blocks. Set it high enough to avoid that and text and other features might trip it up.

I don't consider this particular malady to be unsolvable, though. I got far enough with my algorithm, I think, to have proven the expansion concept. Had I continued, I could, for example, have changed the algorithm to consider the predominating color gradient of the rectangle, if it's not a bevel, to look for cases like this where that gradient "oozes through" an apparent barrier like a line of text or a dashed line.

Another challenge is with cases like the toolbar in figure 4 where the buttons do not actually have clearly demarked outer boundaries. We perceive the buttons in part because we perceive strong boundaries around the buttons' icons and because we see a grid-like regularity of the sizes and arrangement of the icons. I think this is where the contraction concept will play a more important role. A test rectangle starting with the toolbar's outer edge and working its way in should naturally stop shrinking when it finds the icons and other controls and decorations. Perceiving their grid-like regularity shouldn't be too hard, either.

Another interesting challenge comes with partial bevels. Some toolbars and other windows have bevels that may only appear on one or two edges. Expansion might fail to properly find these or may consider them to be a part of larger bevels in the outer-more blocks. It's not immediately apparent to me how to deal with that case. One possibility would to be see if a test rectangle has three edges that are smoothly connected and the fourth one is different. That edge could be pushed out one more pixel length and the test continued, for example.

One important shortcoming I've ignored for now is that fact that sometimes UI blocks overlap each other. The most obvious case is when an entire framed window partly covers another one. I believe the expansion algorithm can be enhanced with extra information that might come from other sources (like contraction) and thus be able to morph around known boundaries like a window's corner to still find partly covered text boxes, buttons, and so on.

Perhaps one of the most obvious limitations of this algorithm comes in the form of the question: where do we start our expansion search? In my demo program, I clicked on parts of the image of a desktop. In an interactive help system, it might be that we just follow the cursor as it moves, performing the test from time to time. But is that enough? I think it's worth invoking the fovea concept here. The human eye uses an analogous model. Its center is packed full of retinal cells that see in great detail, whereas the broader expanse of the retina is sparely populated for use in tracking broader shapes and movement. Perhaps its fair to think of an agent that's watching a desktop as being responsible for deciding where to focus on the screen instead of having a "gods eye view" from moment to moment. We don't, so why should it?

Besides the functional shortcomings of the algorithm I quickly wrote and tested, I was quite surprised and disappointed at the performance of it. A larger block match like the one in figure 4 inside the main text area would typically take a few seconds to find. Part of the cost surely comes from the fact that there are several analytical multipliers -- that a single linear gradient means scanning the strip fully and repeatedly; that a single test rectangle involves four such gradients; that expansion of the rectangle means testing the whole rectangle with each step; and so on. But I'm willing to bet that a lot of the cost is probably related to the fact that I did little to optimize the algorithm. Although I didn't bother doing time testing of various components of the algorithm, I'd be willing to bet that simple things like the GetPixel() and SetPixel() routines I used to get the colors of individual pixels are probably gluttonously slow.

Still, while I have no doubt the algorithm I wrote could easily be rewritten to run 10 to 100 times faster, that wasn't the point of this experiment. And I think the code is fairly readable, which I consider more important with novel algorithms.

Back up to the table of contents Contraction

I haven't talked much about contraction because I didn't do any actual experimentation into it. The basic concept, though, is to proceed in the opposite direction as expansion: to start with a test rectangle and work it gradually inward in order to find the blocks that are immediately inside the block that the test rectangle starts with.

One key difference between expansion and contraction that makes contraction more complicated is that we don't simply search for a single inner bounds to a smoothly continuous block that marks where other things begin. We want to be able to recognize multiple punctuating blocks. This clearly adds a significant amount of extra complexity. One possible way to deal with it would be to use a flood-fill approach as a starting point. Because there would invariably be small "leaks", we would have to study the edges formed to look for cases where leaks in strong edges are the result of only a small fraction of boundary pixels' colors being in common with inside pixels' colors.

Another approach might be to take note, when a boundary reaches a break in the smoothness, where the breaks are and to push new test rectangles into those breaks to probe them to see if, by a modified sort of expansion, they are significantly sized. This could get to be quite an expensive analysis, though, in terms of time and memory usage.

One important goal for a contraction algorithm would be to study the outcome to look for grid-like regularities. This should not be difficult. One fairly simple way would involve starting with a collection of all the inner blocks found directly inside this one. Each one's top edge would be tested to see if it is at the same level as each other's top edge. The same would apply to the left, right, and bottom edges. One ideal result would be the construction of a simplified model of the layout in terms of a checkerboard-like grid onto which each constituent rectangle is placed. For example, a simple popup message box that contains a block of text and "OK" and "Cancel" buttons might be represented with a 2 x 2 grid where the top two boxes represent the single text block and the two boxes on the bottom would represent the "OK" and "Cancel" buttons, respectively. This kind of simplified model could be used with neural networks and other traditional pattern matching techniques to help identify what this part of a window might represent.


Figure 5: A simplified block model of a simple message box.

Back up to the table of contents Conclusions

Although I did not achieve the stated goal in the beginning of this document, a small amount of coding and experimenting did show me that the basic concept is sound. I may in time come back to study this algorithm more, but this is good enough for now.

I conclude that the dual concepts of expansion and contraction can be used in helping to characterize a graphical user interface. In conjunction with other traditional AI techniques and a sizable enough knowledge base, these techniques should help to divide and conquer the complexity of such GUIs.

Expansion and contraction are very handy for use in this narrow class of machine vision problems. I do not mean to suggest they should be used as described here for more general machine vision, but I do want to stress the importance of this divide and conquer approach to vision. Expansion and contraction are discrete, primitive techniques that can be used to place the boundaries around subjects of interest that can be the beginning of a much more sophisticated analysis. The algorithm may not be reusable, but the general approach should be.

I began down this road with the premise that a practical, general purpose visual system must use such tricks to preprocess information. My "Texture Vision" experiments began with a related premise that most objects we see are made up of smooth gradients or repeating patterns. I think our eyes see these directly and save our visual systems a lot of trouble by preprocessing them.

In conclusion, I'm pretty happy with the results of this experiment. I think there are plenty of practical applications of this algorithm, and I encourage readers to explore them further.

Back up to the table of contents Download the Demonstration Program

The demonstration code I wrote for this project was written using Visual Studio .NET 2003. Most of the core code is in C#, but the user interface is written in VB.NET. If you wish to review the source code, click on the following link to download a ZIP file containing the entire solution folder:

Download the ZIP file

Back up to the table of contents Your Feedback

I encourage you to let me know what you think of this project or if you intend to use this in your own projects. Does this help solve some problems for you? What limitations get in the way of your research? To be sure, I don't really want to be a help desk for the software, but you're welcome to ask if you want some advice on how to apply these concepts to your work. Drop me a line.

Send me email.


Go to Alexandria's home page Copyright © 2012 The Library of Alexandria. All rights reserved.
Produced in cooperation with Carnell Information Systems, Inc.