Interpreting the Structure of Single
Images by Learning from Examples
Osian Haines
A dissertation submitted to the University of Bristol in accordance with the
requirements for the degree of Doctor of Philosophy in the Faculty of Engineering,
Visual Information Laboratory.
October 2013
52000 words
Abstract
One of the central problems in computer vision is the interpretation of the
content of a single image. A particularly interesting example of this is the
extraction of the underlying 3D structure apparent in an image, which is
especially challenging due to the ambiguity introduced by having no depth
information. Nevertheless, knowledge of the regular and predictable nature of
the 3D world imposes constraints upon images, which can be used to recover
basic structural information.
Our work is inspired by the human visual system, which appears to have
little difficulty in interpreting complex scenes from only a single viewpoint.
Humans are thought to rely heavily on learned prior knowledge for this. As
such we take a machine learning approach, to learn the relationship between
appearance and scene structure from training examples.
This thesis investigates this challenging area by focusing on the task of plane
detection, which is important since planes are a ubiquitous feature of human-
made environments. We develop a new plane detection method, which works
by learning from labelled training data, and can find planes and estimate
their orientation. This is done from a single image, without relying on explicit
geometric information, nor requiring depth.
This is achieved by first introducing a method to identify whether an individ-
ual image region is planar or not, and if so to estimate its orientation with
respect to the camera. This is done by describing the image region using
basic feature descriptors, and classifying against training data. This forms
the core of our plane detector, since by applying it repeatedly to overlapping
image regions we can estimate plane likelihood across the image, which is
used to segment it into individual planar and non-planar regions. We evalu-
ate both these algorithms against known ground truth, giving good results,
and compare to prior work.
We also demonstrate an application of this plane detection algorithm, show-
ing how it is useful for visual odometry (localisation of a camera in an un-
known environment). This is done by enhancing a planar visual odometry
system to detect planes from one frame, thus being able to quickly initialise
planes in appropriate locations, avoiding a search over the whole image. This
enables rapid extraction of structured maps while exploring, and may increase
accuracy over the baseline system.
Declaration
I declare that the work in this dissertation was carried out in accordance with the Reg-
ulations of the University of Bristol. The work is original, except where indicated by
special reference in the text, and no part of the dissertation has been submitted for any
other academic award.
Any views expressed in the dissertation are those of the author and in no way represent
those of the University of Bristol.
The dissertation has not been presented to any other University for examination either
in the United Kingdom or overseas.
SIGNED:
DATE:
Acknowledgements
I would like to begin by thanking my supervisor Andrew Calway, for all his guidance and
sage advice over the years, and making sure I got through the PhD. I would also like to
thank Neill Campbell and Nishan Canagarajah, for very useful and essential comments,
helping to keep me on the right track. My special thanks go to Jos ´e Mart´ınez-Carranza
and Sion Hannuna, who have given me so much help and support throughout the PhD – I
would not have been able to do this without them. Thanks also to my former supervisors
at Cardiff, Dave Marshall and Paul Rosin, for setting me on the research path in the
first place.
I owe a large amount of gratitude to Mam, Dad and Nia, for being supportive throughout
the PhD, always being there when needed, for their encouragement and understanding,
and not minding all the missed birthdays and events.
I am enormously grateful to my fantastic team of proofreaders, whose insightful com-
ments and invaluable suggestions made this thesis into what it is. They are, in no particu-
lar order: David Hanwell, Austin Gregg-Smith, Jos ´e Mart´ınez-Carranza, Oliver Moolan-
Feroze, Toby Perrett, Rob Frampton, John McGonigle, Sion Hannuna, Nia Haines, and
Jack Greenhalgh.
I would like to thank all of my friends in the lab, and in Bristol generally, who have
made the past five years so enjoyable, and without whose friendship and support the
PhD would have been a very different experience. This includes, but is not limited
to: Elena, Andrew, Dima, Adeline, Lizzy, Phil, Tom, John, Anthony, Alex, Tim, Kat,
Louise, Teesid, and of course all of the proofreaders already mentioned above.
Thanks also the British Machine Vision Association, whose helping hand has shaped the
PhD in various ways, including the excellent computer vision summer school, experiences
at the BMVC conferences, and essential financial help for attending conferences toward
the end of the PhD.
Finally, thanks to Cathryn, for everything.
Publications
The work described in this thesis has been presented in the following publications:
1. Visual mapping using learned structural priors (Haines, Mart´ınez-Carranza and
Calway, International Conference on Robotics and Automation 2013) [59]
2. Detecting planes and estimating their orientation from a single image (Haines and
Calway, British Machine Vision Conference 2012) [57]
3. Estimating planar structure in single images by learning from examples (Haines
and Calway, International Conference on Pattern Recognition Applications and
Methods 2012) [58]
4. Recognising planes in a single image (Haines and Calway, submitted to IEEE
Transactions on Pattern Analysis and Machine Intelligence)
For Cathryn
Contents
List of Figures
vii
List of Tables
xi
1 Introduction
1
1.1 Perception of Single Images . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2 Motivation and Applications . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3 Human Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4 On the Psychology of Perception . . . . . . . . . . . . . . . . . . . . . .
8
1.4.1
Hypotheses and Illusions . . . . . . . . . . . . . . . . . . . . . . .
9
1.4.2
Human Vision Through Learning . . . . . . . . . . . . . . . . . .
11
1.5 Machine Learning for Image Interpretation . . . . . . . . . . . . . . . . .
11
1.6 Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.7 Plane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
i
1.8 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.9 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2 Background
19
2.1 Vanishing Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.2 Shape from Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.3 Learning from Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.1
Learning Depth Maps . . . . . . . . . . . . . . . . . . . . . . . .
25
2.3.2
Geometric Classification . . . . . . . . . . . . . . . . . . . . . . .
28
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3 Plane Recognition
33
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.2 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2.1
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2.2
Reflection and Warping . . . . . . . . . . . . . . . . . . . . . . .
36
3.3 Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.3.1
Salient Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.2
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.3
Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.3.4
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.3.5
Spatiograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.4.1
Relevance Vector Machines . . . . . . . . . . . . . . . . . . . . . .
50
ii
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4 Plane Recognition Experiments
54
4.1 Investigation of Parameters and Settings . . . . . . . . . . . . . . . . . .
55
4.1.1
Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.1.2
Saliency and Scale . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.1.3
Feature Representation . . . . . . . . . . . . . . . . . . . . . . . .
58
4.1.4
Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.1.5
Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.1.6
Spatiogram Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.2 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.3 Independent Results and Examples . . . . . . . . . . . . . . . . . . . . .
65
4.4 Comparison to Nearest Neighbour Classification . . . . . . . . . . . . . .
69
4.4.1
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.4.2
Random Comparison . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.5 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.5.1
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.5.2
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5 Plane Detection
77
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.1.1
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.1.2
Discussion of Alternatives . . . . . . . . . . . . . . . . . . . . . .
78
5.2 Overview of the Method . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
iii
5.3 Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.4 Region Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.5 Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.6 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.6.1
Region Sweeping for Training Data . . . . . . . . . . . . . . . . .
85
5.7 Local Plane Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.8 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.8.1
Segmentation Overview . . . . . . . . . . . . . . . . . . . . . . . .
89
5.8.2
Graph Segmentation . . . . . . . . . . . . . . . . . . . . . . . . .
90
5.8.3
Markov Random Field Overview . . . . . . . . . . . . . . . . . . .
90
5.8.4
Plane/Non-plane Segmentation . . . . . . . . . . . . . . . . . . .
92
5.8.5
Orientation Segmentation . . . . . . . . . . . . . . . . . . . . . .
94
5.8.6
Region Shape Verification . . . . . . . . . . . . . . . . . . . . . . 102
5.9 Re-classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.9.1
Training Data for Final Regions . . . . . . . . . . . . . . . . . . . 103
5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 Plane Detection Experiments
105
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1.1
Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Discussion of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.1
Region Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.2
Kernel Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Evaluation on Independent Data . . . . . . . . . . . . . . . . . . . . . . . 112
iv
6.3.1
Results and Examples . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3.2
Discussion of Failures . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4.1
Description of HSL . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.2
Repurposing for Plane Detection . . . . . . . . . . . . . . . . . . 122
6.4.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5.1
Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5.2
Translation Invariance . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5.3
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7 Application to Visual Odometry
136
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.1.1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3 Visual Odometry System . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.1
Unified Parameterisation . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.2
Keyframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.3.3
Undelayed Initialisation . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.4
Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3.5
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3.6
Characteristics and Behaviour of IDPP . . . . . . . . . . . . . . . 147
7.4 Plane Detection for Visual Odometry . . . . . . . . . . . . . . . . . . . . 149
v
7.4.1
Structural Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.2
Plane Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.4.3
Guided Plane Growing . . . . . . . . . . . . . . . . . . . . . . . . 151
7.4.4
Time and Threading . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.4.5
Persistent Plane Map . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6.1
Fast Map Building . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.7.1
Comparison to Point-Based Mapping . . . . . . . . . . . . . . . . 162
7.7.2
Learning from Planes . . . . . . . . . . . . . . . . . . . . . . . . . 162
8 Conclusion
165
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.3.1
Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.2
Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.3
Enhanced Visual Mapping . . . . . . . . . . . . . . . . . . . . . . 172
8.3.4
Structure Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.4 Final Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
References
175
vi
List of Figures
1.1 Projection from 3D to 2D — depth information is lost . . . . . . . . . .
3
1.2 Some configurations of 3D shapes are more likely than others . . . . . . .
4
1.3 Examples of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4 Images whose interpretation is harder to explain . . . . . . . . . . . . . .
7
1.5 The hollow face illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.6 Examples outputs of plane recognition . . . . . . . . . . . . . . . . . . .
16
1.7 Examples of plane detection . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.1 Examples from Koˇseck ´a and Zhang . . . . . . . . . . . . . . . . . . . . .
21
2.2 Shape from texture result due to G˚arding . . . . . . . . . . . . . . . . . .
24
2.3 Some objects are more likely at certain distances . . . . . . . . . . . . . .
25
2.4 Illustration of Saxena et al.’s method . . . . . . . . . . . . . . . . . . . .
26
2.5 Typical outputs from Saxena et al. . . . . . . . . . . . . . . . . . . . . .
27
2.6 Illustration of the multiple-segmentation of Hoiem et al. . . . . . . . . . .
29
vii
2.7 Examples outputs of Hoiem et al. . . . . . . . . . . . . . . . . . . . . . .
30
3.1 How the ground truth orientation is manually specified . . . . . . . . . .
36
3.2 Examples of training data . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.3 Examples of warped training data . . . . . . . . . . . . . . . . . . . . . .
38
3.4 The quadrant-based gradient histogram . . . . . . . . . . . . . . . . . . .
40
3.5 Words and topics in an image . . . . . . . . . . . . . . . . . . . . . . . .
45
3.6 Topic visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.7 Topic visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.8 Topic distributions in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.1 Performance as the size of the vocabulary is changed . . . . . . . . . . .
56
4.2 Effect of different patch sizes for saliency detection . . . . . . . . . . . .
57
4.3 Comparison of different kernel functions . . . . . . . . . . . . . . . . . .
61
4.4 Adding synthetically generated training data . . . . . . . . . . . . . . . .
62
4.5 Region shape itself can suggest orientation . . . . . . . . . . . . . . . . .
63
4.6 Distribution of orientation errors for cross-validation . . . . . . . . . . .
64
4.7 Distribution of errors for testing on independent data . . . . . . . . . . .
65
4.8 Example results on an independent dataset . . . . . . . . . . . . . . . . .
67
4.9 Examples of correct detection of non-planes . . . . . . . . . . . . . . . .
68
4.10 Examples where the algorithm fails . . . . . . . . . . . . . . . . . . . . .
69
4.11 Comparison of RVM and KNN . . . . . . . . . . . . . . . . . . . . . . .
70
4.12 Example results with K-nearest neighbours . . . . . . . . . . . . . . . . .
72
4.13 Example results with K-nearest neighbours . . . . . . . . . . . . . . . . .
73
viii
4.14 Example of failure cases with K-nearest neighbours . . . . . . . . . . . .
74
4.15 Comparison of KNN and random classification . . . . . . . . . . . . . . .
75
5.1 Problems with appearance-based image segmentation . . . . . . . . . . .
79
5.2 Examples of the four main steps of plane detection . . . . . . . . . . . .
82
5.3 Illustration of region sweeping . . . . . . . . . . . . . . . . . . . . . . . .
83
5.4 Examples of ground truth images for plane detection . . . . . . . . . . .
84
5.5 Obtaining the local plane estimate from region sweeping . . . . . . . . .
87
5.6 Thresholding on classification confidence . . . . . . . . . . . . . . . . . .
88
5.7 Examples of pointwise local plane estimates . . . . . . . . . . . . . . . .
88
5.8 Segmentation: planes and non-planes . . . . . . . . . . . . . . . . . . . .
93
5.9 Illustration of the kernel density estimate . . . . . . . . . . . . . . . . . .
95
5.10 Visualisation of the kernel density estimate of normals . . . . . . . . . .
98
5.11 Plane segmentation using estimated orientations . . . . . . . . . . . . . . 101
5.12 Colour coding for different orientations . . . . . . . . . . . . . . . . . . . 102
6.1 Results when changing region size for sweeping . . . . . . . . . . . . . . . 108
6.2 Changing the size of sweep regions for training data . . . . . . . . . . . . 110
6.3 Results for varying mean shift bandwidth . . . . . . . . . . . . . . . . . . 112
6.4 Distribution of orientation errors for evaluation . . . . . . . . . . . . . . 114
6.5 Hand-picked examples of plane detection on our independent test set . . 115
6.6 Some of the best examples of plane detection on our independent test set 116
6.7 Some typical examples of plane detection on our independent test set . . 117
6.8 Some of the worst examples of plane detection on our independent test set 118
ix
6.9 Example of poor performance of plane detection . . . . . . . . . . . . . . 119
6.10 Hoiem et al.’s algorithm when used for plane detection . . . . . . . . . . 123
6.11 Comparison of our method to Hoiem et al. . . . . . . . . . . . . . . . . . 125
6.12 Increased density local plane estimate . . . . . . . . . . . . . . . . . . . . 129
6.13 Colour coding for different orientations . . . . . . . . . . . . . . . . . . . 129
6.14 Illustrating the re-location of a plane across the image . . . . . . . . . . . 130
6.15 Using translation invariant versus absolute position spatiograms . . . . . 132
7.1 Examples of plane detection masks . . . . . . . . . . . . . . . . . . . . . 150
7.2 Comparing plane initialisation with IDPP and our method . . . . . . . . 155
7.3 Views of the 3D map for the Berkeley Square sequence . . . . . . . . . . 156
7.4 Views of the 3D map for the Denmark Street sequence . . . . . . . . . . 157
7.5 Examples of visual odometry with plane detection . . . . . . . . . . . . . 158
7.6 Examples of plane detection . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.7 Trajectories overlaid on a map . . . . . . . . . . . . . . . . . . . . . . . . 159
7.8 Trajectories overlaid on a map . . . . . . . . . . . . . . . . . . . . . . . . 159
7.9 Frame-rate of our method, compared to the original . . . . . . . . . . . . 160
x
List of Tables
4.1 Comparison between gradient and colour features . . . . . . . . . . . . .
59
4.2 Kernel functions used for the RVM . . . . . . . . . . . . . . . . . . . . .
60
4.3 Evaluating spatiograms with circular regions . . . . . . . . . . . . . . . .
63
6.1 Comparison to Hoiem et al. . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 Comparison of absolute and zero-mean spatiograms . . . . . . . . . . . . 131
7.1 Comparison between IDPP and PDVO . . . . . . . . . . . . . . . . . . . 157
xi
CHAPTER 1
Introduction
A key problem in computer vision is the processing, perception and understanding of
individual images. This is an area which includes heavily researched tasks such as object
recognition, image segmentation, face detection and text recognition, but can also involve
trying to interpret the underlying structure of the scene depicted by an image. This is a
particularly interesting and challenging problem, and includes tasks such as identifying
the 3D relationships of objects, gauging depth without parallax, segmenting an image
into structural regions, and even creating 3D models of a scene. However, in part due to
the difficulty and ill-posed nature of such tasks, it is an area which – compared to object
recognition, for example – has not been so widely studied.
One reason why perceiving structure from one image is difficult is that unlike many
tasks in image processing, it must deal with the fact that depth is ambiguous. Short of
using laser scanners, structured light or time-of-flight sensors, an individual image will
not record absolute depth; and with no parallax information (larger apparent motion
in the image for closer objects, in a moving camera or an image pair), it is impossible
to distinguish even relative depths. Furthermore, when relying on the ambiguous and
potentially low resolution pixel information, it remains difficult even to tell which regions
belong to continuous surfaces – an important problem in image segmentation – making
extraction of structure or surface information challenging.
1
1.1 Perception of Single Images
2
Nevertheless, recent work has shown that substantial progress can be made in perceiving
3D structure by exploiting various features of single images: they may be ambiguous, but
there is still sufficient information to begin perceiving the structures represented. This
involves making use of cues such as vanishing points and rectilinear structure, shading
and texture patterns, or relating appearance directly to structure.
Motivated by the initial success of such techniques, and driven by the potential benefits
that single-image perception will bring, this thesis focuses on investigating new methods
of interpreting the 3D structure of a single image. We believe this is a very interesting
task theoretically, since despite the considerable difficulties involved, some kinds of single
image structure perception do indeed seem to be possible. Recent developments in
this area also highlight that it is of great practical interest, with applications in 3D
reconstruction [66, 113], object recognition [65], wide-baseline matching [93], stereopsis
[111] and robot navigation [73, 92], amongst others.
Furthermore, this is an interesting topic because of its relationship to biological vision
systems, which as we discuss below seem to have little difficulty in interpreting complex
structure from single viewpoints. This is important, since it shows that a sufficiently
advanced algorithm can make good use of the very rich information available in an image;
and suggests that reliable and detailed perception from even a single image must be in
principle possible. The means by which biological systems are thought to do this even
hint at possible ways of solving the problem, and motivates striving for a more general
solution than existing methods. In particular, humans’ ability to take very limited
information, and by relating it to past visual experiences generate a seemingly complete
model of the real world, is something we believe we can learn much from, as we discuss
in depth below.
1.1 Perception of Single Images
In general, extracting the 3D structure from a 2D image is a very difficult problem,
because there is insufficient information recorded in an image. There is no way of re-
covering depth information directly from the image pixels, due to the nature of image
formation. Assuming a pin-hole camera model, all points along one ray from the cam-
era centre, toward some location in space, will be projected to the same image location
(Figure 1.1), which means that from the image there is no way to work backwards to
find where on that line the original point was.
1.1 Perception of Single Images
3
Figure 1.1: This illustrates the ambiguity of projecting 3D points to a 2D
image. The circle in the image plane shows the projection of the red sphere, but
any one of the spheres along the ray through the camera centre will project to
the same location. We have shown the image plane in front of the camera centre
for visual clarity.
Because of this there could potentially be an infinite number of different 3D scenes which
lead to the same 2D image. For example, a configuration of irregular quadrilaterals,
appropriately placed and shaded, may falsely give the appearance of a street scene or a
building fa¸cade, when viewed from a particular vantage point. At the extreme, one may
even be looking at a picture of a picture, and have no way of knowing that all depths
are in fact equal (this is resolved as soon as multi-view information becomes available).
However, while this argument seems to imply that any extraction of structure with-
out 3D information is not possible, it is evident that although any of a number of 3D
configurations are technically valid, some are much more likely than others. While we
cannot for definite say that we are looking at a building, say, as opposed to a contrived
collection of shapes which happen to give that appearance after projection, the former
interpretation is much more plausible. We illustrate this in Figure 1.2, where out of two
possible 3D configurations for a 2D image, one is much more realistic. Pictures, most
of the time, depict actual physical objects which obey physical laws, and tend to be
arranged in characteristic ways (especially if human-made).
This observation – that the real world is quite predictable and structured – is what
makes any kind of structure perception from a single image possible, by finding the most
plausible configuration of the world out of all possible alternatives [133], based on what
we can assume about the world in general. Thus the notion of prior knowledge, either
explicitly encoded or learned beforehand, becomes essential for making sense of otherwise
confusing image data.
1.2 Motivation and Applications
4
(a)
(b)
(c)
Figure 1.2: From even this very simple cartoon image of a house (a), it is
possible to recognise what it is, and from this imagine a likely 3D configuration
(b), despite the fact that, from a single image, the ambiguity of depth means
there are an infinite number of possible configurations, such as (c), which project
to an identical image. Such contrived collections of shapes are arguably much
less likely in the real world.
The different ways in which generic knowledge about the world is used leads to the various
computer vision approaches to detecting structure in single images. For example, parallel
world lines are imaged as lines which converge at a point in the image, and can be used
to recover the orientation of a plane; and the predictable deformation of texture due
to projection is enough to recover some of the viewing information. We go into more
detail about these possibilities in the following chapter, but for now it suffices to say
that, despite the difficulty of the problem, a number of attempts have been made toward
addressing it, with considerable success.
1.2 Motivation and Applications
The task of extracting structure from single images is interesting for several reasons.
As stated above, it is a difficult and ill-posed problem, since the information in one
image cannot resolve the issue of depths; and yet by exploiting low-level cues or learning
about characteristic structures, it is possible to recover some of this lost information.
It is also an interesting task to attempt since it has some biological relevance. Because
it is something humans have little difficulty with, this suggests ways of addressing the
problem, and gives us some baseline with which to ultimately compare. An algorithm
attempting to emulate the psychology of vision may also shed some light on unknown
details of how this is achieved, if it is sufficiently physiologically accurate, although this
is not a route we intend to investigate.
1.3 Human Vision
5
From a practical point of view, the potential of being able to interpret single images would
have a number of interesting applications. For example, being able to perceive structure
in a single image would allow quick, approximate 3D reconstructions to be created, using
only one image as input [4, 66]. For some tasks, a more sophisticated understanding than
knowledge of the structures themselves would not be necessary — that is, the geometry
alone is sufficient. For example, reconstructing rough 3D models when limited data are
available (historical images, for example, or more prosaically to visualise holiday photos),
or to extend the range and coverage of multi-view reconstructions [113].
Alternatively, a deeper understanding of the visual elements would make possible a va-
riety of interesting applications. For example, perception of the underlying structural
elements or their relationships would be useful in reconstructing 3D models where in-
sufficient data are available to see the intersections of all surfaces or retrieve the depth
of all pixels [113]. It would also be useful for providing context for other tasks, such as
object recognition, acting as a prior on likely object location in the image [65].
Knowledge of structure would also be very useful to tasks such as mapping and robot
navigation. Usually, when exploring a new and unknown environment, all that can be
sensed is the location of individual points from different positions, from which a 3D
point cloud can be derived. If however there was a way of gleaning knowledge of the
structure apparent in the image, this could be used to more quickly create a map of
the environment, and build a richer representation. Indeed, knowledge of higher level
structure (obtained by means other than single-image perception) has been very useful in
simplifying map representations and reducing the computation burden of maintaining the
map [47, 88]. We speculate that being able to more quickly derive higher-level structures
would bring further benefits, in terms of scene representation and faster initialisation of
features. This is something we come back to in Chapter 7.
1.3 Human Vision
As discussed above, interpreting general structure from single images is a significant
challenge, and one which is currently far from being solved. On the other hand, we
argue that this must in principle be possible, due to the obvious success of human (and
other animal) vision systems. To a human observer, it appears immediately obvious when
viewing a scene (and crucially, even an image of the scene, void of stereo or parallax
cues) what it is, with an interpretation of the underlying structure good enough to make
1.3 Human Vision
6
(a)
(b)
(c)
Figure 1.3: Examples of images which can be interpreted without difficulty by
a human, despite the complex shapes and clutter.
judgements about relative depths and spatial relationships [51]. Despite the complexity
of the task, this even appears subjectively to happen so fast as to be instantaneous [124],
and in reality cannot take much more than a fraction of a second.
The human vision system is remarkably adept at interpreting a wide range of types of
scene, and this does not appear to depend on calculation from any particular low-level
features, such as lines or gradients. While these might seem to be useful cues, humans
are quite capable of perceiving more complex scenes where such features are absent or
unreliable. Indeed, it is difficult to articulate precisely why one sees a scene as one does,
other than simply that it looks like what it is, or looks like something similar seen before.
This suggests that an important part of human vision may be the use of learned prior
experience to interpret new scenes, which is why complex scenes are so ‘obvious’ as to
what they contain.
For example, consider the image of Figure 1.3a. This is clearly a street scene, comprising
a ground plane with two opposing, parallel walls. The geometric structure is quite ap-
parent, in that there are a number of parallel lines (pointing toward a common vanishing
point near the centre of the image). It is plausible that this structure is what allows
the shape to be so easily perceived, and indeed this has been the foundation of many
geometry-based approaches to single-image perception [73, 93]. However, consider Fig-
ure 1.3b, which shows a similar configuration of streets and walls, and while it remains
obvious to a human what this is, there are considerably fewer converging lines, and the
walls show a more irregular appearance.
1.3 Human Vision
7
(a)
(b)
Figure 1.4: These images further highlight the powers of human perception.
The grassy ground scattered with leaves can easily be seen as sloping away from
the viewer; and the reflections in a river are seen to be of trees, despite the lack
of tree-like colour or texture.
A more extreme example is shown in Figure 1.3c, where the main structures are obscured
by balconies and other obtrusions, and people occlude the ground plane. This would be
difficult for any algorithm attempting to recover the scene structure based on explicit
geometric constructs, or for general image segmentation — and yet the human brain still
sees it with ease. This example in particular is suggestive that perception is a top-down
process of interpretation, rather than building up from low-level cues, and depends upon
a wealth of visual experience for it to make sense.
Finally, we demonstrate how humans can perceive structural content in images with
limited or distorted information. In Figure 1.4a, a grassy plane is shown, strewn with
fallen leaves. Despite the lack of any other objects for context or scaling, nor a uniform
size of leaves (and where cues from foreshortening of texture are quite weak), it is still
possible to see the relative orientation of the ground. Another interesting example where
the information available in the image itself is ambiguous is in Figure 1.4b, taken from
a picture of a riverside. One can see without much difficulty that this depicts trees
reflected in water. However, there is very little in the reflections corresponding to the
features generally indicative of trees. The branching structure is completely invisible,
the overall shape is not particularly tree like, and even the colour is not a reliable cue.
As before, this gives the impression that these objects and structures are perceptible by
virtue of prior experience with similar scenes, and knowing that trees reflected in water
tend to have this kind of distorted shape. The alternative, of recovering shapes from the
image, accounting for the distortion produced by the uneven reflections, and matching
them to some kind of tree prototype, seems rather unlikely (similar effects are discussed
in [51]).
Of course, despite these abilities, a human cannot guess reliable depth estimates nor
1.4 On the Psychology of Perception
8
report the precise orientation of surfaces in a scene. While they may have the impression
that they can visualise a full model of the scene, this is rarely the case, and much of the
actual details of structures remain hidden. Humans tend to be good at perceiving the
general layout of a scene and its content, without being able to describe its constituent
parts in detail (the number of branches on a tree or the relative distances of separated
objects, for example). Then again, the fact that such details are not even necessary
for a human to understand the gross layout is very interesting, hinting that the ability
to describe a scene in such detail is a step too far. This is an important point for
an algorithm attempting to follow the success of human vision since it suggests that
attempting to fully recover the scene structure or depth may not be necessary for a
number of tasks, and a limited perception ability can be sufficient.
These insights into the powers of human vision are what inspire us to take on the
challenge of using computer vision techniques to extract structure from single images,
and suggests that a learning-based approach is a good way to address this. In the
next section we take a deeper look at the details of human perception, focusing on how
learning and prior knowledge are thought to play a crucial role.
1.4 On the Psychology of Perception
Having described how the impressive feats of human perception inspire us to build a
learning-based system, we now consider how humans are believed to be achieve this.
The mechanics of human vision – the optics of the eye, the transmission of visual stimuli
to the brain, and so on – are of little interest here, being less germane to our discussion
than the interpretation of these signals once they arrive. In this section we give a brief
overview of current theories of human perception, focusing on how prior knowledge forms
a key part of the way humans so easily perceive the world.
An important figure in the psychology of human perception is James J. Gibson, whose
theory of how humans perceive pictures [50] focused on the idea that light rays emanating
from either a scene or a picture convey information about its contents, the interpretation
of which allows one to reconstruct what is being viewed. As such, vision can be considered
a process of picking up information about the world, and actively interpreting it, rather
than passive observation. This contrasted strongly with previous theories, which claimed
for example that the light rays emanating from a picture, if identical to those from a
real object, would give rise to the same perception.
1.4 On the Psychology of Perception
9
One aspect of his theory is that what humans internally perceive are perceptual and
temporal invariants [49], encompassing all views of an object, not just the face currently
visible. A consequence is that in order to perceive an object as a whole there must be
some model in the mind of what it is likely to look like, since from any particular view,
the information about an object is incomplete, and must be supplemented by already
acquired internal knowledge to make sense. Thus, vision is a process of applying stored
knowledge about the world, in order to make sense of partial information. This certainly
implies some kind of learning is necessary for the process, even in cases of totally novel
objects.
Similarly Ernst Gombrich, a contemporary of Gibson, likened an image to a trace left
behind by some object or event, that must be interpreted to recover its subject [51]. He
described this as requiring a “well stocked mind”, again clearly implying that learning
is important for perception. It follows that no picture can be properly interpreted on its
own, as being a collection of related 3D surfaces, without prior knowledge of how such
structures generally relate to each other, and how they are typically depicted in images.
For example, seeing an image of a building front as anything other than a jumble of
shapes is not possible without knowing what such shapes tend to represent.
These theories suggest that even though humans have the impression they perceive a
real and detailed description of the world, it is based on incomplete and inaccurate
information (due to occlusion, distance, limited acuity and simply being unable to see
everything at once), and prior beliefs will fill in the mental gaps. This inference and
synthesis is automatic and unconscious, and as Gombrich said, it is almost impossible to
see with an ‘innocent eye’, meaning perceive only what the eyes are receiving, without
colouring with subjective interpretations.
1.4.1 Hypotheses and Illusions
These ideas were developed further by Richard Gregory, reacting to what he perceived as
Gibson’s ignorance of the role of ‘perceptual intelligence’. This was defined as knowledge
applied to perception, as opposed to ‘conceptual intelligence’ (the knowledge of specific
facts). In this paradigm, perception is analogous to hypothesis generation [53], so that
vision can be considered a process of generating perceptual hypotheses (explanations of
the world derived from incomplete data received from the eyes), and comparing them to
reality. Recognition of objects and scenes equates to the formation of working hypotheses
1.4 On the Psychology of Perception
10
Figure 1.5: The hollow face illusion: even when this face mask is viewed from
behind (right) the human brain more easily interprets it as being convex. This
suggests that prior experience has a strong impact on perception (figure taken
from [54]).
about the world, when the information directly available is insufficient to give a complete
description.
This also explains why familiar, predictable objects are easier to see: the mind more
readily forms hypotheses to cover things about which it has more prior experience, and
requires more evidence to believe the validity of an unusual visual stimulus. One exam-
ple of this is the hollow face illusion [54], shown in Figure 1.5, where despite evidence
from motion and stereo cues, the hollow interior of a face mask is mistakenly perceived
as a convex shape. This happens because such a configuration agrees much better with
conventional experience — in everyday life, faces tend to be convex. Interestingly, the
illusion is much more pronounced when the face is the right way up, further supporting
the theory that it is its familiarity as a recognised object which leads to the misinterpre-
tation. This is a striking example of where the patterns and heuristics, learned in order
to quickly generate valid hypotheses, can sometimes lead one astray, when presented
with unusual data.
This is explored further by Gregory in his discussion of optical illusions [54], phenomena
which offer an interesting insight into human perception, and show that people are
easily fooled by unexpected stimuli. Not all optical illusions are of this type, for example
the Caf ´e wall illusion, which is due to neural signals becoming confused by adjacent
parallel lines [55]. But many are due to misapplication of general rules, presumably
1.5 Machine Learning for Image Interpretation
11
learned from experience. One such example is the Ames window illusion [54], where the
strong expectation for a window shape to be rectangular – when it is actually a skewed
quadrilateral – makes it appear to change direction as it rotates.
1.4.2 Human Vision Through Learning
These examples and others strongly suggest that there is an important contribution to
human vision from learned prior information, and that one’s previous experience with
the world is used to make sense of it. Humans cannot see without projecting outward
what they expect to see, in order to form hypotheses about the world. This allows quick
interpretation of scenes from incomplete and ambiguous data, but also leads to formation
of incorrect beliefs when presented with unusual stimuli. In turn this suggests that a
successful approach to interpreting information from single images would also benefit
from attempting to learn from past experiences, enabling fast hypothesis generation
from incomplete data — even if in doing so we are not directly emulating the details of
human vision. Indeed, as Gregory states in the introduction to [54], taking advantage
of prior knowledge should be crucial to machine vision, and the failure to recognise this
may account for the lack of progress to date (as of 1997).
With these insights in mind, we next introduce possibilities for perceiving structure in
single images by learning from prior experience. It is important to note, however, that
we are not claiming that the above overview is a comprehensive or complete description
of human vision. Nor do we claim any further biological relevance for our algorithm.
We do not attempt to model ocular or neurological function, but resort to typical image
processing and machine learning techniques; we merely assert that the potential learning-
based paradigm of human vision is a starting point for thinking of solutions to the
problem. Given the success of human vision at arbitrarily difficult tasks, an approach
in a similar vein, based on using machine learning to learn from prior visual experience,
appears to offer great promise.
1.5 Machine Learning for Image Interpretation
Learning from past experience therefore appears to be a promising and biologically plau-
sible means of interpreting novel images. The use of machine learning is further motivated
1.6 Directions
12
by the difficulty of creating models manually for tasks this complex. As previous work
has shown (such as [73, 93, 107] as reviewed in Chapter 2), explicitly defining how visual
cues can be used to infer structure in general scenes is difficult. For example, shape
from texture methods make assumptions on the type of texture being viewed, and from
a statistical analysis of visible patterns attempt to directly calculate surface slant and
tilt [44, 133]. However, this means that they can work well only in a limited range of
scenarios.
Alternatively, when it is difficult to specify the details of the model, but we know what
form its inputs and outputs should take, it may be more straightforward to learn the
model. Indeed, this is a common approach to complex problems, where describing the
system fully is tedious or impossible. Almost all object recognition algorithms, for exam-
ple, are based on automatically learning what characterises certain objects [69], rather
than attempting to explicitly describe the features which can be used to identify and
distinguish them. We must proceed with caution, since this still supposes that the world
is well behaved, and we should accept that almost any algorithm (including sometimes
human vision, as discussed above) would be fooled by sufficiently complicated or con-
trived configurations of stimuli. Thus the desire to make use of heuristics and ever more
general assumptions is tempered by the need to deal with unusual situations.
We are also motivated by other recent works, which use machine learning to deduce
the 3D structure of images. This includes the work of Saxena et al. [113], who learn the
relationship between depth maps (gathered by a laser range sensor) and images, allowing
good pixel-wise depth maps to be estimated for new, previously unseen images. Another
example is Hoiem et al. [66], who use classification of image segments into geometric
classes, representing different types of structure and their orientation, to build simple
3D interpretations of images. Both these examples, which use quite different types of
scene structure, have very practical applications, suggesting that machine learning is a
realistic way to tackle the problem — and so next we investigate possible directions to
take, with regards what kind of structure perception we intend to achieve.
1.6 Directions
There are several possible ways in which we might explore the perception of structure
in a single image. As stated above, one of the primary difficulties of single image inter-
pretation is that no depth information is available. Once depth can be observed, it is
1.6 Directions
13
then a fairly simple matter to create a 3D point cloud, from which other reconstruction
or segmentation algorithms can work [25, 32]. Therefore one of the most obvious ways
to approach single-image perception would be to try to recover the depth. Clearly this
cannot be done by geometric means, but as the work of Torralba and Oliva [127] and
Sudderth et al. [121] have shown, exploiting familiar objects, or configurations of scene
elements, can allow some rudimentary depth estimates to be recovered. This is related
to the discussion above, in that this relies on the knowledge that some relationships of
depths to image appearance are much more likely than others.
The most sophisticated algorithm to date for perceiving depth from a single image is by
Saxena et al. [113], where detailed depth maps can be extracted from images of general
outdoor scenes. This work has shown that very good depth estimates can be obtained in
the absence of any stereo or multi-view cues (and even used in conjunction with them for
better depth map generation). Since this work has succeeded in extracting rather good
depth maps, we leave the issue of depth detection, and consider other types of structure.
As we discussed in the context of human vision, it may not be necessary to know de-
tailed depth information in order to recover the shape, and this suggests that we could
attempt to recover the shape of general objects without needing to learn the relationship
with depth. Indeed, some progress has been made in this direction using shape from
shading and texture [41, 98, 122], in which the shape of even deformable objects can be
recovered from one image. However, this is a very difficult problem in general, due to
the complexities of shapes in the real world, and the fact that lighting and reflectance
introduce significant challenges — thus it may be difficult to extend to more general
situations.
A simplification of the above is to assume the world is composed of a range of simple
shapes – such as cubes and spheres – and attempt to recover them from images. This
is an interesting approach, since it relies on the fact that certain kinds of volumetric
primitive are likely to appear, and we can insert them into our model based on limited
data (i.e. it is not necessary to see all sides of a cuboid). This is the motivation for
early works such as Roberts’ ‘blocks world’ [109], in which primitive volumes are used to
approximate the structure of a scene; and a more recent development [56] based on first
estimating the scene layout [66] and fitting primitives using physical constraints. While
this approach limits the resulting model to only fairly simple volumetric primitives, many
human-made scenes do tend to consist of such shapes (especially at medium-sized scales),
and so a rough approximation to the scene is plausible.
1.6 Directions
14
A rather different approach would be to attempt to segment the image, so that segments
correspond to the continuous parts of the scene, and by assigning each a semantic label
via classification, to begin to perceive the overall structure. Following such a segmenta-
tion, we could potentially combine the information from the entire image into a coherent
whole, using random fields for example. This bears some similarity to the work of [66], in
which the segments are assigned to geometric classes, which are sufficient to recover the
overall scene layout. A related idea is to associate elements detected across the image to
create a coherent whole using a grammar based interpretation [3, 74], in which the prior
information encoded in the grammatical rules enforces the overall structure.
The above possibilities offer potentially powerful ways to get structure from a single im-
age. For practical purposes, in order to make a start at extracting more basic structures,
we will consider a somewhat more restricted – though still very general – alternative,
based on two of the key ideas above. First, we might further simplify the volume primi-
tives concept to use a smaller class of more basic shapes, which should be easier to find
from a single image, at the same time as being able to represent a wider variety of types
of scene. The most basic primitive we could use like this are planes. Planar surfaces are
very common and so can compactly represent a diverse range of images. If we consider
human-made environments – arguably where the majority of computer vision algorithms
will see most use – these are often comprised of planar structures. The fact that planes
can compactly represent structure means they have seen much use in tasks such as 3D
reconstruction and robot navigation.
While shapes such as cubes and spheres may be distinctive due to their shape in an image,
the same cannot be said of planes. Planes project simply to a quadrilateral, assuming
there is no occlusion; and while quadrilateral shapes have been used to find planes [94]
this is limited to fairly simple environments with obvious line structure. It is here that the
second idea becomes relevant: that using classification of surfaces can inform us about
structure. Specifically, while the shape of planes may not be sufficiently informative, their
appearance is often distinctive, due to the way planar structures are constructed and
used. For example, building fa¸cades and walls have an easily recognisable appearance,
and this allows them to be quickly recognised by humans as being planar even if they
have no obvious outline. Moreover, the appearance of a plane is related to its orientation
with respect to the viewer. Its surface will appear different depending which way it is
facing, due to effects such as foreshortening and texture compression, which suggests we
should be able to exploit this in order to predict orientation.
For these reasons, planar structures appear to be a good place to begin in order to
1.7 Plane Detection
15
extract structure from a single image. Therefore, to investigate single-image perception
in a feasible manner, we look at the potential of recovering planar structures in images
of urban scenes, by learning from their appearance and how this relates to structure.
1.7 Plane Detection
Planes are important structures in computer vision, partly because they are amongst
the simplest possible objects and are easy to represent geometrically. A 3D plane needs
only four parameters to completely specify it, usually expressed as a normal vector in
R 3 and a distance (plus a few more to describe spatial extent if appropriate). Because
of this, they are often easy to extract from data. Only three 3D points are required
to hypothesise a plane in a set of points [5], enabling planar structure recovery from
even reasonably sparse point clouds [46]. Alternatively, a minimum of only four point
correspondences between two images are required to create a homography [62], which
defines a planar relationship between two images, making plane detection possible from
pairs [42] or sequences [136] of images.
Planes are also important because they are ubiquitous in urban environments, making
up a significant portion of both indoor and outdoor urban scenes. Their status as one
of the most basic geometric primitives makes them a defining feature of human-made
environments, and therefore an efficient and compact way of representing structure,
giving realistic and semantically meaningful models while remaining simple. As such,
they have been shown to be very useful in tasks including wide baseline matching [73],
object recognition [65], 3D mesh reconstruction [25], robot navigation [47] and augmented
reality [16].
To investigate the potential of detecting and using planes from single image information,
we have developed an algorithm capable of detecting planar structures in single images,
and estimating their orientation with respect to the camera. This uses general feature
descriptors and standard methods for image representation, combined with a classifier
and regressor to learn from a large set of training data. One of the key points about
our method is that it does not depend upon any particular geometric or textural feature
(such as vanishing points and image gradients), and so is not constrained to a particular
type of scene. Rather, it exploits the fact that appearance of an image is related to scene
structure, and that learning from relevant cues in a set of training images can be used
to interpret new images. We show how this can be used in a variety of environments,
1.8 Thesis Overview
16
and that it is useful for a visual odometry task.
The work can be divided into three main sections: developing the basic algorithm to
represent image regions, recognise planes and estimate orientation; using this for detect-
ing planes in an image, giving groupings of image points into planar and non-planar
regions; and an example application, showing how plane detection can provide useful
prior information for a plane-based visual mapping system.
1.8 Thesis Overview
After this introductory chapter concludes by summarising our main contributions, Chap-
ter 2 presents some background to this area and reviews related work, including details of
geometric, texture, and learning-based approaches to single image structure perception.
In Chapter 3 we introduce our method for plane recognition , which can identify planes
from non-planes and estimate their orientation, example results of which are shown in
Figure 1.6. This chapter describes how regions of an image are represented, using a
collection of feature descriptors and a visual bag of words model, enhanced with spatial
information. We gather and annotate a large training set of examples, and use this to
train a classifier, so that new image regions may be recognised as planar or not; then,
for those which are planar, their orientation can be estimated by training a regression
algorithm. The important point about this chapter is that it deals only with known,
pre-specified image regions. This algorithm cannot find planes in an image, but can
identify them assuming a region of interest has been defined.
Figure 1.6: Examples of our plane recognition algorithm, which for manually
delineated regions such as these, can identify whether they are planar or not,
and estimate their orientation with respect to the camera.
1.8 Thesis Overview
17
Chapter 4 thoroughly evaluates the plane recognition algorithm, investigating the effects
of feature representation, vocabulary size, use of synthetic data, and other design choices
and parameters, by running cross-validation on our training set. We also experiment with
an alternative classification technique, leading to useful insights into how and why the
algorithm works. We evaluate the algorithm against an independent test set of images,
showing it can generalise to a new environment.
The plane recognition algorithm is adapted for use as part of a full plane detection
algorithm in Chapter 5, which is able to find the planes themselves, in a previously unseen
image; and for these detected planes, estimate their orientation. Since the boundaries
between planes are unknown, the plane recognition algorithm is applied repeatedly across
the image, to find the most likely locations of planes. This allows planar regions to be
segmented from each other, as well as separating planes with different orientations. The
result is an algorithm capable of detecting multiple planes from a single image, each with
an orientation estimate, using no multi-view or depth information nor explicit geometric
features: this is the primary novel contribution of this thesis. Example results can be
seen in Figure 1.7.
Figure 1.7: Examples of our plane detection algorithm, in a variety of environ-
ments, showing how it can find planes from amongst non-planes, and estimate
their orientation.
We then evaluate this detection algorithm in Chapter 6, showing the results of exper-
iments to investigate the effect of parameters such as the size of regions used for the
recognition step, or sensitivity of plane segmentation to different orientations. Such ex-
periments are used to select the optimal parameters, before again testing our algorithm
on an independent, ground-truth-labelled test set, showing good results for plane detec-
tion in urban environments. We also compare our algorithm to similar work, showing
side-by-side comparison for plane detection. Our method compares favourably, showing
superior performance in some situations and higher accuracy overall.
Finally, Chapter 7 presents an example application of the plane detection algorithm.
1.9 Contributions
18
Planar structures have been shown to be very useful in tasks such as simultaneous
localisation and mapping and visual odometry, for simplifying the map representation
and producing higher-level, easier to interpret scene representations [47, 89, 132]. We
show that our plane detector can be beneficial in a plane-based visual odometry task, by
giving prior information on the location and orientation of planes. This means planes may
be initialised quickly and accurately, before even the accumulation of parallax necessary
for multi-view methods. We show that this enables fast building of structured maps
of urban environments, and may improve the accuracy by reducing drift. We end with
Chapter 8, which concludes the thesis with a summary of the work and a discussion of
possible future directions.
1.9 Contributions
To summarise, these are the main contributions of this thesis:
  • We introduce a method for determining whether individual regions of an image are
  • planar or not, according to their appearance (represented with a variety of basic
    descriptors), based on learning from training data.
  • We show that it is possible, for these planar regions, to estimate their 3D orientation
  • with respect to the viewer, based only on their appearance in one image.
  • The plane recognition algorithm can be used as part of a plane detection algorithm,
  • which can recover the location and extent of planar structure in one image.
  • This plane detection algorithm can estimate the orientation of (multiple) planes
  • in an image — showing that there is an important link between appearance and
    structure, and that this is sufficient to recover the gross scene structures.
  • We show that both these methods work well in a variety of scenes, and investigate
  • the effects of various parameters on the algorithms’ performance.
  • Our method is shown to perform similarly to a state of the art method on our
  • dataset.
  • We demonstrate an example application of this plane detection method, by apply-
  • ing it to monocular visual odometry, which as discussed above could benefit from
    being able to quickly see structure without requiring parallax.
    CHAPTER 2
    Background
    In this chapter we discuss related works on single image perception, focusing on those
    which consider the problem of detecting planar structure, or estimating surface orien-
    tations. These can be broadly divided into two main categories: firstly, those which
    directly use the geometric or textural properties of the image to infer structure. These
    may be further divided into those which calculate directly from visible geometric entities,
    and those that use a statistical approach based on image features. Secondly, there exist
    methods which use machine learning, to learn the relationship between appearance and
    scene structure.
    As we discussed in Chapter 1, even when only a single image is available it is usually
    possible to infer something about the 3D structure it represents, by considering what is
    most likely, given the information available. Various methods have been able to glean
    information about surface orientation, relative depth, curvature or material properties,
    for example, even though none of these are measurable from image pixels themselves.
    This is due to a number of cues present in images, especially those of human-made,
    regular environments, such as vanishing points, rectilinear structure, spectral properties,
    colours and textures, lines and edges, gradient and shading, and previously learned
    objects.
    19
    2.1 Vanishing Points
    20
    The use of such cues amounts to deliberately using prior knowledge applied to the scene.
    Here prior knowledge means information that is assumed to be true in general, and can
    be used to extrapolate beyond what is directly observed. This can be contrasted with
    specific scene knowledge [2, 102], which can also help to recover structure when limited
    or incomplete observations are available. However, this is limited to working in specific
    locations.
    Over the following sections we review work on single image perception which makes
    use of a variety of prior knowledge, represented in different ways and corresponding to
    different types of assumption about the viewed scene. These are organised by the way
    they make assumptions about the image, and which properties of 3D scenes and 2D
    projections they exploit in order to achieve reconstruction or image understanding.
    2.1 Vanishing Points
    Vanishing points are defined as the points in the image where parallel lines appear to
    meet. They lie on the plane at infinity, a special construct of projective space that is
    invariant to translations of the camera, thus putting constraints on the geometry of the
    scene. Vanishing points are especially useful in urban, human-made scenes, in which
    pairs of parallel lines are ubiquitous, and this means that simple methods can often be
    used to recover 3D information. A detailed explanation of vanishing points and the plane
    at infinity can be found in chapters 3 and 6 of [62].
    A powerful demonstration of how vanishing points and vanishing lines can be used is
    by Criminisi et al. [26], who describe how the vanishing line of a plane and a vanishing
    point in a single image can be used to make measurements of relative lengths and areas.
    If just one absolute distance measurement is known, this can be used to calculate other
    distances, which is useful in measuring the height of people for forensic investigations,
    for example. As the authors state, such geometric configurations are readily available in
    structured scenes, though they do not discuss how such information could be extracted
    automatically.
    Another example of using vanishing points, where they are automatically detected, is the
    work of Koˇseck ´a and Zhang [73], in which the dominant planar surfaces of a scene are
    recovered, with the aim of using them to match between widely separated images. The
    method is based on the assumption that there are three primary mutually orthogonal
    2.1 Vanishing Points
    21
    directions present in the scene, i.e. this is a ‘Manhattan-like’ environment, having a
    ground plane and mutually perpendicular walls. Assuming that there are indeed three
    dominant orientations, the task is thus to find the three vanishing points of the image.
    This is done by using the intersections of detected line segments to vote for vanishing
    points in a Hough accumulator. However, due to noise and clutter (lines which do not
    correspond to any of the orientations), this is not straightforward. The solution is to use
    expectation maximisation (EM) to simultaneously estimate the location of the vanishing
    points in the image, and the probability of each line belonging to each vanishing point;
    the assignment of lines to vanishing points updates the vanishing points’ positions, which
    in turn alters the assignment, and this iterates until convergence. An example result,
    where viable line segments are coloured according to their assigned vanishing direction,
    is shown in Figure 2.1a.
    To find actual rectangular surfaces from this, two pairs of lines, corresponding to two
    different vanishing points, are used to hypothesise a rectangle (a quadrilateral in the
    image). These are verified using the distribution of gradient orientations within the
    image region, which should contain two separate dominant orientations. An example
    of a set of 3D rectangles detected in one image is shown in Figure 2.1b. These can
    subsequently be used for wide-baseline matching to another image.
    (a)
    (b)
    Figure 2.1: Illustration of the method of Koˇseck ´a and Zhang [73], showing the
    lines coloured by their assignment to the three orthogonal vanishing directions
    (a), and a resulting set of planar rectangles found on an image (b). Images are
    adapted from [73].
    This method has shown promising results, in both indoor and outdoor scenes, and the
    authors mention that it would be useful for robot navigation applications. Its main
    downside, however, is that it is limited to scenes with this kind of dominant rectangular
    structure. As such, planes which are perpendicular to the ground, but are oriented differ-
    ently from the rest of the planes, would not be easily detected, and may cause problems
    for the EM algorithm. Furthermore, the method relies on there being sufficiently many
    2.2 Shape from Texture
    22
    good lines which can be extracted and survive the clustering, which may not be the case
    when there is background clutter or spurious edges produced by textured scenes.
    A related method is presented by Miˇcuˇs´ık et al. [93], where line segments are used to
    directly infer rectangular shapes, which in turn inform the structure of the scene. It
    differs from [73] in that it avoids the high computational complexity of exhaustively
    considering all line pairs to hypothesise rectangles. Rectangle detection is treated as a
    labelling problem on the set of suitable lines, using a Markov random field. By using two
    orthogonal directions at a time, labels are assigned to each line segment to indicate the
    vanishing direction to which it points, and which part of a rectangle it is (for example
    the left or right side); from this, individual rectangles can be effortlessly extracted.
    These methods highlight an important issue when using vanishing points and other
    monocular cues. While they are very useful for interpreting a single image, they can also
    provide useful information when multiple images are present, by using single image cues
    to help in wide-baseline matching for example. Similar conclusions are drawn in work
    we review below [111], where a primarily single-image method is shown to be beneficial
    for stereo vision by exploiting complementary information.
    An alternative way of using vanishing points is to consider the effect that sets of con-
    verging parallel lines will have on the spectral properties of the image. For example,
    rather than detecting lines directly in the spatial domain, Ribeiro and Hancock [107]
    use the power spectrum of the image, where spectra which are strongly peaked describe
    a linear structure. From this, properties of the texture and the underlying surface can
    be recovered, such as its orientation. In their later work [108], spectral moments of the
    image’s Fourier transform are used to find vanishing lines via a Hough-like accumula-
    tor. Such methods are dependent on the texture being isotropic and homogeneous, such
    that any observed distortions are due to the effects of projection rather than the pat-
    terns themselves. These restrictions are quite limiting, and so only a subset of scenes –
    namely those with regular grid-like structures, such as brick walls and tiled roofs – are
    appropriate for use with this method.
    2.2 Shape from Texture
    An alternative to using vanishing points, which has received much attention in the liter-
    ature, is known as shape from texture (this is part of a loose collection of methods known
    2.2 Shape from Texture
    23
    collectively as ‘shape-from-X’, which includes recovery of 3D information from features
    such as shading, defocus and zoom [35, 75, 116]). Such an approach is appealing, since
    it makes use of quite different types of information from the rectangle-based methods
    above. In those, texture is usually an inconvenience, whereas many real environments
    have multiple textured surfaces, especially outdoors.
    Early work on shape from texture was motivated by the insights of Gibson into human
    vision [48], specifically that humans extract information about surface orientation from
    apparent gradients. However, this model was not shown to be reliable for real textures
    [133]; and due to making necessary assumptions about the homogeneity and isotropy
    of texture (conditions that, while unrealistic, allow surface orientation to be computed
    directly) methods developed based on these ideas fall short of being able to recover
    structure from real images.
    Work by Witkin [133] allows some of these assumptions to be relaxed in order to better
    portray real-world scenes. Rather than requiring the textures to be homogeneous or
    uniform, the only constraint is that the way in which textures appear non-uniform does
    not mimic perspective projection — which is in general true, though of course there
    will be exceptional cases. Witkin’s approach is based on the understanding that for one
    image there will be a potentially infinite number of plausible reconstructions due to the
    rules of projective geometry, and so the task is to find the ‘best’, or most likely, from
    amongst all possible alternatives. This goal can be expressed in a maximum-likelihood
    framework.
    The method works by representing texture as a distribution of short line segments,
    and assumes that for a general surface texture their orientations will be distributed
    uniformly. When projected, however, the orientations will change, aligning with the
    axis of tilt, so there is a direct relationship between the angular distribution of line
    segments (which can be measured), and the slant and tilt of the image. An iterative
    method is used to find the most likely surface orientation given the observed angles, in a
    maximum likelihood approach. This is developed by G˚arding [44], who proposes a fast
    and efficient one-step method to estimate the orientation directly from the observations,
    with comparable results. An example of applying the latter method is shown in Figure
    2.2, which illustrates the estimated orientation at the four image quadrants.
    However, shape from texture techniques such as these again face the problem that tex-
    tural cues can be ambiguous or misleading, and an incorrect slant and tilt would be
    estimated for textures with persistent elongated features, for example rock fissures or
    2.3 Learning from Images
    24
    Figure 2.2: An example result from the shape from texture algorithm of
    G˚arding [44], applied to an image of an outdoor scene. Orientation is approxi-
    mately recovered for the four quadrants of the image (there is no notion of plane
    detection here, only slant/tilt estimation). The image was adapted from [44].
    fence-posts. Moreover, while the accuracy of [44] was shown to be good for simulated
    textures of known orientation, no ground truth was available for evaluation on real im-
    ages. While the presented results are qualitatively good, it was not possible to accurately
    assess its applicability to real images.
    A further issue with shape from texture methods is that generally they deal only with
    estimating the orientation of given regions (as in the image above), or even of the whole
    image; the detection of appropriate planar surfaces, and their segmentation from non-
    planar regions, is not addressed, to our knowledge.
    2.3 Learning from Images
    The approaches discussed above use a well defined geometric model, relating features of
    appearance (vanishing points, texture, and so on) to 3D structure. They have proven
    to be successful in cases where the model is valid – where such structure is visible and
    the key assumptions hold – but are not applicable more generally. This is because it
    is very difficult to explicitly specify a general model to interpret 3D scenes, and so a
    natural alternative is to learn the model from known data. This leads us on to the next
    important class of methods, to which ours belongs: those which use techniques from
    machine learning to solve the problem of single-image perception.
    An interesting approach is presented in a set of papers by Torralba and Oliva, where
    general properties are extracted from images in order to describe various aspects of
    2.3 Learning from Images
    25
    the scene [99, 100, 127, 128]. They introduce the concept of the ‘scene envelope’ [99],
    which combines various descriptions of images, such as whether they are natural or
    artificial locations, depict close or far structures, or are of indoor or outdoor scenes. By
    estimating where along each of these axes an image falls a qualitative description can
    be generated automatically (for example ‘Flat view of a man-made urban environment,
    vertically structured’ [99]). This is achieved by representing each image by its 2D Fourier
    transform, motivated by the tendency for the coarse spectral properties of the image to
    depend upon the type and configuration of the scene. An indoor scene, for example,
    will produce sharp vertical and horizontal features, which is evident when viewed in the
    frequency domain.
    This work is further developed for scene depth estimation [127], where the overall depth
    of an image is predicted. This is based on an important observation, that while it is
    possible for an object to be at any size and at any depth, there are certain characteristic
    sizes for objects, which can be taken advantage of (a concept illustrated by Figure 2.3).
    For example, buildings are much more likely to be large and distant than small and close
    (as in a toy model); conversely something that looks like a desktop is very unlikely to be
    scenery several kilometres away. Again, this relates to the key point that we desire the
    most likely interpretation. Like the methods mentioned above, this method is based on
    the frequency characteristics of the image. Image properties are represented using local
    and global wavelet energy, with which depth is estimated by learning a parametric model
    using expectation maximisation. While this is a very interesting approach to recovering
    depth, the main downside is that it gives only the overall depth of the image. There is
    no distinction between near or far objects within one image, so it cannot be used to infer
    any detailed scene structure.
    Figure 2.3: An illustration from Torralba and Oliva [127], making the point
    that some objects tend to only appear at characteristic scales, which means they
    can be used to recover approximate scene depth (image taken from [127]).
    2.3 Learning from Images
    26
    2.3.1 Learning Depth Maps
    A more sophisticated approach by Saxena et al. [113] estimates whole-image depth maps
    for single images of general outdoor scenes. This is based on learning the relationship be-
    tween image features and depth, using training sets consisting of images with associated
    ground truth depth maps acquired using a custom built laser scanner unit. The cen-
    tral premise is that by encoding image elements using a range of descriptors at multiple
    scales, estimates of depth can be obtained.
    In more detail, they first segment the image into superpixels (see Figure 2.4b), being local
    homogeneous clusters of pixels, for which a battery of features are computed (includ-
    ing local texture features and shape features). As well as using features from individual
    superpixels, they combine features from neighbouring superpixels in order to encode con-
    textual information, because the orientation or depth of individual locations is strongly
    influenced by that of their surroundings. In addition to these features, they extract
    features to represent edge information to estimate the location of occlusion boundaries.
    These features allow them to estimate both relative and absolute depth, as well as local
    orientation. The latter is useful in evaluating whether there should be a boundary
    between adjacent superpixels. This is interesting, since it implies they are exploiting
    planarity — indeed, they make the assumption that each superpixel can be considered
    locally planar. This is a reasonable assumption when they are small. They justify this
    by analogy with the meshes used in computer graphics, where complex shapes are built
    from a tessellation of triangles. They explicitly build such meshes using the depth and
    orientation estimates, an example of which is shown in Figure 2.4c.
    (a)
    (b)
    (c)
    Figure 2.4: The depth map estimation algorithm of Saxena et al. [113] takes
    a single image (a) as input, and begins by segmenting to superpixels (b). By
    estimating the orientation and depth of each locally planar facet, they produce
    a 3D mesh representation (c). Images are taken from [113].
    2.3 Learning from Images
    27
    Figure 2.5: This shows some typical outputs of depth maps (bottom row)
    estimated by Saxena et al. [113] from single images (top row), where yellow is
    the closest and cyan the farthest. These examples showcase the ability of the
    algorithm to recover the overall structure of the scene with good accuracy (images
    taken from Saxena et al. [113]).
    Following the success of human vision, which easily integrates information from multiple
    cues over the whole image, they attempt to relate information at one location to all other
    locations. This is done by formulating the depth estimation problem in a probabilistic
    model, using a Markov random field, with the aim of capturing connected, coplanar and
    collinear structures, and to combine all the information to get a consistent overall depth
    map. Example results of the algorithm are shown in Figure 2.5.
    This has a number of interesting applications, such as using the resulting depth map
    to create simple, partially realistic single-image reconstructions. These are sufficient to
    create virtual fly-throughs, by using the connected mesh of locally planar regions, and
    these can semi-realistically simulate a scene from a photograph from new angles. Such
    reconstructions were shown to rival even those of Hoiem et al. [64] discussed below.
    Interestingly, while estimating the orientation of the planar facets is a necessary step of
    the algorithm, to relate the depth of adjacent segments, they do not actually produce a
    set of planes. Instead they focus on polygonal mesh representations, and while it might
    be possible to attempt to extract planar surfaces from such models, their results hint
    that it would not be trivial to do so.
    They also show that their depth estimates can be combined for creating multi-view re-
    constructions from images which are normally too widely separated for this to be possible
    (with too little overlap to allow reliable reconstruction using traditional structure-from-
    motion, for example). In [111] this algorithm is also shown to be beneficial for stereo
    2.3 Learning from Images
    28
    vision, as another means of estimating depth. This is interesting since even when two
    images are available, from which depth can be calculated directly, a single-image depth
    estimation still provides valuable additional information, by exploiting complementary
    cues. A simplified version of the algorithm was even used to guide an autonomous vehicle
    over unknown terrain [92]. These last two illustrate the value of using perception from
    single images in multi-view scenarios, an idea we come back to in Chapter 7.
    2.3.2 Geometric Classification
    The work of Hoiem et al. [66] has the goal of interpreting the overall scene layout from a
    single image. This is motivated by the observation that in a great many images, almost
    all pixels correspond to either the ground, the sky, or some kind of upright surface, and
    so if these three classes can be distinguished a large portion of images can be described
    in terms of their rough geometry. This notion of ‘geometric classification’ is the core of
    their method, in which regions of the image are assigned to one of three main classes
    — namely support (ground), sky, and vertical, of which the latter is further subdivided
    into left, right, and forward facing planes, or otherwise porous or solid.
    While the method is not explicitly aimed at plane detection, it is an implicit part of
    understanding the general structure of scenes, as the image is being effectively partitioned
    into planar (ground, left, right, forward) and non-planar (sky, solid, porous) regions. It
    is important to note, however, that plane orientation is limited by quantisation into four
    discrete classes. No finer resolution on surface orientation can be obtained than ‘left’
    and so on.
    Classification is achieved using a large variety of features, including colour (summary
    statistics and histograms), filter bank responses to represent texture, and image location.
    Larger-scale features are also used, including line intersections, shape information (such
    as area of segments), and even vanishing point information. These cues are used in the
    various stages of classification, using boosted decision trees, where the logistic regression
    version of Adaboost chosen for the weak learners is able to select the most informative
    features from those available. This automatic selection out of all possible features is
    one of the interesting properties of the method, in that it is not even necessary to tell
    it what to learn from — although this is at the expense of extracting many potentially
    redundant feature descriptors. The classifiers are trained on manually segmented training
    data, where segments have been labelled with their true class.
    2.3 Learning from Images
    29
    The central problem is that in an unknown image, it is not known how to group the
    pixels, and so more complex features than simple local statistics cannot be extracted.
    Thus features such as vanishing points cannot be included until initial segments have
    been hypothesised, whose fidelity in turn depends upon such features. Their solution is
    to gradually build up support, from the level of pixels to superpixels (over-segmented
    image regions) to segments (groups of superpixels), which are combined in order to cre-
    ate a putative segmentation of the image. An initial estimate of structure (a particular
    grouping of superpixels) is used to extract features, which are then used in order to up-
    date the classifications, to create a better representation of the structure and a coherent
    labelling of all the superpixels.
    The superpixels are found by using Felzenszwalb and Huttenlocher’s graph-cut segmen-
    tation method [36] to separate the image into a number (usually around 500) of small,
    approximately homogeneous regions, from which local features can be extracted. These
    are the atomic elements, and are much more convenient to work with than pixels them-
    selves. Segments are formed by grouping these superpixels together, combining infor-
    mation probabilistically from multiple segmentations of different granularity (see Figure
    2.6). Since it is not feasible to try all possible segmentations of superpixels, they sample
    a smaller representative set.
    (a)
    (b)
    (c)
    (d)
    Figure 2.6: For some image (a), this shows the superpixels extracted (b),
    and two segmentations at different granularities (c,d, with 15 and 80 segments
    respectively). Images adapted from [66].
    There are two main steps in evaluating whether a segment is good and should be retained.
    First, they use a pairwise-likelihood classification on pairs of adjacent superpixels, which
    after being trained on ground truth data is able to estimate the probability of them
    having the same label. This provides evidence as to whether the two should belong in
    the same eventual segment, or straddle a boundary. Next there is an estimate of segment
    homogeneity, which is computed using all the superpixels assigned to a segment, and is in
    turn used to estimate the likelihood of its label. To get the label likelihoods for individual
    superpixels, they marginalise over all sampled segments in which the superpixel lies.
    2.3 Learning from Images
    30
    They show how these likelihoods can be combined in various ways, from a single max-
    margin estimate of labels, to more complex models involving a Markov random field or
    simulated annealing. The final result is a segmentation of the image into homogeneously
    labelled sets of superpixels, each with a classification to one of the three main labels,
    and a sub-classification into the vertical sub-classes where appropriate. Some examples
    from their results are shown in Figure 2.7.
    Figure 2.7: Examples outputs of Hoiem et al. [66], for given input images. The
    red, green and blue regions denote respectively vertical, ground and sky classes.
    The vertical segments are labelled with symbols, showing the orientation (arrows)
    or non-planar (circle and cross) subclass to which they are assigned. Images
    adapted from [66].
    The resulting coarse scene layout can help in a number of tasks. It can enable simple 3D
    reconstruction of a scene from one image [64], termed a ‘pop-up’ representation, based on
    folding the scene along what are estimated to be the ground-vertical intersections. Its use
    in 3D reconstruction is taken further by Gupta and Efros [56], who use the scene layout
    as the first step in a blocks-world-like approach to recovering volumetric descriptions of
    outdoor scenes.
    Scene layout estimation is also used as a cue for object recognition [65], because knowing
    the general layout of a scene gives a helpful indication of where various objects are
    most likely to appear, which saves time during detection. For example, detecting the
    location and orientation of a road in a street scene helps predict the location and scale of
    pedestrians (i.e. connected to the road and around two metres in height), thus discarding
    a large range of useless search locations. This is interesting as it shows how prior scene
    knowledge can be beneficial beyond reconstructing a scene model.
    However, this method has a few shortcomings. Firstly from an algorithmic standpoint,
    its sampling approach to finding the best segmentation is not deterministic, and so quite
    different 3D structures will be obtained for arbitrarily similar images. The iterative
    nature of the segmentation, in which the features are extracted from a structure estimate
    which is not yet optimal, might also be problematic, for example falling into local minima
    where the true segmentation cannot be found because the most appropriate cues cannot
    2.4 Summary
    31
    be exploited.
    In terms of scene understanding, there are a few more drawbacks. The method rests on
    the not unreasonable assumption that the camera is level (i.e. there is no roll), which
    for most photographs is true, although cannot be guaranteed in a robot navigation
    application, for example (indeed, such images were manually removed from their dataset
    before processing). Moreover, the horizon needs to be somewhere within the image,
    which again may be rather limiting. While the assumptions it makes are much less
    restrictive than those of, say, shape from texture, it is still limited in that the scene
    needs to be well represented by the discrete set of classes available.
    In terms of plane detection, the quantisation of orientation (left-facing, right-facing and
    so on) is the biggest downside, since it limits its ability to distinguish planes from one
    another if they are similar in orientation, and would give ambiguous results for planes
    facing obliquely to the camera; as well as the obvious limit to accuracy attainable for
    any individual plane. Thus while the algorithm shows impressive performance in a range
    of images, extracting structure from arbitrarily placed cameras could be a problem.
    Nevertheless, the fact that a coarse representation gives a very good sense of scene
    structure, and is useful for the variety of tasks mentioned above, is reassuring, and paves
    the way for other learning-based single-image perception techniques.
    2.4 Summary
    This chapter has reviewed a range of methods for extracting the structure from single
    images, giving examples of methods using either direct calculation from image features or
    inference via machine learning. While these have been successful to an extent, a number
    of shortcomings remain. As we have discussed, those which rely on texture gradients
    or vanishing points are not applicable to general scenes where such structures are not
    present. Indeed, as Gregory [54] suggested, their shortcomings arise because they fail
    to take into account the contribution of learned prior knowledge to vision, and thus are
    unable to deal with novel or unexpected situations.
    It is encouraging therefore that several methods using machine learning have been devel-
    oped; however, these also have a few problems. Torralba and Oliva et al.’s work focuses
    on global scene properties, which is a level of understanding too coarse for most inter-
    esting applications. Saxena et al. successfully show that depth maps can be extracted
    2.4 Summary
    32
    from single images (so long as comprehensive and accurate ground truth is available for
    training), then used to build rough 3D models. While this does reflect the underlying
    scene structure, it does not explicitly find higher-level structures with which the scene
    can be represented, nor determine what kind of structure the superpixels represent. We
    emphasise that this work has made significant progress in terms of interpreting single
    images, but must conclude that it does not wholly address the central issue with which
    this thesis is concerned — namely, the interpretation and understanding of scenes from
    one image. That is, they can estimate depth reliably, but do not distinguish different
    types of scene element, nor divide structures as to their identity (the mesh represents
    the whole scene, with no knowledge of planes or other objects). By contrast, the plane
    detection algorithm we present distinguishes planar and non-planar regions, and though
    it does not classify planes according to what they actually are, is a first step toward un-
    derstanding the different structures that make up the scene, potentially enabling more
    well-informed augmentation or interaction.
    Hoiem et al. on the other hand focus almost exclusively on classifying parts of the
    image into geometric classes, bridging the gap between semantic understanding and
    3D reconstruction, and combining techniques from both machine learning and single
    view metrology. However, because orientations are coarsely quantised it means that
    the recovered 3D models lack specificity, being unable to distinguish similarly oriented
    planes which fall into the same category; and any reconstruction is ultimately limited
    to the fidelity of the initial superpixel extraction. This limitation is not only a practical
    inconvenience, but suggests that the available prior knowledge contained in the training
    set could be exploited more thoroughly. Moreover, their requirements that the camera
    be roughly aligned with the ground plane, and the use of vanishing point information as
    a cue, suggest they are not making use of fully general information. In particular, they
    tend to need a fairly stable kind of environment, with a visible and horizontal ground
    plane at the base of the image. While this is a common type of scene, what we are
    aiming for is a more general treatment of prior information.
    On the other hand, its ability to cope with cartoon images and even paintings show it is
    much more flexible than typical single-image methods, being able to extract very general
    cues which are much more than merely the presence of lines or shapes. This method,
    more than any others, shows the potential of machine learning methods to make sense
    of single images. In this thesis we attempt to further develop these ideas, to show how
    planar structures can be extracted from single images.
    CHAPTER 3
    Plane Recognition
    This chapter introduces our plane recognition algorithm, whose purpose is to determine
    whether an image region is planar or not, and estimate its 3D orientation. As it stands,
    this requires an appropriately delineated region of interest in the image to be given as
    input, and does not deal with the issue of finding or segmenting such regions from the
    whole image; however we stress that this technique, while limited, will form an essential
    component of what is to come.
    As we discussed in Chapter 2, many approaches to single image plane detection have
    used geometric or textural cues, for example vanishing points [73] or the characteristic
    deformation of textures [44]. We aim to go beyond these somewhat restrictive paradigms,
    and develop an approach which aims to be more general, and is applicable to a wider
    range of scenes. Our approach is inspired by the way humans appear to be able to
    easily comprehend new scenes, by exploiting prior visual experiences [50, 51, 54, 101].
    Therefore we take a machine learning approach, to learn the relationship between image
    appearance and 3D scene structure, using a collection of manually labelled examples.
    To be able to learn, our algorithm requires the collection and annotation of a large set
    of training data; representation of training and test data in an appropriate manner; and
    the training of classification and regression algorithms to predict class and orientation
    33
    3.1 Overview
    34
    respectively for new regions. Over the course of this chapter we develop these concepts
    in detail.
    3.1 Overview
    The objective of our plane recognition algorithm is as follows: for a given, pre-segmented
    area of an image (generally referred to as the ‘image region’), to classify it as being planar
    or non-planar, and if it is deemed to be planar, to estimate its orientation with respect
    to the camera coordinate system. For now, we are assuming that an appropriate region
    of the image is given as input.
    The basic principle of our method is to learn the relationship between appearance and
    structure in a single image. Thus, the assumption upon which the whole method is
    founded is that there is a consistent relationship between how an image region looks
    and its 3D geometry. While this statement may appear trivial, it is not necessarily true:
    appearances can deceive, and the properties of surfaces may not be all they seem. Indeed,
    exploiting the simplest of relationships (again, like vanishing points) between appearance
    and structure in a direct manner is what can lead to the failure of existing methods, when
    such assumptions are violated. However, we believe that in general image appearance is
    a very useful cue to the identity and orientation of image regions, and we show that this
    is sufficient and generally reliable as a means of predicting planar structure.
    To do this, we gather a large set of training examples, manually annotated with their
    class (it is to be understood that whenever we refer to ‘class’ and ‘classification’, it
    is to the distinction of plane and non-plane, rather than material or object identity)
    and their 3D orientation (expressed relative to the camera, since we do not have any
    global coordinate frame, and represented as a normalised vector in R 3 , pointing toward
    the viewer), where appropriate. These examples are represented using general features,
    rather than task-specific entities such as vanishing points. Image regions are represented
    using a collection of gradient orientation and colour descriptors, calculated about salient
    points. These are combined into a bag of words representation, which is projected to a
    low dimensional space by using a variant of latent semantic analysis, before enhancement
    with spatial distribution information using ‘spatiograms’ [10]. With these data and the
    regions’ given target labels, we train a classifier and regressor (often referred to as simply
    ‘the classifiers’). Using the same representation for new, previously unseen, test regions,
    the classifiers are used to predict their class and orientation.
    3.2 Training Data
    35
    In the following sections, we describe in detail each of these steps, with a discussion of
    the methods involved. A full evaluation of the algorithm, with an investigation of the
    various parameters and design choices, is presented in the next chapter.
    3.2 Training Data
    We gather a large set of example data, with which to train the classifiers. From the raw
    image data we manually choose the most relevant image regions, and mark them up with
    ground truth labels for class and orientation, then use these to synthetically generate
    more data with geometric transformations.
    3.2.1 Data Collection
    Using a standard webcam (Unibrain Fire-i) 1 connected to a laptop, we gathered video
    sequences from outdoor urban scenes. These are of size 320 × 240 pixels, and are corrected
    for radial distortion introduced by a wide-angle lens 2 . For development and validation
    of the plane recognition algorithm, we collected two datasets, one taken in an area
    surrounding the University of Bristol, for training and validating the algorithm; and
    a second retained as an independent test set, taken in a similar but separate urban
    location.
    To create our training set, we select a subset of video frames, which show typical or
    interesting planar and non-planar structures. In each, we manually mark up the region
    of interest, by specifying points that form its convex hull. This means that we are using
    training data corresponding to either purely planar or non-planar regions (there are no
    mixed regions). This is the case for both training and testing data.
    To train the classifiers (see Section 3.4), we need ground truth labels. Plane class is easy
    to assign, by labelling each region as to whether it is planar or not. Specifying the true
    orientation of planar regions is a little more complicated, since the actual orientation is of
    course not calculable from the image itself, so instead we use an interactive method based
    on vanishing points as illustrated in Figure 3.1 [26, 62]. Four points corresponding to the
    1 www.unibrain.com/products/fire-i-digital-camera/
    2 Using the Caltech calibration toolkit for M atlab , from www.vision.caltech.edu/bouguetj/calib doc
    3.2 Training Data
    36
    vanishing points
    v 2
    vanishing line
    l = v 1 x v 2
    v 1
    n = K T l
    plane normal
    Figure 3.1: Illustration of how the ground truth orientation is obtained, using
    manually selected corners of a planar rectangle.
    corners of a rectangle lying on the plane in 3D are marked up by hand, and the pairs of
    opposing edges are extended until they meet to give vanishing points in two orthogonal
    directions. These are denoted v 1 and v 2 , and are expressed in homogeneous coordinates
    (i.e. extended to 3D vectors by appending a 1). Joining these two points in the image
    by a line gives the vanishing line, which is represented by a 3-vector l = v 1 × v 2 . The
    plane which passes through the vanishing line and the camera centre is parallel to the
    scene plane described by the rectangle, and its normal can be obtained from n = K T l ,
    where K is the 3 × 3 intrinsic camera calibration matrix [62] (and a superscripted ‘T’
    denotes the matrix transpose) . Examples of training data, both planar, showing the
    ‘true’ orientation, and non-planar, are shown in Figure 3.2.
    3.2.2 Reflection and Warping
    In order to increase the size of our training set, we synthetically generate new training
    examples. First, we reflect all the images about the vertical axis, since a reflection of an
    image can be considered equally physically valid. This immediately doubles the size of
    our training set, and also removes any bias for left or right facing regions.
    We also generate examples of planes with different orientations, by simulating the view
    as seen by a camera in different poses. This is done by considering the original image to
    be viewed by a camera at location [ I | 0 ], where I is the 3 × 3 identity matrix (no rotation)
    and 0 is the 3D zero vector (the origin); and then approximating the image seen from
    a camera at a new viewpoint [ R | t ] (rotation matrix R and translation vector t with
    respect to the original view) by deriving a planar homography relating the image of the
    3.2 Training Data
    37
    (a)
    (b)
    (c)
    (d)
    (e)
    Figure 3.2: Examples of our manually outlined and annotated training data,
    showing examples of both planes (orange boundary, with orientation vectors)
    and non-planes (cyan boundary).
    plane in both views. The homography linking the two views is calculated by
    tn T
    H = R +
    (3.1)
    d
    where n is the normal of the plane and d is the perpendicular distance to the plane
    (all defined up to scale, which means without loss of generality we set d = 1). We use
    this homography to warp the original image, to approximate the view from the new
    viewpoint. To generate the pose [ R | t ] of the hypothetical camera, we use rotations of
    angle θ γ about the three coordinate axes γ { x,y,z } , each represented by a rotation
    matrix R γ , the product of which gives us the final rotation matrix
     
     
    1
    0
    0
    cos θ y
    0 sin θ y
    cos θ z sin θ z 0
    R = R x R y R z =
    0 cos θ x sin θ x   0
     
    1
    0 sin θ z
     
    cos θ z
    0
    0 sin θ x cos θ x
    sin θ y 0 cos θ y
    0
    0
    1
    (3.2)
    We calculate the translation by t = RD + D , where D is a unit vector in the direction
    of the point on the plane around which to rotate (calculated as D = K m where m is
    1
    a 2D point on the plane, usually the centroid, expressed in homogeneous coordinates).
    After warping the image, the normal vector for this warped plane is Rn .
    To generate new [ R | t ] pairs we step through angles in x and y in increments of 15 , up
    to ± 30 in both directions. While we could easily step in finer increments, or also apply
    rotations about the z -axis (which amounts to rotating the image), this quickly makes
    3.3 Image Representation
    38
    (a)
    (b)
    (c)
    (d)
    (e)
    Figure 3.3: Training data after warping using a homography to approximate
    new viewpoints.
    training sets too large to deal with. We also omit examples which are very distorted, by
    checking whether the ratio of the eigenvalues of the resulting regions are within a certain
    range, to ensure that the regions are not stretched too much. Warping is also applied to
    the non-planar examples, but since these have no orientation specified, we use a normal
    pointing toward the camera, to generate the homography, under the assumption that the
    scene is approximately front-on. Warping these is not strictly necessary, but we do so to
    increase the quantity of non-planar examples available, and to ensure that the number
    of planar and non-planar examples remain comparable. Examples of warped images are
    shown in Figure 3.3.
    3.3 Image Representation
    We describe the visual information in the image regions of interest using local image
    descriptors. These represent the local gradient and colour properties, and are calculated
    for patches about a set of salient points. This means each region is described by a large
    and variable number of descriptors, using each individual point. In order to create more
    concise descriptions of whole regions, we employ the visual bag of words model, which
    represents the region according to a vocabulary; further reduction is achieved by discov-
    ering a set of underlying latent topics, compressing the bag of words information to a
    lower dimensional space. Finally we enhance the topic-based representation with spatial
    distribution information, an important step since the spatial configuration of various
    image features is relevant to identifying planes. This is done using spatial histograms,
    or ‘spatiograms’. Each step is explained in more detail in the following sections.
    3.3 Image Representation
    39
    3.3.1 Salient Points
    Only salient points are used, in order to reduce the amount of information we must deal
    with, and to focus only on those areas of the image which contain features relevant to our
    task. For example, it would be wasteful to represent image areas devoid of any texture
    as these contribute very little to the interpretation of the scene.
    Many alternatives exist for evaluating the saliency of points in an image. One popular
    choice is the FAST detector [110], which detects corner-like features. We experimented
    with this, and found it to produce not unreasonable results (see Section 4.1.2); however
    this detects features at only a single scale, ignoring the fact that image features tend
    to occur over a range of different scales [80]. Since generally there is no way to know a
    priori at which scale features will occur, it is necessary to analyse features at all possible
    scales [79].
    To achieve this we use the difference of Gaussians detector to detect salient points, which
    is well known as the first stage in computing the SIFT feature descriptor [81]. As well
    as a 2D location for interest points, this gives the scale at which the feature is detected.
    This is to be interpreted as the approximate size, in pixels, of the image content which
    leads the point to be considered salient.
    3.3.2 Features
    Feature descriptors are created about each salient point in the image. The patch sizes
    used for building the descriptors come from the points’ scales, since the scale values
    represent the regions of the image around the points which are considered salient. The
    basic features we use to describe the image regions capture both texture and colour
    information, these being amongst the most important basic visual characteristics visible
    in images.
    Texture information is captured using histograms of local gradient orientation; such
    descriptors have been successfully used in a number of applications, including object
    recognition and pedestrian detection [28, 82]. However one important difference between
    our descriptors and something like SIFT is that we do not aim to be invariant. While
    robustness to various image transformations and deformations is beneficial for reliably
    detecting objects, this would actually be a disadvantage to us. We are aiming to actually
    3.3 Image Representation
    40
    recover orientation, rather than be invariant to its effects, and so removing its effects
    from the descriptor would be counter-productive.
    To build the histograms, the local orientation at each pixel is obtained by convolving
    the image with the mask [1 , 0 , 1] in the x and y directions separately, to approximate
    the first derivatives of the image. This gives the gradient values G x and G y , for the
    horizontal and vertical directions respectively, which can be used to obtain the angle θ
    and magnitude m of the local gradient orientation:
     
    G y
    θ
    = tan
    1
    G x
    (3.3)
    m = p G x2 + G y 2
    The gradient orientation and magnitude are calculated for each pixel within the image
    patch in question, and gradient histograms are built by summing the magnitudes over
    orientations, quantised into 12 bins covering the range [0 ). Our descriptors are actually
    formed from four such histograms, one for each quadrant of the image patch, then
    concatenated into one 48D descriptor vector. This is done in order to incorporate some
    larger scale information, and we illustrate it in Figure 3.4. This construction is motivated
    by histograms of oriented gradient (HOG) features [28], but omits the normalisation over
    multiple block sizes, and has only four non-overlapping cells.
    The importance of colour information for geometric classification was demonstrated by
    Hoiem et al. [66], and is used here as it may disambiguate otherwise difficult examples,
    such as rough walls and foliage. To encode colour, we use RGB histograms, which are
    formed by concatenating intensity histograms built from each of the three colour channels
    Feature descriptor:
    Figure 3.4: An illustration showing how we create the quadrant-based gradient
    histogram descriptor. An image patch is divided into quadrants, and a separate
    orientation histogram created for each, which are concatenated.
    3.3 Image Representation
    41
    of the image, each with 20 bins, to form a 60D descriptor.
    We use both types of feature together for plane classification; however, it is unlikely
    that colour information would be beneficial for estimating orientation of planar surfaces.
    Therefore, we use only gradient features for the orientation estimation step; the means
    by which we use separate combinations of features for the two tasks is described below.
    3.3.3 Bag of Words
    Each image region has a pair of descriptors for every salient point, which may number
    in the tens or hundreds, making it a rather rich but inefficient means of description.
    Moreover, each region will have a different number of points, making comparison prob-
    lematic. This is addressed by accumulating the information for whole regions in an
    efficient manner using the bag of words model.
    The bag of words model was originally developed in the text retrieval literature, where
    documents are represented simply by relative counts of words occurring in them; this
    somewhat naıve approach has achieved much success in tasks such as document classifi-
    cation or retrieval [77, 115]. When applied to computer vision tasks, the main difference
    that must be accounted for is that there is no immediately obvious analogue to words
    in images, and so a set of ‘visual words’ are created. This is done by clustering a large
    set of example feature vectors, in order to find a small number of well distributed points
    in feature space (the cluster centres). Feature vectors are then described according to
    their relationship to these points, by replacing the vectors by the ID of the word (clus-
    ter centre) to which they are closest. Thus, rather than a large collection of descriptor
    vectors, an image region can be represented simply as a histogram of word occurrences
    over this vocabulary. This has been shown to be an effective way of representing large
    amounts of visual information [134]. In what follows, we use ‘word’ (or ‘term’) to refer
    to such a cluster centre, and ‘document’ is synonymous with image.
    We create two separate codebooks (sets of clustered features), for gradient and colour.
    These are formed by clustering a large set of features harvested from training images
    using K-means, with K clusters. Clustering takes up to several minutes per codebook,
    but must only be done once, prior to training (codebooks can be re-used for different
    training sets, assuming there is not too much difference in the features used).
    Word histograms are represented as K -dimensional vectors w , with elements denoted
    3.3 Image Representation
    42
    w k , for k = 1 ,...,K . Each histogram bin is w = | Λ k | where Λ k is the set of points which
    k
    quantised to word k (i.e. w k is the count of occurrences of word k in the document). To
    reduce the impact of commonly occurring words (words that appear in all documents
    are not very informative), we apply term frequency–inverse document frequency (tf-idf)
    weighting as described in [84]. We denote word vectors after weighting by w 0 , and when
    we refer to word histograms hereafter we mean those weighted by tf-idf, unless otherwise
    stated.
    As discussed above, we use two different types of feature vector, of different dimension-
    ality. As such we need to create separate codebooks for each, for which the clustering
    and creation of histograms is independent. The result is that each training region has
    two histograms of word counts, for its gradient and colour features.
    3.3.4 Topics
    The bag of words model, as it stands, has a few problems. Firstly, there is no association
    between different words, so two words representing similar appearance would be deemed
    as dissimilar as any other pair of words. Secondly, as the vocabularies become large, the
    word histograms will become increasingly sparse, making comparison between regions
    unreliable as the words they contain are less likely to coincide.
    The answer again comes from the text retrieval literature, where similar problems (word
    synonymy and high dimensionality) are prevalent. This idea is to make use of an under-
    lying latent space amongst the words in a corpus, which can be thought of as representing
    ‘topics’, which should roughly correspond to a single semantic concept.
    It is not realistic to expect each document to correspond to precisely one latent topic, and
    so a document is represented by a distribution over topics. Since the number of topics
    is generally much less than the number of words, this means topic analysis achieves
    dimensionality reduction. Documents are represented as a weighted sum over topics,
    instead of over words, and ideally synonyms are implicitly taken care of by being mapped
    to the same semantic concept.
    A variety of methods exist for discovering latent topics, which all take as input a ‘term-
    0 0
    document’ matrix. This is a matrix W = [ w 00 , w ,..., w ], where each column is the
    1
    M
    weighted word vector for each of M documents, so each row corresponds to a word k .
    3.3 Image Representation
    43
    3.3.4.1 Latent Semantic Analysis
    The simplest of these methods is known as latent semantic analysis (LSA) [30], which
    discovers latent topics by factorising the term-document matrix W , using the singular
    value decomposition (SVD), into W = UDV . Here U is a K × K matrix, V is M × M ,
    T
    and D is the diagonal matrix of singular values. The reduced dimensionality form is
    obtained by truncating the SVD, retaining only the top T singular values (where T is
    the desired number of topics), so that W U t D t V t T , where U t is K × T and V t is
    M × T . The rows of matrix V t are the reduced dimensionality description the topic
    vectors t m – for each document m .
    An advantage of LSA is that it is very simple to use, partly because topic vectors for
    the image regions in the training set are extracted directly from V t . Topic vectors
    for new test image regions are calculated simply by projecting their word histograms
    1
    into the topic space, using t j = D t U Tt w j . However, LSA suffers from an important
    0
    problem, due to its use of SVD: the topic weights (i.e. the elements of the vectors t j )
    may be negative. This makes interpretation of the topic representation difficult (what
    does it mean for a document to have a negative amount of a certain topic?), but more
    importantly the presence of negative weights will become a problem when we come to
    take weighted means of points’ topics (see Section 3.3.5), where all topic weights would
    need to be non-negative.
    A later development called probabilistic latent semantic analysis (pLSA) [63] does pre-
    serve the non-negativity of all components (because they are expressed as probabilities),
    but its use of expectation maximisation to both find the factorisation, and get the topic
    vectors for new data, is infeasibly slow for our purposes. Fortunately, a class of methods
    known as non-negative matrix factorisation has been shown, under some conditions, to
    be equivalent to pLSA [31, 45].
    3.3.4.2 Non-negative Matrix Factorisation
    As the name implies, non-negative matrix factorisation (NMF) [76] is a method for
    factorising a matrix (in this case, the term-document matrix W ) into two factor matrices,
    with reduced dimensionality, where all the terms are positive or zero. The aim is to find
    the best low-rank approximation W BT to the original matrix with non-negative
    terms. If W is known to have only non-negative entries (as is the case here, since its
    3.3 Image Representation
    44
    entries are word counts) so will B and T . Here, T is T × M and will contain the topic
    vectors corresponding to the columns of W ; and B can be interpreted as the basis of
    the topic space (of size K × T , the number of words and topics respectively). There are
    no closed form solutions to this problem, but Lee and Seung [76] describe an iterative
    algorithm, which can be started from a random initialisation of the factors.
    As with LSA, topic vectors for training regions are simply the columns of T . Unfor-
    tunately, although we can re-arrange the above equation to obtain t j = B w j (where
    † 0
    B is the Moore-Penrose pseudoinverse) to get a low-dimensional approximation for test
    vectors w j0 , there is no non-negativity constraint on B . This means that the resulting
    topic vectors t j too will contain negative elements, and so we have the same problem as
    with LSA.
    The problem arises from using the pseudoinverse. It is generally calculated using SVD,
    which as we saw above does not maintain non-negativity. One way to ensure that the
    inverse of B is non-negative is to make B orthogonal, since for a semi-orthogonal matrix
    (non-square) its pseudoinverse is also its transpose. This leads us to the methods of
    orthogonal non-negative matrix factorisation.
    3.3.4.3 Orthogonal Non-negative Matrix Factorisation
    The objective of orthogonal non-negative matrix factorisation (ONMF) [18] is to factorise
    as above, with the added constraint that B T B = I . Left-multiplying W = BT by B
    T
    T
    0
    we get B T W = B BT = T , and so it is now valid to project a word vector w j to
    the topic space by t j = B T w j , such that the topic vector contains only non-negative
    0
    elements. In practice, these factors are found by a slight modification of the method of
    NMF [135], and so the algorithms used for ONMF are:
    ( WT ) nt
    T
    ( B T W ) tm
    B nt ←− B nt
    T tm ←− T tm
    (3.4)
    ( BTW B ) nt
    T
    ( B T BT ) tm
    The result is a factorisation algorithm which directly gives the topic vectors for each
    training example used in creating the term-document matrix, and a fast linear method
    for finding the topic vector of any new test word histogram, simply by multiplying with
    the transpose of the topic basis matrix B .
    3.3 Image Representation
    45
    (a) The words in the
    (b) Points contribut-
    (c) Points contribut-
    image, each with a
    ing to Topic 16
    ing to Topic 5
    unique colour
    Figure 3.5: For a given planar region (a), on which we have drawn coloured
    points to indicate the different words, we can also show the weighting of these
    points to different topics, namely Topics 16 (b) and 5 (c), corresponding to those
    visualised below.
    3.3.4.4 Topic Visualisation
    So far, in considering the use of topics analysis, we have treated it simply as a means of
    dimensionality reduction. However, just as topics should represent underlying semantic
    topics in text documents, they should also represent similarities across our visual words.
    Indeed, it is partly in order capture otherwise overlooked similarities between different
    words that we have used topic analysis, and so it would interesting to check whether this
    is actually the case.
    First, consider the example plane region in Figure 3.5a, taken from our training set. In
    the left image, we have shown all the words, each with a unique colour – showing no
    particular structure. The two images to its right show only those points corresponding
    to certain topics (Topic 16 and Topic 5, from a gradient-only space of 20 topics), via the
    words that the features quantise to, where the opacity of the points represent the extent
    to which the word contributes to that topic. It appears that words contributing to Topic
    16 tend to occur on the tops and bottoms of windows, while those for Topic 5 lie within
    the windows, suggesting that they may be picking out particular types of structure.
    We expand upon this, by more directly demonstrating what the topics represent. Vi-
    sualising words and topics can be rather difficult, since both exist in high dimensional
    spaces, and because topics represent a distribution over all the words, not simply a re-
    duced selection of important words. Nevertheless, it should be the case that the regions
    which quantise to the words which contribute most strongly to a given topic will look
    more similar to each other than other pairs of words.
    3.3 Image Representation
    46
    (a) Word 336
    (b) Word 84
    (c) Word 4
    Figure 3.6: A selection of patches quantising to the top three words for Topic
    16, demonstrating different kinds of horizontal edge.
    (a) Word 376
    (b) Word 141
    (c) Word 51
    Figure 3.7: Patches representing the top three words for Topic 5, which appear
    to correspond to different kinds of grid pattern (or, with a combination of ver-
    tical and horizontal edges; a few apparent outliers are shown toward the right,
    although these retain a similar pattern of edges).
    For the image region above, we looked at the histogram of topics, and found that the
    two highest weighted were Topics 16 and 5 (those shown above). Rather than showing
    these topic directly, we can find which words contribute most strongly to those topics.
    For Topic 16, the highest weighted words (using the word-topic weights from B ) were
    336, 84 and 4; and for Topic 5, the words with the highest weights were 376, 141 and
    51. Again, these numbers have little meaning on their own, being simply indices within
    the unordered set of 400 gradient words we used.
    We show example image patches for each of these words, for those two topics, in Figures
    3.6 and 3.7. These patches were extracted from the images originally used to create
    the codebooks (the patches are of different sizes due to the use of multi-scale salient
    point detection, but have been resized for display). If the clustering has been performed
    correctly there should be similarities between different examples of the same word; and
    according to the idea behind latent topic analysis, we should also find that the words
    assigned to the same topic are similar to each other.
    This does indeed appear to be the case. The three words shown for Topic 16 represent
    various types of horizontal features. Because we use gradient orientation histograms as
    features, the position of these horizontal edges need not remain constant, nor the direc-
    tion of intensity change. Similarly, Topic 5 appears to correspond to grid-like features,
    such as window panes and tiles. The fact that these groups of words have been placed
    together within topics suggests topic analysis is performing as desired, in that it is able
    to link words with similar conceptual features which would usually – in a bag of words
    model – be considered as different as any other pair of words. Not only does this grouping
    suggest the correct words have been associated together, but it agrees with the location
    3.3 Image Representation
    47
    of the topics as visualised in Figure 3.5, which are concentrated on horizontal window
    edges and grid-like window panes respectively.
    We acknowledge that these few examples do not conclusively show that this happens
    consistently, but emphasise that our method does not depend on it: it is sufficient that
    ONMF performs dimensionality reduction in the usual sense of the word. Nevertheless,
    it is reassuring to see such structure spontaneously emerge from ONMF, and for this to
    be reflected in the topics found in a typical image from our training set.
    3.3.4.5 Combining Features
    The above discussion assumes there is only one set of words and documents, and does not
    deal with our need to use different vocabularies for different tasks. Fortunately, ONMF
    makes it easy to combine the information from gradient and colour words. This is done
    by concatenating the two term-document matrices for the corpus, so that effectively
    each document has a word vector of length 2 K (assuming the two vocabularies have the
    same number of words). This concatenated matrix is the input to ONMF, which means
    the resulting topic space is over the joint distribution of gradient and colour words, so
    should encode correlations between the two types of visual word. Generally we double
    the number of topics, to ensure that using both together retains the details from either
    used individually.
    When regressing the orientation of planes, colour information is not needed. We run
    ONMF again to create a second topic space, using a term-document matrix built only
    from gradient words, and only using planar image regions. This means that there are two
    topic spaces, a larger one containing gradient and colour information from all regions,
    and a smaller one, of lower dimensionality, using only the gradient information from the
    planar regions.
    3.3.5 Spatiograms
    The topic histograms defined above represent image regions compactly; however, we
    found classification and regression accuracy to be somewhat disappointing using these
    alone (see our experiments in Section 4.1.1). A feature of the underlying bag of words
    model is that all spatial information is discarded. This also applies to the latent topic
    3.3 Image Representation
    48
    representation. Although this has not hampered performance in tasks such as object
    recognition, for our application the relative spatial position of features are likely to be
    important. For example, some basic visual primitives may imply a different orientation
    depending on their relative position.
    It is possible to include spatial information by representing pairwise co-occurrence of
    words [9], by tiling overlapping windows [131], or by using the constellation or star models
    [37, 38]. While the latter are effective, they are very computationally expensive, and scale
    poorly in terms of the number of points and categories. Instead, we accomplish this by
    using spatiograms. These are generalisations of histograms, able to include information
    about higher-order moments. These were introduced by Birchfield and Rangarajan [10]
    in order to improve the performance of histogram-based tracking, where regions with
    different appearance but similar colours are too easily confused.
    We use second-order spatiograms, which as well as the occurrence count of each bin, also
    encode the mean and covariance of points contributing to that bin. These represent the
    spatial distribution of topics (or words), and they replace the topic histograms above
    (not the gradient or colour histogram descriptors). While spatiograms have been useful
    for representing intensity, colour and terrain distribution information [10, 52, 83], to our
    knowledge they have not previously been used with a bag of words model.
    We first describe spatiograms over words, in order to introduce the idea. A word spa-
    word
    tiogram s word
    , over K words, is defined as a set of K triplets s k word
    = ( h k word
    , µ k word
    , Σ k
    ),
    = w are the elements of the word histogram as above,
    0 k
    where the histogram values h k word
    and the mean and (unbiased) covariance are defined as
    word
    1
    X
    word
    1
    X k k T
    µ k
    =
    v i
    =
    v v
    i i
    (3.5)
    | Λ k |
    Σ k
    i Λ k
    | Λ k | − 1
    i Λ k
    where v i is the 2D coordinate of point i and v ki = v i µ k word , and as above Λ k is the
    subset of points whose feature vectors quantise to word k .
    To represent 2D point positions within spatiograms, we normalise them with respect
    to the image region, leaving us with a set of points with zero mean, rather than being
    coordinates in the image. This gives us a translation invariant region descriptor. Without
    this normalisation, a similar patch appearing in different image locations would have
    3.3 Image Representation
    49
    a different spatiogram, and so would have a low similarity score, which may adversely
    affect classification. The shift is achieved by replacing the v i in the equations above with
    v i = v i 1 P N v i . Note that by shifting the means, the covariances (and histogram
    N
    i =1
    values) are unaffected.
    The definition of word spatiograms is fairly straightforward, since every 2D point quan-
    tises to exactly one word. Topics are not so simple, since each word has a distribution
    over topics, and so each 2D point contributes some different amount to each topic. There-
    fore, rather than simply a sum over a subset of points, the mean and covariance will be
    a weighted mean and a weighted (unbiased) covariance over all the points, according to
    their contribution to each topic. It is because of this calculation of weighted means that
    the topic weights cannot be negative, as ensured by our use of ONMF, discussed above.
    Rather than a sum over individual points, we can instead express topic spatiograms as
    a sum over the words, since all points corresponding to the same word contribute the
    same amount to their respective topics. The resulting topic spatiograms s topic
    , defined
    topic
    topic
    topic
    topic
    over T topics, consist of T triplets s t
    = ( h t
    , µ t
    , Σ t
    ). Here, the scalar elements
    topic
    h t
    are from the region’s topic vector as defined above, while the mean and covariance
    are calculated using
    X
    K
    X
    K
    topic
    1
    topic
    α t
    η tk X t t T
    η tk µ k
    word
    µ t
    =
    Σ t
    =
    v v
    | Λ k |
    i i
    (3.6)
    α t
    k =1
    α t 2 β t
    k =1
    i Λ k
    topic
    , α t = P η tk , and β t = P
    K
    K
    η
    tk . The weights η tk are given by
    2
    where v i t = v µ
    i
    t
    k =1
    k =1 | Λ k |
    η tk = B tk w 0 k and reflect both the importance of word k through w k and its contribution to
    0
    topic t via B tk (elements from the topic basis matrix). As in Section 3.3.4.5, we maintain
    two spatiograms per (planar) image region, one for gradient and colour features, and one
    for just gradient features. We illustrate the data represented by spatiograms in Figure
    3.8, where we have shown an image region overlaid with some of the topics to which the
    image features contribute (see Figure 3.5 above), as well as the mean and covariance for
    the topics, which is encoded within the spatiogram.
    To use the spatiograms for classification and regression, we use a similarity measure
    proposed by O Conaire et al. [97]. This uses the Battacharyya coefficient to compare
    `
    spatiogram bins, and a measure of the overlap of Gaussian distributions to compare
    3.4 Classification
    50
    (a)
    (b)
    (c)
    (d)
    Figure 3.8: An illustration of what is represented by topic spatiograms. The
    points show the contribution of individual points to each topic (a,c), and the
    spatiograms represent the distribution of these contributions (b,d), displayed
    here as an ellipse showing the covariance, centred on the mean, for individual
    topics.
    their spatial distribution. For two spatiograms s A and s of dimension D , this similarity
    B
    function is defined as
    X
    D
    q
    1
    ρ ( s , s ) =
    A
    B
    h A h B 8 π | Σ d A Σ B | 4 N µ A ; µ B , 2( Σ A + Σ B ) 
    d d
    d
    d
    d
    d
    d
    (3.7)
    d =1
    where N ( x ; µ , Σ ) is a Gaussian with mean µ and covariance matrix Σ evaluated at x .
    Following [97], we use a diagonal version of the covariance matrices since it simplifies the
    calculation.
    3.4 Classification
    After compactly representing the image regions with spatiograms, they are used to train
    a classifier. We use the relevance vector machine (RVM) [125], which is a sparse kernel
    method, conceptually similar to the more well-known support vector machine (SVM). We
    choose this classifier because of its sparse use of training data: once it has been trained,
    only a small subset of the data need to be retained (fewer even than the SVM), making
    classification very fast even for very large training sets. The RVM gives probabilistic
    outputs, representing the posterior probability of belonging to either class.
    3.4.1 Relevance Vector Machines
    The basic model of the RVM is very similar to standard linear regression, in which the
    output label y for some input vector x is modelled as a weighted linear combination of
    3.4 Classification
    51
    M fixed (potentially non-linear) basis functions [11]:
    X
    M
    ω i φ i ( x ) = ω φ ( x )
    T
    y ( x ) =
    (3.8)
    i =1
    where ω = ( ω i ,...,ω M ) T is the vector of weights, for some number M of basis functions
    φ ( x ) = ( φ i ( x ) ,...,φ M ( x )) T . If we choose the basis functions such that they are given by
    kernel functions, where there is one kernel function for each of the N training examples
    (a similar structure to the SVM), (3.8) can be re-written:
    X
    N
    y ( x ) =
    ω m k ( x , x n ) + b
    (3.9)
    n =1
    where x n are the training data, for n = 1 ,...,N , and b is a bias parameter. The kernel
    functions take two vectors as input and return some real number. This can be considered
    as a dot product in some higher dimensional space — indeed, under certain conditions, a
    kernel can be guaranteed to be equivalent to a dot product after some transformation of
    the data. This mapping to a higher dimensional space is what gives kernel methods their
    power, since the data may be more easily separable in another realm. Since the mapping
    need never be calculated explicitly (it is only ever used within the kernel function), this
    means the benefits of increased separability can be attained without the computational
    effort of working in higher dimensions (this is known as the ‘kernel trick’ [11]).
    Training the RVM is done via the kernel matrix K , where each element is the kernel
    function between two vectors in the training set, K ij = k ( x i , x j ). The matrix is sym-
    metric, which saves some time during computation, but calculating it is still quadratic
    in the number of data. This becomes a problem when using such kernel methods on
    very large datasets (the RVM training procedure is cubic, due to matrix inversions [11],
    although iteratively adding relevance vectors can speed it up somewhat [126]).
    While equation 3.9 is similar to standard linear regression, the important difference here
    is that rather than having a single shared hyperparameter over all the weights, the RVM
    introduces a separate Gaussian prior for each ω i controlled by a hyperparameter α i . Dur-
    ing training, many of the α i tend towards infinity, meaning the posterior probability of
    the associated weight is sharply peaked at zero — thus the corresponding basis functions
    3.4 Classification
    52
    are effectively removed from the model, leading to a significantly sparsified form.
    The remaining set of training examples – the ‘relevance vectors’ – are sufficient for
    prediction for new test data. These are analogous to the support vectors in the SVM,
    though generally far fewer in number; indeed, in our experiments we observed an over
    95% reduction in training data used, for both classification and regression. Only these
    data (and their associated weights) need to be stored in order to use the classifier to
    predict the target value of a new test datum x 0 . Classification is done through a new
    0
    kernel matrix K 0 (here being a single column), whose elements are calculated as K r =
    k ( x r , x 0 ) for R relevance vectors indexed r = 1 ,...,R . The prediction is then calculated
    simply by matrix multiplication:
    y ( x ) = ω K
    0
    T
    0
    (3.10)
    where ω is again the vector of weights. For regression problems, the target value is
    simply y ( x 0 ); whereas for classification, this is transformed by a logistic sigmoid,
    p ( x 0 ) = σ ( y ( x )) = σ ( ω K )
    0
    T
    0
    1
    (3.11)
    σ ( x ) =
    1+ exp ( x )
    This maps the outputs to p ( x 0 ) (0 , 1), which is interpreted as the probability of the
    test datum belonging to the positive class. Thresholding this at 0.5 (equal probability
    of either class) gives a binary classification. In our work we used the fast [126] ‘Sparse
    Bayes’ implementation of the RVM made available by the authors 3 .
    The above describes binary classification or single-variable regression in the standard
    RVM. In order to regress multi-dimensional data – in our case the three components of the
    normal vectors – we use the multi-variable RVM (MVRVM) developed by Thayananthan
    et al. [123]; the training procedure is rather different from the above, but we omit details
    here (see [11]). This regresses over each of the D output dimensions simultaneously,
    using the same set of relevance vectors for each. Regression is achieved as in (3.10),
    3 www.vectoranomaly.com/downloads/downloads.htm
    3.5 Summary
    53
    but now the weights are a R × D matrix , with a column for each dimension of the
    T
    target variable; and so y ( x 0 ) = K . For this we adapted code available online . For
    0
    4
    both classification and regression, we can predict values for many data simultaneously if
    necessary, simply by adding more columns to K 0 .
    All that remains is to specify the form of kernel function we use. When using word
    or topic histograms, we experimented with a variety of standard histogram similarity
    measures, such as the Bhattacharyya coefficient and cosine similarity; for spatiograms,
    we use various functions of equation 3.7. More details of the different functions can be
    found in our experiments in Section 4.1.4.
    3.5 Summary
    This concludes our description of the plane recognition algorithm, for distinguishing
    planes from non-planes in given image regions and estimating their orientation. To
    summarise: we represent data for regions by detecting salient points, which are described
    using gradient and colour features; these are gathered into a bag of words, reduced with
    topic analysis, and enhanced with spatial information using spatiograms. These data
    are used to train an RVM classifier and regressor. For a test region, once the spatiogram
    has been calculated, the RVMs are used to classify it and, if deemed planar, to regress
    its orientation.
    However, it is important to realise that, as successful as this algorithm may be, it is
    only able to estimate the planarity and orientation for a given region of the image. It
    is not able to detect planes in a whole image, since there is no way of knowing where
    the boundaries between regions are, and this is something we shall return to in Chapter
    5. Before that, in the next chapter we present a thorough evaluation of this algorithm,
    both in cross-validation, to investigate the effects of the details discussed above; and
    an evaluation on an independent test set, showing that it can generalise well to novel
    environments.
    4 mi.eng.cam.ac.uk/˜at315/MVRVM.htm
    CHAPTER 4
    Plane Recognition Experiments
    In this chapter, we present experimental results for our plane recognition algorithm. We
    show how the techniques described in the previous chapter affect recognition, for example
    how the inclusion of spatial distribution information makes classification and regression
    much more accurate, and the benefits of using larger amounts of training data. We
    describe experiments on an independent set of data, demonstrating that our algorithm
    works not only on our initial training and validation set, but also in a wider context;
    and compare to an alternative classifier.
    Our experiments on the plane recognition algorithm were conducted as follows: we began
    with a set of training data (individual, manually segmented regions, not whole images),
    which we represented using the steps discussed in Section 3.3. The resulting descriptions
    of the training regions were used to train the RVM classifiers. Only at this point were
    the test data introduced, so the creation of the latent space and the classifiers was
    independent of the test data.
    54
    4.1 Investigation of Parameters and Settings
    55
    4.1 Investigation of Parameters and Settings
    For the first set of experiments, we used only our training set, collected from urban
    locations at the University of Bristol and the surrounding area. This consisted of 556
    image regions, each of which was labelled as described in Section 3.2; these were reflected
    to create our basic dataset of 1112 regions. We also warped these regions, as described
    in Section 3.2.2, to obtain a total of 7752 regions. The effect of using these extra regions
    is discussed later.
    All of these experiments used five-fold cross-validation on this training set — the train
    and test folds were kept independent, the only association between them being that the
    data came from the same physical locations, and that the features used to build the bag
    of words codebooks could have overlapped both train and test sets. We also ensured
    that warped versions of a region never appeared in the training set when the original
    region was in the test set (since this would be a potentially unfairly easy test). All
    of the results quoted below came from ten independent runs of cross-validation, from
    which we calculated the mean and standard deviation, which are used to draw error bars.
    The error bars on the graphs show one standard deviation either side of the mean, over
    all the runs of cross-validation (i.e. it is the standard deviation of the means from the
    multiple cross-validation runs, and not derived from the standard deviations of individual
    cross-validations).
    4.1.1 Vocabulary
    The first experiment investigated the performance for different vocabulary sizes, for
    different basic representations of the regions. The vocabulary size is the number of
    clusters K used in the K-means algorithm, and by testing using different values for K
    we could directly see what effect this had on plane recognition (instead of choosing the
    best K according to the distortion measure [11], for example).
    This experiment also compared four different representations of the data. First, we used
    weighted word histograms as described in Section 3.3.3 — this representation does not
    use the latent topics, nor use spatial information. We compared this to the basic topic
    histogram representation, where the word vectors have been projected into the topic
    space to reduce their dimensionality (refer to Section 3.3.4); in these experiments we
    always used a latent space of 20 topics, but found performance to be robust to different
    4.1 Investigation of Parameters and Settings
    56
    values. Next we created spatiograms for both word and topic representations, using
    equations 3.5 and 3.6.
    The experiment was conducted as follows: for each different vocabulary size, we created
    a new codebook by clustering gradient features harvested from a set of around 100
    exemplary images. The same set of detected salient points and feature descriptors was
    used each time (since these are not affected by the vocabulary size); and for each repeat
    run of cross-validation the same vocabulary was used, to avoid the considerable time it
    takes to re-run the clustering. This was done for each of the four region descriptions,
    then the process was repeated for each of the vocabulary sizes. We used only gradient
    features for this to simplify the experiment — so the discussion of combining vocabularies
    (Section 3.3.4.5) is not relevant at this point. In this experiment (and those that follow),
    we used only the original marked-up regions, and their reflections – totalling 1112 regions
    – since warping was not yet confirmed to be useful, and to keep the experiment reasonably
    fast. The kernels used for the RVMs were polynomial sums (see Section 4.1.4) of the
    underlying similarity measure (Bhattacharyya for histograms and spatiogram similarity
    [97] for spatiograms).
    100%
    30
    28
    90%
    26
    24
    80%
    22
    Word Histograms
    20
    70%
    Topic Histograms
    Word Histograms
    Word Spatiograms
    18
    Topic Histograms
    TopicSpatiograms
    Word Spatiograms
    60%
    16
    TopicSpatiograms
    14
    50%
    12
    0
    500
    1000
    1500
    2000
    0
    500
    1000
    1500
    2000
    Vocabulary size
    Vocabulary size
    (a)
    (b)
    Figure 4.1: Performance of plane classification and orientation estimation as
    the size of the vocabulary was changed.
    The results are shown in Figure 4.1, comparing performance for plane classification (a)
    and orientation regression (b). It is clear that spatiograms outperformed histograms,
    over all vocabulary sizes, for both error measures. Furthermore, using topics also tended
    to increase performance compared to using words directly, especially as the vocabulary
    size increased. As one would expect, there was little benefit in using topics when the
    number of topics was approximately the same as the number of words, and when us-
    ing small vocabularies, performance of word spatiograms was as good as using topic
    spatiograms. However, since this relies on using a small vocabulary it would overly con-
    4.1 Investigation of Parameters and Settings
    57
    strain the method (and generally larger vocabularies, up to an extent, give improved
    results [68]); and best performance was only seen when using around 400 words, the
    difference being much more pronounced for orientation estimation. This experiment
    confirms the hypothesis advanced in Section 3.3 that using topic analysis plus spatial
    representation is the best way, of those studied, for representing region information.
    4.1.2 Saliency and Scale
    In Section 3.3.1 we discussed the choice of saliency detector, and our decision to use a
    multi-scale detector to make the best use of the information at multiple image scales.
    We tested this with an experiment, comparing the performance of the plane recognition
    algorithm when detecting points using either FAST [110] or the difference of Gaussians
    (DoG) detector [81]. FAST gives points’ locations only, and so we tried a selection of
    possible scales to create the descriptors. We did the same using DoG (i.e. ignoring the
    scale value and using fixed patch sizes), to directly compare the type of saliency used.
    Finally, we used the DoG scale information to choose the patch size, in order to create
    descriptors to cover the whole area deemed to be salient.
    18.5
    FAST
    18
    DoG position
    DoG with scale
    17.5
    17
    16.5
    16
    15.5
    15
    14.5
    14
    13.5
    0
    10
    20
    30
    40
    50
    60
    Patch width (pixels)
    Figure 4.2: The effect of patch size on orientation accuracy, for different means
    of saliency detection.
    Our results are shown in Figure 4.2, for evaluation using the mean angular error of
    orientation regression (we found that different saliency detectors made no significant
    difference to classification accuracy). It can be seen that in general, DoG with fixed patch
    sizes out-performed FAST, although at some scales the difference was not significant.
    4.1 Investigation of Parameters and Settings
    58
    This suggests that the blob-like features detected by DoG might be more appropriate
    for our plane recognition task than the corner-like features found with FAST. The green
    line shows the performance when using the scale chosen by DoG (thus the patch size
    axis has no meaning, hence the straight line), which in most cases was clearly superior
    to both FAST and DoG with fixed patch sizes.
    We note that at a size of 15 pixels, a fixed patch size appeared to out-perform scale
    selection, and it may be worth investigating further since this would save computational
    effort. However, we do not feel this one result is enough yet to change our method, as
    it may be an artefact of these data. We conclude that this experiment broadly supports
    our reasons for using scale selection, rather than relying on any particular patch size;
    especially given that we would not know the best scale to use with a particular dataset.
    4.1.3 Feature Representation
    The next experiment compared performance when using different underlying feature
    representations, namely the gradient and colour features as described in Section 3.3.2.
    In these experiments, we used a separate vocabulary for gradient and colour descriptors,
    using K-means as before (having fixed the number of words, on the basis of the earlier
    experiment, at 400 words for the gradient vocabulary, and choosing 300 for colour).
    For either feature descriptor type used in isolation, the testing method was effectively
    the same; when using both together, we combined the two feature representations by
    concatenating their word histograms before running ONMF (words were re-numbered as
    appropriate, so that word i in the colour space became word K g + i in the concatenated
    space, where K g is the number of words in the gradient vocabulary). Since this used
    around twice as much feature information, we doubled the number of topics to 40 for the
    concatenated vocabularies, which we found to improve performance somewhat (simply
    doubling the number of topics for either feature type in isolation showed little difference,
    so any improvement will be due to the extra features).
    These experiments were again conduced using ten runs of five-fold cross-validation, using
    topic spatiograms, after showing them to be superior in the experiment above. We used
    the non-warped set of 1112 regions, with the polynomial sum kernel (see below), and the
    vocabularies were fixed throughout.
    Table 4.1 shows the results, which rather interestingly indicate that using colour on its
    4.1 Investigation of Parameters and Settings
    59
    own gave superior performance to gradient information. This is somewhat surprising
    given the importance of lines and textural patterns in identifying planar structure [44],
    although as Hoiem et al. [66] discovered, colour is important for geometric classification.
    It is also important to remember that we use spatiograms to represent the distribution
    of colour words, not simply using the mean or histogram of regions’ colours directly.
    Even so, our hypothesis that using both types of feature together is superior is verified,
    since the concatenated gradient-and-colour descriptors performed better than either in
    isolation, suggesting that the two are representing complementary information.
    Gradient
    Colour
    Gradient&Colour
    Classification Accuracy (%) 86.5 (1.8) 92.5 (0.5)
    93.9 (2.8)
    Orientation Error (deg)
    13.1 (0.2) 28.4 (0.3)
    17.9 (0.7)
    Table 4.1: Comparison of average classification accuracy and orientation er-
    ror when using gradient and colour features. Standard deviations are shown in
    parentheses.
    On the other hand, as we expected, colour descriptors fared much worse when estimating
    orientation, and combining the two feature types offered no improvement. This stands
    to reason, since the colour of a region – beyond some weak shading information, or the
    identity of the surface – should not give any indication to its 3D orientation, whereas
    texture will [107]. Adding colour information would only serve to confuse matters, and
    so the best approach is simply to use only gradient information for orientation regression.
    To summarise, the image representations we use, having verified this by experiment,
    are computed as follows: gradient and colour features are created for all salient points
    in all regions, which are used to create term-document matrices for the two vocabu-
    laries. These are used to create a combined 40D topic space, encapsulating gradient
    and colour information, and this forms the classification topic space. Then, a separate
    term-document matrix is built using only planar regions, using only gradient features, to
    create a second 20D gradient only topic space, to be used for regression. This means that
    each region will have two spatiograms, one of 40 dimensions and one of 20 dimensions,
    for classification and regression respectively.
    4.1 Investigation of Parameters and Settings
    60
    Data
    Name
    Function k ( x i , x j ) =
    Linear
    x i T x j
    Euclidean
    k x i x j k
    Histogram
    P x id x jd
    Bhattacharyya
    d
    Q
    q
    Bhattacharyya Polynomial P P d x id x jd 
    q =1
    Spatiogram
    ρ ( s i , s j ) (see (3.7) )
    ρ ( s i , s j ) p
    Gaussian
    exp(
    2 σ 2
    )
    Spatiogram
    P Q ρ ( s i , s j ) q
    Polynomial
    q =1
    Logistic
    1
    1+exp( ρ ( s i , s j ))
    P Q w d ρ ( s i , s j ) q
    Weighted Polynomial
    q =1
    Table 4.2: Description of the kernel functions used by the RVM, for histograms
    and spatiograms.
    4.1.4 Kernels
    Next, we compared the performance of various RVM kernels, for both classification
    and orientation estimation (this test continued to use only gradient features for ease of
    interpreting the results). In this experiment, we compared kernels on both histograms
    and spatiograms.
    For histograms, we used various standard comparison functions, including Euclidean,
    cosine, and Bhattacharyya distances, as well as simply the dot product (linear kernel).
    For spatiograms, since they do not lie in a vector space, we could only use the spatiogram
    similarity measure, denoted ρ , from equation 3.7 [97], and functions of it. Variations we
    used include the original measure, the version with diagonalised covariance, a Gaussian
    radial basis function, and polynomial functions of the spatiogram similarity. The latter
    were chosen in order to increase the complexity of the higher dimensionality space and
    strive for better separability. These functions are described in Table 4.2.
    Figure 4.3 shows the results. It is clear that as above the spatiograms out-performed his-
    tograms in all cases, though the difference is less pronounced for classification compared
    to regression. Of the spatiogram kernels, the polynomial function showed superior per-
    formance (altering the weights for each degree seemed to make no difference). Therefore
    we chose the unweighted polynomial sum kernel for subsequent testing (and we set the
    maximum power to Q = 4).
    It is also interesting to compare this polynomial sum kernel to the equivalent polynomial
    sum of the Bhattacharyya coefficient on histograms, which leads to substantially lower
    4.1 Investigation of Parameters and Settings
    61
    90%
    35
    80%
    30
    70%
    25
    60%
    50%
    20
    40%
    15
    30%
    10
    20%
    10%
    5
    0%
    0
    (a)
    (b)
    Figure 4.3: Comparison of different kernel functions, for histogram (first five)
    and spatiogram (the others) region descriptions. Spatiograms always out-perform
    histograms, with the polynomial sum kernels proving to be the best.
    performance. This confirms that the superior performance is due to using spatiograms,
    rather than the polynomial function or Bhattacharyya comparison of histogram bins;
    while the polynomial function leads to increased performance compared to the regular
    spatiogram similarity kernel.
    4.1.5 Synthetic Data
    In Section 3.2.2 we described how, from the initial marked-up training examples, we can
    synthetically generate many more by reflecting and warping these, to approximate views
    from different locations. We conducted an experiment to verify that this is actually ben-
    eficial to the recognition algorithm (note that all the above experiments were using the
    marked-up and reflected regions, but not the warped). This was done by again running
    cross-validation, where the test set was kept fixed (comprising marked-up and reflected
    regions as before), but for the training set we added progressively more synthetically
    generated regions. The experiment started with a minimal set consisting of only the
    marked-up regions, then added the reflected regions, followed by increasing quantities
    of warped regions, and finally included warped and reflected regions. These were added
    such that the number of data was increased in almost equal amounts each time.
    4.1 Investigation of Parameters and Settings
    62
    95%
    16
    90%
    15
    85%
    14
    80%
    13
    75%
    12
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    Training set size
    Training set size
    (a)
    (b)
    Figure 4.4: The effect of adding progressively more synthetically generated
    training data is to generally increase performance, although the gains diminish
    as more is added. The best improvement is seen when adding reflected regions
    (second data point).
    The results of this experiment, on both classification and orientation performance, are
    shown in Figure 4.4. As expected, adding more data was beneficial for both tasks.
    However, while performance tended to increase as more training data were used, the
    gains diminished until adding the final set of data made little difference. This could be
    because we were only adding permutations of the already existing data, from which no
    new information could be derived. It is also apparent that the biggest single increase was
    achieved after adding the first set of synthetic data (second data point) which consisted
    of the reflected regions. This seems reasonable, since of all the warped data this was the
    most realistic (minimal geometric distortion) and similar to the test regions.
    It seems that synthetically warping data was generally beneficial, and increased perfor-
    mance on the validation set — but as most of the benefit came from reflected images this
    calls into question how useful or relevant the synthetic warping actually was (especially
    given there were a very large number of warped regions). This experiment confirmed
    that using a larger amount of training data was an advantage, and that reflecting our
    initial set was very helpful, but it may be that including more marked-up data, rather
    than generating new synthetic regions, would be a better approach.
    4.1.6 Spatiogram Analysis
    It may be thought that part of the success of spatiograms comes from the way the test
    regions have been manually segmented, with their shape often being indicative of their
    actual orientation – for example the height of an upright planar region in the image
    4.1 Investigation of Parameters and Settings
    63
    Figure 4.5: Region shape, even without any visual features, is suggestive of the
    orientation of manually segmented regions.
    will often diminish as it recedes into the distance. Even without considering the actual
    features in a region, the shape alone can give an indication of its likely orientation, and
    such cues are sure to exist when boundaries are outlined by a human. We illustrate
    this effect in Figure 4.5. This is significant, as unlike histograms, spatiograms use the
    location of points, which implicitly encodes the region shape. This is a problem, since
    in ‘real’ images, such boundary information would not be available; worse, it could bias
    the classifier into an erroneous orientation simply due to the shape of the region.
    To investigate how much of an effect this has on our results, an experiment was carried
    out where all regions were reduced to being circular in shape (by finding the largest
    circle which can fit inside the region). We would expect this to reduce performance
    generally, since there was less information available to the classifier; however, we found
    that spatiograms still significantly outperformed histograms for orientation regression,
    as Table 4.3 shows. While circular regions gave worse performance, this was by a simi-
    lar amount for both representations, and adding spatial information to circular regions
    continued to boost accuracy. A similar pattern was seen for classification too, where the
    region shape should not be such a significant cue. To summarise it seems that the region
    shapes are not a particularly important consideration, confirming our earlier conclusions
    that spatiograms contribute significantly to the performance of the plane recognition
    algorithm.
    Histograms Spatiograms Cut Hist. Cut Spat.
    Class. Acc. (%)
    77.3 (1.0)
    87.9 (1.0)
    75.3 (1.0) 84.4 (0.9)
    Orient. Err. (deg)
    24.9 (0.2)
    13.3 (0.1)
    26.6 (0.2) 17.0 (0.3)
    Table 4.3: Comparison of performance for histograms and spatiograms on re-
    gions cut to be uniformly circular, compared to the original shaped regions. Spa-
    tiograms are still beneficial, showing this is not only due to the regions’ shapes
    (standard deviation in parentheses).
    4.2 Overall Evaluation
    64
    4.2 Overall Evaluation
    Finally, using the experiments above, we decided upon the following as the best settings
    to use for testing our algorithm:
  • Difference of Gaussians saliency detection, to detect location and scale of salient
  • points.
  • Gradient and colour features combined for classification, but gradient only for
  • regression.
  • A vocabulary of size 400 and 300 for gradient and colour vocabularies respectively.
  • Latent topic analysis, to reduce the dimensionality of words to 20-40 topics.
  • Spatiograms as opposed to histograms, to encode spatial distribution information.
  • Polynomial sum kernel of the spatiogram similarity measure within the RVM.
  • Augmented training set with reflected and warped regions.
  • 35 %
    30 %
    25 %
    20 %
    15 %
    10 %
    5 %
    0 %
    0
    20
    40
    60
    80
    100
    120
    140
    160
    Orientation error (degrees)
    Figure 4.6: Distribution of orientation errors for cross-validation, showing that
    the majority of errors were below 15 .
    Using the above settings, we ran a final set of cross-validation runs on the full dataset,
    with reflected and warped regions, comprising 7752 regions. We used the full set for
    training but only the marked-up and reflected regions for testing. We observed a mean
    classification accuracy of 95% (standard deviation σ = 0 . 49%) and a mean orientation
    error of 12.3 ( σ = 0 . 16 ), over the ten runs. To illustrate how the angular errors were
    distributed, we show a histogram of orientation errors in Figure 4.6. Although some are
    large, a significant number are under 15 (72%) and under 20 (84%). This is a very
    4.3 Independent Results and Examples
    65
    encouraging result, as even with a mean as low as 12.3 , the errors are not normally
    distributed, with an obvious tendency towards lower orientation errors.
    35 %
    30 %
    25 %
    20 %
    15 %
    10 %
    5 %
    0 %
    0
    20
    40
    60
    80
    100
    120
    140
    Orientation error (degrees)
    Figure 4.7: Distribution of errors for testing on independent data.
    4.3 Independent Results and Examples
    While the above experiments were useful, they were not a good test of the method’s
    ability to generalise, since the training and test images, though never actually coinciding,
    were taken from the same physical locations, and so there would inevitably be a degree
    of overlap between them.
    To test the recognition algorithm properly, we used a second dataset of images, gathered
    from an independent urban location, ensuring the training and test sets were entirely
    separate. This set consisted of 690 image regions, of an equal number of planes and
    non-planes (we did not use any reflection or warping on this set), which were marked
    up with ground truth class and orientation as before. Again, we emphasise these were
    manually chosen regions of interest, not whole images. Ideally, the intention was to keep
    this dataset entirely separate from the process of training and tuning the algorithm, and
    to use it exactly once at the end, for the final validation. Due to the time-consuming
    nature of acquiring new training data, this was not quite the case, and some of the data
    would have been seen more than once, as follows. A subset of these data (538 regions)
    were used to test an earlier version of the algorithm as described in [58] (without colour
    features and using a different classifier). The dataset was expanded to ensure equal
    balance between the classes, but the fact remains that most of the data have been used
    once before. Furthermore, we needed to use this dataset once more to verify a claim
    made about the vocabulary size: in section 4.1.1 we justified using a larger number of
    4.3 Independent Results and Examples
    66
    (a) error = 0.9
    (b) error = 1.9
    (c) error = 2.0
    (d) error = 3.1
    (e) error = 3.1
    (f) error = 3.4 *
    (g) error = 10.4
    (h) error = 10.6
    (i) error = 10.7
    (j) error = 11.0
    (k) error = 11.7
    (l) error = 11.7
    (m) error = 27.9
    (n) error = 30.2
    (o) error = 30.7
    (p) error = 31.8
    (q) error = 33.2
    (r) error = 45.4
    4.3 Independent Results and Examples
    67
    Figure 4.8: Example results, selected algorithmically as described in the text,
    showing typical performance of plane recognition on the independent dataset.
    The first six (a-f) are selected from the best of the results, the next six (g-l)
    are from the middle, and the final six (m-r) are from the worst, according to the
    orientation error. *Note that (f) was picked manually, because it is an important
    illustrative example.
    words partly by the fact that this should allow the algorithm to generalise better to
    new data. A test (not described here) was done to see if this was true for our test set
    (the result showed that using a small vocabulary without topic discovery was indeed
    detrimental to performance, more so than implied by the proximity of the two curves
    toward the left of Figure 4.1a). Other than these lapses, the independent dataset was
    unseen with respect to the process of developing and honing our method.
    The results we obtained for plane recognition were a mean classification accuracy of
    91.6% and a mean orientation error of 14.5 . We also show in Figure 4.7 a plot of the
    orientation errors; this is to be compared to the results in Figure 4.6, and shows that
    here too the spread of orientation errors is good, with the majority of regions being given
    an accurate normal estimate. This suggests that the algorithm is capable of generalising
    well to new environments, and supports our principal hypothesis that by learning from a
    set of training images, it is possible to learn how appearance relates to 3D structure; and
    that this can be applied to new images with good accuracy. We have not compared this
    to other methods, due to the lack of appropriately similar work with which to compare
    (though a comparative evaluation of the full plane detector is presented in Chapter 6)
    but we believe that these results represent a good level of accuracy, given the difficulty of
    the task, and the fact that no geometric information about the orientation is available.
    Figures 4.8 to 4.10 show typical example results of the algorithm, on this independent
    training set. To avoid bias in choosing the results to show, the images were chosen as
    follows. The correct classifications of planar regions (true positives) were sorted by their
    orientation error. We took the best ten percent, the ten percent surrounding the median
    value, and the worst ten percent, then chose randomly from those (six from each set)
    – these are shown in Figure 4.8. The only exception to this is Figure 4.8f which we
    picked manually from the best ten percent, because it is a useful illustrative example;
    all the others were chosen algorithmically. This method was chosen to illustrate good,
    typical, and failure cases of the algorithm, but does not reflect the actual distribution of
    errors (c.f. Figure 4.7) which of course has more good than bad results. We then chose
    randomly from the set of true negatives (i.e. correct identification of nonplanes), false
    4.3 Independent Results and Examples
    68
    (a)
    (b)
    (c)
    (d)
    (e)
    (f)
    (g)
    (h)
    (i)
    Figure 4.9: Correct classification of non-planes, in various situations (selected
    randomly from the results).
    negatives, and false positives, as shown in the subsequent images, again in order to avoid
    biasing the results we display.
    These results show the algorithm is able to work in a variety of different environments.
    This includes those with typical Manhattan-like structure, for example 4.8a — but cru-
    cially, also those with more irregular textures like Figure 4.8f. While the former may
    well be assigned good orientation estimates by typical vanishing-point algorithms [73],
    such techniques will not cope well with the more complicated images.
    We also show examples of successful non-plane detection in Figure 4.9. These are mostly
    composed of foliage and vehicles, but we also observed the algorithm correctly classifiying
    people and water features.
    4.4 Comparison to Nearest Neighbour Classification
    69
    (a) False negative
    (b) False negative
    (c) False negative
    (d) False positive
    (e) False positive
    (f) False positive
    Figure 4.10: Some cases where the algorithm fails, showing false negatives (a-c)
    and false positives (d-f) (these were chosen randomly from the results).
    It is also interesting to consider cases where the algorithm performs poorly, examples of
    which are shown in the lower third of Figure 4.8 and in Figure 4.10. The former shows
    regions correctly classified as planes, but with a large error in the orientation estimate
    (many of these are of ground planes dissimilar to the data used for training), while the
    rectangular window on a plain wall in Figure 4.8p may not have sufficienly informative
    visual information. The first row of Figure 4.10 shows missed planes; we speculate that
    Figure 4.10c is misclassified due to the overlap of foliage into the region. The second
    row shows false detections of planes, where Figure 4.10d may be confused by the strong
    vertical lines, and Figure 4.10e has too little visual information to be much use. These
    examples are interesting in that they hint at some shortcomings of the algorithm; but we
    emphasise that such errors are not common, with the majority of regions being classified
    correctly.
    4.4 Comparison to Nearest Neighbour Classification
    Relevance vector machines for classification and regression perform well, however it is
    not straightforward to interpret the reasons why the RVMs behave as they do. That is,
    we cannot easily learn from them which aspects of the data they are exploiting, or indeed
    4.4 Comparison to Nearest Neighbour Classification
    70
    if they are functioning as we believe they are. We investigated this by using a K-nearest
    neighbour (KNN) classifier instead. This assigns a class using the modal class of the K
    nearest neighbours, and the orientation as the mean of its neighbours’ orientations. By
    looking at the regions the KNN deemed similar to each other, we can see why the given
    classifications and orientations were assigned. Ultimately this should give some insight
    into the means by which the RVM assigns labels.
    The first step was to verify that the KNN and RVM gave similar results, otherwise it
    would be perverse to claim one gives insight into the other. This was done by running a
    cross-validation experiment with the KNN on varying amounts of training data. We used
    the recognition algorithm in the same way as above, except for the final classification
    step.
    92%
    20
    KNN
    19
    RVM
    90%
    18
    88%
    KNN
    RVM
    17
    86%
    16
    84%
    15
    82%
    14
    80%
    13
    78%
    12
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    Training set size
    Training set size
    (a)
    (b)
    0.1
    7000
    0.09
    6000
    0.08
    0.07
    5000
    KNN
    0.06
    RVM
    4000
    KNN
    0.05
    RVM
    3000
    0.04
    0.03
    2000
    0.02
    1000
    0.01
    0
    0
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    Training set size
    Training set size
    (c)
    (d)
    Figure 4.11: Comparison of RVM and KNN, showing improved accuracy and
    better scalability to larger training sets (at the expense of a slow training phase
    for the RVM).
    4.4 Comparison to Nearest Neighbour Classification
    71
    The results, from ten runs of cross-validation, were a mean classification accuracy of
    95.6% ( σ = 0 . 49%) and an orientation error of 13.9 ( σ = 0 . 16 ), neither of which
    were substantially different from results with the RVM. As Figures 4.11a and 4.11b
    show, performance for both algorithms improved with more training data, and was fairly
    similar (although the RVM is generally better). Figure 4.11c compares classification
    time, showing that the KNN was much slower and scaled poorly to larger training sets,
    justifying our choice of the RVM. The main drawback of the RVM, on the other hand,
    is its training time (the KNN requires no training). Figure 4.11d compares the setup
    time (creation of training descriptors and training the classifiers if necessary) for both
    algorithms, where the time taken for the RVM increased dramatically with training set
    size.
    We also tested the KNN version of the algorithm on our independent test set, and again
    found similar performance: classification accuracy was 87.8%, while orientation error
    increased to 18.3 .
    4.4.1 Examples
    In this section we show example results from the independent test set, with the ground
    truth and classification overlaid as before, accompanied by the nearest neighbours for
    each (using K = 5 neighbours). Obviously, this set is no longer completely independent
    or unseen, since it is the same as used above to test the method using the RVM; but no
    changes were made based on those results before using the KNN.
    As above, the examples we show here were not selected manually, but chosen randomly in
    order to give a fair representation of the algorithm. This was done as above, i.e. taking
    the best, middle, and worst ten percent of the results for true positive cases (Figure
    4.12), and selecting randomly from each set. Examples of true negatives (Figure 4.13),
    and false positives and negatives (Figure 4.14), are selected randomly.
    These images illustrate that classification was achieved by finding other image regions
    with similar structure. It is interesting to note that the neighbours found were not always
    perceptually similar, for example Figures 4.12b and 4.12d. This is important, since while
    an algorithm which matched only to visually similar regions would work to an extent, it
    would fail when presented with different environments. Figure 4.13c, for example, shows
    how non-planes can be correctly matched to similar non-planar regions in the training
    4.4 Comparison to Nearest Neighbour Classification
    72
    (a)
    (b)
    (c)
    (d)
    (e)
    (f)
    (g)
    (h)
    (i)
    Figure 4.12: Examples of plane classification and orientation when using a
    K-nearest neighbour classifier, showing the input image overlaid with the classi-
    fication and orientation (left), and the five nearest neighbours from the training
    set. These show triplets of images selected randomly from the best (a-c), middle
    (d-f), and worst (g-i) examples, for correct plane classification.
    4.4 Comparison to Nearest Neighbour Classification
    73
    (a)
    (b)
    (c)
    Figure 4.13: Examples of correct identification of non-planar regions, using a
    K-nearest neighbour classifier; these examples were chosen randomly from the
    results.
    set, but Figure 4.13b is also classified correctly, despite being visually different from its
    neighbours.
    It is interesting to note the role that the reflected and warped data play in classification
    – in many situations several versions of the same original image are found as neighbours
    (for example Figure 4.12e in particular). This stands to reason as they will be quite
    close in feature space. On the other hand, the tendency to match to multiple versions of
    the same image with different orientations can cause large errors, as in Figure 4.12g.
    It is also instructive to look at examples where the KNN classifier performed poorly, since
    now we can attempt to discover why. Figure 4.12i, for example, has a large orientation
    error. By looking at the matched images, we can see that this is because it has matched
    to vertical walls which share a similar pattern of lines tending toward a frontal vanishing
    point, but whose orientation is not actually very similar. Misclassification of a wall occurs
    in Figure 4.14a where a roughly textured wall was predominantly matched to foliage,
    resulting in an incorrect non-planar classification (interestingly, there is a wall behind
    the trees in the neighbouring training regions). Figure 4.14b was also wrongly classified,
    perhaps due to the similarity between the ventilation grille and a fence. These two
    examples are interesting since they highlight the fact that what is planar can sometimes
    be rather ambiguous. Indeed, Figure 4.14c shows the side of a car being classified as
    planar, which one could argue is actually correct.
    4.4 Comparison to Nearest Neighbour Classification
    74
    (a)
    (b)
    (c)
    (d)
    Figure 4.14: Examples of incorrect classification of regions, showing randomly
    chosen examples of false negatives (a,b) and false positives (c,d).
    4.4.2 Random Comparison
    A further useful property of the KNN was that we could confirm that the low aver-
    age orientation error we obtained was a true result, not an artefact of the data or test
    procedure. It is conceivable that the recognition algorithm was simply exploiting some
    property of the dataset, rather than actually using the features we extracted. For exam-
    ple, if all the orientations were actually very similar, any form of regression would return
    a low error. We refute this in Figure 4.15b, which shows the spread of orientation er-
    rors obtained (in cross-validation) when using randomly chosen neighbours in the KNN,
    instead of the spatiogram similarity. This means there was no image information being
    used at all. Compared to results obtained using the KNN classifier (shown in Figure
    4.15a), performance was clearly much worse. The histogram of results for the KNN also
    shows similar performance to the RVM (Figure 4.6 above).
    These experiments with the KNN were quite informative, since even an algorithm as
    simple as the KNN classifier can effectively make use of the visual information available
    in test data, in an intuitive and comprehensible way, to find structurally similar training
    examples. The RVM and KNN exhibit broadly similar performance, though they work
    by different mechanisms. It is reasonable, therefore, to consider the RVM as being a more
    efficient way of approximating the same goal, that of choosing the regions in feature space
    most appropriate for a given test datum [11]. The superior performance of the RVM at
    4.5 Summary of Findings
    75
    35 %
    35 %
    30 %
    30 %
    25 %
    25 %
    20 %
    20 %
    15 %
    15 %
    10 %
    10 %
    5 %
    5 %
    0 %
    0 %
    0
    20
    40
    60
    80
    100
    120
    140
    160
    0
    20
    40
    60
    80
    100
    120
    140
    160
    Orientation error (degrees)
    Orientation error (degrees)
    (a)
    (b)
    Figure 4.15: Comparison of the distribution of orientation errors (in cross-
    validation), for K-nearest neighbour regression (a) and randomly chosen ‘neigh-
    bours’ (b).
    a lower computational cost (during testing), and more efficient handling of large training
    sets, suggests it is a suitable choice of classifier.
    4.5 Summary of Findings
    In this chapter, we have shown experimental results for the plane recognition algorithm
    introduced in Chapter 3. We investigated the effects of various parameters and imple-
    mentation choices, and showed that the methods we chose to represent the image regions
    were effective for doing plane classification, and out-performed the simpler alternatives.
    We also showed that the algorithm generalises well to environments outside the training
    set, being able to recognise and orient planar regions in a variety of scenes.
    4.5.1 Future Work
    There are a number of further developments of this algorithm which would be interesting
    to consider, but fall outside the scope of this thesis. First, while we are using a saliency
    detector which identifies scale, and using it to select the patch size for descriptor creation,
    this is not fully exploiting scale information (rather, we are striving to be invariant to
    it). However, the change in size of similar elements across a surface is an important
    depth cue [43], and is something we might consider using — perhaps by augmenting
    the spatiograms to represent this as well as 2D position. On the other hand, whether
    the notion of image scale detected by the DoG detector has any relation to real-world
    scale or depth is uncertain. Alternatively, investigation of other types of saliency may
    4.5 Summary of Findings
    76
    be fruitful.
    We could also consider using different or additional feature descriptors. We have already
    shown how multiple types of descriptor, each with their own vocabulary, can be combined
    together using topics, and that these descriptors are suited to different tasks. We could
    further expand this by using entirely different types of feature for the two tasks, for
    example using sophisticated rotation and scale invariant descriptors for classification,
    following approaches to object recognition [33, 70], and shape or line based [6] features
    for orientation estimation.
    4.5.2 Limitations
    The system as described above has a number of limitations, which we briefly address
    here. First of all, because it is based on a finite test set, it has a limited ability to deal
    with new and exceptional images. We have endeavoured to learn generic features from
    these, to be applicable in unfamiliar scenes, though this may break down for totally
    new types of environment. Also, while we have avoided relying on any particular visual
    characteristics, such as isotropic texture, the choices we have made are tailored to outdoor
    locations, and we doubt whether it would perform well indoors where textured planar
    surfaces are less prevalent.
    The most important and obvious limitation is that it requires a pre-segmented region of
    interest, both for training and testing, which means it requires human intervention to
    specify the location. This was sufficient to answer an important question, namely, given
    a region, can we identify it as a plane or not and estimate its orientation? However, the
    algorithm would be of limited use in most real-world tasks, and we cannot simply apply
    it to a whole image, unless we already know a plane is likely to fill the whole image.
    The next step, therefore, is to place the plane recognition into a broader framework in
    which it is useful for both finding and orienting planes. This is the focus of the following
    chapter, in which we show how it is possible, with a few modifications, to use it as part
    of a plane detection algorithm, which is able to find planes from anywhere within the
    image, making it possible to use directly on previously unseen images.
    CHAPTER 5
    Plane Detection
    This chapter introduces our novel plane detection method, which for a single image can
    detect multiple planes, and predict their individual orientations. It is important to note
    the difference between this and the plane recognition algorithm presented in Chapter 3,
    which required the location of potential planar regions to be known already, limiting its
    applicability in real-world scenarios.
    5.1 Introduction
    The recognition algorithm forms a core part of our plane detection method, where it is
    applied at multiple locations, in order to find the most likely locations of planes. This
    is sufficient to be able to segment planes from non-planes, and to separate planes from
    each other according to their orientation. This continues with our aim of developing
    a method to perceive structure in a manner inspired by human vision, since the plane
    detection method extends the machine learning approach introduced in the recognition
    method. Again this means we need a labelled training set of examples, though with
    some key differences.
    77
    5.1 Introduction
    78
    5.1.1 Objective
    First we more rigorously define our objective: in short, what we intend to achieve is the
    detection of planes in a single image. More precisely, we intend to group the salient points
    detected in an image into planar and non-planar regions, corresponding to locations of
    actual planar and non-planar structures in the scene. The planar regions should then
    be segmented into groups of points, having the same orientation, and corresponding
    correctly to the planar surfaces in the scene. Each group will have an accurate estimate
    of the 3D orientation with respect to the camera. This is to be done from a single image
    of a general outdoor urban scene, without knowledge such as camera pose, physical
    location or depth information, nor any other specific prior knowledge about the image.
    This cannot rely upon specific features such as texture distortion or vanishing points.
    While such methods have been successful in some situations (e.g. [13, 44, 73, 93]), they
    are not applicable to more general real-world scenes. Instead, by learning the relationship
    between image appearance and 3D structure, the intention is to roughly emulate how
    humans perceive the world in terms of learned prior experience (although the means by
    which we do this is not at all claimed to be biologically plausible). This task as we have
    described it has not, to the best of our knowledge, been attempted before.
    5.1.2 Discussion of Alternatives
    Given that we have developed an algorithm capable of recognising planes and estimating
    their orientation in a given region of an image (Chapter 3), a few possible methods
    present themselves for using this to detect planes. Briefly, the alternatives are to sub-
    divide the image, perhaps using standard image segmentation algorithms, to extract
    candidate regions; to find planar regions in agglomerations of super-pixels; and to search
    for the optimal grouping by growing, splitting and merging segmentations over the salient
    points.
    Amongst the simplest potential approaches is to provide pre-segmented regions on which
    the plane recogniser can work, using a standard image segmentation algorithm. However,
    image segmentation is in general a difficult and unsolved problem [36, 130], especially
    when dealing with more complicated distinctions than merely colour or texture, and it
    is unlikely that general algorithms would give a segmentation suitable for our purposes.
    To illustrate the problem, Figure 5.1 shows typical results of applying Felzenszwalb and
    Huttenlocher’s segmentation algorithm [36], with varying settings (controlling roughly
    5.1 Introduction
    79
    Figure 5.1: This illustrates the problems with using appearance-based segmen-
    tation to find regions to which plane recognition may be applied. Here we have
    used Felzenszwalb and Huttenlocher’s algorithm [36], which uses colour informa-
    tion in a graph-cut framework. While the image is broadly broken down into
    regions corresponding to real structure, it is very difficult to find a granularity
    (decreasing left to right) which does not merge spatially separate surfaces, while
    not over-segmenting fine details.
    the number of segments) to some of our test images. The resulting segments are either
    too small to be used, or will be a merger of multiple planar regions, whose boundary is
    effectively invisible (for example the merging of walls and sky caused by an over-exposed
    image).
    An even more basic approach would be simply to tile the image with rectangular blocks,
    and run plane recognition on each, then join adjacent blocks with the same classifi-
    cation/orientation. This does not depend on any segmentation algorithm to find the
    boundaries, but will result in a very coarse, blocky segmentation. Choosing the right
    block size would be problematic, since with blocks too large we gain little information
    about the location or shape of surfaces, but too small and the recognition will perform
    too poorly to be of any use (c.f. Chapter 3 and experiment 6.2.1). However, one could
    allow the blocks to overlap, and while the overlapped sections are potentially ambiguous,
    this could allow us to use sufficiently large regions while avoiding the blockiness — this
    is something we come back to later.
    Rather than use fixed size blocks, we could deliberately over-segment the image into so-
    called ‘superpixels’, where reasonably small homogeneous regions are grouped together.
    This allows images to be dealt with much more efficiently than using the pixels them-
    selves. Individual superpixels would likely not be large enough to be classifiable on their
    own, but ideally they would be merged into larger regions which conform to planar or
    5.1 Introduction
    80
    non-planar surfaces. Indeed, this is rather similar to Hoiem et al. [66], who use seg-
    ments formed of superpixels to segment the image into geometric classes. They use local
    features such as filter responses and mean colour to represent individual superpixels,
    which are very appropriate for grouping small regions (unlike our larger-scale classifica-
    tion). Even so, finding the optimal grouping is prohibitively expensive. In our case we
    would also have to ensure that there are always enough superpixels in a collection being
    classified at any time, constraining the algorithm in the alterations it can make to the
    segmentation.
    Another alternative is to initialise a number of non-overlapping regions, formed from
    adjacent sets of salient points rather than superpixels (an initial segmentation). Plane
    recognition would be applied to each, followed by iterative update of the regions’ bound-
    aries in order to search for the best possible segmentation. The regions could be merged
    and split as necessary and the optimal configuration found using simulated annealing
    [34], for example. Alternatively this could be implemented as region growing where a
    smaller number of regions are initialised centred at salient points, and are grown by
    adding nearby points if they increase the ability of the region to describe a plane; then
    split apart if they become too large or fail to be classified confidently. The problem is
    that while region growing has been successful for image segmentation [129], our situa-
    tion differs in that the decision as to whether a point should belong to a plane is not
    local — i.e. there is nothing about the point itself which indicates that it belongs to a
    nearby region. Rather, it is only after including the point within a region to be classi-
    fied that anything about its planar or non-plane status can be known. This is different
    from segmenting according to colour, for instance, which can be measured at individual
    locations.
    These methods would need some rule for evaluating the segmentation, or deciding
    whether regions should be merged or split, and unfortunately it is not clear how to
    define which cost function we should be minimising (be it the energy in simulated an-
    nealing or the region growing criterion). One candidate is the probability estimate given
    by the RVM classifier, where at each step, the aim would be to maximise the overall
    probability of all the classifications, treating the most confident configuration as the
    best. The problem is that we cannot rely on the probability given by the RVM, partly
    due to the interesting property of the RVM that the certainty of classification, while
    generally sensible for validation data, actually tends to increase as the test data move
    further away from the training data. This is an unfortunate consequence of enforcing
    sparsity in the model [106], and cannot be easily resolved without harming the efficiency
    5.2 Overview of the Method
    81
    of the algorithm. Indeed, in some early experiments using region growing, we found that
    classification certainty tended to increase with region size, so that the final result always
    comprised a single large region no matter what the underlying geometry. This particu-
    lar problem is specific to the RVM, although similar observations hold for classification
    algorithms more generally — any machine learning method would be constrained by its
    training data, and liable to make erroneous predictions outside this scope.
    Bearing the above discussion in mind, what we desire is a method that does not rely
    upon being able to classify or regress small regions (since our plane recogniser cannot
    do this); avoids the need for any prior segmentation or region extraction (which cannot
    be done reliably); does not rely on accurate probabilities from a classifier in its error
    measure (which are hard to guarantee); and will not require exploration or optimisation
    over a combinatorial search space. In the following, we present such a method, the
    crucial factor being a step we call region sweeping, and use this to drive a segmentation
    algorithm at the level of salient points.
    5.2 Overview of the Method
    This section gives a brief overview of the plane detection method, the details of which
    are elaborated upon in the following sections, and images representing each of the steps
    are shown in Figure 5.2. We begin by detecting salient points in the image as before,
    and assigning each a pair of words based on gradient and colour features. Next we use a
    process we call region sweeping, in which a region is centred at each salient point in turn,
    to which we apply the plane recognition algorithm. This gives plane classification and
    orientation estimation at a number of overlapping locations covering the whole image.
    We use these to derive the ‘local plane estimate’ which is an estimate at each individual
    salient point of its probability of belonging to a plane, and its orientation, as shown in
    Figure 5.2b. These estimates of planarity and orientation at each point are derived from
    all of the sweep regions in which the point lies.
    While this appears to be a good approximation of the underlying planar structure, it is
    not yet a plane detection, since we know nothing of individual planes’ locations. The
    next step, therefore, is to segment these points into individual regions, using the local
    plane estimates to judge which groups of points should belong together, as illustrated
    in Figure 5.2c. The output of the segmentation is a set of individual planar and non-
    planar segments, and the final step is to verify these, by applying the plane recognition
    5.3 Image Representation
    82
    (a)
    (b)
    (c)
    (d)
    Figure 5.2: Example steps from plane detection: from the input image (a), we
    sweep the plane recogniser over the image to obtain a pointwise estimate of plane
    probability and orientation (b). This is segmented into distinct regions (c), from
    which the final plane detections are derived (d).
    algorithm once more, on these planar segments. The result, as shown in Figure 5.2d, is a
    detection of the planar surfaces, comprised of groups of salient points, with an estimate
    of their orientation, derived from their spatial adjacency and compatibility in terms of
    planar characteristics.
    5.3 Image Representation
    The representation of images will closely follow the description in Section 3.3, except
    that now we need to consider the entire image, as well as multiple overlapping regions.
    Because regions overlap, feature vectors are shared between multiple regions.
    We begin by detecting a set of salient points V = { v 1 ,..., v n } over the whole image,
    using the difference of Gaussians (DoG) detector. For each point, we create a gradient
    and colour descriptor. Assuming that we have already built bag of words vocabularies,
    we quantise these to words, so that the image is represented by pairs of words (gradient
    and colour) assigned to salient points (for more details, refer back to Section 3.3). The
    further stages of representation – word/topic histograms and spatiograms – depend on
    having regions, not individual points, and so are not applied yet. Conveniently, this set
    of salient points and words can be used for any regions occurring in the image.
    5.4 Region Sweeping
    83
    5.4 Region Sweeping
    In order to find the most likely locations of planar structure, we apply a ‘region sweeping’
    stage, using the set V of salient points. Region sweeping creates a set of approximately
    circular overlapping regions R , by using each salient point v i in turn to create a ‘sweep
    region’ R i R , using the point as the centroid and including all other points within a
    fixed radius κ . We define R i as the set of all salient points within the radius from the
    centroid: R i = { v j |k v j v i k < κ,j = 1 ,...,n } . To speed up the process, we generally
    use every fourth salient point only, rather than every point, as a centroid. Points are
    processed in order from the top left corner, to ensure the subset we use is approximately
    evenly distributed.
    Topic spatiograms are created for each sweep region (see Section 3.3.5). Using these, the
    plane recognition algorithm can be applied, resulting in an estimate of the probability
    p ( R i ) (0 , 1) of belonging to the plane class, and an estimate of the orientation n ( R i )
    R 3 (normal vector), for each sweep region R i in isolation. The result before any further
    processing – can be see in in Figure 5.3, showing multiple overlapping regions R coloured
    according to their probability of being planar, with the orientation estimate shown for
    each planar region. These regions are classified and regressed using RVMs, the training
    data for which is described in the next section.
    (a)
    (b)
    Figure 5.3: Input image (left) and the result of region sweeping (right) this
    shows the hull of each region, coloured according to its estimated probability of
    being a plane (red is planar, blue is non-planar, and shades of purple in between),
    and the regressed normal for plane regions (only a subset of the regions are shown
    for clarity).
    Note that the choice of region size is dictated by two competing factors. On one hand,
    larger regions will give better recognition performance, but at the expense of obtaining
    coarser-scale information from the image, blurring plane boundaries. Small regions would
    be able to give precise and localised information, except that accuracy falls as region size
    5.5 Ground Truth
    84
    (a)
    (b)
    (c)
    (d)
    Figure 5.4: Examples of manually segmented ground truth data, used for train-
    ing the classifiers, showing planes (red) with their orientation and non-planes
    (blue).
    decreases. Fortunately, due to the segmentation method we will introduce soon, this does
    not mean our algorithm is incapable of resolving details smaller than the region size. We
    investigate the implications of region size in our experiments presented in Section 6.2.1.
    5.5 Ground Truth
    Before discussing the next step in the detection algorithm, it is necessary to explain the
    ground truth data. This is because it is crucial for training the recognition algorithm,
    as well as validation of the detection algorithm (see Chapter 6).
    Unlike in the previous chapters, these ground truth data will contain the location and
    orientation of all planes in the image, not just a region of interest. We begin with a set of
    images, selected from the training video sequences, and hand segment them into planar
    and non-planar regions. We mark up the entire image, so that no areas are resigned to
    being ambiguous. Plane orientations are specified using the interactive vanishing line
    method as before (refer to Figure 3.1). Examples of such ground truth regions are shown
    in Figure 5.4.
    5.6 Training Data
    In Section 3.2, we described how training data were collected by manually selecting and
    annotating regions from frames from a video sequence. These regions are no longer
    suitable, because the manually selected region boundaries are not at all like the shapes
    obtained from region sweeping. As a consequence, classification performance suffers.
    5.6 Training Data
    85
    Furthermore, we can no longer guarantee that all regions used for recognition will be
    purely planar or non-planar, since they are not hand-picked but are extracted from
    images about all salient points. Thus training regions which are cleanly segmented and
    correspond entirely to one class (or orientation) are not representative of the test data.
    5.6.1 Region Sweeping for Training Data
    To obtain training data more appropriate to the plane detection task, we gather it using
    the same method as we extract the sweep regions themselves (see above), but applied
    instead to ground truth images described in the previous section. When creating regions
    by grouping all salient points within a radius of the central salient point, we use only a
    small subset of salient points in each image as centre points, so that we do not get an
    unmanageably large quantity of data. Since these are ground truth labelled images, we
    use the ground truth to assign the class and orientation labels to these extracted regions.
    Inevitably some regions will lie over multiple ground truth segments, as would test data.
    This is dealt with by assigning to each salient point the class of the ground truth region
    in which it lies, then the training regions are labelled based on the modal class of their
    salient points. The same is done for assigning orientation, except for using the geometric
    median.
    We also investigated making use of the estimated probability of each region being a
    plane, which was calculated as the proportion of planar points in the region. The aim
    would be to regress such a probability estimate for test regions, but experiments showed
    no benefit in terms of the resulting accuracy, and so we use the method outlined above.
    The downside to this approach is that regions whose true class is ambiguous (having for
    example almost equal numbers of plane and non-plane points) will be forced into one
    class or the other. However, this will be the same during testing, and so we consider it
    sensible to leave these in the training set.
    This method has the advantage that it allows a very large amount of training data to be
    extracted with little effort, other than the initial marking up of ground truth. Indeed, this
    is far more than we can deal with if we do not sparsify the sweeping process considerably.
    As such it is not necessary to apply warping as well (Section 3.2.2), although we still
    consider it beneficial to reflect all the ground truth images before extraction begins.
    5.7 Local Plane Estimate
    86
    5.6.1.1 Evaluation of Training Data
    To verify that gathering an entirely new set of training regions is necessary, we col-
    lected a test set of planar and non-planar regions by applying region sweeping to an
    independent set of ground truth images (those which are later used to evaluate the full
    algorithm). This produced a new set of 3841 approximately circular regions. When using
    the plane recognition algorithm on these, using the original hand-segmented training set
    from Chapter 4, the results were a classification accuracy of only 65.8%, and a mean
    orientation error of 22 , which would not be good enough for reliable plane segmentation
    and detection. However, running the same test using the new sweeping-derived training
    set described here increases classification accuracy to 84.6%, indicating that having an
    appropriate training set is indeed important. We note with some concern that the mean
    orientation error decreased only marginally to 21 . Possibly this is because both training
    and testing data now include regions containing a mixture of different planes, making it
    more difficult to obtain good accuracy with respect to the ground truth.
    5.7 Local Plane Estimate
    After running region sweeping, as described in Section 5.4, and classifying these regions
    with classifiers trained on the data just described, we have a set of overlapping regions
    R covering the image. This gives us local estimates of what might be planar, but says
    nothing about boundaries. Points in the image lying inside multiple regions have am-
    biguous classification and orientations. We address this by considering the estimate given
    to each region R i containing that point as a vote for a particular class and orientation.
    Intuitively, a point where all the regions in which it lies are planar is very likely to be
    on a plane. Conversely a point where there is no consensus about its class is uncertain,
    and may well be on the boundary between regions. This observation is a crucial factor
    in finding the boundaries between planes and non-planes, and between different planes.
    More formally, we use the result of region sweeping to estimate the probability of a
    salient point belonging to a planar region, by sampling from all possible local regions it
    could potentially be a part of, before any segmentation has been performed. For points
    which are likely to be in planar regions, we also estimate their likely orientation, using
    the normals of all the planar surfaces they could lie on. Each salient point v i lies within
    multiple regions R i R , where R i is defined as R i = { R k | v i R k } ; that is, the subset
    5.7 Local Plane Estimate
    87
    (a)
    (b)
    Figure 5.5: Using the sweep regions (left), we obtain the local plane estimate,
    which for each point, located at v i , assigns a probability q i of belonging to a
    plane, and an orientation estimate m i (right). Probability is coloured from red
    to blue (for degree of plane to non-plane), and orientation is shown with a normal
    vector (only a subset of points are shown for clarity).
    of regions R k in which v i appears. Each point v i is given an estimate of its probability
    of being on a plane, denoted q i , and of its normal vector m i , calculated as follows:
    q i
    = ζ ( { p ( R k ) | R k R i } )
    (5.1)
    m i = ζ G ( { n ( R k ) | R k R i ,p ( R k ) > 0 . 5 } )
    where ζ and ζ G are functions calculating the median and geometric median in R re-
    3
    spectively. Note that m i is calculated using only regions whose probability of being a
    plane is higher than for a non-plane. To clarify, equation 5.1 describes how the plane
    probability and normal estimate for the point i come from the median of the regions
    in R i . We use the median rather than the mean since it is a more robust measure of
    central tendency, to reduce the effect of outliers, which will inevitably occur when using
    a non-perfect classifier. Figure 5.5 illustrates how the sweep regions lead to a pointwise
    local plane estimate.
    In order to improve the accuracy of the local plane estimate, we discard regions whose
    classification certainty is below a threshold. The classification certainty is defined as the
    probability of belonging to the class it has been assigned, and thus is p ( R i ) and 1 p ( R i ),
    for planar and non-planar regions respectively. This discarding of classified regions is
    justified by a cross-validation experiment (similar to those in Section 4.1). As shown
    in Figure 5.6, as the threshold is increased, to omit the less certain classifications, the
    mean classification accuracy increases. This is at the expense of decreasing the number
    of regions which can be used. It is this thresholding for which having a confidence
    5.7 Local Plane Estimate
    88
    100%
    4500
    4000
    95%
    3500
    90%
    3000
    2500
    85%
    2000
    80%
    1500
    1000
    75%
    500
    70%
    0
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    Certainty threshold
    Certainty threshold
    (a)
    (b)
    Figure 5.6: As the threshold on classifier certainty increases, less confident
    regions are discarded, and so accuracy improves (a); however, the number of
    regions remaining drops (b).
    value for the classification, provided by the RVM, is very useful, and is an advantage
    over having a hard classification. The effect of this is to remove the mid-range from
    the spread of probabilities, after which the median is used to choose which of the sides
    (closer to 0 or 1) is larger. This is equivalent to a voting scheme based on the most
    confident classifications. However, our formulation can allow more flexibility if necessary,
    for example using a different robust estimator.
    Since there are usually a large number of overlapping regions available for each point,
    the removal of a few is generally not a problem. In some cases, of course, points are
    left without any regions that they lie inside ( R i = ), and so these points are left out
    of subsequent calculations. Such points are in regions where the classifier cannot be
    confident about classification, and so should play no part in segmentation.
    The result is an estimate of planarity for each point, which we will refer to as the local
    plane estimate — examples are shown in Figure 5.7. We have now obtained a represen-
    tation of the image structure which did not require the imposition of any boundaries,
    and circumvented the problem of being unable to classify arbitrarily small regions. It is
    encouraging to note that at this stage, the underlying planar structure of the image is
    visible, although as expected it is less well defined at plane boundaries, where orientation
    appears to transition smoothly between surfaces.
    5.8 Segmentation
    89
    (a)
    (b)
    (c)
    Figure 5.7: Examples of local plane estimates, for a variety of images; colour,
    from red to blue, shows the estimated probability of belonging to a plane, and
    the normal vector is the median of all planar regions in which the point lies.
    5.8 Segmentation
    Although the local plane estimate produced above is a convincing representation of the
    underlying structure of the scene – showing where planes are likely to be, and their ap-
    proximate orientation – this is not yet an actual plane detection. Firstly, plane bound-
    aries remain unknown, since although we know the estimates at each point, it is not
    known which points are connected to which other points on a common surface. Sec-
    ondly, each pointwise local plane estimate is calculated from all possible surfaces on
    which the point might lie. As such, this is not accurate, which is especially clear in Fig-
    ure 5.7c where points near plane-plane boundaries take the mean of the two adjoining
    faces.
    This section describes how the local plane estimate is used to segment points from each
    other to discover the underling planar structure of the scene, in terms of sets of connected,
    coplanar points. This consists of segmenting planes from non-planes, deciding how many
    planes there are and how they are oriented, then segmenting planes from each other.
    5.8.1 Segmentation Overview
    Our segmentation is performed in two separate steps. This is because we found it was
    not straightforward to select a single criterion (edge weight, energy function, set of clique
    potentials etc.) to satisfactorily express the desire to separate planes and non-planes,
    while at the same time separating points according to their orientation. Therefore, we
    first segment planar from non-planar regions, according to the probability of belonging
    5.8 Segmentation
    90
    to a plane given by the local plane estimate. Next, we consider only the resulting planar
    regions, and segment them into distinct planes, according to their orientation estimate,
    for which we need to determine how many planes there are. We do this by finding
    the modes in the distribution of normals observed in the local plane estimate, which
    represent the likely underlying planes. This is based on the quite reasonable assumption
    that there are a finite number of planar surfaces which we need to find.
    5.8.2 Graph Segmentation
    We formulate the segmentation as finding the best partition of a graph, using a Markov
    random field (MRF) framework. A MRF is chosen since it can well represent our task,
    which is for each salient point, given its observation, to find its best ‘label’ (for plane
    class and orientation), while taking into account the values of its neighbours. The values
    of neighbouring points are important, since as with many physical systems we can make
    the assumption of smoothness. This means we are assuming that points near each other
    will generally (discontinuities aside) have similar values [78]. Without the smoothness
    constraint, segmentation would amount simply to assigning each point to its nearest
    class label, which would be trivial, but would not respect the continuity of surfaces.
    Fortunately, optimisation of MRFs is a well-studied problem, and a number of efficient
    algorithms exist.
    First we build the graph to represent the 2D configuration of points in the image and their
    neighbourhoods. We do this with a Delaunay triangulation of the salient points to form
    the edges of the graph, using the efficient S-Hull implementation 1 [118]. We modify the
    standard triangulation, first by removing edges whose endpoints never appear in a sweep
    region together. This is because for two points which are never actually used together
    during the plane sweep recognition stage, there is no meaningful spatial link between
    them, so there should be no edge. To obtain a graph with only local connectivity, we
    impose a threshold on edge length (typically 50 pixels), to remove any undesired long-
    range effects.
    1 The code is available at www.s-hull.org/
    5.8 Segmentation
    91
    5.8.3 Markov Random Field Overview
    A MRF expresses a joint probability distribution on an undirected graph, in which
    every node is conditionally independent of all other nodes, given its neighbours [78] (the
    Markov property). This is a useful property, since it means that the global properties
    of the graph can be specified by using only local interactions, if certain assumptions
    are made. In general the specification of the joint probability of the field would be
    intractable, were it not for a theorem due to Hammersley and Clifford [60] which states
    that a MRF is equivalent to a Gibbs random field, which is a random field whose global
    properties are described by a Gibbs distribution. This duality enables us to work with
    the MRF in an efficient manner and to find optimal values, in terms of maximising the
    probability.
    If we define a MRF over a family of variables F = { F 1 ,...,F N } , each taking the value
    F i = f i , then f = { f 1 ,...,f N } denotes a particular realisation of F , where some f
    describes the labels assigned to each node, and is called a configuration of the field. In
    this context, the Gibbs distribution is expressed as
    × e
    U ( f )
    P ( f ) = Z
    1
    (5.2)
    where U ( f ) is the energy function, and for brevity we have omitted the temperature
    parameter from the exponent. Z is the partition function:
    X U ( f )
    Z =
    e
    (5.3)
    f F
    which is required to ensure the distribution normalises to 1. Since this must be evaluated
    over all possible configurations f (the space F ) evaluating (5.2) is generally intractable.
    However, since Z is the same for all f , it is not needed in order to find the optimal
    configuration of the field — i.e. it is sufficient that P ( f ) e U ( f ) . The optimal config-
    uration is the f which is most likely (given the observations and any priors), so the aim
    is to find the f = f which maximises the probability P .
    The energy function U ( f ) sums contributions from each of the neighbourhood sets of the
    graph. Since the joint probability is inversely related to the energy function, minimising
    the energy is equivalent to finding the MAP (maximum a-posteriori) configuration f
    of the field. Thus, finding the solution to a MAP-MRF is reduced to an optimisation
    problem on U ( f ). This is generally written as a sum over clique potentials, U ( f ) =
    5.8 Segmentation
    92
    P V ( f ), where a clique is a set of nodes which are all connected to each other, and
    c ∈C c
    V c is the clique potential for clique c (in the set of all possible cliques C ). In our case,
    deal with up to second order cliques – that is, single nodes and pairs of nodes – so the
    energy function can be expressed as:
    X
    X
    X
    X X
    U ( f ) =
    V 1 ( f i ) +
    V 2 ( f i ,f i 0 )
    =
    V 1 ( f i ) +
    V 2 ( f i ,f i 0 ) (5.4)
    { i }∈C 1
    { i,i 0 }∈C 2
    i ∈S
    i ∈S i 0 ∈N i
    where V 1 and V 2 are the first and second order clique potentials respectively. The left
    hand equation expresses the energy as a sum over the two clique potentials, over the set of
    all first order C 1 (single nodes) and second order C 2 (pairs connected by an edge) cliques;
    the right hand side expresses this more naturally as a sum of potentials over nodes, and
    a sum over all the neighbourhoods for all the nodes; S is the set of all nodes in the graph
    and N i S denotes all the neighbours of node i . The two clique potentials take into
    account respectively the dependence of the label on the observed value, at a single node,
    and the interaction between pairs of labels at adjacent nodes (this is how smoothness
    is controlled). In summary, it is this energy U ( f ), calculated using the variables f i of
    the configuration f and the functions V 1 , 2 , which is to be minimised, in order to find the
    best configuration f (and thus the optimal labels f i for all the nodes).
    We use iterative conditional modes (ICM) to optimise the MRF, which is a simple but
    effective iterative algorithm developed by Besag [8]. The basic principle of ICM is to
    set every node in turn to its optimal value, given the current value of its neighbours,
    monotonically decreasing the total energy of the MRF. The updates will in turn alter
    the optimal value for already visited nodes, so the process is repeated until convergence
    (convergence to a (local) minimum can be proven, assuming sequential updates). ICM
    is generally quite fast to converge, and although it may not be the most efficient opti-
    misation algorithm, it is very simple to implement, and was found to be suitable for our
    task.
    5.8.4 Plane/Non-plane Segmentation
    The first step is to segment planes from non-planes. We do this using a MRF as follows:
    let p represent a configuration of the field, where each node p i { 0 , 1 } represents the class
    5.8 Segmentation
    93
    of point i (1 and 0 being plane and non-plane, respectively). We then seek the optimal
    configuration p , defined as p = argmin p U ( p ), where U ( p ) represents the posterior
    energy of the MRF:
    X
    X X
    U ( p ) =
    V 1 ( p i ,q i ) +
    V 2 ( p i ,p j )
    (5.5)
    i ∈S
    i ∈S j ∈N i
    Here, q i is the observation at point i , which is the estimated probability of this point
    belonging to a plane, obtained as in equation 5.1. The set S contains all salient points
    in the image (assuming they have been assigned a probability). The functions V 1 and V 2
    are the single site and pair site clique potentials respectively, defined as
    V 1 ( p i ,q i ) = ( p i q i )
    2
    V 2 ( p i ,p j ) = δ p i 6 = p j
    (5.6)
    where δ p i 6 = p j has value 0 iff p i and p j are equal, 1 otherwise. Here we express the function
    V 1 with two arguments, since it depends not only on the current value of the node but
    also its oberved value. This function penalise deviation of the assigned value p i at a point
    from its observed value q i , using a squared error, since we want the final configuration to
    correspond as closely as possible to the local plane estimates. We desire pairs of adjacent
    nodes to have the same class, to enforce smoothness where possible, so δ in function V 2
    returns a higher value (1) to penalise a difference in its arguments.
    Each p i is initialised to the value in { 0 , 1 } which is closest to q i (i.e. we threshold the
    observations to obtain a plane/non-plane class) and optimise using ICM. We generally
    find that this converges within a few iterations, since the initialisation is usually quite
    good. It tends to smooth the edges of irregular regions and removes small isolated
    segments. After this optimisation, each node is set to its most likely value (plane or not),
    given the local plane estimate and a smoothness constraint imposed by its neighbours, so
    that large neighbourhoods in the graph will correspond to the same class. The process
    is illustrated with examples in Figure 5.8. Finally, segments are extracted by finding the
    connected components corresponding to planes and non-planes, which now form distinct
    regions (Figure 5.8d).
    5.8 Segmentation
    94
    (a)
    (b)
    (c)
    (d)
    Figure 5.8: The process of segmenting planes from non-planes. Using the
    probabilities estimated at each point from region sweeping (a), we initialise a
    Markov random field (b); this is optimised using iterative conditional modes
    (c), resulting in a clean segmentation into smooth disjoint regions of planes and
    non-planes (red and blue respectively) (d).
    5.8.5 Orientation Segmentation
    Once planar regions have been separated from non-planar regions using the above MRF,
    we are left with regions supposedly consisting only of planes. However, except in very
    simple images, these will be made up of multiple planes, with different orientations. The
    next step is to separate these planes from each other, using the estimated orientation m i
    at each salient point.
    This is done by optimising a second MRF, with the goal of finding the points belonging
    to the same planar surfaces. This is defined on the subgraph of the above which con-
    tains only points in planar segments. This graph may consist of multiple independent
    connected components, but this makes no difference to the formulation of the MRF.
    In contrast to the simplicity of the first MRF, which was a two-class problem, the values
    we have for plane normals are effectively continuously valued variables in R 3 , where we
    do not know the values of the true planes. Some brief experimentation suggested that
    attempting to find an energy function which is able to not only segment the points,
    but also find the correct number of planes, was far from straightforward. Generally, a
    piecewise constant segmentation approach did enforce regions of constant orientation,
    but typically far too many, forming banding effects (essentially discretising the normals
    into too-fine graduations).
    We take an alternative approach, by using mean shift to find the modes of the kernel
    density estimate of the normals in the image. This is justified as follows: we assume
    that in the image, there are a finite number of planar surfaces, each with a possibly
    different orientation, to which all of the planar salient points belong. This implies that
    5.8 Segmentation
    95
    close to the observed normals, there are a certain number, as yet unknown, of actual
    orientations, from which all these observations are derived. The task is to find these
    underlying orientations, from the observed normals. Once we know the value of these,
    the problem reduces to a discrete, multi-class MRF optimisation, which is easily solved
    using ICM. The following sections introduce the necessary theory and explain how we
    use this to find the plane orientations.
    5.8.5.1 Orientation Distribution and the Kernel Density Estimate
    Kernel density estimation is a method to recover the density of the distribution of a
    collection of multivariate data. Conceptually this is similar to creating a histogram, in
    order to obtain a non-parametric approximation of a distribution. However, histograms
    are always limited by the quantisation into bins, introducing artefacts. Instead of as-
    signing each datum to a bin, the kernel density estimate (KDE) places a kernel at every
    datum, and sums the results to obtain an estimate of the density at any point in the
    space, as we illustrate in Figure 5.9 with the example of a 1D Gaussian kernel.
    4.5
    4
    3.5
    3
    2.5
    2
    1.5
    1
    0.5
    0
    −0.5
    −2
    0
    2
    4
    6
    8
    10
    12
    Data values
    Figure 5.9: This illustrates the central principle behind the kernel density
    estimate. For 1D data (the black diamonds), we estimate the probability density
    by placing a kernel over each point (here it is a Gaussian with σ = 0 . 5 ), drawn
    with dashed lines. The sum of all these kernels, shown by the thick red line, is
    the KDE. Clusters of nearby points’ kernels sum together to give large peaks in
    the density; outlying points give low probability maxima.
    5.8 Segmentation
    96
    In general, the KDE function f ˆ for multivariate data X = { x 1 ,..., x N } , evaluated at a
    point y within the domain of X is defined as [23]
    X
    N
     y x i 
    1
    f ˆ ( y ) =
    K
    (5.7)
    Nh d
    i =1
    h
    where K is the kernel function, h is the bandwidth parameter, and d is the number of
    dimensions.
    Following the derivation of Comaniciu and Meer [23], the kernel function is expressed in
    terms of a profile function k , so that K ( x ) = ck ( k x k 2 ), where c is some constant. For a
    Gaussian kernel, the profile function is
    1
    k ( x ) = exp( x )
    (5.8)
    2
    which leads to an isotropic multivariate Gaussian kernel:
     1
    
    1
    2
    K ( x ) =
    d
    exp k x k
    (5.9)
    (2 π ) 2
    2
    d
    where the constant c became (2 π ) 2 . Substituting the Gaussian kernel function (5.9)
    into equation 5.7 yields the function for directly calculating the KDE at any point:
    X
    N
    1
    1
    1 y x i
    2 !
    f ˆ ( y ) =
    d
    exp
    (5.10)
    Nh d
    i =1
    (2 π ) 2
    2 h
    which states that the estimate of the density at any point is proportional to the sum of
    kernels placed at all the data, evaluated at that point, as the illustration above showed.
    In our case, we use the KDE to recover the distribution of normals observed in the
    5.8 Segmentation
    97
    image, based on the hypothesis that while the normals are smoothly varying due to the
    region sweeping, they will cluster around the true plane orientations. In a two-plane
    image, for example, two distinct modes should be apparent in the density estimate. The
    values of these modes will then correspond to the normals around which the observations
    are clustered, and which will be used to segment them. These normals become the
    discrete set of class labels for the MRF. If we assume the data vary smoothly, and are
    approximately normally distributed about the true normals, this justifies the use of a
    Gaussian kernel function.
    The normal vectors have so far been represented in R 3 , but can be more compactly
    represented with spherical coordinates, using a pair of angles θ and φ . An important
    issue when dealing with angular values is to avoid wrap-around-effects (angles of 350
    and -10 are the same, for example). Conveniently, because we only represent angles
    facing toward the camera (because we cannot see planes facing away), only half of the
    space of all θ,φ is used (to be precise, the hemisphere θ [ 2 π , 3 2 π ] and φ [0 , 2 π ]), and
    so with careful choice of parameterisation, we can avoid any such wrapping effects, and
    stay away from the ‘edge’ of the parameter space (and so our formulae below may appear
    different from standard spherical coordinates). The transformation of a normal vector
    n = ( n x ,n y ,n z ) T to and from angular representation is therefore:
    θ = tan ( )
    1 n x
    n z
    n x = sin( θ )sin( φ )
    φ = cos ( )
    1 n y
    | n |
    n y = cos( φ )
    (5.11)
    n z = cos( θ )sin( φ )
    To use the KDE we need to set the value of the bandwidth parameter h . There are
    various methods outlined in the literature for automatic bandwidth selection [22], or
    even the use of varying bandwidths according to the data [24]. For convenience, we set
    our bandwidth by observing the performance on training data, to a value of 0.2 radians
    (see our experiments in Section 6.2.2) — though we acknowledge that more intelligent
    selection or adaptation of the bandwidth would be a worthwhile area for exploration.
    5.8 Segmentation
    98
    (a)
    (b)
    (c)
    (d)
    Figure 5.10: Visualisation of the kernel density estimate (KDE) for the dis-
    tribution of normals in an image. The estimated normals from the local plane
    estimate (a) are used to calculate a KDE with a Gaussian kernel. Plots (b) show
    all the normals represented in θ,φ space, and in (c) the colour map of the KDE
    (red denotes higher probability density). (d) shows a 3D representation, making
    the structure of the density clearer.
    5.8.5.2 KDE Visualisation
    Because the kernel density estimates are calculated in the 2D space of angles, the process
    is easy to understand and visualise. The density can be represented as a 2D image, with
    the horizontal and vertical axes corresponding to the two angles, where the KDE at each
    point is represented by the colour of the pixel. Since the 2D heat map is not particularly
    easy to interpret, we can also show the KDE visualised as a 3D surface, where the axes
    in the horizontal plane correspond to the angles and the height is the magnitude of the
    density estimate at the point (the mesh is built simply by connecting points in a 4-way
    grid).
    In Figure 5.10 we show some examples, consisting of the local plane estimate (Section
    5.7), showing all the normals in the image, plus the 2D and 3D representations of the
    KDE. In each example, one can clearly see the peaks in the KDE corresponding to the
    dominant planes in the scene, whose relative height (density estimate) roughly corre-
    sponds to the size of the plane (i.e. the number of observations relating to it). An
    important point about the KDE is that it requires no a priori knowledge of the number
    of modes (in contrast to K-means, for example), and is entirely driven by the data. This
    5.8 Segmentation
    99
    gives us an elegant way of choosing the number of planes present in the image.
    5.8.5.3 Mode Finding and Mean Shift
    The above examples suggest that the KDE is a suitable way of describing the underlying
    plane orientations, but does not yet give an easy way to actually find the values of these
    modes. Sampling the space, at the resolution displayed above (the data are in a space
    of 100 × 200 divisions in θ and φ ), is rather time consuming since the KDE must be
    evaluated at each point, and each evaluation of f ˆ involves the summation of Gaussian
    kernels centred at each of the N observations.
    A better solution is to use mean shift [17], which is a method for finding the modes
    of a multivariate distribution, and is intimately related to the KDE. The idea in mean
    shift is to follow the direction of steepest ascent (the normalised gradient) in the KDE
    until a stationary point is reached, i.e. a local maximum. By starting the search from
    sufficiently many points in the domain, all modes can be recovered. In order to find the
    modes, it is not actually necessary to calculate the KDE itself (neither over the whole
    space as in the visualisations, nor even strictly at the points themselves, unless we wish
    to recover the probabilities, which we do once after convergence), since the necessary
    information is encoded implicitly by the gradient function.
    The mean shift vector m ( y ) at some point y describes the magnitude and direction in
    which to move from y toward the nearest mode. We omit a full derivation of how the
    mean shift vector is obtained from the partial derivatives of the KDE equations — a
    thorough exposition can be found in [23]. In general, the mean shift vector for a point
    y is defined as:
    P N x i g ( k y x i k 2 )
    m ( y ) = P N
    i =1
    h
    y x i 2
    y
    =
    ´ ( y ) y
    (5.12)
    i =1
    g ( k
    h
    k )
    where g ( x ) = k 0 ( x ) is the negative derivative of the kernel profile described above,
    and we have used ( y ) to conveniently denote the left hand term in the mean shift
    expression. Since this is the direction in which the point y should be moved, the new
    value for y , denoted y 0 , is simply y = y + ( y ) y = m ( y ), and so m ( · ) is a function
    0
    ´
    ´
    updating the current point to its next value on its path to the mode. Using the Gaussian
    5.8 Segmentation
    100
    profile function k ( x ) = exp( 21 x ) , the function g ( x ) = 2 1 exp( 21 x ) , and by substituting
    this into the above, we obtain
    P N x i exp( 1 k y x i k 2 )
    i =1
    ´ ( y ) = P N─
    2
    h
    (5.13)
    i =1
    exp( 21 k y h x i k 2 )
    To run mean shift we use a set of points Y = { y 1 ,..., y N } , initialised by setting all
    y i = x i , i.e. a copy of the original data. At each step, we update all the y i with
    y i0 = ( y i ) , and iterate until convergence. Using separate variables x and y highlights
    the fact that while it is the original data that we update and follow, the data X used for
    calculating the vectors remain fixed (otherwise the KDE itself would alter as we attempt
    to traverse it).
    5.8.5.4 Accelerating Mean Shift
    The mean shift process itself is time consuming, since it requires iteration for every one
    of the data points, of which there are several hundred. This amounts to a significant
    factor in the overall time for plane detection. Fortunately, we can make some alterations
    to achieve significant speedups. This is because at each iteration of mean shift for a given
    point, its next value y 0 is determined entirely by its current value y (i.e. location in θ,φ
    space) and the direction and magnitude of the gradient m ( y ). Furthermore, there is
    no distinction between the points, which means once we know the trajectory of a given
    point toward its mode, then all other points, which fall anywhere on that trajectory,
    will behave the same. This means we can avoid re-computing trajectories which will
    eventually converge, allowing us to significantly accelerate mean shift.
    Two points’ trajectories may not coincide exactly, so we divide the space into a grid of
    cells, their size being on the order of the kernel bandwidth. As soon as a point enters a
    cell through which another point has passed (at any time during its motion), we remove
    it, leaving just one point to be updated for that trajectory. Ideally, this will lead to
    only one point per mode being updated by the end of the process, which means we are
    still able to find the same modes, without the wasted computation of following every y i
    which will terminate there.
    This algorithm is based on the assumption that two points within the same cell will
    5.8 Segmentation
    101
    converge to the same mode, which is not always true, because points arbitrarily close to
    a watershed between two modes’ basins of attraction will diverge. We found that if the
    cell size is made smaller than the bandwidth, this does not appear to present a problem.
    This fairly simple approach is sufficient for our needs, achieving a speedup of around
    100 × while giving almost identical results to full mean shift. The experiments described
    in Section 6.2.2 provide evidence for this.
    5.8.5.5 Segmentation
    Finally, we use the modes given by mean shift as the discrete labels in the MRF (after
    converting back to normals in R 3 ). Mean shift can give us the identity of the mode to
    which each point converges. Equivalently, we can initialise the nodes by setting their
    initial label, denoted n i to be the one closest (measured by angle) to its observed value
    (where the observed values of the nodes are the m i , from region sweeping).
    The MRF optimisation can now proceed, using only the discrete labels provided by mean
    shift. This guarantees it will output the number of different planes we have already
    found to be in the image, and avoids searching over the continuum of normals. This
    is formulated as optimisation of an energy function E ( n ), where n is a configuration
    of normals n = { n 1 ,..., n |S 0 | } on the nodes in S 0 S , the subset of points in the
    graph which were segmented into planar regions. As before we desire to find the optimal
    configuration n = argmin n E ( n ), by minimising the energy E(n), expressed as a sum of
    clique potentials:
    X
    X X
    E ( n ) =
    F 1 ( n i , m i ) +
    F 2 ( n i , n j )
    (5.14)
    i ∈S 0
    i ∈S 0 j ∈N i
    where both clique potential functions F 1 and F 2 return the angle between two vectors in
    R 3 , thus penalising deviation of labels from the observations, and neighbours from each
    other. This is optimised using ICM, which converges to groups of spatially contiguous
    points corresponding to the same normal. This is illustrated in Figure 5.11.
    5.8 Segmentation
    102
    (a)
    (b)
    (c)
    (d)
    Figure 5.11: Segmentation of planes using their orientation. From the ini-
    tial sweep estimate of orientation (a), we initialise a second MRF (b), where
    each normal is coloured according to its orientation (see Figure 5.12), showing
    smooth changes between points. After optimisation, the normals are now piece-
    wise constant (c), and correspond to the two dominant planes in the scene. After
    segmentation by finding connected components with the same normal, we obtain
    an approximate plane detection, where the normals shown are simply the mean
    of all salient points in the segments (d).
    Figure 5.12: The colours which represent
    different orientation vectors (the point in this
    map to which a normal vector from the centre
    of the image would project gives the colour it
    is assigned).
    5.8.6 Region Shape Verification
    After running the two stages of segmentation, we are left with a set of non-planar seg-
    ments (from the first stage), and a set of planar segments with orientation estimates,
    from the second. Before using these to get the final plane detection, we discard inap-
    propriate regions. First, any regions smaller than a certain size are discarded, since they
    are not likely to contribute meaningfully to the final detection, nor be given reliable ori-
    entation estimates in the next step. We apply a threshold to both the number of points
    in the region, typically 30, and to the pixel area covered by the region, typically 4000
    pixels. Second, we also remove excessively elongated regions, a fairly unusual shape for
    planes, by calculating the Eigenvalues of the 2D points in a region, and discarding those
    where the smaller value is less than a tenth of the larger.
    5.9 Re-classification
    103
    5.9 Re-classification
    Figure 5.8d above shows the result of plane segmentation, a collection of disjoint groups
    of points with normal estimates. However, these are not themselves the final plane
    detection, since they have not actually been classified or regressed. These regions simply
    have an average of the plane class and orientation of the points which lie within them,
    which in turn are derived from all sweep regions in which they lie. This means the
    estimates for these regions use data which extend well beyond their extent in the image.
    We are now finally in a stage where we can run our original plane recognition algorithm
    from Chapter 3 on appropriately segmented regions. We re-classify the planar segments,
    to ensure that they are planar — generally they remain so, though often a few outliers
    are discarded at this stage. We do not attempt to re-classify the non-planar segments,
    because we have already removed them from consideration, and they were not part of
    the orientation segmentation. For those which are classified as planes, we re-estimate
    their orientation, so that the orientation estimate is derived from only the points inside
    the region. In general we find that the re-estimated normals are not hugely different
    from the means of the segments’ points. This is encouraging, since it suggests that using
    the sweep estimates to segment planes from each other is a valid approach.
    5.9.1 Training Data for Final Regions
    The data we wish to classify here are rather different in shape from the training data.
    This is because the training data were created by region sweeping, because we wished to
    learn from regions of similar shape to those used during detection (Section 5.6). However,
    the segments resulting from the MRF are often much more irregular, as well as being
    different in shape and size from our original manually segmented regions. To address
    this we introduce another set of training data, thereby creating a second pair of RVMs,
    trained for this final task alone.
    Appropriate training data are generated by applying the full plane detection algorithm
    to our ground truth training images (Section 5.5), and using the resulting plane and
    non-plane detections as training data. These should have similar shapes to detections
    on test data. Of course, we can use the ground truth labels for these segments, so
    the classification and orientation accuracy on the final segments is irrelevant. We also
    enhanced this training set by including all marked-up regions from the ground truth
    5.10 Summary
    104
    images, whose shapes correspond to those of true planes and non-planes. We found that
    replacing the final RVMs with these trained on better data increased the orientation
    accuracy of the final result by several degrees on average (the segmentation itself, of
    course, is completely unaffected by this).
    We now have an algorithm whose final output is a grouping of the points of the image into
    planar and non-planar regions, where the planar regions have a good estimate of their
    orientation, provided by our original plane recognition algorithm trained on example
    data.
    5.10 Summary
    This chapter has introduced a new algorithm to detect planes, and estimate their 3D
    orientation, from a single image. This has two important components: first, a region
    sweeping stage in which we repeatedly sample regions of the image with the plane recog-
    nition algorithm from Chapter 3, in order to find the most likely locations of planes.
    This allows us to calculate a ‘local plane estimate’ which gives an approximate plane
    probability and orientation to all salient points. Second, we use this intermediate re-
    sult within a two-stage Markov random field framework, to segment into planar and
    non-planar regions, before running plane recognition again on these to obtain the final
    classification and orientation.
    Unlike existing methods, we do not rely upon rectilinear structure or vanishing points,
    nor on specific types of texture distortion. This makes the method applicable to a
    wider range of scenes. Furthermore, since we do not rely on any prior segmentation of
    the image, the algorithm is not dependant on any underlying patterns or structure being
    present, for example planar regions being demarcated by strong lines. Most importantly,
    this algorithm requires only a single image as input, and does not need any cues from
    stereo or multiple views, in contrast to most methods for plane detection. In the following
    chapter, we evaluate the performance of the algorithm on various images, and investigate
    the effects of the design decisions outlined above. We also compare the method to existing
    work.
    CHAPTER 6
    Plane Detection Experiments
    This chapter presents the results of experimental validation of our plane detection algo-
    rithm. We discuss the effect of some of the parameters on plane detection, and describe
    the experiments with which we chose the best settings, by using our training dataset. We
    then show the results of our evaluation on an independent dataset. Finally, we describe
    the comparison of our algorithm to a state of the art method for extracting scene layout,
    with favourable results.
    6.1 Experimental Setup
    To evaluate the performance of the plane detection algorithm, we used the manually
    segmented ground truth data described in Section 5.5. These were whole images hand-
    segmented into planar and non-planar regions, and the planar regions were labelled with
    an orientation. This means that every pixel in the image was included in the ground
    truth, and there was no ambiguity. For regions which were genuinely ambiguous (such
    as stairs and fences), we tried to give the semantically most sensible labels, and made
    sure that we were consistent across training and test data.
    105
    6.1 Experimental Setup
    106
    The key difference compared to the experiments in Chapter 4 is that we now have marked-
    up detections, rather than class and orientation for individual planes. The difficulty is
    to evaluate how well one segmentation of the image (our detection, which has sections
    missing) corresponds to another (the ground truth). Comparing segmentations in general
    is not easy (especially when there are multiple potentially ‘correct’ answers [130]). Our
    approach is based on assessing the classification and orientation accuracies across all the
    detected segments.
    6.1.1 Evaluation Measures
    We use two evaluation measures, for the classification accuracy and orientation error.
    We cannot directly compare ground truth regions and detected regions, since there will
    not be a direct correspondence. Instead, we perform the comparison via the set of salient
    points, which is the level at which our detection and grouping actually operate.
    We measure classification accuracy as the mean accuracy over all salient points (mean
    over the image, or mean over all points in all images when describing a full set of tests).
    That is, we take ground truth class of points as the assigned class of the labelled region
    in which they fall, and the estimated class as the class of the detected plane of which it
    forms a part.
    We take a slightly different approach to evaluating orientation accuracy. While we could
    have taken the mean orientation error over salient points, as above, this would not give a
    true sense of how points are divided into planes, giving undue influence to large planes,
    which contain more salient points. In keeping with the objective of evaluating the planes
    themselves, rather than the points, we evaluate orientation accuracy as the orientation
    error for whole regions. This is calculated by comparing to the true orientation of the
    region, which is taken to be the mean orientation of its salient points, whose orientation
    comes from the true region in which they lie. This means that if the detected plane
    is entirely within one ground truth region, we are comparing to that region’s normal;
    otherwise, we are comparing to the mean of those regions it covers, weighted by the
    number of points from each region.
    6.2 Discussion of Parameters
    107
    6.2 Discussion of Parameters
    Each of the components of the algorithm described in Chapter 5 will have some effect
    on the performance of the plane detection algorithm. In this section, we show the
    results of experiments conducted to investigate how performance changes for different
    configurations of some important parameters. The results of these experiments were then
    used to choose the best parameters to use in subsequent evaluation. These experiments
    used our training set of ground truth data, comprised of 439 images.
    6.2.1 Region Size
    The first parameter we needed to set was the region size for sweeping (the process by
    which we extract multiple overlapping regions from an image, described in Section 5.4).
    This is both a part of the plane detection algorithm, and used to gather the training
    data used for the plane recognition algorithm.
    It was not immediately obvious what the best size would be as there is a trade-off
    between the accuracy and the specificity of regions. We would generally expect larger
    regions to perform better than smaller regions, since there is more visual information
    available. However, using regions which are as large as possible was not advisable. This is
    because these regions were not hand segmented, so larger regions would begin to overlap
    adjacent true regions, which would have a different class or orientation. Thus we needed
    to compromise between having large regions, and having ‘pure’ regions, by which we
    mean those that lie within only one ground truth region.
    To investigate region size, we first conducted an experiment where we stepped through
    different sizes of region when using the plane recognition algorithm. We ran this ex-
    periment using plane recognition, rather than plane detection, since it was simpler to
    set up and gave a direct way of seeing how region size affected classification, without
    considering other factors such as the segmentation.
    Using each of the region sizes, we harvested training regions by sweeping over a large set
    of fully labelled ground truth images (see Sections 5.4 and 5.5), where region size was
    specified by the radius around the centre point within which neighbouring salient points
    were included. We then fully trained the plane recognition algorithm, as described in
    Chapter 3, and tested it using cross-validation.
    6.2 Discussion of Parameters
    108
    90%
    36
    34
    88%
    32
    30
    86%
    28
    26
    84%
    24
    82%
    22
    20
    80%
    18
    20
    30
    40
    50
    60
    70
    80
    90
    100
    20
    30
    40
    50
    60
    70
    80
    90
    100
    Sweep radius
    Sweep radius
    (a) Overall accuracy
    (b) Overall orientation error
    4
    −3
    x 10
    x 10
    2
    0.95
    3.5
    3
    0.9
    1.5
    2.5
    0.85
    2
    1
    0.8
    1.5
    0.5
    1
    0.75
    0.5
    0
    0.7
    20
    40
    60
    80
    100
    20
    40
    60
    80
    100
    0
    20
    40
    60
    80
    100
    Sweep radius
    Sweep radius
    Sweep radius
    (c) Area of regions
    (d) Purity, defined as
    (e) Covariance of re-
    the amount by which
    gion orientations
    true regions are mixed
    Figure 6.1: Using different region sizes (measured in pixels) when using sweep-
    ing to get training data. These experiments are for plane recognition.
    Figure 6.1 shows the results. Increasing the region size improved performance; yet con-
    trary to expectation performance continued to rise even for very large regions, which
    occupied a significant fraction of the whole image, mixing the different classes. This
    is evident in Figures 6.1a and 6.1b, which show the mean classification accuracy and
    orientation error respectively. Figure 6.1c confirms that as the radius was increased, the
    actual area of regions did indeed get larger.
    Nevertheless, using arbitrarily large regions is inadvisable. As Figure 6.1d shows, when
    the regions were bigger, they were on average less ‘pure’, meaning they tended to overlap
    multiple ground truth regions, mixing classes and orientations. We define purity as the
    percentage of points in a sweep region which belong to the largest truth region falling
    inside that sweep region. Thus, the trade-off is that as we use larger regions, we can
    be less confident that they accurately represent single planar or non-planar regions,
    potentially making the training set more difficult to learn from as appearance corresponds
    less consistently to specific orientations.
    We explain the continually increasing performance with Figure 6.1e, where we plot the
    determinant of the covariance matrix formed from all the regions’ true orientations.
    6.2 Discussion of Parameters
    109
    As the regions got bigger, the covariance got smaller, which means the normals were
    becoming more similar to each other. This suggests that as regions get larger, they will
    tend to cover more nearby planes, averaging out the difference between them. As planar
    regions approach the size of the image, the orientation is simply the mean orientation
    over all points in the ground truth. For an image with multiple planes, these will likely
    average out to an approximately frontal orientation. Thus we would expect to see less
    overall variation in the normals, as indeed we did. The lower amount of variability in
    turn makes these orientations easier to regress.
    This experiment confirmed our suspicion that very small regions were not suitable for
    classification, which supports the case for using our sweeping-based method rather than
    over-segmentation (c.f. Section 5.1.2). The experiment also suggested that overly large
    regions were not suitable either. Unfortunately since the performance measures kept on
    rising, this experiment could not be used to determine which region size should be used,
    and it was necessary to consider what effect the region size had on segmentation. As
    such the following experiment was conducted to further investigate this issue.
    6.2.1.1 Testing Region Size using Plane Detection
    Since testing the recognition algorithm on its own was not a good way to find the opti-
    mum region size for plane detection, we instead ran a test on the whole plane detection
    algorithm, in cross-validation, for different region sizes. From this we could directly ob-
    serve any effect of changing the region size on the results of the plane detection, rather
    than relying on a proxy for how well it would perform. This experiment involved carry-
    ing out all of the steps described in the previous chapter, repeatedly, for multiple sets of
    truth regions with different sweep settings.
    The experimental procedure consisted of k -fold cross-validation with n repeats, described
    as follows. We looped over a set of region sizes, from a radius r of 20 pixels up to 100 in
    increments of 10. For each, we ran k -fold cross-validation, where we divided the data into
    k equal sets, and for each fold, used all but one to train the algorithm. This was done
    by detecting salient points, creating and quantising features and so on, then running
    region sweeping to extract training regions, with which we trained a pair of RVMs. It
    was necessary to also gather regions for the final re-classification step, so we ran the
    plane detection on these test images and retained the resulting detected regions (plus
    the ground truth regions themselves), and trained a second pair of RVMs. Finally, we
    6.2 Discussion of Parameters
    110
    80%
    30
    78%
    28
    76%
    74%
    26
    72%
    24
    70%
    22
    68%
    66%
    20
    20
    30
    40
    50
    60
    70
    80
    90
    100
    20
    30
    40
    50
    60
    70
    80
    90
    100
    Sweep radius
    Sweep radius
    (a)
    (b)
    Figure 6.2: Changing the size of sweep regions (measured in pixels) for training
    data used for plane detection. The error bars show one standard deviation either
    side of the mean, calculated over twelve repeat runs.
    used this plane detector on the images held out as test data for this fold, using a sweep
    radius r , and stored the mean accuracy and orientation error compared to its ground
    truth. We repeated all of this n times, and calculated the mean and standard deviation
    of the results for each radius.
    The results are shown in Figure 6.2, which graphs the classification accuracy and orien-
    tation error, against sweep region radius. The graphs show the results from twelve runs
    of five-fold cross-validation for each sweep size – the points are at the mean, and the
    error bars show one standard deviation either side of the mean, over the twelve runs.
    Even with this many runs, the results were somewhat erratic. This may have been due
    to using only a quarter of the set of truth images we had available (115 regions were
    used), and limiting the training stage to use a maximum of 1000 sweep regions for each
    run, due to the large amount of time this took. As such the overall accuracy might have
    been diminished, and the uncertainty shown by the error bars is so large that the best
    parameters are hard to determine.
    The results hint that the optimal sweep size for classification and regression may be
    a little different, with the best classification accuracy occurring at small radii, while
    orientation estimation seemed to improve up to a radius of around 90 pixels. If this
    behaviour is real, we could in principle use two separate region sizes for the two tasks,
    which would entail having two entirely different sets of training regions, not just different
    descriptors, c.f. Section 3.3.4.5. However, since the results of this experiment are not
    conclusive, we chose the simpler option of using the same radius for both; this experiment
    does not dictate a clear value to use, so a radius of 70 pixels was chosen as a compromise.
    6.2 Discussion of Parameters
    111
    This value was used for the rest of the experiments in this chapter, unless otherwise
    stated.
    6.2.2 Kernel Bandwidth
    We performed an experiment to determine the optimal kernel bandwidth for mean shift
    (to find the modes of the kernel density estimate of all orientations in an image, as de-
    scribed in Section 5.8.5.1). It would be possible to evaluate this as above, by running the
    whole plane detection algorithm for a range of different bandwidth parameters. How-
    ever, this would give an indirect measurement of the quality of the segmentation, since it
    would measure the accuracy after having re-classified the detected regions, the result of
    which would depend on all the stages of plane detection. To compare kernel bandwidths,
    we needed only to determine the best granularity for splitting planes from each other,
    given the local plane estimate. This was done by using just the local plane estimate,
    calculated not from classification but from ground truth data, which was sufficient since
    for ground truth images we know the true number of planes we should find.
    The experiment was set up as follows: we calculated the local plane estimate, using region
    sweeping, but rather than run full plane recognition we simply used the ground truth
    data for each region. We processed overlapping regions as before, taking the median and
    geometric median of classifications and orientations respectively. This gave us a local
    plane estimate which was as good as possible, independent of the features or classifiers.
    After segmenting the planes from non-planes (using the first MRF), we took all the
    estimated normals from points in planar regions and ran mean shift to find the modes.
    We were able to evaluate these modes without running the subsequent segmentation,
    because ideally each true plane would correspond to one mode, and so we measured
    the difference between the number of modes and the number of true planes (we counted
    parallel planes as one, since their orientation does not distinguish them). This evaluation
    did not consider whether the orientations for the modes returned by mean shift were
    actually correct, but given that we were testing on a ground truth local plane estimate,
    it should not be an issue.
    Figure 6.3 shows the results. Clearly, very small bandwidths (corresponding to a very
    rough density estimate) were poor, as they led to far too many plane orientations. Ac-
    curacy improved until around 0.2, after which there was a very slow deterioration (pre-
    6.3 Evaluation on Independent Data
    112
    40
    12
    35
    Full mean shift
    Accelerated mean shift
    10
    Full mean shift
    30
    Accelerated mean shift
    8
    25
    20
    6
    15
    4
    10
    2
    5
    0
    0
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    Kernel bandwidth
    Kernel bandwidth
    (a) The mean error in number of modes
    (b) Mean time taken for mean shift, com-
    compared to true planes. Error bars cor-
    paring the full algorithm and our acceler-
    respond to one standard deviation, calcu-
    ated version
    lated over all test images
    Figure 6.3: Results for evaluating bandwidth for mean shift, when finding the
    modes within the local plane estimate.
    sumably arbitrarily high bandwidths would increase the error as there would be always
    one mode), and so we chose this as our bandwidth value for further experiments.
    We also compared the full mean shift algorithm, in which all the data were iterated until
    they reached their mode, with our accelerated version (Section 5.8.5.4). The results
    confirm that this approximation does not increase the error. Figure 6.3a shows the
    results of the two overlaid, with virtually no visible difference in either mean or standard
    deviation. Figure 6.3b plots the average time per image for both methods. Clearly, our
    accelerated version offered a huge improvement in speed, being on average 100 times
    faster.
    6.3 Evaluation on Independent Data
    We proceeded to evaluate our algorithm on an independent dataset. For training, we
    used the set of 439 ground-truth images from above, which were first reflected about the
    vertical axis, to double the number of images from which training data was gathered.
    From these we extracted around 10000 regions by sweeping, which were used to train
    the full plane detector. Our independent dataset was taken from a different area of the
    city, to ensure we had a proper test of the generalisation ability of the algorithm. This
    consisted of 138 images, which were also given ground truth segmentations, and plane
    6.3 Evaluation on Independent Data
    113
    class and orientation labels. These exhibited of a variety of types of structure including
    roads, buildings, vehicles and foliage. As in Chapter 4 the aim was for these to be
    totally unseen image regions, though we must specify a few caveats. First, these data are
    harvested from the same video sequences (but not the same frames) as the independent
    dataset of Chapter 4, so the location is not wholly new (again, this location was never
    used in training data). A subset of the data were used to evaluate the algorithm as
    described in [57], so some of the images will have been seen before during testing (this is
    the subset used in Section 6.4 below). We also used some of these images (because of the
    ground truth labelling being available) in section 5.6.1.1, to show that extracting training
    regions by sweeping was necessary when the test regions were similarly extracted. Other
    than these exceptions, the training data described here are new and unseen during the
    process of developing the algorithm.
    6.3.1 Results and Examples
    When running the evaluation on our test data, we obtained a mean classification accuracy
    of 81%, which was calculated over all points, except those which were not used in any
    regions. We obtained a mean orientation error of 17.4 (standard deviation 14.7 ). Note
    that standard deviation was calculated over all the test images, not over multiple runs.
    This is larger than the error of 14.5 obtained on basic plane recognition (Section 4.3),
    but the task was very different, in that it first needed to find and segment appropriate
    regions.
    These results are a little worse (though within a few percent for both measures) than
    we originally reported in [57]. However, those were obtained using a smaller set of test
    images, which may have been less challenging.
    To clarify what these results mean, we show a histogram of the orientation errors in
    Figure 6.4. Comparing this to Figure 4.7 in Chapter 4 shows it to be only a little worse,
    even though this is a much more difficult task, with a large majority of the orientation
    errors remaining below 20 . We believe that these results are very reasonable given the
    difficulty of the task, in that these planes were not specified a priori but were segmented
    automatically from whole images, without recourse to geometric information.
    We now show example results from this experiment. First, in Figure 6.5, we show some
    manually selected example results, to showcase some of the more interesting aspects of
    6.3 Evaluation on Independent Data
    114
    30 %
    25 %
    20 %
    15 %
    10 %
    5 %
    0 %
    0
    20
    40
    60
    80
    100
    120
    140
    Orientation error (degrees)
    Figure 6.4: Distribution of orientation errors for regions detected in indepen-
    dent set of test images.
    the method. Then in order to show a fair an unbiased sample from our results, we choose
    the example images to display in the following way. We sort all the results by the mean
    orientation error on the detected planar surfaces (this is one of two possibilities, as we
    could also use the classification accuracy over the salient points – the former was chosen
    since we believe orientation accuracy is the more interesting criterion here). We then
    take the best ten percent, the worst ten percent, and the ten percent surrounding the
    median error; then chose six random images from these sets. This is in order to give a
    fair sampling of the best cases, some typical results in the middle, and situations where
    the algorithm performs poorly. The best, medium, and worst examples are shown in
    Figures 6.6, 6.7 and 6.8 respectively.
    Our algorithm is able to extract planar structures, and estimate their orientation, when
    dominant, orthogonal structures with converging lines are apparent. Figures 6.6c and
    6.7a, for example, show it can deal with situations typical for vanishing line based al-
    gorithms. Crucially, we also show that the algorithm can find planes even when such
    structure is not available, and where the image consists of rough textures, such as Figure
    6.6e. This would be challenging for conventional methods, since there are not many
    intersections between planes, and the tops of the walls are not actually horizontal. An-
    other example is shown in Figure 6.5c, where the wall and floor have been separated
    from each other.
    Our method is also able to reliably distinguish planes from non-planes, one of the most
    important features of the algorithm. For example, Figures 6.5a and 6.5d show a car and a
    tree, respectively, are correctly separated from the surrounding planes before attempting
    to estimate their orientations. This sets our approach apart from typical shape from
    texture methods, which tends to assume that the input is a plane-like surface which
    6.3 Evaluation on Independent Data
    115
    (a)
    (b)
    (c)
    (d)
    (e)
    (f)
    Figure 6.5: Examples of plane detection on our independent test set. These
    images were hand-picked to show some interesting baheviour, rather than being
    a representative sample from the results (see the next images). Columns are:
    input image, ground truth, local plane estimate, plane detection result.
    6.3 Evaluation on Independent Data
    116
    (a)
    (b)
    (c)
    (d)
    (e)
    (f)
    Figure 6.6: Examples of plane detection on our independent test set. These are
    randomly chosen from the best 10% of the examples, sorted by orientation error.
    Columns are: input image, ground truth, local plane estimate, plane detection
    result.
    6.3 Evaluation on Independent Data
    117
    (a)
    (b)
    (c)
    (d)
    (e)
    (f)
    Figure 6.7: Examples of plane detection on our independent test set. These
    have been selected randomly from the 10% of examples around the median,
    sorted by orientation error. Columns are: input image, ground truth, local plane
    estimate, plane detection result.
    6.3 Evaluation on Independent Data
    118
    (a)
    (b)
    (c)
    (d)
    (e)
    (f)
    Figure 6.8: Examples of plane detection on our independent test set. These
    have been selected randomly from the worst 10% of examples, sorted by orienta-
    tion error. Columns are: input image, ground truth, local plane estimate, plane
    detection result.
    6.3 Evaluation on Independent Data
    119
    Figure 6.9: An examples of plane detection on an image from our independent
    test set, where two planes have been split into three, due to the algorithm being
    unable to perceive the boundary between them. From top left: input image,
    ground truth, local plane estimate, plane detection result.
    needs its orientation estimated, without being able to report that it is not in fact planar.
    These results also suggest that our algorithm is able to generalise quite well to new
    environments. It does this by virtue of the way we have represented the training data
    using quite general feature descriptors, which should encode the underlying relationships
    between gradient or colour and structure, as opposed to using the appearance directly.
    This is supported by Figure 6.7f for example, where the car park structure on the left is
    correctly classified even though such structures are not represented in the training set.
    However, we observe that in these images there are some missing planes. It is common
    for our method to miss ground planes due to the lack of texture. This is because if there
    are insufficient salient points, there will be no classification nor orientation assigned,
    so major planar structures may be missed – for example in Figure 6.5e where there
    are insufficient salient points on the ground to support a plane, despite the grid-like
    6.3 Evaluation on Independent Data
    120
    appearance; and in Figure 6.7b where the road is entirely featureless and thus no planes
    (or indeed non-planes) can be found. On the other hand, in some cases, such as Figure
    6.5b, the ground plane can be detected and given a plausible orientation, at least in the
    region in which salient points exist.
    6.3.2 Discussion of Failures
    Our method fails in some situations, examples of which are shown in Figure 6.8, which
    are sampled randomly from the worst 10% of the results by orientation error. Misclas-
    sification during the sweeping stage causes problems, for example Figure 6.8b where the
    tree was classified as being planar, which led to the plane region over-extending the wall.
    Another example of a plane leaking beyond its true boundary is shown in Figure 6.8d,
    where the orientation estimated for the ground and the vertical wall are unfortunately
    quite similar, leading them to be grouped into the same segment. Misclassification also
    causees problems in Figure 6.8e where a pedestrian has been partly classified as planar
    (and the ground missed), though the region in question is rejected by the final classifi-
    cation step.
    A common problem is the inability to deal with small regions. This is due to the region-
    based classifier, and how fine detail is obscured by the nature of our region-sweeping
    stage. This is shown in Figure 6.5f for example, where rather complex configurations
    of planes partially occluded by other planes are not perceived correctly. We observed
    a tendency to over-segment, such as Figure 6.8c, where the ground plane has been
    unnecessarily divided in two, due to a failure to merge the varying normals into one
    segment. Conversely, the algorithm may fail to divide regions when it should, as we
    pointed out above in Figures 6.8b and 6.8d, and also in Figure 6.5f where the whole
    scene has been merged into one large plane. A related problem is the over-extension or
    ‘leaking’ of planes, as evinced by Figure 6.7b where the plane envelopes the pavement
    as well as the wall. This is an example where the algorithm has failed to respect true
    scene and image boundaries.
    Finally we consider the interesting case of Figure 6.9 (enlarged to more clearly show
    the variations in the local plane estimate, bottom left). As one might expect, the local
    plane estimate shows how the normals at the points vary smoothly around the sharp
    corner. Unfortunately, the MRF has segmented this into three rather than two segments,
    effectively seeing the middle of the transition as a plane in its own right. Currently there
    6.4 Comparative Evaluation
    121
    is nothing in our algorithm to prevent this, if those normals are sufficiently numerous to
    be another mode in the kernel density estimate. This is also an example where the final
    classifier has assigned incorrect orientations to the three planes, perhaps caused by the
    segmentation being incorrect.
    These failures are the most common types of error we observe (though we emphasise
    that, in accordance with Figure 6.4, the majority of orientations are good), and can
    generally be explained given the way the algorithm works. They do, however, suggest
    definite ways in which it could be improved, and hint at directions for future work.
    6.4 Comparative Evaluation
    As we discussed in Chapter 2, our algorithm is quite different from most existing meth-
    ods. For example, algorithms that use information such as vanishing points to directly
    estimate the scene geometry should perform better than ours when such features are
    available, but are not applicable to the more general types of scene we encountered. As
    such we omit a direct comparison with such methods, though we acknowledge that when
    obvious vanishing point structure is visible, they will almost certainly perform better.
    Shape from texture methods, on the other hand, may perform well in more general en-
    vironments, but impose constraints of their own. As we mentioned in the background
    chapter, these are not usually able to find the planes, and assume orientation is to be
    estimated for the whole image, so a comparison is not well defined.
    A good example of a single-image interpretation algorithm which can deal with similar
    types of image to our work is the scene layout estimation of Hoiem et al. [66] (which
    we will henceforth refer to as HSL). This uses a machine learning algorithm to segment
    the image into geometric classes representing vertical, support surfaces, and sky, includ-
    ing a discrete estimate of surface orientation, and has been used to create simple 3D
    reconstructions, and as a prior for object recognition [65]. In the following, we explain
    the relevant details of this algorithm and how it relates to our own, before showing the
    results of using it for plane detection on our dataset.
    6.4 Comparative Evaluation
    122
    6.4.1 Description of HSL
    We have already described the HSL algorithm in our background chapter — refer to
    Section 2.3.2. Here we recap some of the more important aspects, as relevant to the
    comparison. First, note that it uses superpixels, obtained by over-segmentation (based
    on intensity and colour) as its atomic representation, rather than working with salient
    points. While these superpixels give some sense of where image boundaries are, these
    boundaries are only as accurate as the initial image segmentation.
    Superpixels are grouped together by a multiple segmentation process. This uses classifiers
    to decide whether two superpixels should be together; whether a segment is sufficiently
    homogeneous in terms of its labelling; and to estimate the likelihood of a class label per
    segment. By using cues extracted from the segments, using these to form segments, and
    extracting larger-scale features from these, the algorithm can build up structure from
    the level of superpixels to the level of segments. A variety of features such as colour,
    texture, shape, line length, and vanishing point information are used for this.
    The result is a set of image segments, each labelled with a geometric class, which rep-
    resent geometric properties of image elements, as opposed to their identity or material.
    The three main classes are ground, sky, and vertical surfaces, which should be able to
    represent the majority of image segments. The vertical class is divided further into left,
    right, and forward facing planes; and porous and solid non-planes. This allows the un-
    derlying scene layout to be perceived, but offers no finer resolution on orientation than
    these labels.
    6.4.2 Repurposing for Plane Detection
    Although HSL was developed for coarse scene layout estimation, and not for plane detec-
    tion, there are some important similarities. By separating the sky from the other main
    classes, and subdividing the vertical class into the planar and non-planar (porous and
    solid) subclasses, this is effectively classifying regions into planar or non-planar classes.
    Thus, HSL can be used as a form of plane detection, and so it was our intention to
    evaluate how well it could do this — specifically, how well it could perform at our stated
    task of grouping points into planar regions and estimating their orientation.
    We ran the unmodified HSL algorithm on our dataset, then processed the output so that
    6.4 Comparative Evaluation
    123
    it represents plane detection. We considered the ground class, and the left, right, and
    forward facing subclasses of the vertical class to be planar, and the rest (sky, porous,
    and solid) to be non-planar. This corresponds to our separation of plane from non-
    plane. Plane orientations come from the planar subclasses of the vertical segments, and
    the support class. Re-drawing the output to show this is very interesting, as it shows
    some shortcomings of the algorithm that are not obvious in the usual way the output is
    drawn (e.g. in [64, 66]). As Figure 6.10 shows, regions which are correctly deemed to be
    vertical, and usually drawn all in red, may contain a mixture of conflicting orientations,
    and include non-planar sections.
    For these experiments, we used code provided by the authors 1 . We made no changes to
    adapt it to our dataset, and the fact that we did not retrain it using the same dataset
    used for our detection algorithm (due to the difficulty of marking up the data in the
    required manner) may cause some bias in our results. However the training data used
    to train the provided classifiers (described in [66]) should be suitable, since the range of
    image sizes covers the size we use, and the type of image are similar. The results we
    obtained appear reasonable compared to published examples, suggesting the method is
    able to deal sufficiently well with the data we collected.
    To compare with our algorithm, we looked at the difference in classification and ori-
    entation at each salient point in the image. This is because there was no easy way to
    1 Available at www.cs.uiuc.edu/homes/dhoiem/
    (a) Input
    (b) Output drawn
    (c) Drawn to show
    similar to the original
    plane detection
    Figure 6.10: Re-purposing HSL[66] for plane detection: the original way to
    draw the output (b) indicates the majority of classifications are correct (in-
    deed, the whole surface is ‘vertical’); however when we draw to highlight planar
    and non-planar regions and to distinguish orientations (c) (colours as in Fig-
    ure 6.11) errors become apparent, including regions of non-planar classification
    falling within walls.
    6.4 Comparative Evaluation
    124
    correspond the regions in HSL to either ours or the ground truth; and because com-
    paring at every pixel does not make sense for our method, since unlike HSL we do not
    segment using every pixel in the image. Salient points are not used as part of HSL,
    but we assume that they are a sufficiently good sampling to represent the segmentation.
    This comparison will thus focus on textured regions, where salient points lie, so if HSL
    performs better in texture-less regions we cannot measure this; and we acknowledge that
    comparing the algorithms only at the locations where ours outputs a value, rather than
    all locations, is not necessarily a fair comparison.
    For this comparison, we continued to use the percentage of points assigned to the correct
    class as the measure of classification accuracy (as above), since this is well defined for
    both methods. However, we needed to compare orientation differently, since our algo-
    rithm assigns orientations to each plane as a normal vector in R 3 , whereas HSL can only
    give coarser division into orientation classes. To do this we sacrificed the specificity of
    our method, and quantised our orientation estimates into one of the four orientation
    classes. This was done by finding which of the four canonical vectors, representing left,
    right, upwards and forwards, were closest in angle to a given normal vector. Using this
    quantisation, for both the ground truth and detected planes, orientation error was mea-
    sured as a classification accuracy. This may introduce quantisation artefacts (arbitrarily
    similar orientations near a quantisation boundary will be treated as different), but gave
    us a fair comparison.
    6.4.3 Results
    This experiment was performed using a subset of 63 of our labelled ground truth data,
    using plane detection trained on a subset of the training data described above (the initial
    datasets for which we reported results in [57]). The detector was trained by the same
    procedure as before, except that we used regions of radius 50 pixels. We do not imagine
    this has a major impact on our conclusions, since the difference in performance between
    the two radii according to Figure 6.2 was relatively minor.
    The results are shown in Table 6.1. Our algorithm gave better results than HSL— which
    stands to reason since our method is geared specifically toward plane detection, and uses
    a training set gathered solely for this purpose.
    Example results are shown in Figure 6.11 (see caption on facing page for explanation
    6.4 Comparative Evaluation
    125
    Ours HSL
    Classification accuracy
    84%
    71%
    Orientation accuracy
    73%
    68%
    Table 6.1: Comparison of plane and quantised orientation classification accu-
    racy between our method and Hoiem et al. [66] (HSL) for plane detection.
    of colours). Note that these are illustrative examples hand-picked from the results, in
    order to show the similarities and differences between the algorithms, as opposed to a
    fully representative sample. The fifth column shows the typical output of HSL; as we
    mentioned above, these segmentations appear accurate, but do not illustrate performance
    of plane detection. The final column shows the same result when drawn to show plane
    detection, which no longer appear so cleanly segmented, and numerous errors in plane
    extent and surface orientation are visible. We draw the result of our method and the
    ground truth in the same manner for comparison, as well as the input image and standard
    output of our algorithm for reference.
    6.4.4 Discussion
    These images show some interesting similarities and differences between the two algo-
    rithms. In many situations, they performed similarly, such as Figures 6.12a and 6.12b,
    where in both images the main plane(s) were found and assigned a correct orientation
    class. It is worth noting that in Figure 6.12b our method did not detect the ground
    plane, due to the lack of texture and salient points, but this posed no problem for HSL.
    This is partly down to the different features used (such as shape and colour saturation),
    but also because image position is an explicit feature in HSL, meaning that pixels near
    Figure 6.11: Illustrative examples, comparing our method to the surface layout
    method of Hoiem et al. [66] (HSL). The columns are the input image, ground
    truth, our method, our method quantised, original output of HSL, and HSL
    drawn to show orientation classes. The original HSL outputs are drawn here as
    in their own work, where green, red, and blue denote ground, vertical, and sky
    classes respectively, and the symbols denote the vertical subclasses. The second,
    fourth and sixth columns are drawn so that non-planar regions are shown in blue,
    and left, right, frontal, and horizontal surfaces are drawn in yellow, red, brown
    and green respectively, overlaid with appropriate arrows — thus illustrating the
    orientation classes. Captions show classification accuracy for both methods, dis-
    played as plane/non-plane classification and orientation classification accuracy,
    respectively.
    6.4 Comparative Evaluation
    126
    (a) Ours: (99% , 91%) HSL: (95% , 92%)
    (b) Ours: (84% , 96%) HSL: (81% , 100%)
    (c) Ours: (85% , 93%) HSL: (75% , 20%)
    (d) Ours: (94% , 100%) HSL: (53% , 28%)
    (e) Ours: (98% , 97%) HSL: (99% , 68%)
    (f) Ours: (98% , 29%) HSL: (41% , 71%)
    (g) Ours: (100% , 59%) HSL: (73% , 99%)
    6.4 Comparative Evaluation
    127
    the bottom of the image are quite likely to be classified as ground. This can itself lead
    to problems, as in Figure 6.12e where a slight change in intensity mid-way up a stone
    wall misled HSL into extending the ground plane too far.
    In other cases, our algorithm performed better, such as being able to disambiguate the
    two planes in Figure 6.12c, by assigning them different orientations, whereas HSL merged
    them together as a forward-facing plane. This failure of geometric classification to dis-
    ambiguate surfaces suggests that being able to estimate actual orientations is beneficial.
    Also, in Figure 6.12d our algorithm found the whole plane, and gave an orientation class
    matching the ground truth; HSL missed half of the wall and assigned the ‘wrong’ ori-
    entation. It could be argued that the true orientation for Figure 6.12d should not be
    frontal (brown) but right (red). This ambiguity in orientation class caused by arbitrarily
    angled planes is exactly the reason we require fine-grained plane orientation, rather than
    geometric classification.
    On the other hand, HSL clearly out-performed our method in Figure 6.12f by finding
    some of the planes, whereas our method was confused by the multiple small surfaces. The
    use of superpixels in HSL allows it to perceive smaller details than our sweeping approach,
    as well as showing better adherence to strong edges, by cleanly segmenting the edges
    of buildings (even if it does misclassify some). Figure 6.12g is another example where
    directly seeing edges may help, since our algorithm failed to distinguish two orthogonal
    planes, giving a nonsensical orientation for the whole image.
    Despite the differences between the two algorithms – and the fact that HSL is not
    designed specifically to detect planes – they both gave quite similar performance when
    presented with the same data. Given that our algorithm was capable of producing
    better results on our test data, and was superior in a number of cases, this suggests that
    there is benefit in using our method, rather than simply re-purposing HSL for the task.
    Furthermore, as the results have shown, there is a good reason for estimating continuous
    orientation as opposed to discrete classes, since the latter can fail to disambiguate non-
    coplanar structure. The ability to more accurately distinguish orientations could also be
    useful in various applications, as we begin to investigate in the next chapter.
    6.5 Conclusion
    128
    6.5 Conclusion
    In this chapter we have thoroughly evaluated our plane detector. We began by show-
    ing, through cross-validation on training data, how its performance changed as various
    parameters were altered. These experiments allowed us to select the best parameters
    empirically, before applying it to real test data. We then demonstrated the algorithm
    working on an independent and previously unseen dataset, captured in a different area
    of the city (albeit with some fairly similar structures). The performance on these data
    shows that our algorithm generalises well to new environments.
    More generally, these results show that our initial objective, of developing a method using
    machine learning to perceive structure in a single image, is achievable. We emphasise
    what is being accomplished: the location of planar structures, with estimates of their
    3D orientation with respect to the camera, are being found from only a single image,
    using neither depth nor multi-view information, and without using geometric features
    such as vanishing points or texture distortion, as in previous methods. While the results
    we show here exhibit room for improvement, we believe they conclusively show that such
    a method has promise, and that exploiting learned prior knowledge – inspired by, but
    not necessarily emulating, human vision – is a worthwhile approach.
    Despite the algorithm’s success, the experiments have exposed a few key limitations,
    which we discuss in more detail here. First, due to the way we obtain the initial local
    plane estimates via region sweeping, our method is not able to perceive very small
    regions. While the MRF segmentation does in principle allow it to extract small segments
    (certainly smaller than the 70 pixel radius segments used for the first stage), very small
    planes are generally not detected because of the smoothly varying local plane estimate,
    which comes from sampling the class and orientation estimates from overlapping regions.
    This also means the algorithm is not perceptive of boundaries between regions, except
    when there is a noticeable change in the orientation estimates. Of course, as we discussed
    in Chapter 3, we would not want to rely on edge or boundary information, since this
    may not always be present or reliable, but some awareness of it could be advantageous.
    6.5.1 Saliency
    The comparison with HSL has also highlighted another limitation, albeit one added
    to the algorithm deliberately. Because we use only salient points, our plane detection
    6.5 Conclusion
    129
    (a)
    (b)
    Figure 6.12: It is possible to create a denser local plane estimate, by using a
    different set of points than the salient points used to create descriptors. In these
    two examples, for the input image (top left) we create the local plane estimate
    (top right) at the salient points as normal; we can also do this at a regular grid
    (bottom left), or even at every pixel (bottom right), where the colours represent
    orientation, as described in Figure 6.13. Grey means non-plane and black is
    outside the swept regions. The two images show very different orientations,
    and hence their colours are in different parts of the colour map. Note that no
    segmentation has been performed yet.
    Figure 6.13: The colours which represent
    different orientation vectors (the point in this
    map to which a normal vector from the centre
    of the image would project gives the colour it
    is assigned).
    method does not deal with any regions in which there is no texture. This was done
    in order to focus on ‘interesting’ regions, and to avoid wasting computational effort.
    However, this means that comparatively blank parts of the image, which may still be
    important structures, such as roads, are omitted.
    We could possibly simply increase the density of the points, either by lowering the
    saliency threshold, using a different measure of saliency, or even using a regular grid.
    Such points may not be ideal for creating descriptors, but we can decouple the set of
    points used for image representation and those used to build the local plane estimate.
    Presently the two sets coincide primarily for convenience. It is possible to use one set
    of salient points to build the word histograms and spatiogram descriptors, to describe
    the sweeping regions, while sampling at another set of points to create the local plane
    estimate (a point will lie inside multiple sweep regions, and can be assigned a class
    probability and approximate normal, independently of whether it has associated feature
    6.5 Conclusion
    130
    Figure 6.14: When the image of the plane on the right is moved to the left of
    the image, it appears to have a different orientation, due to the perspective at
    which it is apparently being viewed.
    vectors).
    We experimented with this technique, to create dense local plane estimates over an entire
    image (using every pixel, except for those outside any sweep region) as shown in Figure
    6.12. This gives us a pixel-wise map of estimated orientation across a surface (although
    as with the local plane estimate, it does not show where the actual planes are). In
    principle, such a dense map would allow us to extend detection to non textured regions.
    However, in these regions, there will have been insufficient sweep regions (because they
    are still centred on salient points) to give a reliable and robust estimate, so accuracy may
    suffer. The best combination of salient points and local plane estimate density would
    require further work.
    6.5.2 Translation Invariance
    In Section 3.3.5 we described how we shift the points before creating spatiogram descrip-
    tors, such that they have zero mean, giving us a translation invariant descriptor. While
    this seems like a desirable characteristic, it has important implications, since in effect
    we are saying that the position of a plane in the image is not relevant to its orientation.
    However, on further consideration this does not seem to be true. If a plane is visible
    in one part of an image, then the exact same pixels in another part of the image would
    imply a different orientation — see for example Figure 6.14, where we have copied the
    right half of the image onto the left half. Even though the two sides are identical, to an
    observer they appear to have different orientations, with the right half appearing to be
    more slanted away from the viewer. This is because, as a parallel plane moves across
    one’s field of view, it should usually appear more or less foreshortened.
    This has an important implication for our plane detector. Since this effect has not been
    6.5 Conclusion
    131
    accounted for, we have been implicitly assuming that the planes are sufficiently far from
    the observer for this to not be relevant. Fortunately, since the planes are not generally
    very close to the viewer (where such parallax effects are strongest), we do not envisage it
    would cause a drastic difference. Indeed, in the image shown, despite moving the plane
    to the other side of the image, the orientation change is fairly small. Nevertheless, we
    also investigated how much of a difference a translation invariant representation has on
    our results.
    First, we compared the performance of the plane recognition algorithm, when using two
    types of spatiogram. These were the translation invariant version as before (‘zero-mean’),
    and an alternative where we use the original image point coordinates (‘absolute’). In
    the latter, similar regions in different locations are described differently. This could in
    fact be an advantage, as it implicitly uses image location itself as a feature, which Hoiem
    et al. [66] found to be most effective. The experiment was carried out using a large set
    of image regions, harvested from the ground truth images as described in the previous
    chapter, and by running five runs of five-fold cross-validation. The results are shown in
    Table 6.2.
    Zero-mean
    Absolute position
    Classification accuracy 83% (0.3%)
    79% (1.0%)
    Orientation error
    23.0 (0.1 )
    23.1 (0.3 )
    Table 6.2: The difference between using zero-mean (translation invariant) and
    absolute position within spatiograms, when running plane recognition.
    Interestingly, the results using the absolute image coordinates were slightly worse, al-
    though there was no drastic change. We speculate that competing factors were at work.
    Non-invariant descriptors may give a better representation of image regions according
    to their orientation, but at the cost of making similar appearance in different places
    seem more different than it should be. We hypothesise that this would mean that more
    training data are required to achieve the same goal, since there will be fewer potential
    matches for a region at a given image location, though further experiments would be
    needed to confirm this. This is more likely to be an issue for plane classification, since
    unlike orientation estimation, image location should be irrelevant, which partly explains
    the larger observed difference for this measure.
    Despite our concerns, the experiment appears to suggests there is no benefit in adding
    spatial location information (at least in the way we have done so), so the model of plane
    recognition we have used is not invalidated. Nevertheless, so far we have considered only
    plane recognition performance, which was sufficient to determine whether the two means
    6.5 Conclusion
    132
    of description behave similarly, but may miss some important effects. To address this,
    we conducted a further experiment to visualise what the change in representation means
    for plane detection. This was done by training a new plane detector, using the same
    images as before but using the absolute position spatiograms, and applying both this
    and the original to the artificial split plane from Figure 6.14.
    (a) Local plane estimate (subset shown for clarity)
    (b) Local plane estimate, orientation illustrated by
    colour
    (c) Plane detection result
    Figure 6.15: An example of using the zero-mean (left) and absolute position
    (right) spatiograms for plane detection. The first row shows the local plane esti-
    mate, which is more clearly seen when represented as colours (b). The estimated
    orientations are different when using absolute position, for identical image con-
    tent, reflecting how it is perceived. Given the right mean shift bandwidth, this
    means the two planes can actually be separated, when using MRF segmentation,
    unlike when using our original translation invariant representation (c).
    The results are shown in Figure 6.15, for the original zero-mean representation on the
    left and the absolute position version on the right. The first row shows the local plane
    estimates (LPE). Since this is difficult to interpret with so small an orientation change,
    6.5 Conclusion
    133
    we show the LPEs using coloured points in the second row, where the colours correspond
    to orientation as before (Figure 6.13). This indicates that the orientations for the two
    halves, in the zero-mean parameterisation, are basically identical. This is as expected,
    since there is nothing to distinguish between the two halves of the image. When using
    the absolute positions, on the other hand, the colours are different for the two halves
    of the image. The effect is rather subtle, and may not be different enough to give
    ‘correct’ orientation for either side, but is sufficient to show that by incorporating image
    location into the representation we can get different orientation values depending on a
    plane’s location within the image. In fact, as the bottom row shows, this is enough of
    a difference to be able to split the surface into two planes during MRF segmentation,
    whereas the original zero-mean representation does not have enough difference between
    halves. To do this it was necessary to lower the mean shift bandwidth (to 0.05, for both
    versions), but crucially the halves cannot be split with any bandwidth using the original
    parameterisation.
    To conclude, while the zero-mean spatiogram representation may not be able to faithfully
    represent differences in appearance caused by changes in orientation as we intended, the
    alternative parametrisation using absolute position does not perform better. However,
    since we have shown that an awareness of image location is able to tell the difference
    between exactly the same image patterns when presented in different places – taking
    advantage of location context – this would be another useful avenue of future work,
    potentially able to increase the algorithm’s ability to discriminate between planes and
    predict more accurately their orientation.
    6.5.3 Future Work
    We defer an in depth discussion of future work and applications, involving more drastic
    changes to our method, to Chapter 8, but briefly mention some improvements that could
    be made to the algorithm as it is.
    There are two main reasons why our detector performs poorly, being either due to errors
    in the local plane estimate (LPE), which in turn must cause the segmentation to fail (see
    for example Figure 6.5f); or for the MRF segmentation to be unable to correctly extract
    planes from a (potentially complicated) LPE. We consider these two issues in turn in
    suggesting future work.
    6.5 Conclusion
    134
    An inferior LPE is primarily caused by incorrect classification or regression. This means
    that any method to accumulate the information at salient points will have difficulty.
    We have gone some way to deal with erroneous classification when forming the LPE
    by using a robust estimator (the median) to calculate plane probability and orientation
    at salient points. We could further develop this by using more sophisticated robust
    statistics, such as the Tukey biweight or other M-estimators [90]. Even assuming all
    plane recognitions are correct, the problem remains that the LPE will be very smooth,
    making small regions and fast changes invisible. One possible way forward would be
    to derive separate estimates of the reliability of regions, before incorporating them into
    the LPE. We might for example be able to classify whether a region is likely to belong
    entirely to a single region, or be straddled across a surface boundary, and use this to
    pre-filter what goes into building the LPE.
    Next we consider methods to improve the segmentation of a given LPE, by using a more
    sophisticated method of segmenting the MRF. So far we have only used the iterative
    conditional modes algorithm to optimise the configuration over the field, chosen for its
    computational simplicity, though other more complex methods may give us results closer
    to what we desire. It would also be worthwhile to investigate the use of other energy
    functions in the MRF, either by weighting the contribution differently for the single
    and pair site clique potentials (with values learned from training data), or by including
    higher-order cliques to model long-range dependencies.
    Additionally, motivated by the under-segmentation and leaking of planes seen above,
    we could attempt to incorporate edge information into our segmentation. As it stands,
    segmentation in the MRF uses only the local plane estimates at each point, making
    boundaries between planes hard to perceive, whereas if we can incorporate information
    about edges it may improve plane segmentation. This could either be edge information
    from the image itself, using an edge detector or gradient discontinuity information; or
    derived from further classification, in which we attempt to classify whether pairs of points
    or regions belong to the same or different planes. This could potentially be incorporated
    into the same probabilistic framework, by using a MRF to deal with nodes representing
    both the points and boundaries between regions [78]. Similar use of edge information
    was shown to be beneficial by [67].
    Given the strengths and weaknesses of our plane detector as compared to the work of
    Hoiem et al. [67] (HSL), as illustrated by our results, it would be worthwhile to consider
    combining them to create a hybrid system. It is unclear how our sweeping could be used
    with their multiple-segmentation approach. On the other hand, it may be useful to use
    6.5 Conclusion
    135
    their algorithm to find coarse structure such as the ground plane – at which it excels,
    especially in texture-free regions – and use this to guide our segmentation. One might
    even consider using our plane recognition algorithm on groups of superpixels, once HSL
    has joined them into putative clusters, rather than relying on geometric classification for
    the final scene layout.
    CHAPTER 7
    Application to Visual Odometry
    In this chapter, we demonstrate the use of our plane detection algorithm in a real-world
    application, showing that it can be of practical use. We focus on the task of real-time
    visual mapping, where we integrate the plane detector into an existing visual odometry
    system which uses planes in order to quickly recover the structure of a scene. This work
    was originally published in collaboration with Jos ´e Mart´ınez-Carranza [59].
    7.1 Introduction
    In vision based mapping, whether for visual odometry (VO) or simultaneous localisation
    and mapping (SLAM), early and fast instantiation of 3D features improves performance,
    by increasing robustness and stability of pose tracking [19]. For example, careful selection
    of feature combinations for initialisation can yield faster convergence of 3D estimates and
    hence better mapping and localisation, as described in [61].
    A powerful way of improving the initialisation of features is to make use of higher-
    level structures, such as lines and planes, which has shown promise in terms of aiding
    intelligent feature initialisation and measurement, thus increasing accuracy [47]. An
    136
    7.1 Introduction
    137
    important benefit of structure-based priors is that they can be used to quickly build
    a more comprehensive map representation, which can be much more useful for human
    understanding, interaction and augmentation [16]. This is in contrast to point clouds,
    which are difficult to interpret without further post-processing.
    This chapter introduces a new approach to speeding up map building, motivated by the
    observation that if knowledge of objects or structural primitives is available, and a means
    of detecting their presence from a single video frame, then it provides a quick way of
    deriving strong priors for constructing the relevant portions of the map. At one extreme
    this could involve instantaneous insertion of known 3D objects, derived, for example,
    from scene specific CAD models [71, 104], or of navigation with a rough prior map
    [102]. However, these limit mapping to previously-known scenes, and require the effort of
    creating the models or maps. Rather, our interest is in the more general middle ground,
    to consider whether knowledge of the appearance and geometry of generic primitive
    classes can allow fast derivation of strong priors for directing feature initialisation.
    We investigate this by using the algorithm we developed in the previous chapters, to fo-
    cus specifically on map building with planar structure. This is an area which has received
    considerable attention due to the ubiquity of planes in urban and indoor environments.
    Previous approaches to exploiting planar structure in maps have included measuring lo-
    cally planar patches [95], fitting planes to point clouds [47], and growing planes alongside
    points [88, 89]. These methods are generally handicapped by having to allow sufficient
    parallax (and hence time) for detecting planes in 3D, either by waiting for sufficient 3D
    information to become available before fitting models, or by simultaneously estimating
    planar structure while building the map. This is due to the fact that such methods are
    not able to observe higher-level structures directly, and rely on being able to infer them
    from measurements of the geometry of simpler (i.e. point) features.
    This suggests that planar mapping would benefit from a method of obtaining structure
    information more directly, and independently of the 3D mapping component. Such infor-
    mation would be able to inform the map building and act as a prior on the location and
    orientation of planes, and as a guide for the initialisation of planar structure even before
    it becomes fully observable in a geometric sense. Indeed, this was nicely demonstrated
    in the work of Castle et al. [14], in which specific planar objects with known geometry
    were detected and inserted in the map, to quickly build a rich map representation; and
    also by Flint et al. [40], who use the regular Manhattan-like structure of indoor scenes
    to quickly build 3D models.
    7.1 Introduction
    138
    Our work further develops such ideas, in order to derive strong priors for the location of
    planar structure in general, without reference to specific planes, or being overly restricted
    by the type of environment. This is achieved by combining our plane detection algorithm
    with an extended Kalman filter (EKF) visual odometry (VO) system, by modifying the
    plane growing method developed by Mart´ınez-Carranza and Calway [89]. Monocular
    visual odometry, and the closely related problem of simultaneous localisation and map-
    ping (SLAM), are interesting areas on which to focus since we believe they have the
    most to gain from single-image perception, since depth, an important property of any
    visual feature, is not directly observable from a monocular camera. A single image, as
    we discussed in Chapter 1, is rich enough for a human to immediately get a sense of
    scene structure; and yet currently much of this information is discarded, to focus only
    on point features.
    In the next section we discuss related work in the field of plane-based SLAM and VO,
    followed by an overview of our hybrid detection-based method. This is followed by
    a description of the baseline VO system we use in Section 7.3, and a more in-depth
    discussion of how the two methods are brought together to form our combined plane
    detection–visual odometry (PDVO) system in Section 7.4. Our results are presented
    in Section 7.5, which show that the approach is capable of incorporating larger planar
    structures into the map and at a faster rate than previously reported in [89] – averaging
    around 60 fps – while still giving good pose trajectory estimates. This demonstrates the
    potential of the approach both for the specific case of planar mapping, and more generally
    the plausibility of using single image perception to introduce priors for map building.
    Section 7.6 concludes, with a summary and some ideas for future work, including a
    discussion of how our PDVO system might be extended to allow the detector to be
    informed by the 3D map, with the potential of ultimately learning about structure from
    the environment directly.
    7.1.1 Related Work
    The use of planes in monocular SLAM/VO has a long history, motivated by the ubiquity
    of planar structure in human-made scenes. Planes are useful for mapping in a variety
    of ways, from being a convenient assumption during measurement, to being an integral
    part of an efficient state parameterisation.
    Amongst the earliest to use planar features in visual SLAM were Molton et al. [95], who
    7.1 Introduction
    139
    use locally planar patches rather than points as the basic feature representation, within
    an EKF framework. Salient points can usually be considered locally planar, with some
    orientation. After estimating this orientation, the image patch is warped in order to
    account for distortion due to change in view, to allow better matching and tracking of
    features by predicting their appearance. This results in maps consisting of many plane-
    patch features, which as well as improving the localisation ability of the mobile camera,
    give an improved interpretability of the resulting 3D map. However, the planar patches
    remain independent, and are not joined together into higher-level, continuous surfaces.
    A similar approach was developed by Pietzsch [103], in which the parameters of the
    planar patches are included into the SLAM state (again using an EKF), rather than
    being estimated separately. This work shows that even a single planar feature is sufficient
    to accurately localise a moving camera, something which would require very many point
    correspondences. Planes are measured using image alignment, which allows accurate
    measurements to be made. However, including all pixels in the SLAM state incurs a
    significant penalty in computational complexity, since updating the EKF is quadratic
    in the number of features and cubic in the number of measurements (due to inversion
    of the innovation matrix). Furthermore, no mechanism is described for the detection of
    such planar features, relying instead upon manual initialisation.
    More extensive use of image alignment for planar surfaces is made by Silveira et al. [117],
    in which the whole SLAM problem is treated as optimisation, not only over camera and
    scene parameters but also surface properties and illumination. The assumption that a
    scene can be well approximated by a collection of planar surfaces can even apply to large
    scale outdoor scenes, to the extent that this method is capable of localising a camera
    while mapping a large, complex outdoor scene. Since the trajectories shown close no
    loops it is difficult to evaluate the global accuracy of the method.
    The above methods effectively use planar structures to either improve the appearance
    of a map or to make better use of visual features; but planes can also help reduce the
    complexity of the map representation. If many points lie on the same plane, they can
    all be represented with a more compact representation (essentially by exploiting the
    correlations between their states). A good example is by Gee et al. [47] who use planes
    to collapse the state space in EKF SLAM. Planes are detected from a 3D map built
    using regular point-based mapping, by applying RANSAC to the point cloud in order to
    find coplanar collections in a manner similar to Bartoli [5]. Once such planes have been
    found, they are inserted into the EKF to replace the point features, so that whole sections
    of the map are represented by their relationship to the plane. This effectively achieves
    7.1 Introduction
    140
    a reduction in state size, while maintaining a full map, at the same time as introducing
    higher-level structures which may be used for augmentation [16]. Efficiency in terms
    of the state size is important in EKF SLAM since the filter updates are quadratic in
    the size of the state, therefore a reduction in state size has a big impact on increasing
    computational efficiency, or allows larger environments to be mapped.
    The disadvantage of [47] is that it requires a 3D point cloud in order to find the planes,
    which must already have converged sufficiently (i.e. 3D points are well localised). This
    means initialisation of planar surfaces can take some time, especially in larger environ-
    ments. Thus while the detection of planar surfaces can be most useful once they have
    been detected, the initial mapping itself derives no benefit from the planarity of the
    scene. Furthermore, since the primary aim is to reduce the state size, it does not nec-
    essarily follow that detected planes will correspond to true planes in the world. It is
    possible for coplanar configurations of points within the cloud to be mistaken for planes,
    especially in complex and cluttered scenes. This makes no difference to the state reduc-
    tion ability, but means interpreting the features as belonging to true planes in the world
    is problematic.
    An alternative method of finding planes while performing visual SLAM, without relying
    on converged coplanar points, was developed by Mart´ınez-Carranza and Calway [86],
    based on the appearance of images in regions hypothesised to belong to planes. The
    basis of the method is to use triplets of 3D points visible in the current image (obtained
    by a Delaunay triangulation of the visible points) and test whether they might form a
    plane, by determining if pixels inside the triangle obey a planar constraint across multiple
    views. Adherence to this constraint disambiguates planar surfaces from other triplets of
    points. Crucially, the method takes into account the uncertainty estimate maintained by
    the EKF of the location of the camera and the 3D points, using a χ 2 test to determine
    whether there is sufficient evidence to reject the planar hypothesis; this is effectively
    a variable threshold on the sensitivity of the planar constraint according to the filter
    uncertainty. The method was shown to be successful in single-camera real-time SLAM,
    and was used for making adaptive measurements of points on known planes to cope with
    occlusion and enable tracking without increasing the state size [87].
    Building on the above ideas is the inverse depth planar parameterisation (IDPP) [88],
    which adapts the inverse depth representation [19] to planar features. Planes are detected
    as the map is built, allowing planes to be initialised and represented compactly within
    the filter, while their parameters (represented as an inverse depth and orientation with
    respect to a reference camera) are optimised, based on a number of points estimated to
    7.1 Introduction
    141
    belong to the plane. This was shown to be successful in a variety of settings, and easily
    adapted for visual odometry [89]. We use this method for our plane detection enhanced
    visual odometry application, and give further details in Section 7.3.
    The above are good examples of how, by exploiting the existence of planar structures,
    SLAM and VO can be enhanced. These concepts are taken further by Castle et al.
    [14], who use the learned appearance of individual planar objects within an EKF SLAM
    framework. This involves the recognition, in real-time, of a set of planar objects with
    known appearance, thereby incorporating some very specific knowledge. These are in-
    serted into the map once they are detected, using SIFT [81] descriptors to match to
    the object prototype, and localised, by using corner points of the planar surfaces as
    standard EKF map features. Placing the recognised objects into the map immediately
    gives a more interpretable structure. More importantly, by using the known dimensions
    of the learned objects, the absolute metric scale of the mapped scene can be recovered,
    something which cannot be done with pure monocular SLAM. Other than the ease of
    tracking objects in the map, it does not depend upon them being planar, and has been
    extended to non-planar objects [21].
    The main disadvantage to this is that it requires a pre-built database of the objects
    of interest. Learning consists of recording SIFT features relating to each image, along
    with its geometry and the image itself, in a database, which precludes automatic online
    learning, and constrains the method to operate only within previously explored locations,
    rather than being able to learn planar features in general or extracting them from new
    environments. Nevertheless, the idea of recognising planes visually and inserting them
    into the map is most appealing, as we discuss below.
    An interesting application of higher level structures is demonstrated by Wangsiripitak
    and Murray [132]. They reason about the visibility of point features, within a SLAM
    system based on the parallel tracking and mapping (PTAM) of [72], and use this infor-
    mation to more judiciously decide which points to measure. Usually a point-based map
    is implicitly assumed to be transparent, so in principle points can be observed from any
    pose, but in reality they will often be occluded by objects. To address this they use a
    modified version of the plane recognition enhanced SLAM of [14] to recognise 3D objects
    in the map, as well as automatically detecting planar structures from the point cloud.
    These structures are used to determine the visibility of points, which is useful because
    if a point is behind an occluding surface with respect to the camera, it should not be
    measured. This increases the duration of successful tracking, by avoiding the risk of
    erroneous matches and focusing on those points which are currently visible.
    7.2 Overview
    142
    Finally, the work of Flint et al. [40] is interesting in that it explores the use of seman-
    tic geometric labels of planar surfaces within a SLAM system, in order to create more
    comprehensible maps, based on the fact that many indoor environments obey the Man-
    hattan world assumption. It also uses PTAM [72], where only a sparse point cloud is
    initially available, but this is supplemented by using a vanishing-line method similar
    to Koˇseck ´a and Zhang [73], taking the lines detected in a scene and grouping them
    according to which of the three vanishing directions they belong, if any. Because the
    vanishing directions (in a world coordinate frame) remain fixed, they may be estimated
    from multiple images, giving a much more robust and stable calculation. Once these
    are known, surfaces are labelled according their orientation, introducing semantic labels
    based on context (i.e. floor, wall, ceiling). The result is that quite accurate schematic
    representations of 3D scenes, comprising the main environmental features, can be built
    automatically, without relying on fitting planes directly to the point cloud. While the
    dependence on multiple frames for the optimisation means this is fundamentally quite
    different from our work, it is interesting to note how this assumption of orthogonal planes
    is very powerful in creating clean and semantically meaningful descriptions, for simple
    indoor scenes; though this is of course limited to places where at least two sets of surfaces
    are mutually orthogonal.
    7.2 Overview
    We now proceed to give an overview of the plane detection–visual odometry (PDVO)
    system we developed. This uses our detector to find planes and estimate their orienta-
    tion, given only a single image from a moving monocular camera navigating an outdoor
    environment. The mapping algorithm, which attempts to recover the camera trajectory
    in real time by tracking points and planes, is based on the IDPP VO system [88]. Ordi-
    narily this maps planes in the scene by growing them from initial seed points, but because
    the location of planes is not known a priori , it is forced to attempt plane initialisation
    at many image locations, and to grow them slowly and conservatively.
    The novelty comes from the fact that using single image plane detection helps the VO
    system to initialise planes quickly and reliably. By introducing planes using only one
    frame, with an initial estimate of their image extent and 3D orientation, we can make
    plane growing faster and more reliable. The result is that fewer planes are created,
    covering more of the scene, while being restricted only to those areas in which planes
    7.3 Visual Odometry System
    143
    are likely to occur, rather than attempted at every possible location. This means that,
    unlike previous approaches, we need not wait for 3D data or multi-view constraints to
    be available, but can directly use the information available in a single frame to guide
    the initialisation and optimisation of scene features. Therefore plane detection is driv-
    ing the mapping, rather than feeding off its results. This process allows planes to be
    added quickly, but later updated as multi-view information becomes available, so that 3D
    mapping can benefit from prior information but is not corrupted by incorrect estimates.
    We also show that when these planes are added to the map, we can quickly build a rough
    plane-based model of the scene, by leaving the detected and updated planes – which tend
    to cover a reasonable about of the scene – in the map once they are no longer in view.
    This results in a more visually appealing reconstruction, more easily interpretable by
    humans.
    In the next sections, we explain how the baseline VO system works, how we adapted
    our plane detector to use with it, and how they are combined. The results we show in
    Section 7.5 indicate that this is a promising approach, and while our evaluation is not
    sufficient to show that it is necessarily more accurate than existing methods, our aim is
    to demonstrate that using learned generic prior information is a valid and useful way of
    building fast structure-driven maps in a real-time setting.
    7.3 Visual Odometry System
    This section describes the visual odometry system we used, which is based on the inverse
    depth plane parameterisation (IDPP) [88]. This uses an extended Kalman filter (EKF)
    SLAM engine ultimately based on [19, 29]. The distinction which makes this visual
    odometry (VO), rather than full SLAM, is that features are removed from the state once
    they are no longer observed, which means the estimation can progress indefinitely as
    new environments are explored, at the cost of losing global consistency.
    7.3.1 Unified Parameterisation
    An important aspect of IDPP is that two different types of feature are used, namely
    points and planes. All features are mapped using an inverse depth formulation [19], in
    7.3 Visual Odometry System
    144
    which depth is represented by its inverse, allowing a well behaved distribution over depths
    even for points effectively at infinity. This representation is extended to planar features,
    in a unified framework, which allows both points and planes to be encoded in the same
    efficient way. A general planar feature is described by its state vector m i = [ r i , ω i , n i i ],
    which represents the position and orientation (using exponential map representation) of
    the camera, normal vector of the plane, and inverse depth respectively, giving a total of 10
    dimensions. To represent points, the normal n i is omitted, leaving a 7D parameterisation.
    This is further reduced to simply a 3D point for features which have converged to a good
    estimate of depth.
    Planes are defined as collections of observed point features which satisfy a planarity con-
    straint, and are measured and updated by projecting these points into the current image,
    using the recovered orientation estimate. Measurements are made using normalised cross
    correlation of image patches, and like Molton et al. [95] patches can be warped according
    to their planar orientation in order to improve matching. Using patch correlation is gen-
    erally faster than using descriptor based methods [15]. Moreover, while such descriptors
    are powerful due to being scale, rotation and view invariant, this means they will match
    at a greater range of possible orientations. Cross correlation, on the other hand, will
    only find a match with the correct warping. This means patches not actually on the
    plane will fail to be matched, and thus discarded.
    Further efficiency gains are achieved by sharing reference cameras between features (be
    they points or planes) initialised in the same frame. This means that even though the
    features have more dimensions than the 6D features of inverse depth points, the overall
    state size is reduced both by sharing of reference cameras, and because coplanar points
    are represented as part of the same planar feature. This leads to a large reduction in
    state size when the scene is dominated by planes, and hence more efficient mapping, or
    the ability to maintain a map of larger areas.
    7.3.2 Keyframes
    An important feature of IDPP is the use of keyframes. These are images retained when
    new features are initialised, relating the initial camera view to the feature state, and
    to which new observations must be matched for measurement. The pose of the camera
    when the keyframe is taken is stored as a reference camera, associated with the image;
    this is very convenient for plane detection, as will become apparent. Note that this use
    7.3 Visual Odometry System
    145
    of keyframes is very different from keyframe-based bundle adjustment methods (such
    as [72, 119]), where points are tracked between frames to localise the camera but only
    measurements on keyframes themselves are used to update the map; whereas since IDPP
    is based on a filter, all measurements in all frames are used to update the state recursively.
    7.3.3 Undelayed Initialisation
    An important difference compared to planar SLAM methods such as [47] is the undelayed
    initialisation of planar features. As soon as a point is observed, being a candidate for a
    new plane, it is added to the state and estimation of its location proceeds immediately
    (this can happen even though the initial depth is unknown due to the inverse depth
    representation). Immediately after initialisation, this initial ‘seed’ point can be grown
    into a plane, by finding nearby salient points and testing whether they belong to a
    plane with the same possible orientation. By this method, the algorithm simultaneously
    estimates which points are part of the plane, while updating its orientation.
    In more detail, plane growing starts with one initial seed point on the keyframe, and
    assuming this is successfully matched, new points in the keyframe, within some maximum
    distance of the original and chosen randomly, are added to the plane. If these are also
    matched, a plane normal can be calculated from them, along with an estimate of its
    uncertainty (done as part of the filter update, since the normal is part of the state). The
    process continues, adding more points in the vicinity of the previous points, expanding in
    a tree structure — but only from those points whose measurements show they are indeed
    part of the same plane. Points which are not compatible are discarded, and it is this
    which allows plane growing to fill out areas of the image whose geometry implies a plane
    is present. Using such a geometric constraint does mean that planes will also be detected
    from coincidentally coplanar points (assuming the patches can still be matched), or where
    the points are so far away that no parallax is observed. Nevertheless, while these are
    not actually planar, they are still beneficial in making measurements and collapsing the
    state space, until an increase in error forces points to be discarded.
    It should be emphasised that while planar feature initialisation is undelayed in terms
    of the filtering, it is not the entire planar structure which is instantaneously available.
    Rather, ‘undelayed’ refers to the fact that seed points can be updated and grown as
    soon as they are added to the map. Planes take time to grow, and must be augmented
    cautiously lest incorrect points be added, leading to an incorrect estimate of orientation,
    7.3 Visual Odometry System
    146
    which can be hard to recover from. Such errors are liable to occur, and it is a delicate
    matter to allow planes to grow sufficiently fast to be useful, and while they are still
    visible, without introducing erroneous measurements into the filter.
    7.3.4 Robust Estimation
    Careful feature parameterisation and view-dependent warping, as discussed above, both
    help to improve the efficiency and accuracy of the algorithm, but it would quickly be-
    come over-confident and inconsistent if not for some robust outlier rejection. One way
    this is done is by using the one-point RANSAC algorithm, first described in [114] and
    implemented for visual odometry by Civera et al. [20]. RANSAC (RAndom SAmple
    Consensus) [39] is a hypothesise-and-verify framework for robust model fitting and out-
    lier rejection, where a random, minimal subset of measurements is taken, and used to
    hypothesise a possible model for all the data points (for example, four point correspon-
    dences in a pair of images to generate a homography). This is scored by the number
    of inlier measurements, i.e. how consistent all the data are with this model. The most
    consistent model is chosen, and only the inliers are used to calculate the final model.
    The number of samples required for RANSAC to reliably find the correct model is a
    function of the number of expected inliers and the number of measurements required to
    form a minimal hypothesis. The key to one-point RANSAC is to reduce the number of
    measurements needed as far as possible, to using only one, and to use prior information
    about the scenario to complete the model. Therefore while a single measurement cannot
    generate an instance of the model, it is sufficient to choose from a one-parameter family.
    In IDPP, one-point RANSAC is used to test possible camera poses, to ensure that no
    outlier measurements corrupt the current estimate. Instead of using the five point corre-
    spondences typically needed to generate a camera pose, one-point RANSAC allows the
    current state and its covariance to limit the range of most likely poses, meaning that
    only a single point measurement is sufficient to hypothesise a new camera. Because the
    minimal set is so much smaller, in order to get a high probability of finding the correct
    pose only seven measurements are sufficient on average, making this significantly faster
    than full RANSAC.
    One-point RANSAC proceeds by selecting each point in turn, from the randomly chosen
    set, and using it to perform a partial update on the EKF (to update the state but not the
    7.3 Visual Odometry System
    147
    full covariance). Then the difference between the updated points and their measurements
    (innovation) is used to find which are the inlier points, since after the update those
    which have a high innovation are not consistent, and so are likely to be outliers. Once
    the hypothesis with the highest number of low-innovation inlier measurements is found,
    these are used to perform a full state update. However, it is possible that some of the
    points rejected as outliers are actually correct (such as those recently initialised and not
    converged, for example), so the final step is to rescue high-innovation liners, by finding
    those points which still are consistent with the model, even if they are less certain.
    However, as powerful as it is, one-point RANSAC is not sufficient to make the algorithm
    fully robust. A further level of checking is added, in the form of a 3D consistency test (for
    full details see [85]), allowing the extra information available from the 3D map to be used.
    This is useful because often 2D points will seem to be consistent with their individual
    covariance bounds, whereas they do not lie on the correct plane in 3D. The covariance
    of the planar features in 3D is propagated to individual planar points, meaning points
    not actually conforming to the plane can be removed. The 3D consistency check is more
    selective, but more time consuming, than the 2D consistency check, so the 2D check is
    run first for all points and the 3D check run afterwards to remove any remaining outliers.
    7.3.5 Parameters
    Amongst the parameters which control the operation of the plane growing algorithm is
    the minimum distance between planar points, which determines from how far away new
    points are added to the plane, measured in the keyframe (set to a value of 12 pixels).
    A related parameter is the number of new points which are added to a plane at each
    step (set to 3). Together these two parameters control the speed at which planes grow.
    Finally, there is the maximum number of measurements (of either feature type) which
    can be made in one frame, which increases map density to the detriment of frame rate
    (set to 200).
    7.3.6 Characteristics and Behaviour of IDPP
    This plane parameterisaton was shown to be successful in detecting planes in cluttered
    environments, such as an office. Moreover, due to the ability to quickly add relevant
    measurements to a plane as they become available, it is able to deal smoothly with
    7.3 Visual Odometry System
    148
    occlusion, so that while parts of a plane are occluded, other parts are measured. The
    method was even shown to work in reasonably featureless indoor environments, by virtue
    of picking up many more features than is possible with point-based SLAM.
    The IDPP system, when being used as visual odometry, performs well in outdoor en-
    vironments, and achieves comparable performance to [20] on the Rawseeds dataset 1 ,
    compared to ground truth GPS data, while running at a significantly faster frame-rate
    [89]. Visual odometry discards the built map, so it can run indefinitely with constant pro-
    cessing time; but even so, its accuracy is sufficient to almost close large loops (although
    some scale drift is inevitable).
    However, despite the benefits of the algorithm outlined above, there are a number of
    disadvantages which must be considered. First, although initialisation is undelayed and
    planar structures can be mapped immediately, it still takes time to build these up,
    and gather sufficient measurements to determine that a plane does indeed exist. Often
    collections of points can be measured as planes for some time before realising this is not
    actually the case. At worst this risks introducing erroneous measurements, and at best
    slows down execution as many false planes are attempted and discarded.
    Successful plane growing also relies on having sufficient parallax, so if the camera is
    stationary or performing pure rotational motion, there will be no way to recover planes
    given any number of frames. Consequently, the accuracy of the resulting planes will
    be a function of the distance the camera has translated and the distance to the plane.
    As well as potentially introducing false planes, this could lead to incorrect orientations
    being assigned, meaning the map does not correspond well to reality, but also making
    it difficult for the planes to be corrected when reliable measurements are made, or to
    continue growing.
    One of the main implications of not knowing where planes are beforehand is that many
    potential locations must be investigated. A large number of planar features must be
    initialised and grown, which is rather computationally expensive. While this blind prob-
    ing is ultimately effective, it risks many planes being grown over non-planar structures,
    or for planes to overlap and compete for measurements, further slowing the process of
    finding true planar structure. An ideal solution to this and the other problems, of course,
    would be some way of knowing a priori which image regions correspond to planes.
    1 See www.rawseeds.org
    7.4 Plane Detection for Visual Odometry
    149
    7.4 Plane Detection for Visual Odometry
    Now that we have outlined the important details of the IDPP VO system we use as a
    basis, and that the details of the plane detection algorithm are thoroughly explained
    in Chapters 3 and 5, we proceed to describe how the two are combined, to produce a
    plane-based visual odometry system that runs in real time.
    7.4.1 Structural Priors
    As we stated earlier, the function of our plane detection algorithm here is to provide a
    prior location, and orientation, of planar structures in the image. That is, we do not rely
    exclusively upon our detection algorithm, since it is prone to error, nor do we repeatedly
    use it to refine the estimates of the planes as they are observed. This cleanly separates
    the domain of the two algorithms: plane detection finds the location of the planes in the
    image, and passes this information to the VO system, which then initialises planes and
    continues to measure and map them as appropriate. Plane detection runs as described
    previously, using only the information in a single colour image as input, independently
    of any estimates present in the map.
    One issue to consider is how the data are passed from the plane detector to IDPP.
    Plane detection operates at the level of salient points (Section 3.3.1), but this is only
    a limited sampling of the image; and while the VO also uses a set of salient point to
    track and localise features, these are not necessarily the same points (in fact, we use the
    difference of Gaussians detector, which picks blob-like salient regions at multiple scales,
    while IDPP uses the corner-like features provided by FAST [110]). We bridge this gap
    by modifying our algorithm to produce a mask as output, indicating which pixels belong
    to which planes. This is by nature an approximation since we do not have pixel-level
    information, but it is valid if we assume that planes are continuous between the salient
    points assigned to them, and do not extend significantly beyond their planar points.
    We create the mask using the Delaunay triangulation, which is already available having
    created the graph for the Markov random field (Section 5.8.2), and assign pixels which
    fall inside each triangle to have the same class (and orientation) as the three vertices,
    for those triplets which are in the same segment. Pixels inside triangles whose vertices
    belong to different segments, or those outside any triangles, are not assigned to a planar
    region, and assumed non-planar (the non-planar regions are not important here)
    7.4 Plane Detection for Visual Odometry
    150
    Figure 7.1: Examples of the masks we use to inform the visual odometry
    system of plane detections. From the input image (left) we use a Delaunay
    triangulation to create a mask, showing which pixels belong to planes, each with
    a grey level mapping to an ID with which orientation is looked up (right). These
    triangulations are similar to how the plane detection is usually displayed (centre).
    The masks that result from this are vectorised to integer arrays, where all non-planar
    pixels are set to 0, and pixels in a plane have a value corresponding to the plane’s ID
    number, which is used to recover the plane normal from a list. We can also show these
    masks as grey scale images, examples of which are in Figure 7.1, where for display pur-
    poses IDs are mapped to visible grey levels. The approximated pixel-wise segmentation is
    actually the same as the way we have illustrated the extent of planar regions in previous
    chapters.
    7.4.2 Plane Initialisation
    When a keyframe is created by IDPP, instead of attempting to initialise planes at any
    and all locations, the keyframe image is passed to the plane detector, and upon receipt of
    the plane mask image, planar features are created. We initialise one IDPP planar feature
    per detected plane, using the centroid of the region (specifically, the nearest FAST corner
    to the centre of mass of the mask pixels) as the location of the seed point. To avoid over-
    writing existing planes, we ensure that each centroid is above a minimum distance (set to
    30 pixels) of any planar point in the keyframe, and discard the detected plane otherwise.
    Where no planes are present (black areas of the mask), point features are initialised as
    7.4 Plane Detection for Visual Odometry
    151
    normal, if necessary, to ensure there is sufficient coverage of the environment.
    The estimated normal vector from the plane detector is used to initialise the normal in
    the feature state. The aim is that this will be close to the true value, much more so than
    the front-on orientation which must be assumed in the absence of any knowledge. This
    allows faster convergence of the normal estimate, and avoids falling into local minima.
    However, this must be initialised with sufficiently large uncertainty as while an initial
    estimate is useful for initialisation, it cannot be relied upon to be accurate, and so it
    must be able to be updated as more measurements are made without the filter becoming
    inconsistent. Using these initial normals also helps ensure non-planar points are not used
    for measurements, since as described above, their warped image will be less similar if
    the correct normal is used.
    7.4.3 Guided Plane Growing
    It is important to note that we do not initialise the whole extent of the plane immediately,
    spreading points over the mask region. This would risk introducing incorrect estimates
    into the filter, caused by not having enough baseline to correct the normal estimate from
    all the measurements. Such errors would be difficult to recover from and may corrupt
    the map.
    Instead, we use the regular plane growing algorithm to probe the extent of the plane,
    but allow the growing to proceed faster than normal, by adding more new points per
    frame. The number of new planar points added to a plane in each frame is increased to
    10 (from 3 in IDPP), meaning we can exploit the prior knowledge of the plane location
    and orientation to map planes quickly, but retain the ability to avoid regions which are
    not actually coplanar. Furthermore, any points which do not conform to the planar
    estimate are automatically pruned by the algorithm’s 3D consistency test (see Section
    7.3.4 above), so minor errors in the plane detection stage (planes leaking into adjoining
    areas, for example) do not cause problems in the map estimation.
    The planes are not permitted to grow outside the bounds set by the plane detector,
    which means the growing is automatically halted once the edge of the detected plane
    has been reached (they cannot inadvertently envelop the whole image). No other planar
    features are allowed than those detected. The result is a much smaller but more precise
    set of planes, corresponding better to where they should be; and no wasted time in
    7.4 Plane Detection for Visual Odometry
    152
    attempting to grow planes in regions which are not appropriate. This, combined with
    the computational savings achieved by initialising with a good normal estimate, allows
    for a substantial increase in frame rate.
    What we are striving for, in effect, is a system whereby the strengths of both methods
    are combined, in order to correct each others’ errors. The inability of IDPP to quickly
    map out new planes is mitigated by using the prior information, and the potential for
    detected planes to have inaccurate orientations is ameliorated by the iterative update
    over subsequent frames.
    7.4.4 Time and Threading
    While our original implementation of the plane detector was reasonably fast, running
    in one to two seconds per image, this was using un-optimised code. For application
    in a real-time system, the fastest execution possible is desired. As such we parallelise
    portions of our algorithm to decrease its execution time. The region sweeping stage is
    ideal for splitting between simultaneous processes or threads, since each sweep region
    is independent (indeed, it is basically a separate invocation of the plane recognition
    algorithm). This means the creation of descriptors and classification of the regions can
    be done separately for sets of regions, then combined together when creating the local
    plane estimate.
    While this is still not sufficient to run in real-time (the camera images are received at
    a rate of 30Hz) it is not strictly necessary for our detector to run this fast (i.e. with
    an execution time below 33ms, which may be possible but difficult to obtain). This is
    because we can run the detector in the background, in a separate processor thread, and
    return the result for use by the VO system when it has completed.
    Since the plane detector is running in a separate thread, we take full advantage of this
    by running it continuously. As soon as the plane detector has finished one frame, it will
    move on to the next current image from the camera. This means the VO thread has a
    constant stream of plane detections, as soon as they are available.
    This is where the keyframes of the VO system are particularly convenient, since these
    are retained even after the camera has moved on, and their associated camera poses are
    maintained in the state. This means once the plane detector has finished, planes can
    be related back to the keyframe on which they were detected, in order to grow them
    7.4 Plane Detection for Visual Odometry
    153
    and update the reference camera. Updates to one keyframe, even in the ‘past’, can
    be used to update the current map. The disadvantage is that updates resulting from
    measuring planes arrive several frames late, although this will not be a problem under
    the assumption that there are no drastic changes in camera pose in the interim, and
    so long as the planes are still in view. The delay is more than compensated for by the
    increased speed at which planes can grow and converge.
    While one of the primary reasons to use only a single frame to detect planar structure
    was to do this quickly, it appears we must still wait several frames for detections to
    become available. One might argue that rather than wait for the single-frame estimate,
    it would be simpler to take all the frames from a corresponding time period and apply
    standard stereo or multi-view algorithms to recover planar structure. This is not an
    appropriate alternative, and we refute it thus:
    The problem with simply using all the images in a 20-30 frame window to recover 3D
    information is that accuracy is very sensitive to the baseline. The camera would need
    to move far enough, over a period of around one second, to perceive large differences
    in depth, which is rather unlikely when the camera is moving at moderate speeds in an
    urban scene. Indeed, this is the limitation which stops IDPP growing planes instanta-
    neously. While one could, in principle, make use of information from all these frames,
    it would not necessarily be of any benefit; indeed, the original IDPP algorithm, which
    uses information from every frame, filtered by an EKF, would itself be one of the best
    ways of doing this. In contrast, even if our plane detection allows 30 frames to pass by
    unseen while detecting planes in just one of them, it does not matter even if all those
    frames are the same.
    A second reason is that the plane detection algorithm uses an entirely different kind
    of visual cue. Even if some superior multi-view geometry algorithm could be used to
    extract accurate planar orientation from such a narrow time window, this still uses only
    the geometric information apparent from depth and parallax cues. On the other hand,
    by exploiting the appearance information of the image, our algorithm is attempting to
    interpret structure directly, based on learned prior knowledge. This is independent of 3D
    information, and so is complementary to the kind of measurements made by IDPP. We
    draw a parallel with the work of Saxena et al. [111], who show that estimating a depth
    map from a single image helped to improve the results from a stereo camera system, by
    combining the two complementary sources of information.
    7.5 Results
    154
    7.4.5 Persistent Plane Map
    One objective of our PDVO is to create a plane-based map of the scene. In general,
    the accuracy of visual odometry is not good enough to create a persistent map, and the
    quality and extent of planes created by IDPP is not sufficient for a 3D map display.
    However, by using our plane detection, the planes tend to be larger, more accurate, and
    better represent the true 3D scene structure, and we can be more confident about the
    planes’ authenticity than with IDPP alone. This suggests an easy way to create maps
    quickly, by simply leaving the mapped planes in the world, even after they are no longer
    in view. Given that this is visual odometry, they are removed from the state, maintaining
    constant-time operation, and so will not be re-estimated when they are again in view. We
    find that the accuracy of their pose is sufficient to build such a 3D map, to immediately
    give a good sense of the 3D structure of the world; note that this work is at a preliminary
    stage, leading to quite rough 3D models.
    7.5 Results
    A number of experiments were carried out using videos of outdoor urban scenes. These
    were recorded using a hand-held webcam running at 30Hz, of size 320 × 240 pixels and
    corrected for distortion caused by a wide-angle lens. Our intention was to investigate
    what is possible when using learned planar priors, rather than to exhaustively evaluate
    the difference between the two methods. As such, we tuned both methods to work as
    well as possible by altering the number of new planar points that can be initialised at
    each frame. For IDPP, this was set to 3, for conservative plane growing, while for PDVO
    we used a value of 10, allowing planes to more rapidly fill the detected region, permissible
    since it is less sensitive to how planes are grown into unknown image regions.
    First we consider the implications of the delay in initialisation while waiting for the
    plane detector to run, compared to the undelayed initialisation of (seeds of) planes by
    IDPP. In Figure 7.2 we show the development of a keyframe over several frames after
    initialisation in both methods. The first row shows the initial input image, and the
    result of plane detection used to initialise planes in the PDVO system. Following this
    are images showing the progression of plane estimation. It is clear that IDPP, in the left
    column, quickly initialised many planes, at many image locations (some of which were
    not at all planar), but these took some time to grow, and competed for measurements.
    7.5 Results
    155
    Figure 7.2: Comparison of the initialisation of plane features using the original
    IDPP method (left) and when augmenting it with plane detection (PDVO, right).
    Images of the camera view after 2 (initialisation), 14 (detection ends), and 46
    frames have elapsed are shown, demonstrating that although there is a delay of
    many frames while the plane detector runs, the good initial estimate makes up
    for this in terms of the number and quality of the resulting planes. The bottom
    row shows planes in 3D at frame 46.
    7.5 Results
    156
    When using plane detection (right column), a single plane was initialised at the centroid
    of the detected region, and grew quickly. Even with a delay of around 14 frames before
    the detector finished, the plane expanded rapidly, overtaking those initialised by IDPP in
    number of measurements and image coverage. The bottom row shows 3D visualisations
    of the planes, corresponding to the last camera frame shown; the many planes created by
    IDPP had not yet attained good poses, while the plane initialised in PDVO already shows
    appropriate orientation. Again, this difference was partly because only the one plane
    was initialised, and because the plane prior allowed us to choose a less conservative rate
    for plane growing, highlighting that fact that the two methods operate in very different
    ways.
    Figure 7.3: Some views of the Berkeley Square sequence, showing the original
    IDPP (left) and our improved method (right). The top images show a top-down
    view of the whole path, while the lower images show oblique views, illustrating
    that the PDVO method produces less clutter and larger planes.
    Next, we compared the two methods on a long video sequence, as the camera traversed
    a large loop of approximately 300 metres — this was a square surrounded by houses,
    with trees on the inside (the Berkeley Square sequence). 3D views resulting from the
    two methods are compared in Figure 7.3; while both recovered an approximately correct
    7.5 Results
    157
    Figure 7.4: Comparison on the Denmark Street sequence: IDPP (left) again
    has more numerous and smaller planes than PDVO (right) (note that the grid
    spacing is arbitrary and does not reflect actual scale).
    trajectory (the true path was not actually square, but the ends should meet) and placed
    planes parallel to the route along its length, it is clear in the PDVO method (right)
    that there are fewer planes, which tend to be larger and less cluttered, giving a clearer
    representation of the 3D environment. The oblique views underneath show this clearly,
    where compared to PDVO, the planes mapped by IDPP are smaller, more irregular, and
    with more varying orientations.
    We also show results for another video sequence, taken in a residential area, surrounded
    by planes on all sides (the Denmark Street sequence), shown in Figure 7.4. Again, the
    map visualisation created using our method is more complete and clear than that with
    the original IDPP, with fewer and larger planes. Examples of planes as seen from the
    camera are shown in Figure 7.5, and further examples of plane detections acquired during
    mapping are shown in Figure 7.6, showing our detection algorithm is quite capable of
    operating in such an environment.
    Method
    Total planes
    Points per plane
    Average area (pixels)
    IDPP
    205
    17.9
    521.0
    PDVO
    52
    28.9
    1254.4
    Table 7.1: Comparison of summary statistics for the IDPP and PDVO methods,
    on the Berkeley Square sequence.
    Table 7.1 compares statistics calculated from mapping the Berkeley Square sequence, in
    order to quantify the apparent reduction in clutter. These confirm our intuition that
    when using plane priors, fewer planes will be initialised, by avoiding non-planar regions.
    Furthermore, the planes resulting from PDVO are measurably larger, both in terms of
    the average number of point measurements, and number of pixels covered.
    7.5 Results
    158
    (a)
    (b)
    (c)
    (d)
    Figure 7.5: Visual odometry as seen from the camera. For IDPP, many planes
    are initialised on one surface (a), or on non-planar regions (b); whereas PDVO
    has fewer, larger planes, being initialised only on regions classified as planes (c,d).
    Figure 7.6: Examples of plane detection from the Berkeley Square sequence,
    showing the area deemed to be planar and its orientation. Note the crucial
    absence of detections on non-planar areas; and that multiple planes are detected,
    being separated according to their orientation.
    As we emphasised earlier, our intention is to show the potential for using the plane
    detection method for fast map building, and not necessarily to produce a more accurate
    visual odometry. However, it is interesting to analyse the accuracy of PDVO compared
    to IDPP against the areas’ actual geography. Ground truth was not available, but the
    trajectories can be manually aligned with a map, as shown in Figures 7.7 and 7.8 for the
    Berkeley Square and Denmark Street sequences respectively. The latter is a compelling
    example, suggesting that, under certain conditions, our method helps to ameliorate the
    problem of scale drift (a well known problem for monocular visual odometry [119]); of
    course, many more repeated runs would be needed to quantify this, but we consider
    these initial tests to be good grounds for further investigation.
    One of our main hypotheses was that by using strong structural priors, we can make
    mapping faster by more carefully selecting where to initialise planes. Our experiments
    confirmed this, according to Figure 7.9 where we compare the computation time (mea-
    sured in frames per second) for both methods, on the Berkeley Square sequence. As
    previously reported in [89], the IDPP system achieves a frame rate of between 18 and 23
    fps (itself an improvement on similar methods running at 1 fps [20]), which is confirmed
    7.5 Results
    159
    Figure 7.7: In lieu of ground truth data, the trajectories were manually overlaid
    on a map for comparison, for the Berkeley Square sequence. The true trajectory
    closes the loop, and while both methods show noticeable drift, the error for our
    PDVO method (red) was an improvement on that of IDPP (blue).
    Figure 7.8: A comparison of mapping ability of the two methods on the Den-
    mark Street sequence, compared to a map. Again, both methods exhibit gradual
    drift, but this is reduced by our PDVO method (red) compared to IDPP (blue).
    by this experiment (blue curve). Our method clearly out-performed this, achieving both
    a substantially higher average frame rate of 60 fps and being consistently faster through-
    out the sequence (this times the VO threads, so does not include the time taken to run
    the plane detector). We are not aware of existing visual odometry systems running at
    such high frame rates for a similar level of accuracy, suggesting that our use of learned
    structural knowledge is a definite advantage. Running at such high speeds is beneficial
    since it means more measurements can be made for the same computational load, which
    tends to increase accuracy [120], or frees computation time for global map correction
    methods [119].
    7.6 Conclusion
    160
    IDPP
    PDVO
    100
    50
    0
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    10000
    Frame number
    Figure 7.9: Time (frames per second) for each of the methods (smoothed with
    a width of 100 frames for clarity). The mean is also shown for both.
    7.6 Conclusion
    This chapter has described how we can use our plane detection algorithm as part of a
    visual odometry system, being a good example of a real-world application. We achieved
    this by modifying an existing plane-based visual odometry system to take planes from
    our detector and use them to quickly initialise planar features in appropriate image
    locations.
    Part of the success of this approach was due to careful choice of the baseline VO system
    we used. The IDPP visual odometry system is based on taking measurements from one
    keyframe image, and growing planes from seed points, and these qualities make it ideal
    for incorporating planar priors, by using them to specify where on the key frame a plane
    should be initialised, and by having the confidence to grow these planes much faster.
    Furthermore, this VO system can make use of the single image estimate of the plane
    normal in a principled way, by using it to initialise the plane feature directly in the filter
    state. This means that our estimated value can help with faster initialisation, without
    having to wait for image measurements; but on the other hand reasonable errors in this
    value will not cause problems since it will be updated as more multi-view information
    becomes available.
    However, there is no reason why this would be the only type of SLAM or VO system
    able to benefit from having plane priors, and we could consider the use of methods based
    on fitting planes to points [47] or based on bundle adjustment [132]. For example, we
    might use RANSAC to find collections of coplanar points in a point cloud, and filter out
    false planes using our plane detection algorithm, to detect planes both geometrically and
    with some semantic guidance.
    A key contribution of this chapter was to show that by exploiting general prior knowledge
    7.7 Future Work
    161
    about the real world – encoded via training data – we can derive strong structural priors,
    which are useful for fast initialisation of map features. Direct use of such general prior
    information has not been done in this way before — the closest equivalents are Flint
    et al. [40] who use assumptions on the orthogonality of indoor scenes to semantically
    label planar surfaces, and Castle et al. [14] who use knowledge of specific planar objects
    to enhance the map and recover absolute scale.
    7.6.1 Fast Map Building
    We also demonstrated that by detecting planes almost immediately from a single camera
    frame, they can be inserted directly into the map, to quickly give a concise and meaning-
    ful representation of the 3D structure, again due to having good priors. While planes are
    added to the map as they are built in the regular VO system, of course, the difference
    is that we have good reason to believe these planes will better reflect the actual scene
    structure, as opposed to being planes grown from coincidental coplanar structure. This
    was supported by our results, showing the detected planes to be fewer in number, larger
    in size, and seeming to align better with the known scene layout.
    It would be interesting to develop this further, toward producing fast and accurate plane-
    based 3D models of outdoor environments as they are traversed. This could be used,
    for example, to create quick visualisations of a scene, with textures on the planes taken
    from the camera images. By giving a better sense of the scene structure such maps
    would also be useful for robot navigation and path planning, being better able to avoid
    vertical walls or traverse ground planes; or for human-robot interaction, making it easier
    to communicate locations and instructions in terms of a common 3D map.
    7.7 Future Work
    This section discusses potential developments to this PDVO system, for further evalu-
    ation and use as part of an online learning-system; discussion of future work in other
    applications is deferred to Chapter 8.
    7.7 Future Work
    162
    7.7.1 Comparison to Point-Based Mapping
    One important area which requires further investigation is exactly why we see the per-
    formance gains we do, in terms of frame-rate and accuracy. Our algorithm is able to
    initialise planes in a more intelligent way, by only creating seed points in regions deemed
    planar by our detector, which avoids the potential problems, and computational bur-
    den, of growing planes in inappropriate places. However, it could be that some of the
    benefits come simply from having fewer planes in total, for example if reducing the num-
    ber of planes, irrespective of where they are created, is beneficial. If, hypothetically,
    introducing planar structures inevitably leads to errors, then using fewer of them would
    improve performance, calling into question to benefit of using single image plane detec-
    tion. This could be investigated by comparing the plane detection enhanced version,
    and the standard plane-growing version of IDPP, to a purely point-based system, to
    evaluate the differences between them (note that the point features used in IDPP are
    parameterised using the efficient, unified parameterisation, and so have advantages over
    standard point-based systems).
    We could also compare PDVO and a similar version which initialises the same number
    of planes, but in random locations. We would expect, if the conclusions of this chapter
    hold, that using the detected locations would be significantly better. Furthermore, the
    IDPP method was itself originally compared to point-only mapping, both in simulation
    and on real data, and found to be superior [85, 89]. This suggest it is likely that using
    detected planes as opposed to a random subset would be beneficial, so we maintain our
    assumption that plane detection provides benefit over only points.
    7.7.2 Learning from Planes
    Presently, the training data for PDVO comes from manually labelled training data, the
    creation of which is a time consuming process. An alternative would be to use the IDPP
    visual odometry system itself to detect planes, and use these as training data. Once
    IDPP has been used to map an environment, the result is a map with planar structure,
    with relates planes in the world to the keyframes from which they were observed. This
    means the information available in the keyframes is fairly similar to the type of ground
    truth data we have manually labelled, and so we could process these as described in
    Section 5.4, to extract training regions by sweeping over the whole image. Not only
    would this avoid manually annotating images, but would allow us to easily tailor the
    7.7 Future Work
    163
    detector to new environments, by obtaining a training set more similar to the type of
    scenery encountered.
    There are a few issues to consider in pursuing this idea. Firstly, while keyframes are
    ideal for recovering the identified planar structures, it is not clear at which point during
    the planes’ evolution they should be used as training data. It would be prudent to wait
    for planes to expand and to converge to a stable orientation, although planar points may
    be removed as they go out of view or are occluded, so waiting too long would lead to
    fewer or smaller planes. Points will also be removed from planes if they are later found
    not to be co-planar, so taking the largest coverage of planes would also be inadvisable.
    In addition to this, it would not be possible to know that regions without planes are
    indeed non-planar, since the absence of planes might simply be due to having sufficiently
    many measurements without initialising additional planes on the keyframe.
    A further issue in using planes detected by IDPP – or indeed any geometric method –
    is that these cannot be used to determine what a human would consider planar, but
    only what the mapping algorithm considers to be planar. This may be useful, in terms
    of giving a prior on the locations of the kind of plane that will be detected by the
    mapping system; but it would no longer be encapsulating any human prior knowledge,
    or any planar characteristics complementary to what geometric methods can see. This
    is unfortunate, since it seems that a key benefit of PDVO is the ability to avoid planes
    in inappropriate places, and to predict their extent in the image, something which is
    difficult for IDPP.
    The above idea can be extended further, by combining both training and detection (which
    would both be autonomous) into an online system. The ideal would be a combined plane-
    detection/plane-mapping visual odometry system that starts with no knowledge of planes
    or the environment. As it maps planes using multi-view information, it would gradually
    learn about their appearance. From this it would detect new planes from single images,
    increasing the efficiency of detection as it learns more about its surroundings. In order to
    achieve this it would be necessary to make some changes to the plane detection method,
    primarily to make training feasible in an online system. Training the RVM would no
    longer be practical, since this takes a large amount of time, and would require retention
    of the whole training set (since the relevance vectors are liable to change as more data
    become available). An alternative classifier would be necessary, for example random
    forests [12] which would be easier to incrementally train as data become available.
    This would be a very interesting way to develop the PDVO algorithm, since it would be
    7.7 Future Work
    164
    a step toward creating a self-contained perceptual system, which explores its environ-
    ment, learns from it, and uses this learned knowledge to aid further exploration. Again,
    we make an analogy with the way humans perceive their environment, as discussed in
    Chapter 1. Biological systems are capable of such learning, ultimately starting with no
    prior information, which implies it may be possible to design a vision system to achieve
    similar goals.
    However, many challenges remain before developing such a system using the methods
    described above. Because training data would come from plane growing, this could lead
    to undesirable drift in what is considered a plane, if false planes are detected and used as
    training examples. Alternatively, if few planes are detected, there would be insufficient
    data to learn from, making the plane detector unable to help initialise new features.
    As such, building an online learning system with the methods we have discussed here
    remains a distant prospect requiring considerable further work.
    CHAPTER 8
    Conclusion
    This thesis has investigated methods for finding structure in single images. This was
    inspired by the process of human vision, specifically the way that humans are thought to
    learn how to interpret complex scenes by virtue of their prior knowledge. As we discussed
    in Chapter 1, learning from experience appears to play an important part in how humans
    see the world — evinced be phenomena such as optical illusions. Since humans appear
    capable of perceiving structure from both reality and in pictures, without necessarily
    using stereo or parallax cues, this can provide useful insights into how computer vision
    algorithms might approach such tasks.
    This motivated us to take a machine learning approach to tackle the problem, where
    rather than explicitly specify the model underlying single image perception, we learn
    from training data. This is similar in spirit to a number of approaches to tasks such
    as object recognition [37, 69], face verification [7], robotics [1], and so on. We are also
    motivated by other recent work [66, 113], who have also used machine learning methods
    for perceiving structure in single images; and driven by the range of possible applications
    single-image perception would have.
    Amongst the myriad possibilities for attempting single-image perception (such as re-
    covering depth or estimating object shapes), we have begun by focusing on the task of
    165
    166
    plane detection. This was chosen because planes are amongst the simplest of geometric
    objects, making them easy to incorporate into models of a scene, and can be described
    with a small number of parameters. Furthermore, planes are ubiquitous in human-made
    environments, so can be used to compactly represent many different indoor or urban
    environments. The importance of this task is underlined by many recent works on plane
    detection generally [5, 42], as well as attempts to extract planar structure in single im-
    ages [73, 93]. However, as we described in Chapter 2, existing methods for single image
    plane detection suffer shortcomings such as a dependence on certain types of feature, or
    an inability to accurately predict orientation, so we believe our new method satisfies an
    important need.
    In order to begin the interpretation of planar scenes, we developed a method for recog-
    nising planes in single images and estimating their orientation, which uses basic image
    descriptors in a bag of words framework, enhanced with spatial information (Chapter
    3). We use this representation to train classifiers which can then predict planarity and
    orientation for new, previously unseen image regions. We believe this is a good approach
    to the problem, since it avoids the use of extracting potentially difficult structures such
    as vanishing points, which may not be appropriate in many situations.
    Our experiments in Chapter 4 confirm the validity of such a learning based approach,
    showing that it can deal with a variety of situations, including both regular Manhattan-
    like scenes and more irregular collections of surfaces. We acknowledge that our method
    may give orientation accuracies inferior to direct methods (using vanishing points, for
    example) in the more regular scenes. However, we are not bound by their constraints,
    and can predict orientation in the absence of any such regular structure. This chapter
    also explored a number of design choices involved in creating the algorithm. This is
    important, since it gives some insight into why the method achieves its results, and its
    potential limitations.
    This work was not in itself complete, since it required the correct part of the image to
    be marked up. We addressed this in Chapter 5, where we demonstrated that this plane
    recognition algorithm can be incorporated into a full plane detection system, based on
    applying it multiple times over the image, in order to sample all possible locations of
    planes. This allows us to estimate planarity at each point (the ‘local plane estimate’),
    which gives sufficient cues to be able to segment planes from non-planes, and from each
    other, implemented with a Markov random field (MRF).
    We showed in Chapter 6 that this detection method is indeed capable of detecting planar
    8.1 Contributions
    167
    structure in various situations, including street scenes with orthogonal or vanishing-point
    structures, but also more general locations, without the kind of obvious planar structure
    usually required for such tasks. We emphasise that our algorithm also deals with non-
    planar regions, and can determine if there are not in fact any planes in the image.
    These experiments confirm our initial hypothesis, that by learning about the relationship
    between appearance and structure in single images, we can begin to perceive the structure
    in previously unseen images, without needing multi-view cues or depth information.
    Nevertheless, the task of perceiving structure generally is far from complete, in that we
    have looked only at planar structures so far. Moving on to more complicated types of
    scene would be an interesting avenue to explore in future, although potentially much
    more challenging.
    Chapter 7 showed how our plane detector can be applied in a real application. Here we
    investigated its use for visual odometry (VO), a task where planes have been useful for
    efficient state representation and recovering higher-level maps [46, 87]. We experimented
    by modifying an existing visual odometry system [89] that can simultaneously grow
    planes and estimate their orientation in the map, while using them to localise the camera.
    We chose this system since it was clear that it would benefit from knowing the likely
    locations of planes. We used our plane detector to find planes in a set of keyframes,
    and from there initialise planar features in the map, using our estimated orientation as
    a prior. This allowed planes to be initialised only in locations where they should be, to
    grow quickly into their detected region and not exceed their bounds, and to be initialised
    with an approximate orientation, which while not perfect was better than assuming the
    planes face toward the camera. This increased the accuracy of the resulting maps, and
    drastically increased the frame-rate over the baseline VO system, while also allowing fast
    construction of plane-based maps, by retaining detected planes even after they had been
    removed from the state.
    8.1 Contributions
    Here we briefly summarise the key contributions of this thesis:
  • We described a method of compactly representing image regions, using a variety
  • of basic features in a bag of words framework, enhanced with spatial distribution
    information.
    8.2 Discussion
    168
  • Using this representation, we trained a classifier and regressor to predict the pla-
  • narity and orientation of new candidate regions, and showed that this is accurate
    and performs well with a variety of image types.
  • We developed this for use as part of a plane detection algorithm, which is able to
  • recover the location and extent of planar structure in one image, and give reliable
    estimates of their 3D orientation.
  • This method is novel, in that it is able to both detect planes, using general image
  • information – without relying either on depth information or specific geometric
    cues – and is able to estimate continuous orientation (normal vector as opposed to
    orientation class).
  • We also compared our method to a state of the art method for extracting geo-
  • metric structure from a single image, and found it to compare well, with superior
    performance on average on our test set according to the evaluation measures we
    used.
  • Finally, we demonstrated that single-image plane detection is useful in the context
  • of monocular visual odometry, by giving reliable priors on both the location and
    orientation of planar structures, enabling faster and more accurate maps to be
    constructed.
    8.2 Discussion
    Having outlined the primary contributions and achievements of this work, we now dis-
    cuss some of the more problematic areas. One implicit limitation in our current plane
    recognition algorithm is that it depends upon having a camera of known and constant
    calibration. Images from different cameras with different parameters may look consider-
    ably different, due to the effects of lens distortion or picture quality for example. This
    would impact upon classification accuracy, but could be solved by increasing the quan-
    tity of training data. However, while the calibration parameters do not appear in any
    of our equations or image representations, they are required in order to relate the four
    marked corners of the image to the ground truth normal (Section 3.2). This implies that
    if an image exists with an identical quadrilateral shape, that comes from a camera with
    different parameters, the true normal would actually be different, causing our algorithm
    to give erroneous results. This limits our algorithm to images and videos taken with the
    8.2 Discussion
    169
    same known camera (the results in this thesis were all obtained thus); and yet it would
    clearly be beneficial to attempt to generalise our method, firstly to work with any given
    camera matrix, and ultimately to be independent of calibration, to use images from a
    more diverse range of sources. We envisage a system able to freely harvest images from
    the internet, and intriguingly perhaps even from Google Street View 1 , which already
    contains some orientation information.
    As we discussed in Section 6.5.2, the model we have assumed in order to relate appear-
    ance to structure might cause problems. Specifically, by using a translation invariant
    descriptor (the spatiograms with zero-mean position), we are assuming that location in
    the image is irrelevant for either plane classification or orientation. Not only does this
    miss out on a potentially useful source of information, but is inaccurate, since experience
    shows that an identical planar appearance seen in different image locations may imply a
    different orientation. The brief experiments we conducted to investigate this fortunately
    show that it is not a pressing issue. The difference in orientation as the plane moves
    across an image is slight, and the results using both types of description showed broadly
    similar performance. Nevertheless, it would be desirable to ensure the way we represent
    appearance and orientation allows their true relationship to be expressed, and may lead
    to better results.
    We described in Chapter 5 how we use a two-step process for segmenting planes according
    to class then orientation. We found this worked well in practice, but are aware that
    the use of two stages, plus mean shift to discretise the orientations, might be an over-
    complication. In principle it should be possible to perform the entire segmentation in
    one step, simultaneously estimating class and orientation for plane segments, while also
    finding the best set of orientations. This would perhaps be more efficient, or at least more
    computationally elegant. Such a one-step segmentation could potentially be achieved by
    treating it as a hierarchical segmentation problem on a MRF [78]; or by using a more
    sophisticated model such as a conditional random field, where the best parameters to
    use would be learned from labelled training data [105].
    One significant problem which we observed (see for example the images in Chapter 6) is
    the plane detector’s inability to adhere to the actual edges of planar surfaces. Frequently,
    planes will not reach the edges of the regions, since they only exist where salient points
    occur; but we also observe many cases where planes overlap the true boundaries and leak
    into other areas. This is an important problem since for any kind of 3D reconstruction,
    1 www.maps.google.com
    8.3 Future Directions
    170
    or when using the planes to augment the image, this will lead to errors, making the
    actual structure more difficult to interpret. As we have discussed, making use of edge
    information, or explicitly classifying the boundaries, would be possible ways to mitigate
    this. As it stands, the fact that planes can be detected in the right location with broadly
    correct extent is encouraging, but expanding our algorithm to respect plane boundaries
    would not only significantly improve the presentation of the results, but bolster the case
    that learning from images can be more powerful than traditional techniques such as
    explicit rectangle detection.
    We also note that one of the criticisms we levelled at the work of Hoiem et al. [66] is
    its dependence on scenes structured in a certain way — for example having a horizontal
    ground plane and a visible horizon. Our method is not constrained thus, since it does
    not explicitly require any such characteristics of its test images. Nevertheless, a useful
    avenue of future research would be to more thoroughly investigate this, by evaluating
    the performance of the algorithm on particularly unusual viewpoints (pointing upward
    to a ceiling, for example), or of images arbitrarily rotated. In theory our method would
    be able to cope with such situations (depending on the training data), and quantifying
    this would further support the idea of extrapolating from training data to unexpected
    scenarios.
    We attempted to show in Chapter 7 how using the plane detector can improve monocular
    visual odometry. Our results are suggestive that this is possible, but we acknowledge
    that the experiments shown here are not rigorous enough to be certain of any increased
    accuracy. Ideally, we should have run the point-based, IDPP, and enhanced PDVO ver-
    sions of the algorithms in simulation, to determine error bounds and verify consistency
    under carefully controlled conditions, before attempting to evaluate in a real-world sce-
    nario. More rigorous outdoor testing is also required to validate our apparent increase
    in trajectory accuracy over IDPP, using ground truth data if possible (using GPS, for
    example).
    8.3 Future Directions
    The plane detection algorithm we developed was shown to work in various situations, and
    to be useful in a real-world application. However, there are many avenues left unexplored,
    both in terms of improvements to the algorithm itself, and further development of the
    ideas to new situations. We have already described some modifications and extensions
    8.3 Future Directions
    171
    in earlier chapters; in this section we briefly outline some further interesting directions
    this work could lead to.
    8.3.1 Depth Estimation
    To investigate the potential of using single images for structure perception, we have
    focused on plane detection, but it would be interesting to investigate other tasks using
    a similar approach. As we discussed in the introduction, another useful ability would
    be depth perception — something which has important implications in many situations,
    as evinced by the recent popularity of the Kinect sensor [96], which is able to sense
    depths in indoor locations. We have so far ignored the implications of depth for our
    plane detection, and assume either that it is not relevant (Chapter 5) or can be detected
    by other means (Chapter 7)
    Estimating accurate depth maps for the whole image is challenging, and has been ad-
    mirably tackled by Saxena et al. [112]. This is rather different from the task we consid-
    ered, having focused so far on region-based description, whereas depth maps require finer
    grained perception. However, if we consider depth estimation for planes themselves, we
    could use the results of our plane detection algorithm as a starting point for perceiv-
    ing depth for large portions of the scene. Thus, we would not need to estimate depth
    at individual pixels or superpixels, but by estimating a mean depth for a plane, could
    approximately position it in 3D space; along with other such planes, in relation to each
    other and to the camera, this would provide a rough but useful representation of the
    scene, perhaps even suitable for simple 3D visualisation [64].
    8.3.2 Boundaries
    We discussed at the end of Chapter 6 ideas for using edge information to enhance plane
    detection, in order to better perceive the boundaries between planar or non-planar re-
    gions. One other way in which we might explore this is to use other cues as well, such
    as segmentation information. Hoiem et al. [66] and Saxena et al. [113] used an over-
    segmentation of the image to extract initial regions, for scene layout and depth map
    estimation respectively; however, as we stated previously, we do not believe this would
    be appropriate for creating regions for plane detection. On the other hand, we could
    use the boundaries between segments as evidence for discontinuities between surfaces,
    8.3 Future Directions
    172
    essentially treating it as a kind of edge detection, where the edges have more global sig-
    nificance and consistency. For example, Felzenszwalb and Huttenlocher’s segmentation
    algorithm [36] tries to avoid splitting regions of homogeneous texture, while simply using
    a Canny edge detector could fill such regions with edges.
    8.3.3 Enhanced Visual Mapping
    As described in Chapter 7, using plane detection can enhance visual odometry, by show-
    ing it where planes should be initialised. Our implementation does not apply any consis-
    tency between plane detections, however, and assumes each frame is independent. This
    is a reasonable assumption when there is up to one second between frames, as the camera
    is likely to have moved on in between; but as available computing power increases (or we
    further optimise the detection algorithm), it may eventually be possible to apply plane
    detection to every frame in real time.
    In this situation it would be desirable to impose some kind of temporal consistency, to
    use information across multiple images. This would not necessarily be stereo or multi-
    view information of course — as we have emphasised before, the camera may well be
    stationary or purely rotating between adjacent frames. We could, for example, use the
    previous frame to guide detection of planes in the current, to ensure the layout is similar.
    Without this, we would likely observe fairly large changes between frames, in orientation
    and even in number and size of planes, since the MRF segmentation is begun anew each
    time (plus the locations of salient points may shift dramatically). A principled way of
    achieving this would be to extend the MRF to have temporal as well as spatial links, so
    that points in two (or more) images are linked together in the graph, and segmentation
    takes both into account. This could be applied to video sequences using a sliding window
    approach, taking multiple frames at a time to build a graph across space and time.
    If we can develop such an algorithm, to run in real time and give temporally consistent
    results, this would have many benefits for our visual odometry application. Rather than
    the plane detector running on whichever frame is next available – which might not be
    the best frames to use – we could send requests to the plane detector at times deemed
    necessary by the mapping system, in the knowledge that the result would be available
    fast enough to make a decision regarding initialisation. Planes would either be initialised
    immediately, or the plane detector be called again in a subsequent frame. Alternatively,
    the decision to initialise planes could come from the detector. If it is executed for every
    8.4 Final Summary
    173
    incoming frame, we can decide only to initialise planar features in frames where they are
    particularly large or have good shape characteristics, and keep detecting in every frame
    to find the best opportunities.
    8.3.4 Structure Mapping
    Finally, we mention another potential approach to building maps using our plane detec-
    tor, without being reliant on a pre-existing VO/SLAM system. If we can reliably detect
    planes and their orientation in one image, then move the camera and detect planes again,
    we could begin to attempt associating structures between frames. If the frames are close
    together in time, it is likely we would be viewing the same set of planes. This could be
    used to gauge depth (up to the unknown scale of the camera motion, as in monocular
    SLAM), and by aligning the orientation of the separate planes, recover a rough estimate
    of camera motion. We envisage a probabilistic approach where information from multiple
    frames is combined to try to accurately infer structure from unreliable plane measure-
    ments, while tracking the camera motion. This could be an interesting new approach to
    plane-based mapping, since the planes are detected first then used to build the map, the
    reverse of the standard approach.
    We could also consider a more topological approach, in which the relationships between
    planes (identity across frames, and locations within frames) is maintained in a graph
    structure to build a semantic representation of the scene, without ever estimating lo-
    cations in a global coordinate frame; this would be quite a different result from the
    structure-enhanced maps discussed above, more like the relational maps produced by
    FAB-MAP [27] and non-Euclidean relative bundle adjustment [91], and would be a very
    interesting way to develop our single-image algorithm into a way of gaining a larger scale
    semantic understanding of scenes as a whole.
    8.4 Final Summary
    To conclude, we briefly summarise what we have achieved in this thesis. We have in-
    vestigated methods for single image perception, inspired by the learning-based process
    of human vision, and focused on plane detection in order to develop our ideas. Our hy-
    pothesis that a learning-based approach to perception can be successful was supported
    8.4 Final Summary
    174
    in that we developed a single-region recognition algorithm for planes, followed by a full
    plane detection algorithm. We showed that this can be done with a significant relaxation
    of the assumptions previously relied upon; we retain only the requirement that the test
    images have appearance and structure reasonably similar to the training set. Therefore
    we conclude that a method based on learning from training examples is a valid and suc-
    cessful approach to understanding the content of images. Furthermore, we have shown
    that such an algorithm can be useful in the context of exploring and mapping an un-
    known outdoor environment, and able to extract a map of large-scale structures in real
    time.
    We believe our specific approach has the potential to be further developed, to perform
    better using more visual information, and be adapted to cope with new tasks. There is
    also much interesting work left to do in terms of using machine learning techniques to
    perceive structure from single images, to move on from only finding and orienting planes
    and to understand their 3D relationships or gauge their depth, for example. Ultimately
    the goal would be to go beyond using only planar structures, and recover a more complex
    understanding of the scene, pushing ever further the extent to which human-inspired
    models of visual perception can be used to make sense of the 3D world.
    References
    [1] P. Abbeel, A. Coates, and A. Ng. Autonomous helicopter aerobatics through ap-
    prenticeship learning. Int. Journal of Robotics Research , 29(13):1608–1639, 2010.
    [2] S. Albrecht, T. Wiemann, M. Gunther, and J. Hertzberg. Matching CAD object
    models in semantic mapping. In Proc. IEEE Int. Conf. Robotics and Automation,
    Workshop , 2011.
    [3] L. Antanas, M. van Otterlo, O. Mogrovejo, J. Antonio, T. Tuytelaars, and L. De
    Raedt. A relational distance-based framework for hierarchical image understand-
    ing. In Proc. Int. Conf. Pattern Recognition Applications and Methods , 2012.
    [4] O. Barinova, V. Konushin, A. Yakubenko, K. Lee, H. Lim, and A. Konushin. Fast
    automatic single-view 3-d reconstruction of urban scenes. In Proc. European Conf.
    Computer Vision , 2008.
    [5] A. Bartoli. A random sampling strategy for piecewise planar scene segmentation.
    Computer Vision and Image Understanding , 105(1):42–59, 2007.
    [6] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for shape
    matching and object recognition. In Proc. Conf. Advances in Neural Information
    Processing Systems , 2000.
    [7] T. Berg and P. Belhumeur. Tom-vs-Pete classifiers and identity-preserving align-
    ment for face verification. In Proc. British Machine Vision Conf. , 2012.
    [8] J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statis-
    tical Society B , 48(3):259–302, 1986.
    [9] N. Bhatti and A. Hanbury. Co-occurrence bag of words for object recognition.
    In Proc. Computer Vision Winter Workshop, Czech Pattern Recognition Society ,
    2010.
    175
    REFERENCES
    176
    [10] S. Birchfield and S. Rangarajan. Spatiograms versus histograms for region-based
    tracking. In Proc. IEEE Conf. Computer Vision and Pattern Recognitionn , 2005.
    [11] C. Bishop. Pattern Recognition and Machine Learning . Springer New York, 2006.
    [12] L. Breiman. Random forests. Machine learning , 45(1):5–32, 2001.
    [13] L. Brown and H. Shvaytser. Surface orientation from projective foreshortening
    of isotropic texture autocorrelation. IEEE Trans. Pattern Analysis and Machine
    Intelligence , 12(6):584–588, 1990.
    [14] R. Castle, G. Klein, and D. Murray. Combining monoSLAM with object recogni-
    tion for scene augmentation using a wearable camera. Image and Vision Comput-
    ing , 28(11):1548–1556, 2010.
    [15] D. Chekhlov, M. Pupilli, W. Mayol-Cuevas, and A. Calway. Real-time and ro-
    bust monocular SLAM using predictive multi-resolution descriptors. In Proc. Int.
    Symposium on Visual Computing , 2006.
    [16] D. Chekhlov, A. Gee, A. Calway, and W. Mayol-Cuevas. Ninja on a plane: Au-
    tomatic discovery of physical planes for augmented reality using visual SLAM. In
    Proc. Int. Symposium on Mixed and Augmented Reality , 2007.
    [17] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Analysis
    and Machine Intelligence , 17(8):790–799, 1995.
    [18] S. Choi. Algorithms for orthogonal nonnegative matrix factorization. In Proc. Int.
    Joint Conf. Neural Networks , 2008.
    [19] J. Civera, A. Davison, and J. Montiel. Inverse depth parametrization for monocular
    SLAM. IEEE Trans. Robotics , 24(5), October 2008.
    [20] J. Civera, O. Grasa, A. Davison, and J. Montiel. 1-point RANSAC for EKF-based
    structure from motion. In Proc. Int Conf. Intelligent Robots and Systems , 2009.
    [21] J. Civera, D. G´alvez-L´opez, L. Riazuelo, J. Tard´os, and J. Montiel. Towards
    semantic SLAM using a monocular camera. In Proc. Int. Conf. Intelligent Robots
    and Systems , 2011.
    [22] D. Comaniciu. An algorithm for data-driven bandwidth selection. IEEE Trans.
    Pattern Analysis and Machine Intelligence , 25(2):281–288, 2003.
    [23] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space
    analysis. IEEE Trans. Pattern Analysis and Machine Intelligence , 24(5):603–619,
    2002.
    [24] D. Comaniciu, V. Ramesh, and P. Meer. The variable bandwidth mean shift and
    data-driven scale selection. In Proc. Int. Conf. Computer Vision , 2001.
    [25] O. Cooper and N. Campbell. Augmentation of sparsely populated point clouds
    using planar intersection. In Proc. Int. Conf. Visualization, Imaging and Image
    Processing , 2004.
    REFERENCES
    177
    [26] A. Criminisi, I. Reid, and A. Zisserman. Single view metrology. Int. Journal of
    Computer Vision , 40(2):123–148, 2000.
    [27] M. Cummins and P. Newman. Highly scalable appearance-only SLAM FAB-MAP
    2.0. In Proc. Robotics Science and Systems , 2009.
    [28] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In
    Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2005.
    [29] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real-time single
    camera SLAM. IEEE Trans. Pattern Analysis and Machine Intelligence , 29(6):
    1052–1067, 2007.
    [30] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by
    latent semantic analysis. Journal of the American society for information science ,
    41(6):391–407, 1990.
    [31] C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrix
    factorization and probabilistic latent semantic indexing. Computational Statistics
    & Data Analysis , 52(8):3913–3927, 2008.
    [32] P. Dorninger and C. Nothegger. 3D segmentation of unstructured point clouds
    for building modelling. Int. Archives of the Photogrammetry, Remote Sensing and
    Spatial Information Sciences , 35(3/W49A):191–196, 2007.
    [33] S. Ekvall and D. Kragic. Receptive field cooccurrence histograms for object detec-
    tion. In Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems , 2005.
    [34] Z. Fan, J. Zhou, and Y. Wu. Multibody motion segmentation based on simulated
    annealing. In Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2004.
    [35] P. Favaro and S. Soatto. A geometric approach to shape from defocus. IEEE
    Trans. Pattern Analysis and Machine Intelligence , 27(3):406–417, 2005.
    [36] P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation.
    Int. Journal of Computer Vision , 59(2):167–181, 2004.
    [37] R. Fergus, M. Weber, and P. Perona. Efficient methods for object recognition using
    the constellation model. Technical report, California Institute of Technology, 2001.
    [38] R. Fergus, P. Perona, and A. Zisserman. A sparse object category model for efficient
    learning and exhaustive recognition. In Proc. IEEE Conf. Computer Vision and
    Pattern Recognition , 2005.
    [39] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting
    with applications to image analysis and automated cartography. Communications
    of the ACM Archive , 24:381–395, 1981.
    [40] A. Flint, C. Mei, D. W. Murray, and I. D. Reid. Growing semantically meaning-
    ful models for visual SLAM. In Proc. IEEE Int. Computer Vision and Pattern
    Recognition , 2010.
    REFERENCES
    178
    [41] D. Forsyth. Shape from texture and integrability. In Proc. Int. Conf. Computer
    Vision , 2001.
    [42] D. F. Fouhey, D. Scharstein, and A. J. Briggs. Multiple plane detection in image
    pairs using j-linkage. In Proc. ICPR , 2010.
    [43] J. G˚arding. Shape from texture for smooth curved surfaces in perspective projec-
    tion. Journal of Mathematical Imaging and Vision , 2:329–352, 1992.
    [44] J. G˚arding. Direct estimation of shape from texture. IEEE Trans. Pattern Analysis
    and Machine Intelligence , 15(11):1202–1208, 1993.
    [45] E. Gaussier and C. Goutte. Relation between PLSA and NMF and implications.
    In Proc. Int. Conf. Research and Development in Information Retrieval , 2005.
    [46] A. Gee, D. Chekhlov, W. Mayol, and A. Calway. Discovering planes and collapsing
    the state space in visual SLAM. In Proc. British Machine Vision Conf. , 2007.
    [47] A. Gee, D. Chekhlov, A. Calway, and W. Mayol-Cuevas. Discovering higher level
    structure in visual SLAM. IEEE Trans. on Robotics , 24:980–990, 2008.
    [48] J. Gibson. The Perception of the Visual World . Houghton Mifflin, 1950.
    [49] J. Gibson. The ecological approach to the visual perception of pictures. Leonardo ,
    11:227–235, 1978.
    [50] J. J. Gibson. The information available in pictures. Leonardo , 4:27–35, 1971.
    [51] E. H. Gombrich. Interpretation: Theory and Practice , chapter The Evidence of
    Images: 1 The Variability of Vision, pages 35–68. 1969.
    [52] L. Gong, T. Wang, F. Liu, and G. Chen. A lie group based spatiogram similarity
    measure. In Proc. IEEE Int. Conf. Multimedia and Expo , 2009.
    [53] R. Gregory. Perceptions as hypotheses. Philosophical Trans. Rotal Society , B 290:
    181–197, 1980.
    [54] R. Gregory. Knowledge in perception and illusion. Philosophical Trans. Royal
    Society of London. Series B: Biological Sciences , 352(1358):1121–1127, 1997.
    [55] R. Gregory and P. Heard. Border locking and the caf ´e wall illusion. Perception , 8:
    365–380, 1979.
    [56] A. Gupta and M. Efros, A. Hebert. Blocks world revisited: Image understanding
    using qualitative geometry and mechanics. In Proc. European Conf. Computer
    Vision , 2010.
    [57] O. Haines and A. Calway. Detecting planes and estimating their orientation from
    a single image. In Proc. British Machine Vision Conf. , 2012.
    REFERENCES
    179
    [58] O. Haines and A. Calway. Estimating planar structure in single images by learning
    from examples. In Proc. Int. Conf. Pattern Recognition Applications and Methods ,
    2012.
    [59] O. Haines, J. Mart´ınez-Carranza, and A. Calway. Visual mapping using learned
    structural priors. In Proc. IEEE Int. Conf. Robotics and Automation , 2013.
    [60] J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices.
    [61] A. Handa, M. Chli, H. Strasdat, and A. Davison. Scalable active matching. In
    Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2010.
    [62] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision . Cam-
    bridge University Press, 2003.
    [63] T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in
    Artificial Intelligence , 1999.
    [64] D. Hoiem, A. Efros, and M. Hebert. Automatic photo pop-up. ACM Trans.
    Graphics , 24(3):577–584, 2005.
    [65] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. In Proc. IEEE
    Conf. Computer Vision and Pattern Recognition , 2006.
    [66] D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. Int.
    Journal of Computer Vision , 75(1):151–172, 2007.
    [67] D. Hoiem, A. Stein, A. Efros, and M. Hebert. Recovering occlusion boundaries
    from a single image. In Proc. Int. Conf. Computer Vision , 2007.
    [68] J. Hou, J. Kang, and N. Qi. On vocabulary size in bag-of-visual-words represen-
    tation. Advances in Multimedia Information Processing , pages 414–424, 2010.
    [69] M. Jones and P. Viola. Robust real-time object detection. In Proc. Workshop on
    Statistical and Computational Theories of Vision , 2001.
    [70] S. Kim, K. Yoon, and I. Kweon. Object recognition using a generalized robust in-
    variant feature and Gestalt’s law of proximity and similarity. Pattern Recognition ,
    41(2):726–741, 2008.
    [71] G. Klein and D. Murray. Full-3D edge tracking with a particle filter. 2006.
    [72] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces.
    In Proc. Int. Symposium on Mixed and Augmented Reality , 2007.
    [73] J. Koˇseck ´a and W. Zhang. Extraction, matching, and pose recovery based on
    dominant rectangular structures. Computer Vision and Image Understanding , 100
    (3):274–293, 2005.
    [74] P. Koutsourakis, L. Simon, O. Teboul, G. Tziritas, and N. Paragios. Single view
    reconstruction using shape grammars for urban environments. In Proc. Int. Conf.
    Computer Vision , 2009.
    REFERENCES
    180
    [75] J. Lavest, G. Rives, and M. Dhome. Three-dimensional reconstruction by zooming.
    IEEE Trans. Robotics and Automation , 9(2):196–207, 1993.
    [76] D. Lee and H. Seung. Learning the parts of objects by non-negative matrix fac-
    torization. Nature , 401(6755):788–791, 1999.
    [77] D. Lewis. Naive (Bayes) at forty: The independence assumption in information
    retrieval. In Proc. European Conf. Machine Learning , 1998.
    [78] S. Li. Markov Random Field Modeling in Image Analysis . Springer-Verlag New
    York Inc, 2009.
    [79] T. Lindeberg. Scale-space. Encyclopedia of Computer Science and Engineering , 4:
    2495–2504, 2009.
    [80] T. Lindeberg. Scale-space theory: A basic tool for analyzing structures at different
    scales. Journal of applied statistics , 21(1-2):225–270, 1994.
    [81] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal
    of Computer Vision , 60(2):91–110, 2004.
    [82] D. G. Lowe. Object recognition from local scale-invariant features. In Proc. Int.
    Conf. Computer Vision , 1999.
    [83] D. Lyons. Sharing and fusing landmark information in a team of autonomous
    robots. In Proc. Society of Photo-Optical Instrumentation Engineers Conf. , 2009.
    [84] C. D. Manning and H. Raghavan, P. Schtze. Introduction to Information Retrieval .
    Cambridge University Press, 2008.
    [85] J. Mart´ınez-Carranza. Efficient Monocular SLAM by Using a Structure Driven
    Mapping . PhD thesis, University of Bristol, 2012.
    [86] J. Mart´ınez-Carranza and A. Calway. Appearance based extraction of planar struc-
    ture in monocular SLAM. In Proc. Scandinavian Conf. Image Analysis , 2009.
    [87] J. Mart´ınez-Carranza and A. Calway. Efficiently increasing map density in visual
    SLAM using planar features with adaptive measurement. In Proc. British Machine
    Vision Conf. , 2009.
    [88] J. Mart´ınez-Carranza and A. Calway. Unifying planar and point mapping in
    monocular SLAM. In Proc. British Machine Vision Conf. , 2010.
    [89] J. Mart´ınez-Carranza and A. Calway. Efficient visual odometry using a structure-
    driven temporal map. 2012.
    [90] P. Meer, D. Mintz, A. Rosenfeld, and D. Kim. Robust regression methods for
    computer vision: A review. Int. Journal of Computer Vision , 6(1):59–70, 1991.
    [91] C. Mei, G. Sibley, M. Cummins, P. Newman, and I. Reid. RSLAM: A system
    for large-scale mapping in constant-time using stereo. Int. Journal of Computer
    Vision , 2010.
    REFERENCES
    181
    [92] J. Michels, A. Saxena, and A. Ng. High speed obstacle avoidance using monocular
    vision and reinforcement learning. In Proc. Int. Conf. Machine learning , 2005.
    [93] B. Miˇcuˇs´ık, H. Wildenauer, and J. Koˇseck´a. Detection and matching of rectilinear
    structures. In Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2008.
    [94] B. Micuˇsık, H. Wildenauer, and M. Vincze. Towards detection of orthogonal planes
    in monocular images of indoor environments. In Proc. IEEE Int. Conf. Robotics
    and Automation , 2008.
    [95] N. Molton, A. Davison, and I. Reid. Locally planar patch features for real-time
    structure from motion. In Proc. British Machine Vision Conf. , 2004.
    [96] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davison, P. Kohli,
    J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-time dense surface
    mapping and tracking. In Proc. Int. Symposium on Mixed and Augmented Reality ,
    2011.
    [97] C. O Conaire, N. O’Connor, and A. Smeaton. An improved spatiogram similarity
    `
    measure for robust object localisation. In Proc. IEEE Int. Conf. Acoustics, Speech
    and Signal Processing , 2007.
    [98] J. Oliensis. Uniqueness in shape from shading. Int. Journal of Computer Vision ,
    6(2):75–104, 1991.
    [99] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic represen-
    tation of the spatial envelope. Int. Journal of Computer Vision , 42(3):145–175,
    2001.
    [100] A. Oliva and A. Torralba. Scene-centered description from spatial envelope prop-
    erties. In Proc. Biologically motivated computer vision , 2002.
    [101] G. Orb´an, J. Fiser, R. Aslin, and M. Lengyel. Bayesian model learning in human
    visual perception. Advances in neural information processing systems , 18:1043,
    2006.
    [102] M. Parsley and S. Julier. SLAM with a heterogeneous prior map. In Proc. SEAS-
    DTC Conf. , 2009.
    [103] T. Pietzsch. Planar features for visual SLAM. Advances in Artificial Intelligence ,
    pages 119–126, 2008.
    [104] M. Pupilli and A. Calway. Real-time camera tracking using known 3D models and
    a particle filter. In Proc. Int. Conf. Pattern Recognition , 2006.
    [105] A. Quattoni, S. Wang, L. Morency, M. Collins, and T. Darrell. Hidden conditional
    random fields. IEEE Trans. Pattern Analysis and Machine Intelligence , 29(10):
    1848–1852, 2007.
    [106] C. Rasmussen and J. Quinonero-Candela. Healing the relevance vector machine
    through augmentation. In Proc. Int. Conf. Machine learning , 2005.
    REFERENCES
    182
    [107] E. Ribeiro and E. Hancock. Estimating the 3D orientation of texture planes using
    local spectral analysis. Image and Vision Computing , 18(8):619–631, 2000.
    [108] E. Ribeiro and E. Hancock. Estimating the perspective pose of texture planes
    using spectral analysis on the unit sphere. Pattern recognition , 35(10):2141–2163,
    2002.
    [109] L. Roberts. Machine perception of three-dimensional solids. Technical report,
    DTIC Document, 1963.
    [110] E. Rosten and T. Drummond. Machine learning for high-speed corner detection.
    Lecture Notes in Computer Science , 3951:430–443, 2006.
    [111] A. Saxena, J. Schulte, and A. Ng. Depth estimation using monocular and stereo
    cues. In Proc. Int. Joint Conf. Artificial Intelligence , 2007.
    [112] A. Saxena, S. Chung, and A. Ng. 3-d depth reconstruction from a single still image.
    Int. Journal of Computer Vision , 76(1):53–69, 2008.
    [113] A. Saxena, M. Sun, and A. Ng. Make3D: learning 3D scene structure from a
    single still image. IEEE Trans. Pattern Analysis and Machine Intelligence , pages
    824–840, 2008.
    [114] D. Scaramuzza, F. Fraundorfer, and R. Siegwart. Real-time monocular visual
    odometry for on-road vehicles with 1-point RANSAC. In Proc. IEEE Int. Conf.
    Robotics and Automation , 2009.
    [115] S. Scott and S. Matwin. Feature engineering for text classification. In Proc.
    Machine Learning Int. Workshop , 1999.
    [116] H. Shimodaira. A shape-from-shading method of polyhedral objects using prior
    information. IEEE Trans. Pattern Analysis and Machine Intelligence , 28(4):612–
    624, 2006.
    [117] G. Silveira, E. Malis, and P. Rives. An efficient direct approach to visual SLAM.
    IEEE Trans. on Robotics , 24(5):969–979, 2008.
    [118] D. A. Sinclair. S-Hull: a fast radial sweep-hull routine for Delaunay triangulation.
    Technical report, S-Hull, Cambridge UK, 2010.
    [119] H. Strasdat, J. Montiel, and A. Davison. Scale drift-aware large scale monocular
    SLAM. In Proc. Robotics Science and Systems , 2010.
    [120] H. Strasdat, J. Montiel, and A. Davison. Real-time monocular SLAM: Why filter?
    In Proc. IEEE Int. Conf. Robotics and Automation , 2010.
    [121] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Depth from familiar
    objects: A hierarchical model for 3D scenes. In Proc. IEEE Conf. Computer
    Vision and Pattern Recognition , 2006.
    REFERENCES
    183
    [122] B. Super and A. Bovik. Planar surface orientation from texture spatial frequencies.
    Pattern Recognition , 28(5):729–743, 1995.
    [123] A. Thayananthan, R. Navaratnam, B. Stenger, P. Torr, and R. Cipolla. Multi-
    variate relevance vector machines for tracking. In Proc. European Conf. Computer
    Vision , 2006.
    [124] S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system.
    Nature , 381(6582):520–522, 1996.
    [125] M. Tipping. Sparse Bayesian learning and the relevance vector machine. The
    Journal of Machine Learning Research , 1, 2001.
    [126] M. Tipping and A. Faul. Fast marginal likelihood maximisation for sparse Bayesian
    models. In Proc. Int. Workshop on Artificial Intelligence and Statistics , 2003.
    [127] A. Torralba and A. Oliva. Depth estimation from image structure. IEEE Trans.
    Pattern Analysis and Machine Intelligence , 24(9):1226–1238, 2002.
    [128] A. Torralba and A. Oliva. Semantic organization of scenes using discriminant
    structural templates. In Proc. Int. Conf. Computer Vision , 1999.
    [129] A. Tremeau and N. Borel. A region growing and merging algorithm to color seg-
    mentation. Pattern Recognition , 30(7):1191–1203, 1997.
    [130] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward objective evaluation of
    image segmentation algorithms. IEEE Trans. Pattern Analysis and Machine In-
    telligence , 29(6):929–944, 2007.
    [131] V. Viitaniemi and J. Laaksonen. Spatial extensions to bag of visual words. In
    Proc. ACM Int. Conf. Image and Video Retrieval , 2009.
    [132] S. Wangsiripitak and D. Murray. Reducing mismatching under time-pressure by
    reasoning about visibility and occlusion. Journal of Computer Vision , 60(2):91–
    110, 2004.
    [133] A. Witkin. Recovering surface shape and orientation from texture. Artificial In-
    telligence , 17(1-3):17–45, 1981.
    [134] J. Yang, Y. Jiang, A. Hauptmann, and C. Ngo. Evaluating bag-of-visual-words
    representations in scene classification. In Proc. Int. Workshop on Multimedia In-
    formation Retrieval , 2007.
    [135] J. Yoo and S. Choi. Orthogonal nonnegative matrix factorization: Multiplicative
    updates on Stiefel manifolds. Intelligent Data Engineering and Automated Learn-
    ing , pages 140–147, 2008.
    [136] M. Zucchelli, J. Santos-Victor, and H. Christensen. Multiple plane segmentation
    using optical flow. In Proc. British Machine Vision Conf. , pages 313–322, 2002.