UNIT 4
Triangulation
triangulation is the technique of locating a point's projection in two or more 2D images in order to
determine its 3D position in space. In tasks like 3D reconstruction, posture estimation, and structure-
from-motion, this technique is crucial. The 3D point is situated at the junction of the lines of sight from
the cameras to the 2D image points in triangulation, which operates on the intersection principle.
Here is a brief explanation of how triangulation in functions:
Image Correspondence: First, you need to establish correspondences between points in different
images. These points are typically feature points, like keypoints, that can be reliably tracked in multiple
images.
Camera Calibration: It's crucial to have accurate camera calibration parameters for each camera, such
as the intrinsic parameters (focal length, principal point) and extrinsic parameters (position and
orientation of the camera in 3D space).
Epipolar Geometry: Use the epipolar geometry to find the epipolar lines in each image. The epipolar
lines are the intersection of the image planes of the two cameras and help narrow down the search for
the corresponding point.
Triangulation: For each pair of corresponding points in the two images, you have an epipolar line in
each image. The 3D position of the point is then found at the intersection of these two lines. Several
methods can be used to perform triangulation, such as algebraic triangulation, iterative methods, or
direct linear transformation (DLT).
Depth Estimation: Once the 3D position of the point is determined, you can calculate its depth with
respect to the cameras. This provides information about how far the point is from the cameras.
Triangulation is used in various applications, such as 3D reconstruction from multiple images, pose
estimation (determining the camera's position and orientation in space), and augmented reality, among
others. It plays a vital role in converting 2D image data into 3D information, enabling the understanding
and manipulation of the three-dimensional world from a collection of two-dimensional images.
Two – Frame Structure for motion
A two-frame structure is frequently utilized for a variety of tasks, especially when it comes to motion
analysis and object tracking. To extract information about the motion of objects or camera movement,
this structure includes comparing two successive frames of a video stream. An overview of two-frame
structures is given below:
Frame Acquisition: The first step involves capturing two consecutive frames from a video sequence.
These frames are typically represented as images, and they serve as the input for the subsequent
analysis.
Feature Detection: In the two-frame structure, the primary focus is on detecting and tracking key
features or objects between the two frames. These features could be points, corners, blobs, or any
distinctive parts of the image.
Feature Matching: After feature detection, the next step is to match the features between the two
frames. This is often done by finding corresponding points or key features in the second frame that
correspond to those in the first frame. Feature matching can be achieved using techniques like feature
descriptors (e.g., SIFT, SURF, ORB) and various matching algorithms (e.g., RANSAC).
Motion Estimation: Once corresponding features are identified; the computer vision system can
estimate the motion between the two frames. This involves calculating translation, rotation, scaling, or
any other transformation that describes how the objects or camera have moved between the frames.
Common techniques for motion estimation include optical flow, Lucas-Kanade method, or more
sophisticated algorithms like RANSAC-based methods for robust motion estimation.
Applications:
Object Tracking: The two-frame structure is fundamental for object tracking in video sequences. It
allows for tracking objects over time by continuously updating the position and motion of objects
between frames.
Optical Flow: Optical flow is a technique used to estimate the motion of objects in an image or scene
by analyzing the displacement of pixels between two frames.
Camera Stabilization: In the case of camera motion, the two-frame structure can be used to estimate
the camera's motion and compensate for it to produce stabilized video footage.
Challenges: One of the main challenges in using a two-frame structure is handling occlusions, abrupt
changes in motion, and feature matching problems. Object disappearance and re-appearance between
frames can make it challenging to maintain a consistent track.
Extensions: In more advanced applications, multi-frame structures (using more than two frames) are
used to improve the accuracy of motion estimation. These methods leverage information from multiple
frames to achieve more robust tracking and motion analysis.
A key component of two-frame structure serves as the foundation for numerous algorithms and
applications for motion analysis, object tracking, and scene comprehension in video sequences.
Projective Reconstruction
A 3D scene can be recreated from 2D photographs using a technique called projective reconstruction.
This technique is especially useful when dealing with multiple-view geometry. It seeks to reconstruct
the 3D geometry of the world or objects from a set of 2D pictures collected from various angles. For
uses like 3D modelling, augmented reality, and computer vision tasks like object tracking and
recognition, this is essential.
Projective reconstruction refers to the computation of the structure of a scene from images taken with
uncalibrated cameras, resulting in a scene structure, and camera motion that may differ from the true
geometry by an unknown 3D projective transformation.
Suppose that a set of interest points are identified and matched (or tracked) in several images. The
configuration of the corresponding 3D points and the locations of the cameras that took these images
are supposed unknown. The task of reconstruction is to determine the values of these unknown
quantities.
Formally, assume that a set of image points {xij} are known, where xij represents the image coordinates
of the j-th point seen in the i-th image. It is generally not required that every point’s location be known
in every image, so only a subset of all possible xij are given. The Structure from Motion (SfM) problem
is to determine the camera projection matrices Pi and the 3D point locations Xj such that the projection
of the j-th point in the i-th image is the measured xij. Assuming a pinhole (projective) camera model,
this relationship is expressed as a linear relationship
xij = PiXj (1)
where Pi is a 3 × 4 matrix of rank 3, Xj and xij are expressed in homogeneous coordinates, and the
equality is intended to hold only up to an unknown scale factor λij. More precisely, therefore, the
projection equation is
λij xij = PiXj (2)
In the SfM problem, cameras Pi and points Xj are to be determined, given only the point
correspondences.
Self-Calibration
A camera is a device that converts the 3D world into a 2D image. A camera plays a very important
role in capturing three-dimensional images and storing them in two-dimensional images. To know the
mathematics behind it is extremely fascinating. The following equation can represent the camera.
x=PX
Here x denotes 2-D image point, P denotes camera matrix and X denotes 3-D world point.
Figure vector representation of x=PX [1]
Self-calibration is the process of calculating a camera's intrinsic parameters using only the
information available in the images taken by that camera.
Self-calibration is a powerful method that allows researchers to recover 3D models from image
sequences. It's performed by:
• Calculating all the intrinsic parameters of the camera using only the information available in
the images taken by that camera
• Automatically matching images using the SfM algorithm
• Estimating the calibration parameters with photogrammetric bundle adjustment
• Recovering both the internal and external parameters by using correspondences between three
images
• Obtaining matching pairs of points in two images by corner and vertex detection
Cameras are often categorized on the basis of a set of intrinsic parameters such as:
• Skew of the axis
• Focal length
• Main point
Cameras' orientation is expressed by extrinsic parameters such as: Rotation, Translation.
Perspective and Projective Factorization
For tasks including camera calibration, structure from motion, and 3D reconstruction, computer vision
techniques like perspective and projective factorization are applied. These techniques try to extract 3D
data from 2D pictures taken by one or more cameras. A summary of perspective and projective
factorization is given below:
Perspective Factorization: To retrieve the intrinsic and extrinsic camera parameters, a technique called
perspective factorization is applied. These variables include the camera's position and orientation in 3D
space, focal length, primary point, lens distortion, and others. This method is primarily used for single-
camera calibration and assumes a pinhole camera model. The following are the main steps of
perspective factorization:
• Image point correspondence: Find similar points in several photographs that can be used to
determine the camera's settings.
• Perspective projection: Based on the equations for perspective projection, use the
correspondences to find the intrinsic and extrinsic camera settings.
Projective Factorization: By extending the concept of perspective factorization to multiple cameras,
projective factorization is appropriate for multi-view geometry and 3D reconstruction. It addresses
projective distortions in cameras rather than just pure perspective. Projective factorization involves the
following key steps:
• Image point correspondence: Identify corresponding points in multiple images.
• Homography estimation: Calculate homographies (2D projective transformations) that map
points from one image to another. These homographies capture projective distortions.
• Factorization: Factorize the homography matrices to obtain the intrinsic and extrinsic
parameters for each camera.
Tasks like 3D reconstruction from several photos or video sequences frequently use these techniques.
They are necessary for programmes like scene modelling, camera tracking, and augmented reality. Here
are some important things to think about:
• The simplified pinhole camera model used in perspective factorization might not accurately
represent all the intricacies of actual cameras. It is primarily employed for calibration of a single
camera.
• The addition of projective distortions in multi-camera setups is made possible by the greater
flexibility of projective factorization. With non-pinhole cameras, like those with wide-angle or
fish-eye lenses, it is especially helpful for calibrating and reconstructing scenes.
• Both methods depend on precise correspondence between picture locations, which in practise
can be difficult, especially when noise and occlusions are present.
• While these techniques provide valuable tools for computer vision tasks, modern approaches
often combine them with other methods and algorithms, such as bundle adjustment and feature
matching, to improve accuracy and robustness in 3D reconstruction and camera calibration.
Bundle Adjustment
A key method in the study of 3D reconstruction is bundle adjustment. By minimising the difference
between observed 2D image points and their corresponding 3D points, it is used to refine the parameters
of a 3D scene, such as the 3D structure of objects and the camera poses. Structure-from-motion, visual
SLAM (Simultaneous Localization and Mapping), 3D reconstruction, augmented reality, and other
applications all depend on this technology.
Here's how bundle adjustment works and why it's important:
Input Data: Bundle adjustment takes as input a set of images (usually from multiple cameras) and a
set of 2D-3D correspondences. These correspondences are typically feature points or keypoints
identified in the images and associated with their 3D coordinates.
Initial Parameters: It starts with an initial estimation of the camera poses (extrinsic parameters) and
3D points (intrinsic parameters). These initial parameters may be obtained through techniques like
Structure-from-Motion or other calibration methods.
Cost Function: Bundle Adjustment formulates an optimization problem where the goal is to minimize
a cost function that quantifies the error between the observed 2D image points and their projected 3D
positions. This error is often represented as the reprojection error.
Nonlinear Optimization: Since the cost function is typically nonlinear, bundle adjustment uses
nonlinear optimization techniques, such as Levenberg-Marquardt, Gauss-Newton, or other optimization
algorithms, to iteratively refine the camera poses and 3D points.
Iterations: Bundle Adjustment performs multiple iterations, each of which refines the parameters to
minimize the cost function. It iterates until convergence is reached, or a certain stopping criterion is
met.
Challenges: Bundle Adjustment can be computationally intensive, especially for large-scale
reconstructions. Therefore, efficient implementations and parallelization are often used. Robust
techniques are also employed to handle outliers and inaccuracies in the data.
Output: The final output of bundle adjustment is a more accurate estimation of the 3D scene's structure
and camera poses. This refined 3D model can be used for various applications, such as 3D visualization,
virtual reality, or robotics.
Exploiting Sparsity
Utilising the fact that many visual data analysis algorithms and data are sparse, which means they
contain a considerable amount of zero values or other types of emptiness, is known as "exploiting
sparsity". When dealing with activities like image and video processing, where the data might be
extremely high-dimensional and computationally expensive to handle, this idea is especially crucial.
The efficiency and effectiveness of computer vision algorithms can be greatly increased by recognising
and utilising the sparse nature of visual data. To take use of sparsity, try the following:
Sparse Representations:
Use sparse data representations: Use sparse formats to represent images and visual data, such as
compressed sensing, sparse coding, or sparse dictionaries, to reduce the amount of storage and
processing needed.
Sparse Filtering:
Utilise filters that take the sparsity of the data into account, such as sparse convolutional filters, to enable
convolutional neural networks (CNNs) extract features with fewer operations.
Sparse Sampling:
Employ sparse sampling strategies: Capture and process only a subset of the data, particularly when
dealing with high-resolution images or video frames. This reduces the computational load while
preserving critical information.
Sparse Data Augmentation:
Augment sparse data: Generate synthetic data to fill in sparse regions, either through interpolation or
by leveraging techniques like Generative Adversarial Networks (GANs).
Sparse Depth and Motion Information:
Use depth and motion data: In addition to RGB information, incorporate depth or motion data to identify
and track objects, particularly in scenarios like 3D reconstruction and object tracking.
Sparse Optical Flow:
Utilize sparse optical flow: Instead of dense optical flow, consider sparse optical flow methods for
estimating motion in videos. These methods focus on tracking specific feature points, reducing the
computational load.
Sparse Sampling of Training Data:
Train on sparse data: When training machine learning models, use a sparse subset of the available
training data, which can help speed up the training process without sacrificing performance.
Constrained Structure and Motion
The estimation of 3D scene structure and camera motion from a series of 2D pictures or video frames
is the core task of the computer view fundamental issue known as Constrained Structure and Motion
(CSM). Numerous applications, including robots, augmented reality, and 3D reconstruction, rely
heavily on this issue. To make the challenge manageable and produce more accurate outcomes, CSM
often imposes a number of constraints.
The following are some essential ideas and limitations with regard to Constrained Structure and Motion:
• Epipolar Geometry
• Calibration Constraints
• Scale Ambiguity
• Triangulation
• Bundle Adjustment
• Outlier Rejection
• Sparse vs. Dense Reconstruction
• Temporal Consistency
• Non-Rigid Structure and Motion
• Sensor Fusion
Hierarchical Motion Estimation
The method known as "hierarchical motion estimation" is used to gauge an object's or point's motion
inside a video sequence. It entails segmenting the motion estimation procedure into a number of levels
or hierarchies, each with a unique level of accuracy and detail. When dealing with complicated scenes,
significant displacements, or when real-time performance is desired, this method is particularly helpful.
Hierarchical estimation of the motion vector field (also known as or pyramid search) is a widely applied
approach to motion estimation. It offers low computational complexity and high efficiency, plus a large
degree of flexibility in the trade-off between the two.
Hierarchical motion estimation has the following advantages:
• Low computational complexity
• High efficiency
• Robust estimation of high-motion content
• More homogeneous block-motion fields
• Better representation of the true motion in the frame
The basic components of the hierarchical motion estimation framework are:
• Pyramid construction
• Motion estimation
• Image warping
• Coarse-to-fine refinement
Fourier Based Alignment
The goal in this topic is to perform color channel alignment using the Fourier transform. As discussed
in this lecture, convolution in the spatial domain translates to multiplication in the frequency domain.
Further, the Fast Fourier Transform algorithm computes a transform in O (N M log N M) operations
for an N by M image. As a result, Fourier-based alignment may provide an efficient alternative to
sliding window alignment approaches for high-resolution images.
Similarly, you will perform color channel alignment on the same set of six low-resolution input
images here and three high-resolution images here. You can use the same preprocessing to split the data
into individual color channels. You should use only the original input scale (not the multiscale pyramid)
for both high-resolution and low-resolution images in Fourier-based alignment.
Algorithm outline
The Fourier-based alignment algorithm consists of the following steps:
1. For two color channels C1 and C2, compute corresponding Fourier transforms FT1 and FT2.
2. Compute the conjugate of FT2 (denoted as FT2*), and compute the product of FT1 and FT2*.
3. Take the inverse Fourier transform of this product and find the location of the maximum value
in the output image. Use the displacement of the maximum value to obtain the offset of C2
from C1.
To colorize a full image, you will need to choose a base color channel, and run the above algorithm
twice to align the other two channels to the base.
Color channel pre-processing- Applying the Fourier-based alignment to the image color channels
directly may not be sufficient to align all the images. To address any faulty alignments, try sharpening
the inputs or applying a small Laplacian of Gaussian filter to highlight edges in each color channel.
Submission Checklist
In your report (based on this template), you should provide the following for each of the six low-
resolution and three high-resolution images:
• Final aligned output image
• Displacements for color channels
• Inverse Fourier transform output visualization for both channel
alignments without preprocessing
• Inverse Fourier transform output visualization for both channel alignments with any sharpening
or filter-based preprocessing you applied to color channels
Incremental Refinement
The term "incremental refinement" describes the process by which a system incrementally improves its
comprehension of visual data, frequently through a series of processing steps or learning stages. With
complicated and noisy visual data, this method is used to gradually improve the accuracy and
dependability of computer vision systems.
The following are some essential features of computer vision's progressive refinement:
Multi-Stage Processing: In many computer vision tasks, such as object recognition or scene
understanding, the initial processing stage might provide a rough estimate or preliminary results.
Subsequent stages then work on refining and improving these results. For example, an object detection
system might first detect candidate objects in an image, and then refine the bounding boxes and classify
the objects within those boxes in subsequent stages.
Feedback Loops: Incremental refinement often involves feedback loops where the output of one stage
is used to guide or correct the processing of subsequent stages. This feedback loop allows the system to
learn from its mistakes and make incremental improvements. For example, if an initial object detection
stage misclassifies an object, the system can learn from that mistake and adjust its model accordingly.
Deep Learning and Transfer Learning: Deep learning models are commonly used in computer
vision, and they can be incrementally refined by fine-tuning on new data. Transfer learning, which
involves using pre-trained models and fine-tuning them on specific tasks, is an example of how models
can be incrementally improved for new tasks.
Data Augmentation and Data Collection: Collecting additional data and augmenting the existing
dataset with more diverse examples is a common strategy for incremental refinement. This helps the
model generalize better and improve its accuracy.
Semi-Supervised Learning: In situations where acquiring labelled data is expensive or time-
consuming, semi-supervised learning can be used for incremental refinement. The model can start with
a small labelled dataset and then iteratively improve by actively selecting and labelling more data points
based on its current performance.
Online Learning: Some computer vision systems are designed to operate in real-time or on streaming
data. In such cases, they can use online learning methods, where the model is updated with new data as
it becomes available, allowing for continuous refinement.
Human-in-the-Loop: In some scenarios, human annotators or experts are involved in the incremental
refinement process. For instance, if a computer vision system is used for medical diagnosis, a radiologist
might review and correct the system's findings, helping it improve over time.
Adaptive Models: Models can adapt to changing conditions over time. For example, a computer vision
system deployed in autonomous vehicles may need to adapt to varying lighting and weather conditions,
and it does so incrementally to improve its performance in different situations.