RMK Group 21cs905 CV Unit 5
RMK Group 21cs905 CV Unit 5
2
Please read this disclaimer before
proceeding:
This document is confidential and intended solely for the
educational purpose of RMK Group of Educational Institutions. If
you have received this document through email in error,
please notify the system manager. This document contains
proprietary information and is intended only to the respective
group / learning community as intended. If you are not the
addressee you should not disseminate, distribute or copy
through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this
document from your system. If you are not the intended recipient
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.
3
21CS905
COMPUTER
VISION
Department:
ARTIFICIAL INTELLIGENCE AND DATA
SCIENCE
Batch/Year: BATCH 2021-25/IV
Created by:
Dr.V.Seethalakshmi, Associate Professor, AI&DS, RMKCET
Date: 18-09-2024
4
Table of Contents
Sl. No. Contents Page No.
1 Contents 5
2 Course Objectives 6
6 CO-PO/PSO Mapping 14
Lecture Plan (S.No., Topic, No. of Periods, Proposed
7 date, Actual Lecture Date, pertaining CO, Taxonomy 17
level, Mode of Delivery)
8 Activity based learning 19
Lecture Notes (with Links to Videos, e-book reference,
9 21
PPTs, Quiz and any other learning materials )
Assignments ( For higher level learning and Evaluation
10 75
- Examples: Case study, Comprehensive design, etc.,)
11 Part A Q & A (with K level and CO) 80
18 Mini Project 99
5
Course
Objectives
6
COURSE OBJECTIVES
7
PRE
REQUISITES
8
PRE REQUISITES
9
Syllabus
10
Syllabus
UNIT IV 3D RECONSTRUCTION 9
Shape from X - Active rangefinding - Surface representations - Point-based
representations- Volumetric representations - Model-based reconstruction -
Recovering texture maps and albedosos.
12
Course Outcomes
Course Description Knowledge
Outcomes Level
K6 Evaluation
K5 Synthesis
K4 Analysis
K3 Application
K2 Comprehension
K1 Knowledge
13
CO – PO/PSO
Mapping
14
CO – PO /PSO Mapping
Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03
1 3 2 1 1 3
2 3 3 2 2 3
3 3 3 1 1 3
4 3 3 1 1 3
5 3 3 1 1 3
6 2 2 1 1 3
15
UNIT –V IMAGE-
BASED RENDERING
AND RECOGNITION
16
Lecture
Plan
17
Lecture Plan – Unit 5– IMAGE-BASED
RENDERING AND RECOGNITION
Numb er
Sl. Propose Actual Taxo Mode of
of
No. Topic d Lectur CO nomy Delivery
Period s
Date e Date Level
1 CO5 K3 Blackboard
1 View interpolation / ICT Tools
Layered depth images
1 CO5 K3 Blackboard
2 Light fields and
/ ICT Tools
Lumigraphs
9 1 CO5 K4 Blackboard
Context and scene / ICT Tools
understanding
10 1 CO5 K4 Blackboard
Recognition databases / ICT Tools
and test sets
18
Activity Based
Learning
19
Activity Based Learning
Sl. No. Contents Page No.
21
UNIT-5 IMAGE-BASED
RENDERING AND
RECOGNITION
No. No.
3 Environment mattes 31
4 Video-based rendering 34
5 Object detection 39
Face recognition
6 45
7 Instance recognition 51
8 Category recognition 55
22
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
1. VIEW INTERPOLATION LAYERED DEPTH IMAGES
View interpolation is based on the premise that given two or more images of the same
scene from different viewpoints, intermediate views can be generated using geometric
and image warping techniques.
To handle depth information, interpolation must also adjust for changes in pixel locations
based on their depth values:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
A Layered Depth Image (LDI) stores multiple depth layers at each pixel location,
representing surfaces that appear at different depths along the same viewing ray. This is
useful for handling occlusions in view interpolation, as the multiple layers allow for more
accurate representation of the 3D scene.
Each pixel in an LDI stores a list of depth values and corresponding color values.
Multiple depth layers are useful for handling the occlusion problem because the
frontmost layer may not always be visible in a new view.
LDI Representation:
Here’s a simplified diagram that illustrates the concept of view interpolation using layered
depth images
Viewpoint A Viewpoint B
\ /
\ /
| | |
This diagram shows how multiple depth layers along the viewing rays from different
viewpoints are used to interpolate the view in between them. The scene information is
stored as multiple layers of depth and color at each pixel.
Imagine a scenario where you have two images of a building taken from two different
angles. The images show occlusions, such as parts of the building that are hidden from
one viewpoint but visible from the other. Using LDIs, you store multiple depth values at
each pixel, which allows you to synthesize a new view where previously hidden parts of
the scene become visible as you change the viewpoint.
2. Construct LDIs by storing color and depth values at multiple layers for each pixel.
3. Use the interpolation formula to blend the images and adjust pixel locations based on
depth to generate an intermediate view.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
1.4 Case Study: The "Tour into the Picture"
One of the classic case studies is the "Tour into the Picture" application where a user
explores a 3D scene using 2D images. This technique was effectively used for generating
smooth transitions between captured images of a real-world scene. The LDIs allow for
rendering of complex scenes with occlusions, and intermediate views can be synthesized
dynamically as the user moves through the virtual space. By using multiple depth layers
at each pixel, this system manages occlusions effectively and provides smooth visual
transitions.
Light Fields and Lumigraphs are powerful techniques used in computer vision and
computer graphics to capture and render complex visual scenes. They allow for the
realistic reproduction of a scene by capturing how light travels through every point in
space, and how it changes depending on the viewer's perspective. Here's an explanation
of both:
A light field describes the amount of light traveling in every direction through every point
in space. Essentially, it encodes the light's intensity and color at each point in space and
at every direction. This method allows for the rendering of images from arbitrary
viewpoints without needing to explicitly model the 3D geometry of the scene.
The concept of the light field is mathematically formalized using the plenoptic function,
which describes the intensity of light rays in a scene as a function of several variables:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
In a simplified form, the light field is typically reduced to a 4D function, assuming the light
is constant in terms of wavelength:
This 4D representation allows for the synthesis of novel views of the scene, essentially
capturing both spatial and angular information about the light.
Light field cameras (plenoptic cameras): These cameras have a microlens array in
front of the sensor, which allows them to capture multiple light rays entering from
different angles. Lytro cameras are an example.
Multi-camera arrays: By arranging many cameras in an array, the light field can be
approximated by taking multiple images from slightly different viewpoints
Once a light field is captured, it can be used to render images from any viewpoint within
the captured field. The advantage is that no explicit 3D geometry reconstruction is
needed, and the scene can be rendered directly by sampling the appropriate rays from
the light field data.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
2.2 Lumigraphs
The Lumigraph is an extension of the light field concept, but it incorporates geometric
information in the scene, providing more flexibility and improving the rendering quality
when handling complex visual phenomena such as occlusions.
Light fields, which encode the light information for each ray in space, and
Lumigraphs provide an efficient way to store and render a scene, taking advantage of a
coarse geometric model of the scene to better interpolate between views and reduce
visual artifacts.
1. Scene Sampling: Like light fields, Lumigraphs capture a dense set of images from
multiple viewpoints around the object or scene.
2. Coarse Geometry: Instead of treating the scene as purely a collection of light rays, a
rough 3D geometric model of the scene is constructed.
3. View Interpolation: When rendering a new viewpoint, the system uses the coarse
geometry to determine how to sample rays from the stored light field, adjusting for
occlusions and depth discontinuities.
Given a coarse geometry model, Lumigraphs can improve image quality by reducing
distortions using view-dependent textures. The plenoptic function for a Lumigraph
incorporates the geometry
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
Capturing a Lumigraph requires both a camera array for capturing light rays and
some method for building a coarse geometric model of the scene. This can be
done using photogrammetry, structure-from-motion, or depth sensors.
Lumigraphs can handle more complex scenes with occlusions better than pure
light fields because they use geometry to inform how light rays should be
sampled.
They allow for faster rendering and require less storage than light fields by using
fewer sampled images and leveraging geometric knowledge.
1. Virtual Reality (VR) and Augmented Reality (AR): Light fields and Lumigraphs
enable immersive experiences where users can move freely within a scene and view it
from any angle, enhancing realism.
2. Cinematic VR: These techniques are used in VR movies where multiple viewpoints are
necessary to provide a realistic 3D experience.
3. Image-Based Rendering (IBR): Light fields are heavily used in IBR to synthesize
novel views from captured imagery without needing complex 3D modeling.
4. Medical Imaging: In some cases, light fields can be used for 3D visualization of
medical scans, allowing doctors to see internal structures from multiple angles.
5. Cultural Heritage Preservation: Museums and historical sites use light fields and
Lumigraphs to create virtual tours of artifacts or sites, preserving them digitally in a
highly realistic manner.
A light field display is a type of 3D display that shows different images depending on the
angle from which it's viewed, much like a hologram. By projecting multiple images at
different angles, a light field display can give the illusion of depth, allowing viewers to
see 3D content without special glasses. These displays are based on light field rendering
technology, where the display simulates the behavior of light in a real scene.
How it Works:
2. These views are then projected on the light field display, where the user perceives
different angles depending on their position relative to the display.
3. As the user moves, the display updates the images to simulate depth and parallax, giving
a 3D experience.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
3. ENVIRONMENT MATTES
Environment mattes are a technique used in computer vision and computer graphics
to accurately represent and manipulate the reflections and refractions of an environment
on a transparent or partially transparent object, such as glass or water. They help in
generating realistic images where objects interact with their surroundings through
complex light behavior like reflection, refraction, and transparency.
Environment mattes are particularly useful in applications such as visual effects for
movies, augmented reality, and any scenario where objects must seamlessly blend into
or reflect their surroundings.
For example, imagine placing a glass bottle in front of a colorful background. The
environment matte would capture how the background is seen through the glass,
including:
• Refractions: Distorted parts of the background seen through the transparent parts of
the glass.
• Transparency: How much of the background shows through different regions of the
object.
The environment matting process is usually modeled using the matting equation, which
separates the object from the background. For environment mattes, we modify this
concept to handle transparency, reflection, and refraction. The key idea is to express
how a pixel in the final image is a combination of the object’s inherent color (matte
color) and how it interacts with its environment.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
To create environment mattes, special techniques are used to capture how light from the
environment interacts with the object. This process involves:
2. Analyzing Reflections and Refractions: The images are analyzed to determine how
different parts of the object reflect or refract the background. This helps in creating a
model of how the object would interact with any arbitrary environment.
3. Matte Extraction: The matte is extracted, representing both the transparency of the
object and how it alters the appearance of the background due to reflections and
refractions.
Environment mattes are particularly useful in film and visual effects, augmented
reality, and virtual production. Some specific use cases include:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
Product Visualization: For product design and visualization, environment mattes allow
for accurate rendering of materials like glass, plastic, and other semi-transparent objects
under different lighting conditions. This is useful for industries like automotive design,
where the reflection of the car's surface must be realistic.
Consider a scenario where a glass bottle needs to be composited into a virtual scene
with realistic reflections and refractions.
1. Capture the Environment Matte: Multiple photos of the glass bottle are taken in front
of different patterned backgrounds. These patterns help in identifying how light interacts
with the bottle's surface.
2. Analyze Reflections and Refractions: By comparing how the bottle affects each
background, the environment matte captures how the bottle reflects light from the front
and refracts light from behind.
3. Apply the Matte to a New Scene: Once the environment matte is computed, the
bottle can be placed into any new scene. The matte will ensure that the bottle reflects
and refracts the new background correctly, leading to realistic integration.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
4. VIDEO-BASED RENDERING
• Virtual camera movement in scenes that have only been filmed from a few angles.
The primary challenge in VBR is how to use the captured video frames to generate realistic
and consistent new frames, especially when dealing with complex scenes involving
occlusions, moving objects, and changing lighting conditions.
There are several key techniques and methods used in VBR to achieve realistic scene
reconstruction and rendering:
a) View Interpolation
Layered Depth Images (LDIs) are an extension of view interpolation where the scene
is represented as multiple layers of depth information for each pixel. Each pixel stores
not only its color but also multiple depth values corresponding to different surfaces along
the same ray.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
In VBR, LDIs help handle occlusions and depth discontinuities by allowing the system to
track multiple layers of a scene. This improves the accuracy of interpolating new views,
particularly in complex environments where objects may block parts of the scene.
c) Optical Flow
Optical flow is the pattern of apparent motion between two consecutive video frames.
It is used in VBR to track the motion of objects between frames, enabling the system to
accurately interpolate new views in dynamic scenes where objects are moving.
By calculating the optical flow, VBR systems can predict how pixels will move in
subsequent frames, allowing for smooth rendering of dynamic scenes.
MVS reconstructs a 3D model by aligning and correlating different views from several
cameras, determining the depth for each pixel. The more views that are available, the
better the depth estimation.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
VBR is applied in several areas, particularly where realism and cost-effectiveness are
essential. Some of the key applications include:
VBR is widely used in virtual reality and augmented reality to provide users with
immersive experiences by allowing free movement within a captured video scene. For
instance, in cinematic VR, users can view a pre-recorded scene from different
perspectives as if they were physically present, by synthesizing new views from recorded
video data.
Video-based rendering is also used in sports broadcasting for creating replays with
free-moving cameras. For example, the famous "bullet-time" effect, which was
popularized in movies like The Matrix, allows the camera to move around the action in
slow motion. This effect can be achieved by capturing the scene from multiple angles
and using VBR techniques to interpolate intermediate views.
In the film industry, VBR allows filmmakers to add new camera angles or viewpoints
after shooting a scene, without having to reshoot it. This can be useful for special
effects, where scenes are captured in a controlled environment (e.g., a green screen)
and then new views are generated in post-production.
In interactive media and video games, VBR allows for real-time manipulation of
scenes. For example, a video game could use VBR to simulate realistic movement within
a scene by dynamically interpolating new frames based on the player's viewpoint.
While VBR has many advantages, it also comes with several challenges:
Handling depth discontinuities and occlusions (when one object blocks another) is
difficult, especially in dynamic scenes. If the depth information is inaccurate,
interpolating new views can lead to artifacts such as ghosting or incorrect object
alignment.
b) Real-Time Performance
Generating new frames from video footage in real time requires a lot of computational
power, particularly if the scene involves complex movements or lighting conditions.
Ensuring real-time performance while maintaining high-quality results is challenging.
c) Dynamic Scenes
When objects or people in the scene are moving, VBR must correctly capture and
interpolate motion. Sudden movements can be hard to interpolate, and tracking such
movements using optical flow can sometimes produce incorrect or distorted results.
d) Lighting Changes
One well-known application of VBR is Free-Viewpoint Video (FVV), where a user can
move a virtual camera freely within a recorded scene. This is particularly popular in
sports broadcasting and immersive media.
1. Multiple Cameras Capture: The scene is recorded using an array of cameras from
different viewpoints.
2. Depth Estimation: Depth maps are generated for each frame using stereo matching or
other techniques.
3. View Interpolation: Using the depth information and the captured frames,
intermediate views are generated for virtual cameras positioned between the real
cameras.
4. User Interaction: The user can then explore the scene from different angles,
seamlessly moving between recorded viewpoints.
In sports broadcasting, this is used to provide 3D replays where the audience can
rotate around players or action moments and view the scene from any angle, even those
not covered by actual cameras.
5. OBJECT DETECTION
Object detection is a computer vision technique used to identify and locate objects
within an image or video. It involves both classification (what is in the image) and
localization (where the object is located). Unlike image classification, which assigns a
single label to an image, object detection provides bounding boxes around multiple
objects in the scene and identifies their categories.
2. Localization: Determining where the object is located in the image, usually represented
as a bounding box around the object.
• Class labels: The category (e.g., car, person, dog) for each detected object.
• Confidence scores: A probability indicating how certain the model is about the
presence of an object in a given bounding box.
There are various techniques for object detection, categorized broadly into traditional
methods and deep learning-based methods. In recent years, deep learning
approaches have become the dominant method due to their superior accuracy and ability
to handle complex tasks.
a) Traditional Methods
Traditional object detection methods typically relied on features like HOG (Histogram
of Oriented Gradients) or SIFT (Scale-Invariant Feature Transform) and
classifiers like Support Vector Machines (SVM). Some key traditional methods
include:
• Sliding Window: This involves sliding a window over the image and applying a
classifier to each sub-image to determine if it contains an object.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
• Selective Search: A method used to propose regions of interest (ROI) in the image
that likely contain objects. These regions are then classified using a classifier.
These methods, while effective, were often computationally expensive and not very
flexible for detecting different sizes and types of objects.
YOLO (2016): YOLO is a real-time object detection system that formulates object
detection as a single regression problem. It divides the image into a grid and predicts
bounding boxes and class probabilities for each grid cell in one pass through the
network, making it extremely fast.
SSD (2016): Similar to YOLO, SSD detects objects in a single shot by dividing the image
into a grid and predicting both object categories and bounding boxes at each grid
location. It handles objects at multiple scales and is computationally efficient, balancing
speed and accuracy.
4. RetinaNet:RetinaNet (2017):
Known for its Focal Loss function, which helps handle the class imbalance issue in object
detection by focusing on hard-to-classify objects, RetinaNet improves the accuracy of
detecting small and difficult objects while maintaining high speed.
• Scale Variation: Objects can appear in different sizes within the same image (e.g., a
faraway car versus a close-up car).
• Occlusion: Objects may be partially obscured by other objects or parts of the scene.
• Deformation: Objects like humans and animals can change their shape, making them
harder to detect.
• Class Imbalance: Some objects (e.g., people or cars) may appear frequently, while
others (e.g., rare animals) are less common, making training difficult.
Object detection models are evaluated using various metrics, the most common being:
• Intersection over Union (IoU): Measures the overlap between the predicted
bounding box and the ground truth. It is calculated as the area of overlap divided by the
area of union between the predicted and ground truth boxes.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
A prediction is considered correct if the IoU is above a certain threshold, often set at 0.5
(IoU > 0.5).
• Precision and Recall: These are standard classification metrics that evaluate how
many of the predicted bounding boxes are correct (precision) and how many actual
objects were successfully detected (recall).
• Mean Average Precision (mAP): A popular metric that averages the precision scores
across all object categories and IoU thresholds.
Object detection has a wide range of practical applications across many industries:
a) Autonomous Vehicles:
c) Healthcare:
In the medical field, object detection is used in diagnostic imaging (e.g., identifying
tumors or lesions in medical scans) to assist radiologists.
f) Robotics:
Robots rely on object detection to interact with objects in their environment, such as
picking and placing items or navigating through complex spaces.
Search engines and photo management systems use object detection to enable users to
search for images containing specific objects (e.g., searching for images of dogs or cars
in a photo library).
Let’s take an example of how YOLO (You Only Look Once) works for detecting objects in
an image.
1. Input Image: The input is an image of a street scene with multiple objects like cars,
people, and traffic lights.
2. Dividing the Image into a Grid: YOLO divides the input image into a grid, say 13x13.
Each grid cell is responsible for predicting whether it contains an object.
3. Bounding Box Prediction: For each grid cell, YOLO predicts several bounding boxes
along with their confidence scores and class probabilities.
5. Final Output: The final result consists of bounding boxes drawn around detected
objects (e.g., cars, people, traffic lights), each labeled with the object category and
confidence score.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
6. FACE RECOGNITION
1. Face Verification: Verifying whether a given face matches a specific identity (one-to-
one matching).
a) Face Detection
The first step in face recognition is detecting the presence of a face in an image or video.
This step isolates the face region from the background or other objects in the scene.
Common face detection techniques include:
• Haar Cascades: A traditional method that uses a cascade of classifiers trained with
positive and negative images.
• Deep Learning Methods: Modern face detectors like MTCNN (Multi-task Cascaded
Convolutional Networks) or YOLO for real-time detection.
After detecting the face, the system aligns the face to a standard orientation (e.g.,
frontal view) for consistent recognition.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
b) Face Preprocessing
Before recognition, the face image is typically preprocessed to enhance the quality of
facial features. Preprocessing may include:
• Alignment: Correcting the face's orientation (e.g., rotation) so that the eyes and mouth
are in the same position for all images.
c) Feature Extraction
In this step, a unique feature vector (also called an embedding) is generated for each
face image. This vector represents the distinctive characteristics of a face, such as the
distance between the eyes, the shape of the nose, and other facial features.
The CNN models are trained on large datasets of labeled face images, allowing them to
learn complex, high-dimensional facial feature representations.
d) Face Matching
Once the feature vectors are extracted, face recognition is performed by comparing
these vectors. The two most common tasks are:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
• Verification: Compare two feature vectors using a distance metric (such as Euclidean
distance or Cosine similarity) to determine if they belong to the same person.
For verification, the distance between the two embeddings should be below a certain
threshold for the system to confirm that they represent the same person. For
identification, the embedding of the detected face is compared with all stored
embeddings, and the system returns the identity of the closest match.
Several face recognition models and architectures are widely used due to their high
accuracy and robustness:
a) DeepFace
• Developed by Facebook, DeepFace was one of the earliest successful deep learning-
based face recognition models.
• It uses a deep CNN trained on a large dataset of faces and achieves near-human
accuracy by generating 3D face models for alignment and feature extraction.
b) FaceNet
• Developed by Google, FaceNet introduced the concept of triplet loss, where the goal is
to minimize the distance between an anchor image and a positive image (same person)
while maximizing the distance between the anchor and a negative image.
• FaceNet outputs a 128-dimensional embedding for each face, which is used for face
matching.
c) VGG-Face
• VGG-Face is known for its ability to handle variations in pose, lighting, and facial
expressions.
d) ArcFace
• ArcFace provides high accuracy in both verification and identification tasks, making it
suitable for large-scale face recognition systems.
Faces can look very different from different angles or when a person changes their
expression. Handling pose variations (e.g., profile view vs. frontal view) is particularly
challenging. Solutions like 3D face modeling or robust training datasets help mitigate this
issue.
b) Lighting Conditions
Changes in lighting can dramatically affect the appearance of facial features. Deep
learning models trained on large, diverse datasets are more robust to varying lighting
conditions compared to traditional methods.
c) Aging
As a person ages, their facial features change over time. Long-term recognition systems
must account for these gradual changes, often requiring regular updates to the face
embeddings.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
d) Occlusion
Occlusion occurs when parts of the face are hidden by objects like glasses, masks, or
hair. Some modern face recognition models use partial matching techniques to
recognize faces even when occluded.
e) Adversarial Attacks
• Surveillance systems use face recognition to monitor public spaces and identify
persons of interest, such as suspects or missing individuals.
b) Law Enforcement
Law enforcement agencies use face recognition to match suspects against a database of
known criminals. It is also used in forensic analysis to identify people from CCTV footage
or other media.
Most modern smartphones and laptops use face recognition for user authentication.
This biometric method provides a secure and convenient alternative to passwords or
fingerprint recognition
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
Platforms like Facebook and Google Photos use face recognition to automatically
tag people in images. Face recognition enables users to organize their photo libraries
and search for images based on the individuals in them.
In the retail industry, face recognition is used to analyze customer behavior, personalize
shopping experiences, or even provide targeted advertising based on the recognition of
frequent customers.
f) Healthcare
Let’s break down how FaceNet, a state-of-the-art deep learning model for face
recognition, works:
1. Training: FaceNet is trained using triplet loss, where it learns to map images of the
same person close to each other in the embedding space while mapping images of
different people far apart.
3. Verification: To verify if two faces belong to the same person, the system compares
their embeddings using a distance metric like Euclidean distance. If the distance is
below a certain threshold, the faces are considered to belong to the same individual.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
7. INSTANCE RECOGNITION
Instance recognition differs from object recognition in that the latter focuses on
identifying the general class or type of an object (e.g., dog, car, person), while instance
recognition identifies unique instances within a class (e.g., a specific breed of dog or a
specific license plate).
1. Feature Extraction: The system extracts distinctive features from the image that can
uniquely identify the object. These features might include keypoints (e.g., corners,
edges) or descriptors that summarize the object’s texture, color, or shape.
2. Matching: Once the features are extracted, they are compared to a pre-built
database of known instances. The system uses a similarity measure (like Euclidean
distance) to find the closest match between the detected features and those stored in
the database.
3. Pose Estimation (optional): If the object in the image is not perfectly aligned (rotated
or viewed from a different angle), some systems estimate the pose and adjust the
recognition accordingly. This is common in applications like augmented reality or 3D
model matching.
4. Instance Recognition Output: The system outputs the specific identity of the
detected instance, often with a confidence score that reflects how certain the system is
about the match.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
Instance recognition can be achieved using various methods, ranging from traditional
computer vision techniques to modern deep learning-based approaches.
a) Traditional Methods
• SURF (Speeded Up Robust Features): SURF is a faster alternative to SIFT that also
detects keypoints and uses descriptors for instance matching.
• Bag of Visual Words (BoVW): This method works by quantizing keypoint descriptors
into visual words and using them for image matching.
These methods are widely used for landmark recognition, logo detection, and
product identification, where distinct visual features help identify specific instances.
• Siamese Networks: Siamese networks consist of two or more identical CNNs sharing
weights. They are used to compute embeddings for two input images and then compare
them to see if they represent the same instance.
• Region-based Methods (R-CNN, Fast R-CNN): These models detect specific objects
or regions of interest and can also be used for instance recognition by further refining
the identification process with a secondary matching step.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
In retail settings, instance recognition can identify specific products on store shelves for
inventory management or for customer applications like visual search (e.g., identifying
a particular brand or model of a shoe or clothing item).
b) Facial Recognition
c) Logo Detection
d) Landmark Recognition
In AR, instance recognition allows virtual objects to be overlaid on top of specific real-
world objects (e.g., tracking a product or an image in real-time to display interactive
content).
Instance recognition is used to detect and recognize specific license plates from
images or video frames. This is crucial for applications like traffic law enforcement,
parking management, and vehicle tracking.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
Robots that operate in complex environments rely on instance recognition to detect and
interact with specific objects (e.g., identifying specific tools in a manufacturing line).
Instance recognition, like any computer vision task, has its own set of challenges:
• Intra-Class Variability: Objects belonging to the same category can have subtle
differences that make it difficult to recognize specific instances.
• Viewpoint Variation: Objects may appear different when viewed from different angles,
which requires the system to generalize across viewpoints.
• Occlusion: Parts of the object may be hidden or blocked by other objects, making
recognition harder.
Scenario: A customer uses a smartphone app to take a photo of a specific brand of cereal
in a store. The system needs to recognize this particular brand (and not just any cereal) to
provide information about it, such as its price, availability, and nutritional facts.
Steps:
2. Feature Extraction: The system uses a CNN to extract features from the image,
focusing on distinctive elements like the brand logo, packaging design, and text.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
4. Instance Recognition: The system identifies the exact product based on the closest
match in the database and provides the customer with relevant product information.
This instance recognition system would need to handle variations in how the photo was
taken (e.g., different angles, lighting conditions, or partial occlusion by other products).
8. CATEGORY RECOGNITION
Category recognition in computer vision refers to the task of identifying the class or
type of an object in an image, without distinguishing between specific instances of that
object. It focuses on recognizing objects as belonging to broad categories such as "cat,"
"car," "tree," or "person," rather than identifying a particular cat or a specific car.
Category recognition is also known as object classification or object recognition.
Category recognition generally follows a standard pipeline that involves the following
steps:
a) Image Preprocessing
Before category recognition, the input image is typically preprocessed to enhance the
features necessary for recognition:
• Resizing: Images are resized to a fixed size to make them uniform for input to a
recognition model.
• Normalization: The pixel values are normalized to a common range (e.g., [0, 1] or [-1,
1]).
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
• Data Augmentation: Techniques like flipping, rotating, cropping, and scaling the image
can help improve the robustness of the recognition model by exposing it to varied
appearances of objects.
b) Feature Extraction
Feature extraction is one of the most important steps in category recognition. Here,
distinctive patterns of the object are extracted from the image, such as texture, edges,
shapes, and colors.
c) Classification
Once the relevant features are extracted, the next step is to classify the object into one
of the pre-defined categories. This is typically done using machine learning classifiers or
deep learning models. Some common methods include:
Over the years, several architectures have become popular for performing category
recognition, especially in deep learning:
a) AlexNet
AlexNet was one of the first CNN architectures that gained popularity after winning the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It
demonstrated that CNNs could outperform traditional methods in large-scale category
recognition tasks. AlexNet introduced key innovations like ReLU activation and
dropout to reduce overfitting.
b) VGGNet
VGGNet improved over AlexNet by using a deep architecture of small 3x3 convolutional
filters. It showed that deeper networks could achieve better category recognition
accuracy by capturing more complex visual features. The simplicity of VGGNet made it a
popular architecture for many downstream computer vision tasks.
ResNet introduced skip connections that allowed the training of very deep neural
networks (e.g., with 50 or more layers). ResNet's deep architecture is highly effective for
category recognition because it can learn complex feature hierarchies without suffering
from vanishing gradients, which often occurs in very deep networks.
Inception networks use a combination of convolutions with different filter sizes at each
layer to capture multi-scale information. This architecture is efficient in terms of
computation and memory usage, making it suitable for large-scale category recognition
tasks.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
e) MobileNet
a) Intra-Class Variability
Objects within the same category can appear significantly different due to differences in
shape, size, color, texture, or style. For instance, the category "car" includes various
models, colors, and designs. The system needs to generalize these variations and
recognize them as belonging to the same category.
b) Inter-Class Similarity
Some categories have objects that look very similar to each other, making it difficult to
distinguish between them. For example, some breeds of dogs might look very similar to
wolves, or some brands of shoes may have similar appearances.
c) Viewpoint Variation
Objects can appear different when viewed from various angles or perspectives. A system
trained on images of a car from a frontal view might struggle to recognize the same car
from a side or top view. Models need to be robust to viewpoint variations.
d) Occlusion
Partial occlusion, where parts of an object are hidden by other objects or by the image
boundary, can make category recognition challenging. The system must be able to
recognize objects even if they are only partially visible.
Changes in lighting, reflections, and shadows can alter the appearance of an object. A
robust system must be able to handle different lighting conditions, including
overexposure or dim lighting.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
f) Data Imbalance
In real-world applications, certain categories may have significantly more examples than
others in the training data. For example, there might be thousands of images of "cats"
but only a few of "raccoons." Imbalanced data can lead to biased models that perform
well on common categories but poorly on rare ones.
a) Autonomous Vehicles
b) Healthcare
Category recognition helps e-commerce platforms like Amazon or Google identify product
categories from images. For instance, recognizing whether an uploaded image is a "t-
shirt," "dress," or "shoes" can help organize and recommend products efficiently.
d) Robotics
In robotics, category recognition enables robots to understand and interact with their
environments. For instance, recognizing objects like "cup," "bottle," or "tool" allows
robots to pick and place objects correctly in assembly lines or service robots in
households.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
e) Surveillance
f) Social Media
Social media platforms like Facebook and Instagram use category recognition to
automatically tag objects, places, and people in uploaded photos. This helps improve
user experience and also enables features like image-based search.
One of the most famous examples of category recognition is the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC). In this competition, models are
trained to classify images into one of 1,000 different categories, including animals,
objects, and scenes.
For example, an image of a tiger would be classified into the "tiger" category, while an
image of a bicycle would be classified as "bicycle." Models like AlexNet, VGGNet, and
ResNet have all competed in this challenge, with ResNet achieving breakthrough
performance due to its deep architecture and ability to learn complex features.
Context and scene understanding refer to the ability of computer vision systems to
not only detect individual objects in an image but also interpret the relationships
between these objects, their spatial arrangements, and their environment as a whole.
This goes beyond object detection and classification, aiming to provide a higher-level
comprehension of what is happening in the scene, similar to how humans understand
complex environments.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
Understanding scenes is crucial for tasks such as autonomous driving, image captioning,
human-robot interaction, and surveillance, where recognizing individual objects is not
enough. The system needs to infer how objects relate to each other, predict potential
interactions, and even reason about unseen parts of the environment.
At the base level, understanding a scene involves detecting and classifying objects within
the scene. This step involves identifying the categories (e.g., car, person, tree) of the
objects and localizing them within the image using bounding boxes, keypoints, or
segmentation masks.
b) Contextual Relationships
Context refers to the spatial and semantic relationships between objects within the
scene. For example:
• Spatial Relationships: Objects that are close to each other in space often share a
contextual relationship. For instance, a "car" is likely to be on a "road" and not on a
"table."
Understanding context helps computer vision systems to make better predictions and
resolve ambiguities. For instance, if a blurry object is detected near a "bed," the system
may infer that it’s likely to be a "pillow" rather than a "laptop."
c) Scene Classification
Scene classification is the process of recognizing the broader category or setting of the
entire image, such as "beach," "forest," "office," or "city street."
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
Scene classification provides the global context for the objects in the image and can
guide the interpretation of what is happening in the scene. For instance, knowing the
scene is a "kitchen" can help recognize objects like "stove," "sink," and "fridge."
d) Semantic Segmentation
Semantic segmentation divides an image into regions associated with specific object
categories, labeling each pixel with its corresponding class. This allows for detailed
understanding of which parts of the image correspond to which objects, aiding in both
object recognition and understanding spatial relationships.
e) Instance Segmentation
Understanding a scene also involves recognizing the actions taking place (e.g.,
"walking," "eating," "driving") and interactions between objects or people. For
example, recognizing that a person is holding a cup, or that a dog is chasing a ball,
adds depth to the understanding of the scene.
g) 3D Scene Understanding
3D scene understanding involves interpreting the depth, layout, and geometry of the
scene, which is important for applications like augmented reality (AR) and autonomous
navigation. By understanding the 3D relationships between objects, systems can reason
about occlusions, predict what is behind objects, and navigate the environment.
Several techniques and models are used to achieve context and scene understanding:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
CNNs have been the foundation for object detection and scene classification due to their
ability to extract hierarchical features from images. They are used in architectures like
Faster R-CNN and YOLO (You Only Look Once) for detecting and localizing objects
in images.
For semantic segmentation, FCNs are commonly used. FCNs predict class labels for
every pixel in the image, allowing the system to understand which pixels belong to which
object or background region.
RNNs and Long Short-Term Memory (LSTM) networks are often used for tasks that
require temporal understanding, such as video scene understanding. They help to
capture sequences of actions and interactions in a video by remembering past frames
and context.
e) Transformers
In recent years, Vision Transformers (ViT) have emerged as powerful tools for scene
understanding. Transformers capture long-range dependencies and relationships
between different parts of the image, making them effective for tasks involving complex
scenes and high-level context reasoning.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
Scenes can be very complex, with multiple objects interacting in different ways.
Recognizing and reasoning about all objects and their relationships, actions, and context
is computationally challenging. Moreover, scenes can often be ambiguous, requiring
high-level reasoning to understand their true meaning.
b) Occlusion
In real-world scenes, objects are often partially occluded by other objects, making it
difficult to recognize and understand the entire scene. For example, if part of a car is
hidden behind a tree, the system must infer that it is still a car.
Objects in a scene can appear at various scales and orientations, especially in outdoor
environments. This variability complicates recognition and understanding since objects
can appear quite different based on distance, lighting, or occlusion.
In some cases, object labels may not be available or easy to infer, which makes it harder
for the system to understand the context. For example, in abstract or artistic images,
understanding the scene may require high-level reasoning that goes beyond object
recognition.
e) Real-Time Performance
a) Autonomous Driving
In autonomous vehicles, context and scene understanding are critical for recognizing
road scenes, detecting obstacles, and predicting the behavior of pedestrians and other
vehicles. The system needs to understand the full environment, including road signs,
traffic lights, road conditions, and other dynamic objects, to navigate safely.
b) Robotics
For robots to interact effectively with humans and objects, they must understand the
context of their environment. This includes recognizing objects, actions, and spatial
relationships. In tasks like pick-and-place or household chores, robots rely on scene
understanding to perform efficiently.
d) Image Captioning
In the context of autonomous driving, scene understanding is crucial for making safe
driving decisions. The system needs to:
2. Classify the Scene: Recognize the type of scene, such as a highway, residential area,
or intersection.
4. Predict Actions: Based on context, predict the future actions of dynamic objects (e.g.,
whether a car will turn or if a pedestrian will cross).
5. React in Real-Time: Make driving decisions (e.g., stop, accelerate, change lanes)
based on the understanding of the entire scene.
This holistic understanding of the scene is what enables autonomous cars to safely
navigate complex environments with dynamic objects and changing conditions.
In the field of computer vision, recognition databases and test sets play a critical
role in the development, evaluation, and benchmarking of models for tasks like object
recognition, face recognition, image classification, and scene understanding. These
datasets provide the standardized data necessary for training models and comparing the
performance of different algorithms in a consistent and objective manner.
Recognition databases can vary depending on the task, and they are typically split
into three subsets:
• Training Set: Used to train the model. It contains labeled data that the model uses to
learn patterns, features, and associations.
• Test Set: A separate set of data used for evaluating the model's final performance after
training is complete.
Key Characteristics
• Annotations: Most databases come with annotations such as bounding boxes, pixel-
wise segmentation masks, class labels, and sometimes attributes or relationships
between objects. These annotations are essential for supervised learning tasks.
• Size: The number of images in a recognition database can vary from a few hundred to
millions. Larger datasets often lead to better model performance, as they provide more
examples from which the model can learn.
• Quality: High-quality images with clear labels are critical for effective training. Images
should be representative of real-world conditions, including variations in lighting,
occlusion, and scale.
• ImageNet: Contains over 14 million images across thousands of categories, widely used
for training and benchmarking deep learning models. ImageNet’s challenge, ILSVRC
(ImageNet Large Scale Visual Recognition Challenge), is a key event in the field.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
• Pascal VOC: Provides a benchmark for object detection and segmentation tasks,
featuring a variety of images with labeled objects and detailed annotations across 20
categories.
• CelebA: A large-scale dataset with over 200,000 celebrity images, annotated with facial
attributes and landmarks, used for face recognition and attribute detection tasks.
a) Definition
A test set is a subset of a recognition database specifically reserved for evaluating the
performance of a model after it has been trained. It contains examples that the model
has never seen during training, allowing for an unbiased assessment of its generalization
capability.
b) Key Features
• Separation from Training Data: It is crucial that test sets remain completely separate
from the training and validation datasets. This separation ensures that performance
metrics reflect true model performance rather than memorization of training examples.
• Representativeness: Test sets should reflect the diversity and variability present in the
entire dataset. This includes different categories, orientations, lighting conditions, and
levels of occlusion.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
• Size: While test sets can be smaller than training datasets, they should be large enough
to provide reliable and statistically significant evaluations of model performance.
c) Evaluation Metrics
• Accuracy: The fraction of correctly predicted instances out of the total instances.
• Precision: The ratio of true positive predictions to the total predicted positives,
measuring the correctness of positive predictions.
• Recall: The ratio of true positive predictions to the actual positives, indicating the ability
to capture all relevant instances.
• F1 Score: The harmonic mean of precision and recall, providing a single metric that
balances both aspects.
• Mean Average Precision (mAP): Commonly used in object detection, mAP evaluates
the precision of predictions at different recall levels, providing a comprehensive measure
of accuracy across multiple classes.
a) Model Development
Recognition databases are essential for training machine learning models. The quality
and variety of the data directly influence how well a model learns to recognize objects
and make accurate predictions.
b) Benchmarking
Databases with established test sets provide benchmarks for comparing the performance
of different algorithms and models. Researchers can publish results on these standard
datasets, facilitating a common ground for evaluation.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
c) Transfer Learning
Many models are pre-trained on large recognition databases before being fine-tuned on
specific tasks. This approach leverages learned features from broad datasets, allowing
models to perform well even on smaller, domain-specific datasets.
d) Real-World Applications
High-quality recognition databases and test sets are crucial for developing reliable
systems used in various applications, such as facial recognition, autonomous vehicles,
medical image analysis, and augmented reality. Reliable performance evaluations ensure
that these systems are robust and trustworthy.
a) Data Imbalance
In many recognition databases, certain categories may have significantly more examples
than others, leading to biased models that perform well on common classes but poorly
on rare ones.
b) Annotation Quality
The quality of annotations can vary, and inaccuracies in labels can adversely affect model
training and evaluation. Ensuring high-quality annotations is a critical step in dataset
preparation.
d) Overfitting
Models trained too closely on specific datasets may perform poorly when applied to new,
unseen data, indicating overfitting. It's crucial to have a well-defined test set to evaluate
generalization.
ACTIVITIES
Activity 1: Workshop on Depth and Light Field Techniques
Objective: Understand and apply concepts of view interpolation and layered
depth images, light fields, and lumigraphs.
Format: Hands-on workshop
Description:
• Introduction (30 mins): Begin with a presentation explaining view
interpolation, layered depth images, light fields, and lumigraphs. Include
examples and their applications in computer vision.
• Hands-On Activity (1.5 hours):
• Divide participants into small groups.
• Provide them with a dataset of images with depth information.
• Each group will implement a basic view interpolation algorithm using
layered depth images and light field techniques.
• Encourage experimentation with different interpolation methods and
visualizing the results.
• Discussion (30 mins): Groups will present their findings and challenges faced
during implementation. Discuss potential applications in video rendering and
immersive experiences.
Activity 2: Recognition Systems Challenge
Objective: Build a basic object and face recognition system using recognition
databases.
Format: Group project and presentation
Description:
• Preparation (1 week prior): Assign groups and provide them with a choice of
recognition databases (e.g., COCO for object detection, LFW for face recognition).
• Project Development (2 weeks):
• Groups will design a simple recognition system that can detect objects or
recognize faces in images/videos.
• They will preprocess the data, train their models, and evaluate their
performance using established test sets.
• Encourage groups to explore different models (e.g., CNNs, transfer
learning).
• Final Presentation (1 hour): Each group will present their system, the dataset
used, performance metrics, and insights gained from the project.
ACTIVITIES
Activity 3: Context and Scene Understanding Simulation
Objective: Explore context and scene understanding through practical
applications and discussions.
Format: Interactive simulation and debate
Description:
• Interactive Simulation (1 hour): Use a simulation tool or a computer vision
platform to analyze various scenes. Participants will identify objects, their
relationships, and contextual information in given images.
• Group Analysis (30 mins):
• Split into small groups, each assigned a different scene type (e.g., urban,
natural, indoor).
• Groups will analyze how context influences the understanding of objects
within their scene. What assumptions can be made? How do different
contexts change interpretations?
• Debate (30 mins): Host a debate on the importance of context in computer
vision applications. Discuss scenarios where context is crucial (e.g., autonomous
driving vs. image classification) and how ignoring context can lead to failures.
Video Links
Unit – 5
73
Video Links
Sl. Video Link
Title
No.
74
Assignments
Unit - V
75
Assignment Questions
Assignment Questions – Very Easy
Q. ASSIGNMENT QUESTIONS Marks Knowledg CO
e level
No.
1 5 K1 CO5
Explain what is meant by "view
interpolation" in the context of computer
vision
5 K1 CO5
2 List and briefly describe the essential
components of a recognition database used
in computer vision.
2 5 K2 CO5
Describe how face recognition works in basic
terms, including what features are typically
analyzed in this process.
76
Assignment Questions
Assignment Questions – Medium
Q. ASSIGNMENT QUESTIONS Marks Knowledg CO
e level
No.
1 5 K3 CO5
Imagine a scenario where light field
technology could be applied in real life.
Describe how it would enhance visual
experience or solve a problem.
5 K3 CO5
2 Apply your knowledge of category
recognition and instance recognition by
comparing their applications in object
detection tasks.
2 5 K4 CO5
Identify and analyze at least two major
challenges in video-based rendering. How do
these challenges affect the visual quality and
performance of rendering systems?
Assignment Questions
Assignment Questions – Very Hard
Q. ASSIGNMENT QUESTIONS Marks Knowledg CO
e level
No.
1 5 K5 CO5
Evaluate how context influences scene
understanding. Provide an example where
context is critical for accurate recognition
and explain why.
5 K5 CO5
2 Assess how environment mattes contribute
to improving visual effects in film or virtual
reality environments. What are the pros and
cons of using them in these applications?
78
Course Outcomes:
CO5: To understand image based rendering and recognition.
*Allotment of Marks
15 - 5 20
79
Part A – Questions
& Answers
Unit – V
80
Part A - Questions & Answers
1. What is view interpolation?
View interpolation is a technique used in computer vision and
graphics to generate intermediate frames between two or more
existing images or views, creating smooth transitions between
different viewpoints.
81
6. How are layered depth images used in view interpolation?
Layered depth images store information about both color and
depth at multiple layers in an image, allowing for accurate interpolation and
rendering of intermediate views by considering occluded objects.
7. Why is context important in scene understanding?
Context provides additional information about the relationships
between objects and their environment, helping to improve accuracy in
recognizing scenes and understanding the roles of different objects.
8. What is the purpose of a test set in machine learning?
A test set is used to evaluate the performance of a trained model
on unseen data, ensuring that the model generalizes well and is not simply
memorizing the training data.
9. What is instance recognition?
Instance recognition refers to identifying specific occurrences of
objects, such as recognizing a particular car or a specific person's face, as
opposed to recognizing the general category.
10. What is video-based rendering?
Video-based rendering is a technique that generates new views
of a scene using pre-recorded video footage, often used in virtual reality,
visual effects, and immersive media experiences.
82
11. How could light fields improve 3D displays?
83
16. Why are larger recognition databases better for model training?
19. How can layered depth images handle occlusions during view
interpolation?
Layered depth images store multiple depth layers for each pixel,
allowing the rendering process to account for occluded objects and providing
a more realistic interpolation between views.
84
21. Evaluate the effectiveness of face recognition in crowded
environments.
23. Evaluate the role of light fields in virtual reality (VR) systems.
86
Part B Questions
Q. No. Questions K Level CO
Mapping
1 Describe the key features of view K1 CO5
interpolation and layered depth images.
87
7 Analyze the impact of recognition databases K4 CO5
on model performance. What factors should
be considered when selecting a database
for training and testing?
88
Supportive online
Certification
courses (NPTEL,
Swayam, Coursera,
Udemy, etc.,)
89
Supportive Online Certification
Courses
Coursera – Introduction to Computer
Vision
• Description:
This course provides an overview of computer
vision, including image processing, feature
extraction, and object recognition.
• Offered by:
Georgia Tech
https://www.coursera.org/learn/introdu
ction-computer-vision
• NPTEL:Computer Vision
• Computer Vision - Course (nptel.ac.in)
Udemy:
• Computer Vision -
https://www.bing.com/ck/a?!&&p=37e82e3153ef651eJmlt
dHM9MTcxODc1NTIwMCZpZ3VpZD0wZmYzMGRhMi1iMTh
hLTZmZDgtMmFkNi0xZThmYjAyNzZlYjQmaW5zaWQ9NTU
xMA&ptn=3&ver=2&hsh=3&fclid=0ff30da2-b18a-6fd8- -
2ad6-
1e8fb0276eb4&u=a1L3ZpZGVvcy9yaXZlcnZpZXcvcmVsYX
RlZHZpZGVvP3E9UHJhY3RpY2FsK09wZW5DViszK3VkbWV5
K2NvdXJzZSZtaWQ9NThDMDExNjI2NTMzQzBFNDRCNjE1O
EMwMTE2MjY1MzNDMEU0NEI2MSZGT1JNPVZJUkU&ntb=1
90
Real time
Applications in day
to day life and to
Industry
91
Real time Applications
92
Content Beyond
Syllabus
93
Advanced Concepts in Recognition
Databases and Test Sets
95
Assessment Schedule
(Proposed Date & Actual Date)
Sl. ASSESSMENT Proposed Actual
No. Date Date
1 FIRST INTERNAL
ASSESSMENT
2 SECOND
INTERNAL
ASSESSMENT
3 MODEL
EXAMINATION
4 END SEMESTER
EXAMINATION
96
Prescribed Text
Books & Reference
97
Prescribed Text Books &
Reference
TEXT BOOKS:
1.D. A. Forsyth, J. Ponce, “Computer Vision: A Modern
Approach”, Pearson Education, 2003.
REFERENCES:
1. B. K. P. Horn -Robot Vision, McGraw-Hill.
2.Simon J. D. Prince, Computer Vision: Models, Learning,
and Inference, Cambridge University Press, 2012.
98
Mini Project
Suggestions
99
Mini Project Suggestions
1 Very Hard
Implement a deep learning-based approach for optical flow estimation using a pre-
trained model like RAFT or PWC-Net. Further, develop a layered motion analysis system
that uses optical flow data to segment a scene into multiple layers, each representing
independent motion. This project can explore novel deep learning architectures or
enhancements for better real-time performance and accuracy.
2. Hard
Develop a robust bundle adjustment system for refining 3D reconstruction from
multiple views. The system should perform feature matching across several images,
initial pose estimation, and iterative optimization to minimize reprojection error using
sparse bundle adjustment techniques. Handle large datasets and consider using multi-
threading or GPU acceleration for optimization.
3. Medium
Implement a two-frame structure from motion (SfM) system to reconstruct a 3D
model of a scene using two images. The system should detect features, match them,
estimate the essential matrix, recover the camera pose, and triangulate points to build a
sparse 3D model.
4. Easy
Create an Augmented Reality (AR) application that uses pose estimation with
fiducial markers (like ArUco or AprilTag) to place virtual objects in the real world. The
project involves detecting the markers in a video feed, estimating the camera's pose
relative to the markers, and rendering a 3D object on top of them.
5. Very Easy
Develop a simple image mosaicing application using 2D feature-based alignment
techniques like SIFT or ORB. The application should detect features in two overlapping
images, match them, estimate a homography, and blend the images to create a
seamless panorama.
10
0
Thank you
Disclaimer:
101