0% found this document useful (0 votes)
104 views101 pages

RMK Group 21cs905 CV Unit 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views101 pages

RMK Group 21cs905 CV Unit 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

1

2
Please read this disclaimer before
proceeding:
This document is confidential and intended solely for the
educational purpose of RMK Group of Educational Institutions. If
you have received this document through email in error,
please notify the system manager. This document contains
proprietary information and is intended only to the respective
group / learning community as intended. If you are not the
addressee you should not disseminate, distribute or copy
through e-mail. Please notify the sender immediately by e-mail
if you have received this document by mistake and delete this
document from your system. If you are not the intended recipient
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.

3
21CS905
COMPUTER
VISION

Department:
ARTIFICIAL INTELLIGENCE AND DATA
SCIENCE
Batch/Year: BATCH 2021-25/IV
Created by:
Dr.V.Seethalakshmi, Associate Professor, AI&DS, RMKCET

Date: 18-09-2024

4
Table of Contents
Sl. No. Contents Page No.

1 Contents 5

2 Course Objectives 6

3 Pre Requisites (Course Name with Code) 8

4 Syllabus (With Subject Code, Name, LTPC details) 10

5 Course Outcomes (6) 12

6 CO-PO/PSO Mapping 14
Lecture Plan (S.No., Topic, No. of Periods, Proposed
7 date, Actual Lecture Date, pertaining CO, Taxonomy 17
level, Mode of Delivery)
8 Activity based learning 19
Lecture Notes (with Links to Videos, e-book reference,
9 21
PPTs, Quiz and any other learning materials )
Assignments ( For higher level learning and Evaluation
10 75
- Examples: Case study, Comprehensive design, etc.,)
11 Part A Q & A (with K level and CO) 80

12 Part B Qs (with K level and CO) 86


Supportive online Certification courses (NPTEL,
13 89
Swayam, Coursera, Udemy, etc.,)
14 Real time Applications in day to day life and to Industry 91
Contents beyond the Syllabus ( COE related Value
15 93
added courses)

16 Assessment Schedule ( Proposed Date & Actual Date) 95

17 Prescribed Text Books & Reference Books 97

18 Mini Project 99

5
Course
Objectives

6
COURSE OBJECTIVES

 To understand the fundamental


concepts related to Image formation and processing.
 To learn feature detection, matching
and detection
 To become familiar with feature
based alignment and motion estimation
 To develop skills on 3D reconstruction
 To understand image based rendering
and recognition

7
PRE
REQUISITES

8
PRE REQUISITES

 SUBJECT CODE: 22MA101


 SUBJECT NAME: Matrices and Calculus

 SUBJECT CODE: 22MA201


 SUBJECT NAME: Transforms and Numerical
Methods

9
Syllabus

10
Syllabus

21CS905 COMPUTER VISION L T P C


3 0 0 3
UNIT I INTRODUCTION TO IMAGE FORMATION AND PROCESSING
9
Computer Vision - Geometric primitives and transformations - Photometric image
formation - The digital camera - Point operators - Linear filtering - More
neighborhood operators - Fourier transforms - Pyramids and wavelets - Geometric
transformations - Global optimization.

UNIT II FEATURE DETECTION, MATCHING AND SEGMENTATION 9


Points and patches - Edges - Lines - Segmentation - Active contours - Split and
merge - Mean shift and mode finding - Normalized cuts - Graph cuts and energy-
based methods.

UNIT III FEATURE-BASED ALIGNMENT & MOTION ESTIMATION 9


2D and 3D feature-based alignment - Pose estimation - Geometric intrinsic
calibration - Triangulation - Two-frame structure from motion - Factorization -
Bundle adjustment - Constrained structure and motion - Translational alignment -
Parametric motion - Spline-based motion - Optical flow - Layered motion.

UNIT IV 3D RECONSTRUCTION 9
Shape from X - Active rangefinding - Surface representations - Point-based
representations- Volumetric representations - Model-based reconstruction -
Recovering texture maps and albedosos.

UNIT V IMAGE-BASED RENDERING AND RECOGNITION 9


View interpolation Layered depth images - Light fields and Lumigraphs -
Environment mattes -Video-based rendering-Object detection - Face recognition -
Instance recognition – Category recognition - Context and scene understanding-
Recognition databases and test sets.
Course
Outcomes

12
Course Outcomes
Course Description Knowledge
Outcomes Level

CO1 To understand basic knowledge, theories and methods in K2


image processing and computer
vision.
To implement basic and some advanced image processing
CO2 K3
techniques in OpenCV.

CO3 To apply 2D a feature-based based image alignment, K3


segmentation and motion estimations
To apply 3D image reconstruction techniques
CO4 K4

CO5 To design and develop innovative image processing and K5


computer vision applications.

Knowledge Level Description

K6 Evaluation

K5 Synthesis

K4 Analysis

K3 Application

K2 Comprehension

K1 Knowledge

13
CO – PO/PSO
Mapping

14
CO – PO /PSO Mapping
Matrix
CO PO PO PO PO PO PO PO PO PO PO PO PO PS PS PS
1 2 3 4 5 6 7 8 9 10 11 12 O1 O2 03
1 3 2 1 1 3

2 3 3 2 2 3

3 3 3 1 1 3

4 3 3 1 1 3

5 3 3 1 1 3

6 2 2 1 1 3

15
UNIT –V IMAGE-
BASED RENDERING
AND RECOGNITION

16
Lecture
Plan

17
Lecture Plan – Unit 5– IMAGE-BASED
RENDERING AND RECOGNITION
Numb er
Sl. Propose Actual Taxo Mode of
of
No. Topic d Lectur CO nomy Delivery
Period s
Date e Date Level
1 CO5 K3 Blackboard
1 View interpolation / ICT Tools
Layered depth images

1 CO5 K3 Blackboard
2 Light fields and
/ ICT Tools
Lumigraphs

3 Environment mattes 1 CO5 K4 Blackboard


/ ICT Tools

Video-based rendering 1 CO5 K4 Blackboard


4 / ICT Tools
Blackboard
5 Object detection / ICT Tools
1 CO5 K4
Blackboard
Face recognition
6 1 CO5 K3 / ICT Tools

Instance recognition 1 CO5 K4 Blackboard


7. / ICT Tools
1 CO5 K4 Blackboard
8 Category recognition / ICT Tools

9 1 CO5 K4 Blackboard
Context and scene / ICT Tools
understanding
10 1 CO5 K4 Blackboard
Recognition databases / ICT Tools
and test sets

18
Activity Based
Learning

19
Activity Based Learning
Sl. No. Contents Page No.

1 Recognition Datasets and Test 66


sets

Example Activities for on Recognition Databases and Test Sets


1. Data Collection and Annotation
Question: Describe the process of creating a recognition database from scratch. What considerations
should be taken into account for data collection, and how would you ensure the quality and accuracy
of the annotations?
Activity: In groups, outline a plan for building a small recognition database, including category
selection, data collection methods, and annotation strategies.
2. Impact of Dataset Size
Question: How does the size of a recognition database influence the performance of machine
learning models? Discuss the trade-offs between having a larger dataset versus a smaller, more
curated one.
Activity: Research and present case studies where model performance varied significantly based on
dataset size. Analyze the results and discuss implications for model training.
3. Evaluation Metrics
Question: What are some common evaluation metrics used to assess model performance on test
sets? Explain the importance of each metric and when it is most appropriate to use them.
Activity: Create a visual poster that outlines various evaluation metrics (e.g., accuracy, precision,
recall, F1 score) and provides examples of when to use each in the context of object detection and
classification tasks.
4. Handling Class Imbalance
Question: Class imbalance is a common challenge in recognition databases. What strategies can be
employed to mitigate its effects when training models?
Activity: Design an experiment using a sample dataset with class imbalance. Implement at least two
strategies (e.g., data augmentation, class weighting) and compare the model's performance with and
without these strategies.
5. Benchmarking with Standard Datasets
Question: Why is benchmarking against standard datasets important in the field of computer vision?
Discuss the implications of using these benchmarks for both researchers and industry practitioners.
Activity: Choose a well-known recognition database (e.g., ImageNet, COCO) and analyze recent
research papers that report results on that dataset. Summarize the findings and discuss trends in
model performance over time.
20
Lecture Notes –
Unit 5

21
UNIT-5 IMAGE-BASED
RENDERING AND
RECOGNITION

Sl. Contents Page

No. No.

1 View interpolation Layered depth images 23

2 Light fields and Lumigraphs 26

3 Environment mattes 31

4 Video-based rendering 34

5 Object detection 39
Face recognition
6 45
7 Instance recognition 51

8 Category recognition 55

9 Context and scene understanding 60

10 Recognition databases and test sets 66

22
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
1. VIEW INTERPOLATION LAYERED DEPTH IMAGES

View interpolation is a technique used to synthesize intermediate views of a scene


given two or more input images. The goal is to generate novel viewpoints that provide
the illusion of smooth motion between captured viewpoints. Layered depth images
(LDI) are a way of representing a scene using multiple layers, where each layer
contains color and depth information. This allows for effective view synthesis by handling
occlusions and depth discontinuities in 3D scenes.

1.1 Concept of View Interpolation

View interpolation is based on the premise that given two or more images of the same
scene from different viewpoints, intermediate views can be generated using geometric
and image warping techniques.

Equation for View Interpolation

The fundamental equation for view interpolation can be written as:

1.1.1 Depth-based View Interpolation:

To handle depth information, interpolation must also adjust for changes in pixel locations
based on their depth values:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

1.2. Layered Depth Images (LDI)

A Layered Depth Image (LDI) stores multiple depth layers at each pixel location,
representing surfaces that appear at different depths along the same viewing ray. This is
useful for handling occlusions in view interpolation, as the multiple layers allow for more
accurate representation of the 3D scene.

Each pixel in an LDI stores a list of depth values and corresponding color values.

Multiple depth layers are useful for handling the occlusion problem because the
frontmost layer may not always be visible in a new view.

LDI Representation:

LDI at each pixel is represented as:


UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
In essence, an LDI stores multiple surfaces along each pixel's ray, enabling robust
handling of scene depth and occlusions.

Here’s a simplified diagram that illustrates the concept of view interpolation using layered
depth images

Viewpoint A Viewpoint B

\ /

\ /

\____________________/ <-- Camera views with different perspectives

| | |

Z1 Z2 Z3 <-- Layered depth images with multiple depth layers

This diagram shows how multiple depth layers along the viewing rays from different
viewpoints are used to interpolate the view in between them. The scene information is
stored as multiple layers of depth and color at each pixel.

1.3 Example of View Interpolation Using LDIs

Imagine a scenario where you have two images of a building taken from two different
angles. The images show occlusions, such as parts of the building that are hidden from
one viewpoint but visible from the other. Using LDIs, you store multiple depth values at
each pixel, which allows you to synthesize a new view where previously hidden parts of
the scene become visible as you change the viewpoint.

1. Capture two images from different angles.

2. Construct LDIs by storing color and depth values at multiple layers for each pixel.

3. Use the interpolation formula to blend the images and adjust pixel locations based on
depth to generate an intermediate view.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
1.4 Case Study: The "Tour into the Picture"

One of the classic case studies is the "Tour into the Picture" application where a user
explores a 3D scene using 2D images. This technique was effectively used for generating
smooth transitions between captured images of a real-world scene. The LDIs allow for
rendering of complex scenes with occlusions, and intermediate views can be synthesized
dynamically as the user moves through the virtual space. By using multiple depth layers
at each pixel, this system manages occlusions effectively and provides smooth visual
transitions.

2. LIGHT FIELDS AND LUMIGRAPHS

Light Fields and Lumigraphs are powerful techniques used in computer vision and
computer graphics to capture and render complex visual scenes. They allow for the
realistic reproduction of a scene by capturing how light travels through every point in
space, and how it changes depending on the viewer's perspective. Here's an explanation
of both:

2.1 Light Fields

A light field describes the amount of light traveling in every direction through every point
in space. Essentially, it encodes the light's intensity and color at each point in space and
at every direction. This method allows for the rendering of images from arbitrary
viewpoints without needing to explicitly model the 3D geometry of the scene.

2.1.1 Plenoptic Function:

The concept of the light field is mathematically formalized using the plenoptic function,
which describes the intensity of light rays in a scene as a function of several variables:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

In a simplified form, the light field is typically reduced to a 4D function, assuming the light
is constant in terms of wavelength:

This 4D representation allows for the synthesis of novel views of the scene, essentially
capturing both spatial and angular information about the light.

2.1.2 Capturing a Light Field

Light fields can be captured using:

Light field cameras (plenoptic cameras): These cameras have a microlens array in
front of the sensor, which allows them to capture multiple light rays entering from
different angles. Lytro cameras are an example.

Multi-camera arrays: By arranging many cameras in an array, the light field can be
approximated by taking multiple images from slightly different viewpoints

2.1.3 Rendering with Light Fields

Once a light field is captured, it can be used to render images from any viewpoint within
the captured field. The advantage is that no explicit 3D geometry reconstruction is
needed, and the scene can be rendered directly by sampling the appropriate rays from
the light field data.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION
2.2 Lumigraphs

The Lumigraph is an extension of the light field concept, but it incorporates geometric
information in the scene, providing more flexibility and improving the rendering quality
when handling complex visual phenomena such as occlusions.

2.2.1 Lumigraph Basics

A Lumigraph can be thought of as a hybrid between:

Light fields, which encode the light information for each ray in space, and

Geometry-based representations, where the geometry of the scene is used to


enhance rendering.

Lumigraphs provide an efficient way to store and render a scene, taking advantage of a
coarse geometric model of the scene to better interpolate between views and reduce
visual artifacts.

2.2.2 How a Lumigraph Works

1. Scene Sampling: Like light fields, Lumigraphs capture a dense set of images from
multiple viewpoints around the object or scene.

2. Coarse Geometry: Instead of treating the scene as purely a collection of light rays, a
rough 3D geometric model of the scene is constructed.

3. View Interpolation: When rendering a new viewpoint, the system uses the coarse
geometry to determine how to sample rays from the stored light field, adjusting for
occlusions and depth discontinuities.

2.2.3 Lumigraph Equation

Given a coarse geometry model, Lumigraphs can improve image quality by reducing
distortions using view-dependent textures. The plenoptic function for a Lumigraph
incorporates the geometry
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

2.2.4 Capturing a Lumigraph

Capturing a Lumigraph requires both a camera array for capturing light rays and
some method for building a coarse geometric model of the scene. This can be
done using photogrammetry, structure-from-motion, or depth sensors.

2.2.5 Advantages of Lumigraphs

Lumigraphs can handle more complex scenes with occlusions better than pure
light fields because they use geometry to inform how light rays should be
sampled.

They allow for faster rendering and require less storage than light fields by using
fewer sampled images and leveraging geometric knowledge.

2.2.6 Comparison of Light Fields and Lumigraphs


UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

2.3 Applications of Light Fields and Lumigraphs

1. Virtual Reality (VR) and Augmented Reality (AR): Light fields and Lumigraphs
enable immersive experiences where users can move freely within a scene and view it
from any angle, enhancing realism.

2. Cinematic VR: These techniques are used in VR movies where multiple viewpoints are
necessary to provide a realistic 3D experience.

3. Image-Based Rendering (IBR): Light fields are heavily used in IBR to synthesize
novel views from captured imagery without needing complex 3D modeling.

4. Medical Imaging: In some cases, light fields can be used for 3D visualization of
medical scans, allowing doctors to see internal structures from multiple angles.

5. Cultural Heritage Preservation: Museums and historical sites use light fields and
Lumigraphs to create virtual tours of artifacts or sites, preserving them digitally in a
highly realistic manner.

2.4 Example Case Study: Light Field Displays

A light field display is a type of 3D display that shows different images depending on the
angle from which it's viewed, much like a hologram. By projecting multiple images at
different angles, a light field display can give the illusion of depth, allowing viewers to
see 3D content without special glasses. These displays are based on light field rendering
technology, where the display simulates the behavior of light in a real scene.

How it Works:

1. Multiple views of a scene are captured using light field cameras.

2. These views are then projected on the light field display, where the user perceives
different angles depending on their position relative to the display.

3. As the user moves, the display updates the images to simulate depth and parallax, giving
a 3D experience.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

3. ENVIRONMENT MATTES

Environment mattes are a technique used in computer vision and computer graphics
to accurately represent and manipulate the reflections and refractions of an environment
on a transparent or partially transparent object, such as glass or water. They help in
generating realistic images where objects interact with their surroundings through
complex light behavior like reflection, refraction, and transparency.

Environment mattes are particularly useful in applications such as visual effects for
movies, augmented reality, and any scenario where objects must seamlessly blend into
or reflect their surroundings.

3.1 Basic Idea

An environment matte captures how an object reflects or refracts its surroundings.


The goal is to represent how light from the surrounding environment interacts with the
object, so when the object is composited into a scene, it maintains realistic lighting,
reflection, and transparency effects.

For example, imagine placing a glass bottle in front of a colorful background. The
environment matte would capture how the background is seen through the glass,
including:

• Reflections: Parts of the background reflected off the glass surface.

• Refractions: Distorted parts of the background seen through the transparent parts of
the glass.

• Transparency: How much of the background shows through different regions of the
object.

The environment matting process is usually modeled using the matting equation, which
separates the object from the background. For environment mattes, we modify this
concept to handle transparency, reflection, and refraction. The key idea is to express
how a pixel in the final image is a combination of the object’s inherent color (matte
color) and how it interacts with its environment.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

3.2 Capturing Environment Mattes

To create environment mattes, special techniques are used to capture how light from the
environment interacts with the object. This process involves:

1. Photographing the Object with Different Backgrounds: Multiple images of the


object are taken against different known backgrounds. These backgrounds are often
chosen to provide different lighting or color variations.

2. Analyzing Reflections and Refractions: The images are analyzed to determine how
different parts of the object reflect or refract the background. This helps in creating a
model of how the object would interact with any arbitrary environment.

3. Matte Extraction: The matte is extracted, representing both the transparency of the
object and how it alters the appearance of the background due to reflections and
refractions.

3.3 Applications of Environment Mattes

Environment mattes are particularly useful in film and visual effects, augmented
reality, and virtual production. Some specific use cases include:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

Compositing Transparent Objects: In movies, environment mattes are used to


composite transparent objects like glass, bottles, or water into different scenes. For
example, a glass object composited into a scene would reflect and refract the surroundings
correctly, providing a more realistic look.

Visual Effects in Movies: When creating special effects involving semi-transparent


objects, environment mattes ensure that reflections and refractions are accurately
portrayed. This is crucial when creating CGI objects that need to blend with live-action
footage.

Augmented Reality (AR): In AR applications, environment mattes are used to integrate


virtual objects with real-world environments. For example, a virtual glass or water surface
placed into a real scene needs to reflect and refract the actual surroundings to look
believable.

Product Visualization: For product design and visualization, environment mattes allow
for accurate rendering of materials like glass, plastic, and other semi-transparent objects
under different lighting conditions. This is useful for industries like automotive design,
where the reflection of the car's surface must be realistic.

3.4 Example: Compositing a Glass Bottle into a Scene

Consider a scenario where a glass bottle needs to be composited into a virtual scene
with realistic reflections and refractions.

1. Capture the Environment Matte: Multiple photos of the glass bottle are taken in front
of different patterned backgrounds. These patterns help in identifying how light interacts
with the bottle's surface.

2. Analyze Reflections and Refractions: By comparing how the bottle affects each
background, the environment matte captures how the bottle reflects light from the front
and refracts light from behind.

3. Apply the Matte to a New Scene: Once the environment matte is computed, the
bottle can be placed into any new scene. The matte will ensure that the bottle reflects
and refracts the new background correctly, leading to realistic integration.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

3.5 Advantages of Environment Mattes

Accurate Reflections and Refractions: Environment mattes enable detailed control


over how objects interact with their surroundings, making reflections and refractions look
natural.

Realism in Visual Effects: By capturing complex light interactions, environment mattes


provide high-quality compositing, particularly for transparent and semi-transparent
objects.

Efficiency: Environment mattes provide a relatively efficient way to composite complex


transparent objects without needing to fully simulate the physical behavior of light in 3D
space.

4. VIDEO-BASED RENDERING

Video-Based Rendering (VBR) is a technique in computer vision and computer


graphics that involves generating novel views of a scene from multiple input videos
rather than relying on complex 3D models. This approach is commonly used to produce
realistic animations and virtual environments where traditional 3D modeling might be too
complex or time-consuming. The goal of VBR is to synthesize new frames or views based
on existing video footage.

4.1. What is Video-Based Rendering?

In Video-Based Rendering (VBR), the basic idea is to leverage video sequences


captured from different viewpoints to generate new views or manipulate existing ones.
By capturing visual data (images and videos) from various perspectives, it becomes
possible to interpolate or extrapolate new frames that weren't explicitly captured. This
allows for effects like:

• Virtual camera movement in scenes that have only been filmed from a few angles.

• Synthesis of novel perspectives in multi-camera setups.

• Viewpoint interpolation for smooth transitions between camera angles.


UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

The primary challenge in VBR is how to use the captured video frames to generate realistic
and consistent new frames, especially when dealing with complex scenes involving
occlusions, moving objects, and changing lighting conditions.

4.2 Key Techniques in Video-Based Rendering

There are several key techniques and methods used in VBR to achieve realistic scene
reconstruction and rendering:

a) View Interpolation

View interpolation is the process of generating an intermediate view between two or


more captured video frames. This is done by blending multiple video frames based on
the geometry of the scene and the camera positions.

The view interpolation equation is typically expressed as:

Depth information or camera geometry is often required to determine the correct


alignment of pixels in the interpolated view.

b) Layered Depth Images (LDI)

Layered Depth Images (LDIs) are an extension of view interpolation where the scene
is represented as multiple layers of depth information for each pixel. Each pixel stores
not only its color but also multiple depth values corresponding to different surfaces along
the same ray.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

In VBR, LDIs help handle occlusions and depth discontinuities by allowing the system to
track multiple layers of a scene. This improves the accuracy of interpolating new views,
particularly in complex environments where objects may block parts of the scene.

c) Optical Flow

Optical flow is the pattern of apparent motion between two consecutive video frames.
It is used in VBR to track the motion of objects between frames, enabling the system to
accurately interpolate new views in dynamic scenes where objects are moving.

The optical flow equation is:

By calculating the optical flow, VBR systems can predict how pixels will move in
subsequent frames, allowing for smooth rendering of dynamic scenes.

d) Multi-View Stereo (MVS)

Multi-View Stereo is used in video-based rendering to extract depth information from


multiple camera views. This allows for the creation of 3D models of the scene from video
footage, which can be used to assist in generating new views or blending multiple
viewpoints.

MVS reconstructs a 3D model by aligning and correlating different views from several
cameras, determining the depth for each pixel. The more views that are available, the
better the depth estimation.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

4.3 Applications of Video-Based Rendering

VBR is applied in several areas, particularly where realism and cost-effectiveness are
essential. Some of the key applications include:

a) Virtual Reality (VR) and Augmented Reality (AR)

VBR is widely used in virtual reality and augmented reality to provide users with
immersive experiences by allowing free movement within a captured video scene. For
instance, in cinematic VR, users can view a pre-recorded scene from different
perspectives as if they were physically present, by synthesizing new views from recorded
video data.

b) Sports and Event Broadcasting

Video-based rendering is also used in sports broadcasting for creating replays with
free-moving cameras. For example, the famous "bullet-time" effect, which was
popularized in movies like The Matrix, allows the camera to move around the action in
slow motion. This effect can be achieved by capturing the scene from multiple angles
and using VBR techniques to interpolate intermediate views.

c) Film and Visual Effects

In the film industry, VBR allows filmmakers to add new camera angles or viewpoints
after shooting a scene, without having to reshoot it. This can be useful for special
effects, where scenes are captured in a controlled environment (e.g., a green screen)
and then new views are generated in post-production.

d) 3D Reconstruction and Simulation

VBR is used in 3D reconstruction from video footage, particularly when scanning


environments for simulation or analysis. This technique is often used in archaeology,
architectural preservation, and cultural heritage projects to digitally reconstruct
sites and artifacts.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

e) Interactive Media and Video Games

In interactive media and video games, VBR allows for real-time manipulation of
scenes. For example, a video game could use VBR to simulate realistic movement within
a scene by dynamically interpolating new frames based on the player's viewpoint.

4.4 Challenges in Video-Based Rendering

While VBR has many advantages, it also comes with several challenges:

a) Depth and Occlusion Handling

Handling depth discontinuities and occlusions (when one object blocks another) is
difficult, especially in dynamic scenes. If the depth information is inaccurate,
interpolating new views can lead to artifacts such as ghosting or incorrect object
alignment.

b) Real-Time Performance

Generating new frames from video footage in real time requires a lot of computational
power, particularly if the scene involves complex movements or lighting conditions.
Ensuring real-time performance while maintaining high-quality results is challenging.

c) Dynamic Scenes

When objects or people in the scene are moving, VBR must correctly capture and
interpolate motion. Sudden movements can be hard to interpolate, and tracking such
movements using optical flow can sometimes produce incorrect or distorted results.

d) Lighting Changes

Handling lighting changes in dynamic scenes is another challenge. Variations in


lighting across frames can make it difficult to maintain consistency when generating new
views.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

4.5 Example Case Study: Free-Viewpoint Video (FVV)

One well-known application of VBR is Free-Viewpoint Video (FVV), where a user can
move a virtual camera freely within a recorded scene. This is particularly popular in
sports broadcasting and immersive media.

Steps in FVV using VBR:

1. Multiple Cameras Capture: The scene is recorded using an array of cameras from
different viewpoints.

2. Depth Estimation: Depth maps are generated for each frame using stereo matching or
other techniques.

3. View Interpolation: Using the depth information and the captured frames,
intermediate views are generated for virtual cameras positioned between the real
cameras.

4. User Interaction: The user can then explore the scene from different angles,
seamlessly moving between recorded viewpoints.

In sports broadcasting, this is used to provide 3D replays where the audience can
rotate around players or action moments and view the scene from any angle, even those
not covered by actual cameras.

5. OBJECT DETECTION

Object detection is a computer vision technique used to identify and locate objects
within an image or video. It involves both classification (what is in the image) and
localization (where the object is located). Unlike image classification, which assigns a
single label to an image, object detection provides bounding boxes around multiple
objects in the scene and identifies their categories.

Object detection is a key element in many applications such as autonomous driving,


surveillance, robotics, image annotation, and augmented reality.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

5.1 How Object Detection Works

Object detection combines two tasks:

1. Classification: Determining what type of object is present in an image.

2. Localization: Determining where the object is located in the image, usually represented
as a bounding box around the object.

The output of an object detection model consists of:

• Bounding boxes: A rectangle surrounding each detected object, defined by coordinates


(x,y,w,h), where (x,y) is the top-left corner of the box, and w and h are the width and
height.

• Class labels: The category (e.g., car, person, dog) for each detected object.

• Confidence scores: A probability indicating how certain the model is about the
presence of an object in a given bounding box.

5.2 Popular Object Detection Models and Methods

There are various techniques for object detection, categorized broadly into traditional
methods and deep learning-based methods. In recent years, deep learning
approaches have become the dominant method due to their superior accuracy and ability
to handle complex tasks.

a) Traditional Methods

Traditional object detection methods typically relied on features like HOG (Histogram
of Oriented Gradients) or SIFT (Scale-Invariant Feature Transform) and
classifiers like Support Vector Machines (SVM). Some key traditional methods
include:

• Sliding Window: This involves sliding a window over the image and applying a
classifier to each sub-image to determine if it contains an object.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

• Selective Search: A method used to propose regions of interest (ROI) in the image
that likely contain objects. These regions are then classified using a classifier.

These methods, while effective, were often computationally expensive and not very
flexible for detecting different sizes and types of objects.

b) Deep Learning-Based Methods

Deep learning methods, especially those based on Convolutional Neural Networks


(CNNs), have revolutionized object detection. The most common deep learning
architectures for object detection include:

1. R-CNN (Region-based CNN) Family:

1. R-CNN (2014): The Region-based Convolutional Neural Network (R-


CNN) method uses selective search to propose regions in the image and applies
a CNN to classify them. However, R-CNN was slow due to the separate steps for
region proposal and classification.

2. Fast R-CNN (2015): Improves on R-CNN by sharing CNN features across


different region proposals, significantly speeding up the process.

3. Faster R-CNN (2016): Introduces the Region Proposal Network (RPN),


which directly learns to propose regions of interest, making it much faster and
more accurate.

2. YOLO (You Only Look Once):

YOLO (2016): YOLO is a real-time object detection system that formulates object
detection as a single regression problem. It divides the image into a grid and predicts
bounding boxes and class probabilities for each grid cell in one pass through the
network, making it extremely fast.

• YOLOv3, YOLOv4, YOLOv5: Successive versions of YOLO improve accuracy while


maintaining real-time speed. YOLO is widely used in applications requiring fast detection.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

3. SSD (Single Shot Multibox Detector):

SSD (2016): Similar to YOLO, SSD detects objects in a single shot by dividing the image
into a grid and predicting both object categories and bounding boxes at each grid
location. It handles objects at multiple scales and is computationally efficient, balancing
speed and accuracy.

4. RetinaNet:RetinaNet (2017):

Known for its Focal Loss function, which helps handle the class imbalance issue in object
detection by focusing on hard-to-classify objects, RetinaNet improves the accuracy of
detecting small and difficult objects while maintaining high speed.

5.3 Challenges in Object Detection

Object detection is a challenging task due to several factors:

• Scale Variation: Objects can appear in different sizes within the same image (e.g., a
faraway car versus a close-up car).

• Occlusion: Objects may be partially obscured by other objects or parts of the scene.

• Deformation: Objects like humans and animals can change their shape, making them
harder to detect.

• Class Imbalance: Some objects (e.g., people or cars) may appear frequently, while
others (e.g., rare animals) are less common, making training difficult.

• Real-Time Performance: Achieving high accuracy while maintaining real-time


detection is critical in applications like autonomous driving and robotics.

5.4 Metrics for Evaluating Object Detection

Object detection models are evaluated using various metrics, the most common being:

• Intersection over Union (IoU): Measures the overlap between the predicted
bounding box and the ground truth. It is calculated as the area of overlap divided by the
area of union between the predicted and ground truth boxes.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

A prediction is considered correct if the IoU is above a certain threshold, often set at 0.5
(IoU > 0.5).

• Precision and Recall: These are standard classification metrics that evaluate how
many of the predicted bounding boxes are correct (precision) and how many actual
objects were successfully detected (recall).

• Mean Average Precision (mAP): A popular metric that averages the precision scores
across all object categories and IoU thresholds.

5.5 Applications of Object Detection

Object detection has a wide range of practical applications across many industries:

a) Autonomous Vehicles:

Object detection plays a crucial role in autonomous driving by detecting pedestrians,


vehicles, road signs, and obstacles in real-time to ensure safe navigation.

b) Surveillance and Security:

In video surveillance, object detection is used to automatically detect intruders,


suspicious objects, or unusual activities in real-time from CCTV footage.

c) Healthcare:

In the medical field, object detection is used in diagnostic imaging (e.g., identifying
tumors or lesions in medical scans) to assist radiologists.

d) Retail and Inventory Management:

Object detection helps in automated checkouts, shelf monitoring, and inventory


management by detecting products in stores and ensuring they are correctly stocked.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

e) Augmented Reality (AR):

In AR applications, object detection enables the system to recognize real-world objects


and overlay relevant information or virtual objects on top of them.

f) Robotics:

Robots rely on object detection to interact with objects in their environment, such as
picking and placing items or navigating through complex spaces.

g) Image Search and Annotation:

Search engines and photo management systems use object detection to enable users to
search for images containing specific objects (e.g., searching for images of dogs or cars
in a photo library).

5.6 Example of Object Detection: YOLO

Let’s take an example of how YOLO (You Only Look Once) works for detecting objects in
an image.

1. Input Image: The input is an image of a street scene with multiple objects like cars,
people, and traffic lights.

2. Dividing the Image into a Grid: YOLO divides the input image into a grid, say 13x13.
Each grid cell is responsible for predicting whether it contains an object.

3. Bounding Box Prediction: For each grid cell, YOLO predicts several bounding boxes
along with their confidence scores and class probabilities.

4. Non-Maximum Suppression: After generating bounding boxes, YOLO applies non-


maximum suppression to eliminate duplicate detections of the same object and retains
only the highest confidence bounding box.

5. Final Output: The final result consists of bounding boxes drawn around detected
objects (e.g., cars, people, traffic lights), each labeled with the object category and
confidence score.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

6. FACE RECOGNITION

Face recognition is a biometric technology that identifies or verifies a person from an


image or video based on the person's facial features. It is a subset of computer vision
and pattern recognition, widely used in various applications such as security systems,
smartphones, law enforcement, surveillance, and social media.

Face recognition can perform two major tasks:

1. Face Verification: Verifying whether a given face matches a specific identity (one-to-
one matching).

2. Face Identification: Identifying a face from a database of known faces (one-to-many


matching).

6.1 How Face Recognition Works

Face recognition systems typically consist of the following steps:

a) Face Detection

The first step in face recognition is detecting the presence of a face in an image or video.
This step isolates the face region from the background or other objects in the scene.
Common face detection techniques include:

• Haar Cascades: A traditional method that uses a cascade of classifiers trained with
positive and negative images.

• HOG (Histogram of Oriented Gradients): A feature descriptor that detects objects


by capturing gradients in the image.

• Deep Learning Methods: Modern face detectors like MTCNN (Multi-task Cascaded
Convolutional Networks) or YOLO for real-time detection.

After detecting the face, the system aligns the face to a standard orientation (e.g.,
frontal view) for consistent recognition.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

b) Face Preprocessing

Before recognition, the face image is typically preprocessed to enhance the quality of
facial features. Preprocessing may include:

• Grayscale conversion: Simplifying the image by converting it to grayscale.

• Normalization: Standardizing the face in terms of size, brightness, and contrast.

• Alignment: Correcting the face's orientation (e.g., rotation) so that the eyes and mouth
are in the same position for all images.

c) Feature Extraction

In this step, a unique feature vector (also called an embedding) is generated for each
face image. This vector represents the distinctive characteristics of a face, such as the
distance between the eyes, the shape of the nose, and other facial features.

Two main approaches for feature extraction are:

• Traditional Approaches: Methods like Eigenfaces, LBP (Local Binary Patterns),


and Fisherfaces rely on manual feature engineering. These methods work well in
controlled environments but are less robust to changes in lighting, pose, and expression.

• Deep Learning-Based Approaches: Modern systems use Convolutional Neural


Networks (CNNs) to automatically extract features from the face image. DeepFace,
FaceNet, and VGG-Face are popular models that generate highly discriminative
embeddings for face recognition.

The CNN models are trained on large datasets of labeled face images, allowing them to
learn complex, high-dimensional facial feature representations.

d) Face Matching

Once the feature vectors are extracted, face recognition is performed by comparing
these vectors. The two most common tasks are:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

• Verification: Compare two feature vectors using a distance metric (such as Euclidean
distance or Cosine similarity) to determine if they belong to the same person.

• Identification: Compare a feature vector against a database of stored vectors


(embeddings) to find the closest match.

For verification, the distance between the two embeddings should be below a certain
threshold for the system to confirm that they represent the same person. For
identification, the embedding of the detected face is compared with all stored
embeddings, and the system returns the identity of the closest match.

6.2 Popular Face Recognition Models and Architectures

Several face recognition models and architectures are widely used due to their high
accuracy and robustness:

a) DeepFace

• Developed by Facebook, DeepFace was one of the earliest successful deep learning-
based face recognition models.

• It uses a deep CNN trained on a large dataset of faces and achieves near-human
accuracy by generating 3D face models for alignment and feature extraction.

b) FaceNet

• Developed by Google, FaceNet introduced the concept of triplet loss, where the goal is
to minimize the distance between an anchor image and a positive image (same person)
while maximizing the distance between the anchor and a negative image.

• FaceNet outputs a 128-dimensional embedding for each face, which is used for face
matching.

c) VGG-Face

• Developed by the Visual Geometry Group at the University of Oxford, VGG-Face is


based on the popular VGGNet architecture. It uses deep CNNs to extract facial features
and is trained on millions of images.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

• VGG-Face is known for its ability to handle variations in pose, lighting, and facial
expressions.

d) ArcFace

• ArcFace is a state-of-the-art face recognition model that improves on previous methods


by using an additive angular margin loss to enhance the discriminative power of
facial embeddings.

• ArcFace provides high accuracy in both verification and identification tasks, making it
suitable for large-scale face recognition systems.

6.3 Challenges in Face Recognition

Despite its widespread success, face recognition faces several challenges:

a) Variations in Pose and Expression

Faces can look very different from different angles or when a person changes their
expression. Handling pose variations (e.g., profile view vs. frontal view) is particularly
challenging. Solutions like 3D face modeling or robust training datasets help mitigate this
issue.

b) Lighting Conditions

Changes in lighting can dramatically affect the appearance of facial features. Deep
learning models trained on large, diverse datasets are more robust to varying lighting
conditions compared to traditional methods.

c) Aging

As a person ages, their facial features change over time. Long-term recognition systems
must account for these gradual changes, often requiring regular updates to the face
embeddings.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

d) Occlusion

Occlusion occurs when parts of the face are hidden by objects like glasses, masks, or
hair. Some modern face recognition models use partial matching techniques to
recognize faces even when occluded.

e) Adversarial Attacks

Face recognition systems can be vulnerable to adversarial attacks, where subtle


modifications to a face image (imperceptible to humans) can fool the recognition model
into making incorrect predictions.

6.4 Applications of Face Recognition

Face recognition is widely used in various industries, including:

a) Security and Access Control

• Biometric authentication using face recognition is commonly used to unlock


smartphones (e.g., Apple's Face ID) or control access to secure areas.

• Surveillance systems use face recognition to monitor public spaces and identify
persons of interest, such as suspects or missing individuals.

b) Law Enforcement

Law enforcement agencies use face recognition to match suspects against a database of
known criminals. It is also used in forensic analysis to identify people from CCTV footage
or other media.

c) Smartphones and Devices

Most modern smartphones and laptops use face recognition for user authentication.
This biometric method provides a secure and convenient alternative to passwords or
fingerprint recognition
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

d) Social Media and Tagging

Platforms like Facebook and Google Photos use face recognition to automatically
tag people in images. Face recognition enables users to organize their photo libraries
and search for images based on the individuals in them.

e) Retail and Marketing

In the retail industry, face recognition is used to analyze customer behavior, personalize
shopping experiences, or even provide targeted advertising based on the recognition of
frequent customers.

f) Healthcare

Face recognition can be applied in patient identification and monitoring, especially in


remote healthcare settings. For instance, it can be used to identify patients suffering
from certain diseases or conditions where facial patterns change.

6.5 Example of Face Recognition: FaceNet

Let’s break down how FaceNet, a state-of-the-art deep learning model for face
recognition, works:

1. Training: FaceNet is trained using triplet loss, where it learns to map images of the
same person close to each other in the embedding space while mapping images of
different people far apart.

2. Face Embedding: For each face, FaceNet generates a 128-dimensional embedding,


which represents the unique characteristics of the face.

3. Verification: To verify if two faces belong to the same person, the system compares
their embeddings using a distance metric like Euclidean distance. If the distance is
below a certain threshold, the faces are considered to belong to the same individual.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

7. INSTANCE RECOGNITION

Instance recognition refers to the task of recognizing specific instances of an object


or entity within an image, rather than just recognizing the general object category (e.g.,
not just a car but a particular make and model of a car). It is used to identify specific
occurrences of objects, like recognizing a particular person in a crowd or detecting a
specific product on a shelf.

Instance recognition differs from object recognition in that the latter focuses on
identifying the general class or type of an object (e.g., dog, car, person), while instance
recognition identifies unique instances within a class (e.g., a specific breed of dog or a
specific license plate).

7.1 How Instance Recognition Works

Instance recognition consists of multiple key steps:

1. Feature Extraction: The system extracts distinctive features from the image that can
uniquely identify the object. These features might include keypoints (e.g., corners,
edges) or descriptors that summarize the object’s texture, color, or shape.

2. Matching: Once the features are extracted, they are compared to a pre-built
database of known instances. The system uses a similarity measure (like Euclidean
distance) to find the closest match between the detected features and those stored in
the database.

3. Pose Estimation (optional): If the object in the image is not perfectly aligned (rotated
or viewed from a different angle), some systems estimate the pose and adjust the
recognition accordingly. This is common in applications like augmented reality or 3D
model matching.

4. Instance Recognition Output: The system outputs the specific identity of the
detected instance, often with a confidence score that reflects how certain the system is
about the match.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

7.2 Popular Instance Recognition Techniques

Instance recognition can be achieved using various methods, ranging from traditional
computer vision techniques to modern deep learning-based approaches.

a) Traditional Methods

• SIFT (Scale-Invariant Feature Transform): SIFT is a feature extraction technique


that detects keypoints and generates descriptors to match specific instances of an object,
regardless of changes in scale, rotation, or partial occlusion.

• SURF (Speeded Up Robust Features): SURF is a faster alternative to SIFT that also
detects keypoints and uses descriptors for instance matching.

• Bag of Visual Words (BoVW): This method works by quantizing keypoint descriptors
into visual words and using them for image matching.

These methods are widely used for landmark recognition, logo detection, and
product identification, where distinct visual features help identify specific instances.

b) Deep Learning-Based Methods

• CNN-based Embeddings: In deep learning, Convolutional Neural Networks


(CNNs) are used to extract deep features from images. These deep features capture
more complex patterns and allow for more robust instance recognition under variations
in lighting, viewpoint, and occlusion.

• Siamese Networks: Siamese networks consist of two or more identical CNNs sharing
weights. They are used to compute embeddings for two input images and then compare
them to see if they represent the same instance.

• Region-based Methods (R-CNN, Fast R-CNN): These models detect specific objects
or regions of interest and can also be used for instance recognition by further refining
the identification process with a secondary matching step.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

7.3 Applications of Instance Recognition

a) Product Recognition in Retail

In retail settings, instance recognition can identify specific products on store shelves for
inventory management or for customer applications like visual search (e.g., identifying
a particular brand or model of a shoe or clothing item).

b) Facial Recognition

In facial recognition, the goal is to identify specific individuals in a crowd or in a


surveillance setting. This is a form of instance recognition where the system matches a
person’s facial features to a database of known faces.

c) Logo Detection

Instance recognition is widely used in logo detection to identify specific brands in


images or videos. This has applications in brand monitoring and advertising
analytics.

d) Landmark Recognition

Systems can recognize specific landmarks such as buildings, monuments, or natural


features. For example, Google’s Image Search allows users to take a picture of a
famous landmark, and the system identifies its exact name and location.

e) Augmented Reality (AR)

In AR, instance recognition allows virtual objects to be overlaid on top of specific real-
world objects (e.g., tracking a product or an image in real-time to display interactive
content).

f) License Plate Recognition (LPR)

Instance recognition is used to detect and recognize specific license plates from
images or video frames. This is crucial for applications like traffic law enforcement,
parking management, and vehicle tracking.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

g) Robotics and Industrial Applications

Robots that operate in complex environments rely on instance recognition to detect and
interact with specific objects (e.g., identifying specific tools in a manufacturing line).

7.4 Challenges in Instance Recognition

Instance recognition, like any computer vision task, has its own set of challenges:

• Intra-Class Variability: Objects belonging to the same category can have subtle
differences that make it difficult to recognize specific instances.

• Viewpoint Variation: Objects may appear different when viewed from different angles,
which requires the system to generalize across viewpoints.

• Occlusion: Parts of the object may be hidden or blocked by other objects, making
recognition harder.

• Lighting and Environmental Conditions: Changes in lighting, reflections, and


shadows can alter the appearance of objects, affecting the accuracy of instance
recognition.

• Real-Time Performance: For applications like surveillance or autonomous driving,


real-time instance recognition is essential, but achieving it while maintaining high
accuracy can be computationally challenging.

7.5 Example of Instance Recognition: Product Recognition in Retail

Scenario: A customer uses a smartphone app to take a photo of a specific brand of cereal
in a store. The system needs to recognize this particular brand (and not just any cereal) to
provide information about it, such as its price, availability, and nutritional facts.

Steps:

1. Image Capture: The customer takes a photo of the cereal box.

2. Feature Extraction: The system uses a CNN to extract features from the image,
focusing on distinctive elements like the brand logo, packaging design, and text.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

3. Database Matching: These features are compared to a database of known product


images, where each product has its own unique feature representation.

4. Instance Recognition: The system identifies the exact product based on the closest
match in the database and provides the customer with relevant product information.

This instance recognition system would need to handle variations in how the photo was
taken (e.g., different angles, lighting conditions, or partial occlusion by other products).

8. CATEGORY RECOGNITION

Category recognition in computer vision refers to the task of identifying the class or
type of an object in an image, without distinguishing between specific instances of that
object. It focuses on recognizing objects as belonging to broad categories such as "cat,"
"car," "tree," or "person," rather than identifying a particular cat or a specific car.
Category recognition is also known as object classification or object recognition.

While instance recognition aims to identify specific occurrences of an object, such as a


particular car with a given license plate, category recognition only needs to recognize
that the object belongs to the category "car."

8.1 How Category Recognition Works

Category recognition generally follows a standard pipeline that involves the following
steps:

a) Image Preprocessing

Before category recognition, the input image is typically preprocessed to enhance the
features necessary for recognition:

• Resizing: Images are resized to a fixed size to make them uniform for input to a
recognition model.

• Normalization: The pixel values are normalized to a common range (e.g., [0, 1] or [-1,
1]).
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

• Data Augmentation: Techniques like flipping, rotating, cropping, and scaling the image
can help improve the robustness of the recognition model by exposing it to varied
appearances of objects.

b) Feature Extraction

Feature extraction is one of the most important steps in category recognition. Here,
distinctive patterns of the object are extracted from the image, such as texture, edges,
shapes, and colors.

Common methods for feature extraction include:

• Traditional Methods: These include SIFT (Scale-Invariant Feature Transform),


HOG (Histogram of Oriented Gradients), and SURF (Speeded-Up Robust
Features). These techniques detect key features in an image and extract descriptors
that help in distinguishing between categories.

• Deep Learning-Based Methods: In modern category recognition systems,


Convolutional Neural Networks (CNNs) automatically extract features from the
input image. CNNs consist of multiple layers of convolution, pooling, and non-linear
activation functions that progressively extract complex hierarchical features, from simple
edges to object parts.

c) Classification

Once the relevant features are extracted, the next step is to classify the object into one
of the pre-defined categories. This is typically done using machine learning classifiers or
deep learning models. Some common methods include:

• Traditional Machine Learning Classifiers: These include Support Vector


Machines (SVMs), Random Forests, or k-Nearest Neighbors (k-NN), which take
in the extracted features and assign the image to one of the known categories.

• Deep Learning-Based Classifiers: Modern approaches use the fully connected


layers of CNNs followed by a softmax layer to assign probabilities to different
categories and predict the object’s category.
UNIT V IMAGE-BASED RENDERING AND
d) Post-Processing RECOGNITION
After classification, the output might be refined through confidence thresholding or
non-maximum suppression (for multiple object detection), and the final label for the
object category is produced.

8.2 Popular Models for Category Recognition

Over the years, several architectures have become popular for performing category
recognition, especially in deep learning:

a) AlexNet

AlexNet was one of the first CNN architectures that gained popularity after winning the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It
demonstrated that CNNs could outperform traditional methods in large-scale category
recognition tasks. AlexNet introduced key innovations like ReLU activation and
dropout to reduce overfitting.

b) VGGNet

VGGNet improved over AlexNet by using a deep architecture of small 3x3 convolutional
filters. It showed that deeper networks could achieve better category recognition
accuracy by capturing more complex visual features. The simplicity of VGGNet made it a
popular architecture for many downstream computer vision tasks.

c) ResNet (Residual Networks)

ResNet introduced skip connections that allowed the training of very deep neural
networks (e.g., with 50 or more layers). ResNet's deep architecture is highly effective for
category recognition because it can learn complex feature hierarchies without suffering
from vanishing gradients, which often occurs in very deep networks.

d) Inception Networks (GoogLeNet)

Inception networks use a combination of convolutions with different filter sizes at each
layer to capture multi-scale information. This architecture is efficient in terms of
computation and memory usage, making it suitable for large-scale category recognition
tasks.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

e) MobileNet

MobileNet is a lightweight CNN architecture designed for resource-constrained devices


like mobile phones. It uses depthwise separable convolutions to reduce the number
of parameters, enabling category recognition on devices with limited computational
power.

8.3 Challenges in Category Recognition

a) Intra-Class Variability

Objects within the same category can appear significantly different due to differences in
shape, size, color, texture, or style. For instance, the category "car" includes various
models, colors, and designs. The system needs to generalize these variations and
recognize them as belonging to the same category.

b) Inter-Class Similarity

Some categories have objects that look very similar to each other, making it difficult to
distinguish between them. For example, some breeds of dogs might look very similar to
wolves, or some brands of shoes may have similar appearances.

c) Viewpoint Variation

Objects can appear different when viewed from various angles or perspectives. A system
trained on images of a car from a frontal view might struggle to recognize the same car
from a side or top view. Models need to be robust to viewpoint variations.

d) Occlusion

Partial occlusion, where parts of an object are hidden by other objects or by the image
boundary, can make category recognition challenging. The system must be able to
recognize objects even if they are only partially visible.

e) Lighting and Environmental Conditions

Changes in lighting, reflections, and shadows can alter the appearance of an object. A
robust system must be able to handle different lighting conditions, including
overexposure or dim lighting.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

f) Data Imbalance

In real-world applications, certain categories may have significantly more examples than
others in the training data. For example, there might be thousands of images of "cats"
but only a few of "raccoons." Imbalanced data can lead to biased models that perform
well on common categories but poorly on rare ones.

8.4 Applications of Category Recognition

a) Autonomous Vehicles

In autonomous vehicles, category recognition is critical for recognizing pedestrians,


vehicles, traffic signs, and other objects in real-time to ensure safe navigation. These
systems must correctly identify categories like "stop sign," "traffic light," and "cyclist" to
make appropriate driving decisions.

b) Healthcare

Category recognition is used in medical imaging to recognize diseases and anomalies in


medical scans such as MRI, X-ray, and CT scans. For example, it can classify tumors into
benign or malignant categories, or recognize different types of cells in pathology images.

c) Retail and E-commerce

Category recognition helps e-commerce platforms like Amazon or Google identify product
categories from images. For instance, recognizing whether an uploaded image is a "t-
shirt," "dress," or "shoes" can help organize and recommend products efficiently.

d) Robotics

In robotics, category recognition enables robots to understand and interact with their
environments. For instance, recognizing objects like "cup," "bottle," or "tool" allows
robots to pick and place objects correctly in assembly lines or service robots in
households.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

e) Surveillance

Category recognition can be used in surveillance systems to classify objects detected in


video footage as "person," "vehicle," or "animal." This is useful for monitoring public
spaces or in smart home security systems that need to distinguish between different
types of movement.

f) Social Media

Social media platforms like Facebook and Instagram use category recognition to
automatically tag objects, places, and people in uploaded photos. This helps improve
user experience and also enables features like image-based search.

8.5 Example: ImageNet Classification

One of the most famous examples of category recognition is the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC). In this competition, models are
trained to classify images into one of 1,000 different categories, including animals,
objects, and scenes.

For example, an image of a tiger would be classified into the "tiger" category, while an
image of a bicycle would be classified as "bicycle." Models like AlexNet, VGGNet, and
ResNet have all competed in this challenge, with ResNet achieving breakthrough
performance due to its deep architecture and ability to learn complex features.

9. CONTEXT AND SCENE UNDERSTANDING

Context and scene understanding refer to the ability of computer vision systems to
not only detect individual objects in an image but also interpret the relationships
between these objects, their spatial arrangements, and their environment as a whole.

This goes beyond object detection and classification, aiming to provide a higher-level
comprehension of what is happening in the scene, similar to how humans understand
complex environments.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

Understanding scenes is crucial for tasks such as autonomous driving, image captioning,
human-robot interaction, and surveillance, where recognizing individual objects is not
enough. The system needs to infer how objects relate to each other, predict potential
interactions, and even reason about unseen parts of the environment.

9.1 Key Components of Scene Understanding

Scene understanding typically involves the following key components:

a) Object Detection and Recognition

At the base level, understanding a scene involves detecting and classifying objects within
the scene. This step involves identifying the categories (e.g., car, person, tree) of the
objects and localizing them within the image using bounding boxes, keypoints, or
segmentation masks.

b) Contextual Relationships

Context refers to the spatial and semantic relationships between objects within the
scene. For example:

• Spatial Relationships: Objects that are close to each other in space often share a
contextual relationship. For instance, a "car" is likely to be on a "road" and not on a
"table."

• Semantic Relationships: Objects often co-occur in certain scenes. For example,


"plates" and "forks" are more likely to be seen together in a "dining room" or on a
"table" than in a "bedroom."

Understanding context helps computer vision systems to make better predictions and
resolve ambiguities. For instance, if a blurry object is detected near a "bed," the system
may infer that it’s likely to be a "pillow" rather than a "laptop."

c) Scene Classification

Scene classification is the process of recognizing the broader category or setting of the
entire image, such as "beach," "forest," "office," or "city street."
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

Scene classification provides the global context for the objects in the image and can
guide the interpretation of what is happening in the scene. For instance, knowing the
scene is a "kitchen" can help recognize objects like "stove," "sink," and "fridge."

d) Semantic Segmentation

Semantic segmentation divides an image into regions associated with specific object
categories, labeling each pixel with its corresponding class. This allows for detailed
understanding of which parts of the image correspond to which objects, aiding in both
object recognition and understanding spatial relationships.

e) Instance Segmentation

In contrast to semantic segmentation, instance segmentation not only labels different


categories but also distinguishes between multiple instances of the same category. This
is useful for understanding scenes with multiple objects of the same type (e.g., several
people in a crowd).

f) Action and Interaction Recognition

Understanding a scene also involves recognizing the actions taking place (e.g.,
"walking," "eating," "driving") and interactions between objects or people. For
example, recognizing that a person is holding a cup, or that a dog is chasing a ball,
adds depth to the understanding of the scene.

g) 3D Scene Understanding

3D scene understanding involves interpreting the depth, layout, and geometry of the
scene, which is important for applications like augmented reality (AR) and autonomous
navigation. By understanding the 3D relationships between objects, systems can reason
about occlusions, predict what is behind objects, and navigate the environment.

9.2 Methods for Scene Understanding

Several techniques and models are used to achieve context and scene understanding:
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

a) Convolutional Neural Networks (CNNs)

CNNs have been the foundation for object detection and scene classification due to their
ability to extract hierarchical features from images. They are used in architectures like
Faster R-CNN and YOLO (You Only Look Once) for detecting and localizing objects
in images.

b) Fully Convolutional Networks (FCNs)

For semantic segmentation, FCNs are commonly used. FCNs predict class labels for
every pixel in the image, allowing the system to understand which pixels belong to which
object or background region.

c) Recurrent Neural Networks (RNNs) and LSTMs

RNNs and Long Short-Term Memory (LSTM) networks are often used for tasks that
require temporal understanding, such as video scene understanding. They help to
capture sequences of actions and interactions in a video by remembering past frames
and context.

d) Graph Neural Networks (GNNs)

GNNs are used to model contextual relationships between objects. By representing


objects as nodes and their relationships as edges, GNNs can reason about how objects in
a scene interact. This is useful for tasks like relationship detection (e.g., "a person
riding a bike" or "a dog lying on the floor").

e) Transformers

In recent years, Vision Transformers (ViT) have emerged as powerful tools for scene
understanding. Transformers capture long-range dependencies and relationships
between different parts of the image, making them effective for tasks involving complex
scenes and high-level context reasoning.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

9.3 Challenges in Context and Scene Understanding

a) Complexity and Ambiguity

Scenes can be very complex, with multiple objects interacting in different ways.
Recognizing and reasoning about all objects and their relationships, actions, and context
is computationally challenging. Moreover, scenes can often be ambiguous, requiring
high-level reasoning to understand their true meaning.

b) Occlusion

In real-world scenes, objects are often partially occluded by other objects, making it
difficult to recognize and understand the entire scene. For example, if part of a car is
hidden behind a tree, the system must infer that it is still a car.

c) Scale and Variability

Objects in a scene can appear at various scales and orientations, especially in outdoor
environments. This variability complicates recognition and understanding since objects
can appear quite different based on distance, lighting, or occlusion.

d) Lack of Explicit Object Labels

In some cases, object labels may not be available or easy to infer, which makes it harder
for the system to understand the context. For example, in abstract or artistic images,
understanding the scene may require high-level reasoning that goes beyond object
recognition.

e) Real-Time Performance

Many applications, such as autonomous driving or robotics, require real-time scene


understanding to make decisions. Achieving real-time performance while handling
complex scenes and contexts remains a major challenge.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

9.4 Applications of Scene Understanding

a) Autonomous Driving

In autonomous vehicles, context and scene understanding are critical for recognizing
road scenes, detecting obstacles, and predicting the behavior of pedestrians and other
vehicles. The system needs to understand the full environment, including road signs,
traffic lights, road conditions, and other dynamic objects, to navigate safely.

b) Robotics

For robots to interact effectively with humans and objects, they must understand the
context of their environment. This includes recognizing objects, actions, and spatial
relationships. In tasks like pick-and-place or household chores, robots rely on scene
understanding to perform efficiently.

c) Surveillance and Security

In surveillance, scene understanding helps recognize suspicious activities or anomalies in


real-time. The system can understand interactions between people, objects, and
environments to detect abnormal behavior or dangerous situations, such as a person
leaving a bag unattended in a crowded area.

d) Image Captioning

Image captioning systems use scene understanding to generate descriptions of images.


This involves recognizing objects, actions, and context to describe what is happening in
the scene, such as "a man is riding a bike on a city street."

e) Augmented Reality (AR) and Virtual Reality (VR)

In AR and VR applications, scene understanding is used to interact with real-world


environments. AR systems need to understand the 3D layout and context of the
environment to overlay virtual objects accurately.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

9.5 Example: Autonomous Driving Scene Understanding

In the context of autonomous driving, scene understanding is crucial for making safe
driving decisions. The system needs to:

1. Detect Objects: Identify vehicles, pedestrians, traffic signs, and obstacles.

2. Classify the Scene: Recognize the type of scene, such as a highway, residential area,
or intersection.

3. Understand Context: Infer relationships between objects, such as the distance


between vehicles or whether a pedestrian is about to cross the street.

4. Predict Actions: Based on context, predict the future actions of dynamic objects (e.g.,
whether a car will turn or if a pedestrian will cross).

5. React in Real-Time: Make driving decisions (e.g., stop, accelerate, change lanes)
based on the understanding of the entire scene.

This holistic understanding of the scene is what enables autonomous cars to safely
navigate complex environments with dynamic objects and changing conditions.

10. RECOGNITION DATABASES AND TEST SETS

In the field of computer vision, recognition databases and test sets play a critical
role in the development, evaluation, and benchmarking of models for tasks like object
recognition, face recognition, image classification, and scene understanding. These
datasets provide the standardized data necessary for training models and comparing the
performance of different algorithms in a consistent and objective manner.

10.1 What are Recognition Databases?

A recognition database (or dataset) is a collection of images, videos, or other types of


visual data that are used to train and test computer vision models. These datasets are
typically labeled with the ground truth information needed to evaluate the model’s
performance, such as object categories, bounding boxes, pixel-wise segmentation labels,
or facial identities providing the necessary information for supervised learning.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

Recognition databases can vary depending on the task, and they are typically split
into three subsets:

• Training Set: Used to train the model. It contains labeled data that the model uses to
learn patterns, features, and associations.

• Validation Set: Used to fine-tune model hyperparameters and to prevent overfitting


during training.

• Test Set: A separate set of data used for evaluating the model's final performance after
training is complete.

Key Characteristics

• Diversity: A well-rounded recognition database includes a wide variety of categories,


environments, and scenarios. This diversity ensures that models can learn to recognize
objects in different contexts and variations.

• Annotations: Most databases come with annotations such as bounding boxes, pixel-
wise segmentation masks, class labels, and sometimes attributes or relationships
between objects. These annotations are essential for supervised learning tasks.

• Size: The number of images in a recognition database can vary from a few hundred to
millions. Larger datasets often lead to better model performance, as they provide more
examples from which the model can learn.

• Quality: High-quality images with clear labels are critical for effective training. Images
should be representative of real-world conditions, including variations in lighting,
occlusion, and scale.

Common Recognition Databases

• ImageNet: Contains over 14 million images across thousands of categories, widely used
for training and benchmarking deep learning models. ImageNet’s challenge, ILSVRC
(ImageNet Large Scale Visual Recognition Challenge), is a key event in the field.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

• COCO (Common Objects in Context): A dataset designed for detecting and


segmenting objects in complex scenes. It contains over 330k images with annotations for
object detection, segmentation, and image captioning.

• Pascal VOC: Provides a benchmark for object detection and segmentation tasks,
featuring a variety of images with labeled objects and detailed annotations across 20
categories.

• MNIST: A classic dataset for handwritten digit recognition, consisting of 70,000


grayscale images of digits (0-9) used for training and evaluating models in simple
classification tasks.

• CIFAR-10/CIFAR-100: These datasets contain small images (32x32 pixels)


categorized into 10 and 100 classes, respectively, commonly used in image classification
tasks.

• CelebA: A large-scale dataset with over 200,000 celebrity images, annotated with facial
attributes and landmarks, used for face recognition and attribute detection tasks.

10.2 Test Sets

a) Definition

A test set is a subset of a recognition database specifically reserved for evaluating the
performance of a model after it has been trained. It contains examples that the model
has never seen during training, allowing for an unbiased assessment of its generalization
capability.

b) Key Features

• Separation from Training Data: It is crucial that test sets remain completely separate
from the training and validation datasets. This separation ensures that performance
metrics reflect true model performance rather than memorization of training examples.

• Representativeness: Test sets should reflect the diversity and variability present in the
entire dataset. This includes different categories, orientations, lighting conditions, and
levels of occlusion.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

• Size: While test sets can be smaller than training datasets, they should be large enough
to provide reliable and statistically significant evaluations of model performance.

c) Evaluation Metrics

The performance of models on test sets is evaluated using various metrics:

• Accuracy: The fraction of correctly predicted instances out of the total instances.

• Precision: The ratio of true positive predictions to the total predicted positives,
measuring the correctness of positive predictions.

• Recall: The ratio of true positive predictions to the actual positives, indicating the ability
to capture all relevant instances.

• F1 Score: The harmonic mean of precision and recall, providing a single metric that
balances both aspects.

• Mean Average Precision (mAP): Commonly used in object detection, mAP evaluates
the precision of predictions at different recall levels, providing a comprehensive measure
of accuracy across multiple classes.

10.3 The Importance of Recognition Databases and Test Sets

a) Model Development

Recognition databases are essential for training machine learning models. The quality
and variety of the data directly influence how well a model learns to recognize objects
and make accurate predictions.

b) Benchmarking

Databases with established test sets provide benchmarks for comparing the performance
of different algorithms and models. Researchers can publish results on these standard
datasets, facilitating a common ground for evaluation.
UNIT V IMAGE-BASED RENDERING AND
RECOGNITION

c) Transfer Learning

Many models are pre-trained on large recognition databases before being fine-tuned on
specific tasks. This approach leverages learned features from broad datasets, allowing
models to perform well even on smaller, domain-specific datasets.

d) Real-World Applications

High-quality recognition databases and test sets are crucial for developing reliable
systems used in various applications, such as facial recognition, autonomous vehicles,
medical image analysis, and augmented reality. Reliable performance evaluations ensure
that these systems are robust and trustworthy.

10.4 Challenges in Recognition Databases and Test Sets

a) Data Imbalance

In many recognition databases, certain categories may have significantly more examples
than others, leading to biased models that perform well on common classes but poorly
on rare ones.

b) Annotation Quality

The quality of annotations can vary, and inaccuracies in labels can adversely affect model
training and evaluation. Ensuring high-quality annotations is a critical step in dataset
preparation.

c) Variability in Real-World Conditions

While databases strive to be comprehensive, real-world scenarios often involve unseen


conditions (e.g., extreme lighting, weather changes) that may not be well-represented in
training and test sets.

d) Overfitting

Models trained too closely on specific datasets may perform poorly when applied to new,
unseen data, indicating overfitting. It's crucial to have a well-defined test set to evaluate
generalization.
ACTIVITIES
Activity 1: Workshop on Depth and Light Field Techniques
Objective: Understand and apply concepts of view interpolation and layered
depth images, light fields, and lumigraphs.
Format: Hands-on workshop
Description:
• Introduction (30 mins): Begin with a presentation explaining view
interpolation, layered depth images, light fields, and lumigraphs. Include
examples and their applications in computer vision.
• Hands-On Activity (1.5 hours):
• Divide participants into small groups.
• Provide them with a dataset of images with depth information.
• Each group will implement a basic view interpolation algorithm using
layered depth images and light field techniques.
• Encourage experimentation with different interpolation methods and
visualizing the results.
• Discussion (30 mins): Groups will present their findings and challenges faced
during implementation. Discuss potential applications in video rendering and
immersive experiences.
Activity 2: Recognition Systems Challenge
Objective: Build a basic object and face recognition system using recognition
databases.
Format: Group project and presentation
Description:
• Preparation (1 week prior): Assign groups and provide them with a choice of
recognition databases (e.g., COCO for object detection, LFW for face recognition).
• Project Development (2 weeks):
• Groups will design a simple recognition system that can detect objects or
recognize faces in images/videos.
• They will preprocess the data, train their models, and evaluate their
performance using established test sets.
• Encourage groups to explore different models (e.g., CNNs, transfer
learning).
• Final Presentation (1 hour): Each group will present their system, the dataset
used, performance metrics, and insights gained from the project.
ACTIVITIES
Activity 3: Context and Scene Understanding Simulation
Objective: Explore context and scene understanding through practical
applications and discussions.
Format: Interactive simulation and debate
Description:
• Interactive Simulation (1 hour): Use a simulation tool or a computer vision
platform to analyze various scenes. Participants will identify objects, their
relationships, and contextual information in given images.
• Group Analysis (30 mins):
• Split into small groups, each assigned a different scene type (e.g., urban,
natural, indoor).
• Groups will analyze how context influences the understanding of objects
within their scene. What assumptions can be made? How do different
contexts change interpretations?
• Debate (30 mins): Host a debate on the importance of context in computer
vision applications. Discuss scenarios where context is crucial (e.g., autonomous
driving vs. image classification) and how ignoring context can lead to failures.
Video Links
Unit – 5

73
Video Links
Sl. Video Link
Title
No.

1 View interpolation Layered https://www.youtube.com/wat


depth images ch?v=uqKTbyNoaxE
2 Video-based rendering https://www.youtube.com/watc
h?v=NqqbQJCqFvI
3 Object detection https://www.youtube.com/watc
h?v=MyvOfDFZvgE
Face recognition
4 https://www.youtube.com/watc
h?v=5r6ldRG9FWk
5 Instance recognition https://www.youtube.com/watc
h?v=5QUmlXBb0MY
6 Category recognition https://www.youtube.com/wat
ch?v=I_6sQjeo168
7 Recognition databases and https://www.youtube.com/watc
test sets h?v=SfqN-Hc5two

74
Assignments
Unit - V

75
Assignment Questions
Assignment Questions – Very Easy
Q. ASSIGNMENT QUESTIONS Marks Knowledg CO
e level
No.
1 5 K1 CO5
Explain what is meant by "view
interpolation" in the context of computer
vision
5 K1 CO5
2 List and briefly describe the essential
components of a recognition database used
in computer vision.

Assignment Questions – Easy


Q. ASSIGNMENT QUESTIONS Marks Knowledge CO
Level
No.
1 5 K2 CO5
Provide a brief explanation of how layered
depth images are used for view interpolation
and what advantages they offer.

2 5 K2 CO5
Describe how face recognition works in basic
terms, including what features are typically
analyzed in this process.

76
Assignment Questions
Assignment Questions – Medium
Q. ASSIGNMENT QUESTIONS Marks Knowledg CO
e level
No.
1 5 K3 CO5
Imagine a scenario where light field
technology could be applied in real life.
Describe how it would enhance visual
experience or solve a problem.

5 K3 CO5
2 Apply your knowledge of category
recognition and instance recognition by
comparing their applications in object
detection tasks.

Assignment Questions – Hard


Q. ASSIGNMENT QUESTIONS Marks Knowledge CO
level
No.
1 5 K4 CO5
Why are test sets important in the evaluation
of a recognition system? Analyze the
potential consequences of not using a proper
test set during model evaluation.

2 5 K4 CO5
Identify and analyze at least two major
challenges in video-based rendering. How do
these challenges affect the visual quality and
performance of rendering systems?
Assignment Questions
Assignment Questions – Very Hard
Q. ASSIGNMENT QUESTIONS Marks Knowledg CO
e level
No.
1 5 K5 CO5
Evaluate how context influences scene
understanding. Provide an example where
context is critical for accurate recognition
and explain why.

5 K5 CO5
2 Assess how environment mattes contribute
to improving visual effects in film or virtual
reality environments. What are the pros and
cons of using them in these applications?

78
Course Outcomes:
CO5: To understand image based rendering and recognition.
*Allotment of Marks

Correctness of the Presentation Timely Submission Total (Marks)


Content

15 - 5 20

79
Part A – Questions
& Answers
Unit – V

80
Part A - Questions & Answers
1. What is view interpolation?
View interpolation is a technique used in computer vision and
graphics to generate intermediate frames between two or more
existing images or views, creating smooth transitions between
different viewpoints.

2. What is a recognition database?

A recognition database is a collection of labeled images or


videos used to train and evaluate computer vision models for tasks such
as object detection, face recognition, and scene understanding.

3. What is an environment matte?

An environment matte is a visual effect technique used to


extract a foreground object from its background, often for compositing
the object into a different scene or environment.

4. What does "light field" refer to in computer vision?

A light field refers to a function that describes the amount of


light traveling in every direction through every point in space, used in
capturing rich 3D scene information.

5. What is category recognition?

Category recognition is the task of identifying the class or


category an object belongs to (e.g., car, tree, dog) in a given image or
video, without recognizing individual instances of the object.

81
6. How are layered depth images used in view interpolation?
Layered depth images store information about both color and
depth at multiple layers in an image, allowing for accurate interpolation and
rendering of intermediate views by considering occluded objects.
7. Why is context important in scene understanding?
Context provides additional information about the relationships
between objects and their environment, helping to improve accuracy in
recognizing scenes and understanding the roles of different objects.
8. What is the purpose of a test set in machine learning?
A test set is used to evaluate the performance of a trained model
on unseen data, ensuring that the model generalizes well and is not simply
memorizing the training data.
9. What is instance recognition?
Instance recognition refers to identifying specific occurrences of
objects, such as recognizing a particular car or a specific person's face, as
opposed to recognizing the general category.
10. What is video-based rendering?
Video-based rendering is a technique that generates new views
of a scene using pre-recorded video footage, often used in virtual reality,
visual effects, and immersive media experiences.

82
11. How could light fields improve 3D displays?

Light fields could improve 3D displays by capturing and


rendering light from multiple directions, providing more realistic depth and
viewpoint-dependent effects, enhancing the viewer's immersive experience.

12. How can object detection be applied in autonomous vehicles?

Object detection can be applied in autonomous vehicles to


identify and track objects such as pedestrians, other cars, and obstacles in
real-time, ensuring safe navigation and decision-making.

13. What would be a real-world application of environment


mattes?

Environment mattes are used in film production to seamlessly


blend actors or objects filmed in front of green screens with digitally
created environments or backgrounds.

14. How does face recognition technology enhance security


systems?

Face recognition technology can enhance security systems by


identifying individuals from video feeds or images, allowing for automated
access control and alerting security personnel to unauthorized individuals.

15. How are test sets used to prevent overfitting in models?

Test sets allow models to be evaluated on data they haven't


seen during training, ensuring that the model performs well on new,
unseen examples and doesn't just memorize the training data, preventing
overfitting.

83
16. Why are larger recognition databases better for model training?

Larger recognition databases provide more diverse examples,


enabling models to learn from a broader range of features, variations, and
conditions, thus improving generalization and performance on unseen data.

17. What challenges arise in video-based rendering for dynamic


scenes?

In dynamic scenes, video-based rendering faces challenges such


as handling motion blur, varying lighting conditions, and accurately rendering
objects as they move through the scene, which can reduce realism.

18. How do environment mattes affect visual effects in


filmmaking?

Environment mattes enable filmmakers to extract foreground


elements and seamlessly integrate them into complex backgrounds,
enhancing the overall visual quality and flexibility in post-production.

19. How can layered depth images handle occlusions during view
interpolation?

Layered depth images store multiple depth layers for each pixel,
allowing the rendering process to account for occluded objects and providing
a more realistic interpolation between views.

20. How does context affect object recognition in cluttered scenes?

In cluttered scenes, context helps the recognition system


disambiguate overlapping or partially obscured objects by considering their
spatial relationships and likely associations with other objects in the scene.

84
21. Evaluate the effectiveness of face recognition in crowded
environments.

Face recognition in crowded environments may struggle with occlusion,


varying lighting, and face angles, reducing its effectiveness. However,
advances in deep learning models are improving performance in such
conditions.

22. Assess the benefits of using standard test sets for


benchmarking.

Using standard test sets for benchmarking allows researchers to


compare model performance objectively, fostering innovation and providing
a consistent basis for evaluating advancements in computer vision.

23. Evaluate the role of light fields in virtual reality (VR) systems.

Light fields enhance VR systems by providing more realistic depth and


motion parallax, creating a more immersive experience. However, they
require significant computational resources for capture and rendering.

24. Assess the limitations of object detection algorithms in real-


world applications.

Object detection algorithms can struggle with occlusion, varying


lighting, and unusual object orientations in real-world applications, limiting
their performance compared to controlled environments.

25. Evaluate the impact of training on imbalanced datasets.

Training on imbalanced datasets can lead to biased models that perform


well on frequent classes but poorly on rare ones. Addressing class imbalance
through techniques like data augmentation or re-weighting is necessary for
better performance. 85
Part B – Questions
Unit – V

86
Part B Questions
Q. No. Questions K Level CO
Mapping
1 Describe the key features of view K1 CO5
interpolation and layered depth images.

2 What are recognition databases and test K1 CO5


sets in computer vision? List the common
datasets used for object detection and
face recognition.

3 Explain the role of light fields and K2 CO5


lumigraphs in capturing 3D scenes. How do
they differ from traditional 2D images?

4 Explain how category recognition differs K2 CO5


from instance recognition in object
detection. Why is this distinction important
in real-world applications?

5 Design an experiment using video-based K3 CO5


rendering techniques to simulate a virtual
environment. What challenges might you
face, and how would you overcome them?

6 Apply context and scene understanding K3 CO5


to analyze an image of a cluttered room.
Identify key objects and discuss how
context helps in recognizing the objects.

87
7 Analyze the impact of recognition databases K4 CO5
on model performance. What factors should
be considered when selecting a database
for training and testing?

8 Analyze how environment mattes enhance K4 CO5


realism in visual effects. Compare their use
to traditional chroma keying techniques.

9 Evaluate the effectiveness of object K5 CO5


detection algorithms in real-time applications
such as autonomous vehicles. What are the
main limitations, and how can they be
addressed?

10 Evaluate the role of recognition databases K5 CO5


and test sets in ensuring fairness and
reducing bias in AI models. What strategies
can be used to reduce dataset bias?

88
Supportive online
Certification
courses (NPTEL,
Swayam, Coursera,
Udemy, etc.,)

89
Supportive Online Certification
Courses
 Coursera – Introduction to Computer
Vision
• Description:
This course provides an overview of computer
vision, including image processing, feature
extraction, and object recognition.
• Offered by:
Georgia Tech
https://www.coursera.org/learn/introdu
ction-computer-vision
• NPTEL:Computer Vision
• Computer Vision - Course (nptel.ac.in)

 Udemy:
• Computer Vision -
https://www.bing.com/ck/a?!&&p=37e82e3153ef651eJmlt
dHM9MTcxODc1NTIwMCZpZ3VpZD0wZmYzMGRhMi1iMTh
hLTZmZDgtMmFkNi0xZThmYjAyNzZlYjQmaW5zaWQ9NTU
xMA&ptn=3&ver=2&hsh=3&fclid=0ff30da2-b18a-6fd8- -
2ad6-
1e8fb0276eb4&u=a1L3ZpZGVvcy9yaXZlcnZpZXcvcmVsYX
RlZHZpZGVvP3E9UHJhY3RpY2FsK09wZW5DViszK3VkbWV5
K2NvdXJzZSZtaWQ9NThDMDExNjI2NTMzQzBFNDRCNjE1O
EMwMTE2MjY1MzNDMEU0NEI2MSZGT1JNPVZJUkU&ntb=1

90
Real time
Applications in day
to day life and to
Industry

91
Real time Applications

Scenario 1: Autonomous Vehicle Object Detection and


Recognition

An autonomous vehicle (self-driving car) is navigating a busy urban


environment. The vehicle's vision system relies on object detection,
instance recognition, and category recognition algorithms to safely
navigate through traffic, avoid obstacles, and follow traffic signals.

Scenario 2: Real-Time Face Recognition in Airport Security

A busy international airport has implemented a face recognition


system to enhance security and streamline the boarding process. The
system is designed to match passengers' faces with their passport
photos as they pass through security checkpoints.

Scenario 3: Virtual Reality (VR) Immersive Experience Using


Light Fields and Layered Depth Images

A VR gaming company is developing an immersive virtual reality


experience where players can explore realistic 3D environments. The
game uses light fields to capture scenes and layered depth images for
rendering dynamic perspectives based on player movement.

92
Content Beyond
Syllabus

93
Advanced Concepts in Recognition
Databases and Test Sets

1. Synthetic Data Generation for Training Models:


With the rise of deep learning, generating synthetic data through
Generative Adversarial Networks (GANs) and simulation engines (e.g.,
Unity, Unreal Engine) has become an important method for creating
diverse, labeled datasets for training models, especially when real-
world data is scarce or biased.

2. Federated Learning for Distributed Databases:


In applications where data privacy is paramount (e.g., medical or
personal data), federated learning allows AI models to be trained
across multiple decentralized databases without sharing sensitive data.
This approach ensures that large-scale, diverse datasets can be utilized
while maintaining user privacy.
Assessment
Schedule
(Proposed Date &
Actual Date)

95
Assessment Schedule
(Proposed Date & Actual Date)
Sl. ASSESSMENT Proposed Actual
No. Date Date

1 FIRST INTERNAL
ASSESSMENT

2 SECOND
INTERNAL
ASSESSMENT
3 MODEL
EXAMINATION

4 END SEMESTER
EXAMINATION

96
Prescribed Text
Books & Reference

97
Prescribed Text Books &
Reference
TEXT BOOKS:
1.D. A. Forsyth, J. Ponce, “Computer Vision: A Modern
Approach”, Pearson Education, 2003.

2. Richard Szeliski, “Computer Vision: Algorithms and


Applications”, Springer Verlag London Limited,2011.

REFERENCES:
1. B. K. P. Horn -Robot Vision, McGraw-Hill.
2.Simon J. D. Prince, Computer Vision: Models, Learning,
and Inference, Cambridge University Press, 2012.

3. Mark Nixon and Alberto S. Aquado, Feature Extraction &


Image Processing for Computer Vision, Third Edition,
Academic Press, 2012.
4.E. R. Davies, (2012), “Computer & Machine Vision”,
Fourth Edition, Academic Press.
5.Concise Computer Vision: An Introduction into Theory
and Algorithms, by Reinhard Klette, 2014

98
Mini Project
Suggestions

99
Mini Project Suggestions
1 Very Hard
Implement a deep learning-based approach for optical flow estimation using a pre-
trained model like RAFT or PWC-Net. Further, develop a layered motion analysis system
that uses optical flow data to segment a scene into multiple layers, each representing
independent motion. This project can explore novel deep learning architectures or
enhancements for better real-time performance and accuracy.
2. Hard
Develop a robust bundle adjustment system for refining 3D reconstruction from
multiple views. The system should perform feature matching across several images,
initial pose estimation, and iterative optimization to minimize reprojection error using
sparse bundle adjustment techniques. Handle large datasets and consider using multi-
threading or GPU acceleration for optimization.
3. Medium
Implement a two-frame structure from motion (SfM) system to reconstruct a 3D
model of a scene using two images. The system should detect features, match them,
estimate the essential matrix, recover the camera pose, and triangulate points to build a
sparse 3D model.
4. Easy
Create an Augmented Reality (AR) application that uses pose estimation with
fiducial markers (like ArUco or AprilTag) to place virtual objects in the real world. The
project involves detecting the markers in a video feed, estimating the camera's pose
relative to the markers, and rendering a 3D object on top of them.
5. Very Easy
Develop a simple image mosaicing application using 2D feature-based alignment
techniques like SIFT or ORB. The application should detect features in two overlapping
images, match them, estimate a homography, and blend the images to create a
seamless panorama.

10
0
Thank you
Disclaimer:

This document is confidential and intended solely for the educational


purpose of RMK Group of Educational Institutions. If you have received
this document through email in error, please notify the system
manager. This document contains proprietary information and is
intended only to the respective group / learning community as
intended. If you are not the addressee you should not disseminate,
distribute or copy through e-mail. Please notify the sender immediately
by e-mail if you have received this document by mistake and delete this
document from your system. If you are not the intended recipient you
are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

101

You might also like