Mobilenet V2 Architecture in Computer Vision

MobileNet V2 is an efficient convolutional neural network architecture designed for mobile and embedded vision applications. Developed by Google, it improves upon MobileNet V1 by enhancing performance while maintaining a lightweight design suitable for resource-constrained environments.

Designed for low-power and mobile devices
Uses inverted residual blocks for efficiency
Provides a strong balance between accuracy and computational cost

Key Features

1. Inverted Residuals

MobileNet V2 introduces inverted residual blocks, which are its core building units. Instead of reducing dimensions first (as in traditional residual blocks), it first expands the input and then compresses it back. An inverted residual block consists of three steps:

1×1 Convolution (Expansion Layer): Increases the number of channels to capture more features
Depthwise Convolution: Applies spatial filtering independently on each channel
1×1 Convolution (Projection Layer): Reduces channels back to a smaller size

This design helps reduce computation while maintaining important features.

2. Depthwise Separable Convolutions

Like MobileNet V1, MobileNet V2 uses depthwise separable convolutions to make the model efficient and reduce the number of parameters and computations significantly.

It splits standard convolution into:

Depthwise convolution: Applies filtering on each channel separately
Pointwise convolution (1×1): Combines information across channels

3. Linear Bottlenecks

MobileNet V2 uses linear bottlenecks in the final projection layer of each block.

Instead of applying a non-linear activation at the end, it keeps the transformation linear
This helps prevent loss of important information, especially in low-dimensional spaces

4. ReLU6 Activation Function

MobileNet V2 uses ReLU6, a variation of ReLU.

It limits output values between 0 and 6
This makes the model more suitable for low-precision (mobile) computations

Architecture

MobileNet V2 follows a streamlined architecture built around inverted residual blocks, which serve as the core building units of the network.

Initial Convolution Layer: A standard convolution layer with 32 filters and a stride of 2.
Series of Inverted Residual Blocks: The network contains several stages, each with a specific number of inverted residual blocks. The expansion factors, output channels, and strides vary across stages to manage the computational complexity and receptive field.
Final Convolution Layer: A 1x1 convolution layer with 1280 filters, followed by a global average pooling layer.
Fully Connected Layer: A fully connected layer with softmax activation for classification tasks.

Detailed Layer Configuration

The following table shows the layer-wise configuration of MobileNet V2:

Layer Type	Input Size	Output Size	Kernel Size	Stride	Expansion Factor
Initial Conv	224x224x3	112x112x32	3x3	2	-
Inverted Residual Block	112x112x32	112x112x16	3x3	1	1
Inverted Residual Block x2	112x112x16	56x56x24	3x3	2	6
Inverted Residual Block x3	56x56x24	28x28x32	3x3	2	6
Inverted Residual Block x4	28x28x32	14x14x64	3x3	2	6
Inverted Residual Block x3	14x14x64	14x14x96	3x3	1	6
Inverted Residual Block x3	14x14x96	7x7x160	3x3	2	6
Inverted Residual Block x1	7x7x160	7x7x320	3x3	1	6
Final Conv	7x7x320	7x7x1280	1x1	1	-
Global Avg Pooling	7x7x1280	1x1x1280	-	-	-
Fully Connected	1x1x1280	1x1x1000	-	-	-

Implementing MobileNet V2 using TensorFlow

Consider an example of using a pre-trained MobileNet V2 model to classify an image of a cat.

Python

import tensorflow as tf
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions
import numpy as np

# Load the MobileNetV2 model
model = MobileNetV2(weights='imagenet')

# Load an image for testing
img_path = '/content/simba-8618301_1280.jpg'  # Path to your test image
img = image.load_img(img_path, target_size=(224, 224))

# Preprocess the image
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

# Make predictions
preds = model.predict(x)
print('Predicted:', decode_predictions(preds, top=3)[0])

Output:

Predicted: [('n02123045', 'tabby', 0.5783735), ('n02123159', 'tiger_cat', 0.11342117), ('n02124075', 'Egyptian_cat', 0.05013833)]

The model returns a list of predictions, where each entry contains a class ID, class name, and its probability score.

The model predicts the image is most likely a tabby cat with ~57.8% confidence.
Other possible predictions include tiger cat and Egyptian cat, but with lower confidence.

Advantages

Shows high efficiency by balancing accuracy and computational cost, making it suitable for mobile and embedded devices
Offers flexibility by allowing scaling through width and resolution multipliers for different use cases
Delivers improved performance compared to MobileNet V1 with fewer parameters and lower computation

Applications

Used for image classification, efficiently identifying objects on mobile devices with limited resources
Acts as a backbone in object detection, supporting lightweight detection models
Enables semantic segmentation, allowing real-time pixel-level understanding on constrained devices
Powers embedded vision applications, such as drones, robots, and IoT systems