Mobilenet V2 Architecture in Computer Vision

Last Updated : 12 May, 2026

MobileNet V2 is an efficient convolutional neural network architecture designed for mobile and embedded vision applications. Developed by Google, it improves upon MobileNet V1 by enhancing performance while maintaining a lightweight design suitable for resource-constrained environments.

  • Designed for low-power and mobile devices
  • Uses inverted residual blocks for efficiency
  • Provides a strong balance between accuracy and computational cost

Key Features

1. Inverted Residuals

MobileNet V2 introduces inverted residual blocks, which are its core building units. Instead of reducing dimensions first (as in traditional residual blocks), it first expands the input and then compresses it back. An inverted residual block consists of three steps:

  • 1×1 Convolution (Expansion Layer): Increases the number of channels to capture more features
  • Depthwise Convolution: Applies spatial filtering independently on each channel
  • 1×1 Convolution (Projection Layer): Reduces channels back to a smaller size

This design helps reduce computation while maintaining important features.

2. Depthwise Separable Convolutions

Like MobileNet V1, MobileNet V2 uses depthwise separable convolutions to make the model efficient and reduce the number of parameters and computations significantly.

It splits standard convolution into:

  • Depthwise convolution: Applies filtering on each channel separately
  • Pointwise convolution (1×1): Combines information across channels

3. Linear Bottlenecks

MobileNet V2 uses linear bottlenecks in the final projection layer of each block.

  • Instead of applying a non-linear activation at the end, it keeps the transformation linear
  • This helps prevent loss of important information, especially in low-dimensional spaces

4. ReLU6 Activation Function

MobileNet V2 uses ReLU6, a variation of ReLU.

  • It limits output values between 0 and 6
  • This makes the model more suitable for low-precision (mobile) computations

Architecture

MobileNet V2 follows a streamlined architecture built around inverted residual blocks, which serve as the core building units of the network.

  1. Initial Convolution Layer: A standard convolution layer with 32 filters and a stride of 2.
  2. Series of Inverted Residual Blocks: The network contains several stages, each with a specific number of inverted residual blocks. The expansion factors, output channels, and strides vary across stages to manage the computational complexity and receptive field.
  3. Final Convolution Layer: A 1x1 convolution layer with 1280 filters, followed by a global average pooling layer.
  4. Fully Connected Layer: A fully connected layer with softmax activation for classification tasks.

Detailed Layer Configuration

The following table shows the layer-wise configuration of MobileNet V2:

Layer TypeInput SizeOutput SizeKernel SizeStrideExpansion Factor
Initial Conv224x224x3112x112x323x32-
Inverted Residual Block112x112x32112x112x163x311
Inverted Residual Block x2112x112x1656x56x243x326
Inverted Residual Block x356x56x2428x28x323x326
Inverted Residual Block x428x28x3214x14x643x326
Inverted Residual Block x314x14x6414x14x963x316
Inverted Residual Block x314x14x967x7x1603x326
Inverted Residual Block x17x7x1607x7x3203x316
Final Conv7x7x3207x7x12801x11-
Global Avg Pooling7x7x12801x1x1280---
Fully Connected1x1x12801x1x1000---

Implementing MobileNet V2 using TensorFlow

Consider an example of using a pre-trained MobileNet V2 model to classify an image of a cat.

Python
import tensorflow as tf
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions
import numpy as np

# Load the MobileNetV2 model
model = MobileNetV2(weights='imagenet')

# Load an image for testing
img_path = '/content/simba-8618301_1280.jpg'  # Path to your test image
img = image.load_img(img_path, target_size=(224, 224))

# Preprocess the image
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

# Make predictions
preds = model.predict(x)
print('Predicted:', decode_predictions(preds, top=3)[0])

Output:

Predicted: [('n02123045', 'tabby', 0.5783735), ('n02123159', 'tiger_cat', 0.11342117), ('n02124075', 'Egyptian_cat', 0.05013833)]

The model returns a list of predictions, where each entry contains a class ID, class name, and its probability score.

  • The model predicts the image is most likely a tabby cat with ~57.8% confidence.
  • Other possible predictions include tiger cat and Egyptian cat, but with lower confidence.

Advantages

  • Shows high efficiency by balancing accuracy and computational cost, making it suitable for mobile and embedded devices
  • Offers flexibility by allowing scaling through width and resolution multipliers for different use cases
  • Delivers improved performance compared to MobileNet V1 with fewer parameters and lower computation

Applications

  • Used for image classification, efficiently identifying objects on mobile devices with limited resources
  • Acts as a backbone in object detection, supporting lightweight detection models
  • Enables semantic segmentation, allowing real-time pixel-level understanding on constrained devices
  • Powers embedded vision applications, such as drones, robots, and IoT systems
Comment

Explore