MobileNet V2 is an efficient convolutional neural network architecture designed for mobile and embedded vision applications. Developed by Google, it improves upon MobileNet V1 by enhancing performance while maintaining a lightweight design suitable for resource-constrained environments.
- Designed for low-power and mobile devices
- Uses inverted residual blocks for efficiency
- Provides a strong balance between accuracy and computational cost
Key Features
1. Inverted Residuals
MobileNet V2 introduces inverted residual blocks, which are its core building units. Instead of reducing dimensions first (as in traditional residual blocks), it first expands the input and then compresses it back. An inverted residual block consists of three steps:
- 1×1 Convolution (Expansion Layer): Increases the number of channels to capture more features
- Depthwise Convolution: Applies spatial filtering independently on each channel
- 1×1 Convolution (Projection Layer): Reduces channels back to a smaller size
This design helps reduce computation while maintaining important features.
2. Depthwise Separable Convolutions
Like MobileNet V1, MobileNet V2 uses depthwise separable convolutions to make the model efficient and reduce the number of parameters and computations significantly.
It splits standard convolution into:
- Depthwise convolution: Applies filtering on each channel separately
- Pointwise convolution (1×1): Combines information across channels
3. Linear Bottlenecks
MobileNet V2 uses linear bottlenecks in the final projection layer of each block.
- Instead of applying a non-linear activation at the end, it keeps the transformation linear
- This helps prevent loss of important information, especially in low-dimensional spaces
4. ReLU6 Activation Function
MobileNet V2 uses ReLU6, a variation of ReLU.
- It limits output values between 0 and 6
- This makes the model more suitable for low-precision (mobile) computations
Architecture
MobileNet V2 follows a streamlined architecture built around inverted residual blocks, which serve as the core building units of the network.
- Initial Convolution Layer: A standard convolution layer with 32 filters and a stride of 2.
- Series of Inverted Residual Blocks: The network contains several stages, each with a specific number of inverted residual blocks. The expansion factors, output channels, and strides vary across stages to manage the computational complexity and receptive field.
- Final Convolution Layer: A 1x1 convolution layer with 1280 filters, followed by a global average pooling layer.
- Fully Connected Layer: A fully connected layer with softmax activation for classification tasks.
Detailed Layer Configuration
The following table shows the layer-wise configuration of MobileNet V2:
| Layer Type | Input Size | Output Size | Kernel Size | Stride | Expansion Factor |
|---|---|---|---|---|---|
| Initial Conv | 224x224x3 | 112x112x32 | 3x3 | 2 | - |
| Inverted Residual Block | 112x112x32 | 112x112x16 | 3x3 | 1 | 1 |
| Inverted Residual Block x2 | 112x112x16 | 56x56x24 | 3x3 | 2 | 6 |
| Inverted Residual Block x3 | 56x56x24 | 28x28x32 | 3x3 | 2 | 6 |
| Inverted Residual Block x4 | 28x28x32 | 14x14x64 | 3x3 | 2 | 6 |
| Inverted Residual Block x3 | 14x14x64 | 14x14x96 | 3x3 | 1 | 6 |
| Inverted Residual Block x3 | 14x14x96 | 7x7x160 | 3x3 | 2 | 6 |
| Inverted Residual Block x1 | 7x7x160 | 7x7x320 | 3x3 | 1 | 6 |
| Final Conv | 7x7x320 | 7x7x1280 | 1x1 | 1 | - |
| Global Avg Pooling | 7x7x1280 | 1x1x1280 | - | - | - |
| Fully Connected | 1x1x1280 | 1x1x1000 | - | - | - |
Implementing MobileNet V2 using TensorFlow
Consider an example of using a pre-trained MobileNet V2 model to classify an image of a cat.
import tensorflow as tf
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions
import numpy as np
# Load the MobileNetV2 model
model = MobileNetV2(weights='imagenet')
# Load an image for testing
img_path = '/content/simba-8618301_1280.jpg' # Path to your test image
img = image.load_img(img_path, target_size=(224, 224))
# Preprocess the image
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
# Make predictions
preds = model.predict(x)
print('Predicted:', decode_predictions(preds, top=3)[0])
Output:
Predicted: [('n02123045', 'tabby', 0.5783735), ('n02123159', 'tiger_cat', 0.11342117), ('n02124075', 'Egyptian_cat', 0.05013833)]
The model returns a list of predictions, where each entry contains a class ID, class name, and its probability score.
- The model predicts the image is most likely a tabby cat with ~57.8% confidence.
- Other possible predictions include tiger cat and Egyptian cat, but with lower confidence.
Advantages
- Shows high efficiency by balancing accuracy and computational cost, making it suitable for mobile and embedded devices
- Offers flexibility by allowing scaling through width and resolution multipliers for different use cases
- Delivers improved performance compared to MobileNet V1 with fewer parameters and lower computation
Applications
- Used for image classification, efficiently identifying objects on mobile devices with limited resources
- Acts as a backbone in object detection, supporting lightweight detection models
- Enables semantic segmentation, allowing real-time pixel-level understanding on constrained devices
- Powers embedded vision applications, such as drones, robots, and IoT systems