Re: Vertex AI AutoML Vision training keeps failing...

stanley2001 · 04-21-2025 06:15 AM

Trying to train an AutoML Vision classification model in Vertex AI, but every time I start training I get:
"Training pipeline failed with error message: Internal error occurred. Please retry in a few minutes."

Tried different datasets, model names, and regions (europe-west4, us-central1) same error.

Anyone else experiencing this? Could this be related to the current GCE C3 VM issues?

dawnberdan

Hi @stanley2001,

Welcome to Google Cloud Community!

The internal error message you're getting suggests that there could be an infrastructure issue, which is often tied to the underlying VM or hardware resources used for model training. Here are a few troubleshooting steps you can try to resolve the problem:

1. General Troubleshooting:

Transient Issues: The error message suggests retrying. System glitches can occur, and a simple retry might resolve the issue. Vertex AI automatically restarts CustomJob or HyperparameterTuningJob up to three times.
Check Logs: Look for more detailed error information in Google Cloud Logging. Filter by resource.type = "ml_job" and ensure your time range is set correctly.
Data Splits: If you're using the default data split, Vertex AI might assign too few instances of a class to a particular set (test, validation, or training). This is more common with imbalanced datasets. Try manually splitting your data or removing less frequent labels.
Stockout: Vertex AI trains models using Compute Engine resources. If the Compute Engine is at capacity for a specific CPU or GPU in a region, your job might fail. This is more common with GPUs. Consider switching to a different GPU type or region.
Increase resources of VM: Consistently high CPU or memory utilization indicate the need to scale up the size of a VM. If the VM consistently uses greater than 90% of its CPU or memory, change the VM's machine type to a machine type with more vCPUs or memory.

2. Potential GCE C3 VM-Related Issues: While the generic "Internal error" doesn't definitively point to C3 VM problems, here's what to consider:

C3 Limitations: There are known limitations when using c3-standard-*-lssd and c3d-standard-*-lssd machine types with Google Kubernetes Engine.
IPv6 Issues: If you're using an IPv6-only C3 VM, it might become unreachable during live migration. Restarting the VM could help.

3. Data Issues

Missing Labels: When you use the default data split when training an AutoML classification model, Vertex AI might assign too few instances of a class to a particular set (test, validation, or training), which causes an error during training. This issue more frequently occurs when you have imbalanced classes or a small amount of training data.

If the issue persists, you can reach out to Google Cloud Support. When reaching out, include detailed information and relevant screenshots of the errors you’ve encountered. This will assist them in diagnosing and resolving your issue more efficiently.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.

stanley2001

Thankyou for your reply. I will take a look and see if i can fix it.

harsh088

Have you discovered the cause of the error? I am experiencing the same issue.

Vertex AI AutoML Vision training keeps failing with "internal error"