Abstract
This extended paper presents a novel approach towards unsupervised SEM image segmentation for IC layout extraction. Existing methods typically rely on supervised machine learning with manually labeled training data, requiring re-training and partial annotation when applying them to new datasets. To address this issue, we propose a SEM image segmentation algorithm based on unsupervised deep learning, eliminating the need for manual labeling. We train and evaluate our approach on a real-world dataset comprising 648 SEM images of metal-1 and metal-2 layers from a commercial IC, achieving competitive segmentation error rates well below 1%. Releasing our dataset and algorithm implementations, we allow researchers to apply our approach to their own datasets and evaluate their methods against our dataset, facilitating reproducibility in the field.
Similar content being viewed by others
1 Introduction
Integrated Circuit (IC) dissection has a broad range of applications, including competitive analysis [1], counterfeit detection [1], and identifying malicious circuit modifications [2]. Accurately extracting (parts of) the layout from an IC is an important task in this process [3]. For dies manufactured on modern technology nodes, it involves chemical and mechanical preparation and delayering of the chip prior to imaging each layer of the die with a Scanning Electron Microscope (SEM) [1]. Due to the usually imperfect SEM image quality, a major challenge of layout reconstruction is to segment all metal layers as precisely as possible into background, tracks, and vias [4]. State-of-the-art IC SEM image segmentation approaches often employ supervised machine learning [5,6,7], relying on manually annotated images to serve as labels during the training process. However, models trained on one dataset are often not directly applicable to others due to differing preparation, manufacturing, and imaging parameters [4]. Instead, models must be re-trained for each new dataset, which must either be partially annotated or otherwise preprocessed to fit the model’s training data, causing a performance degradation [8]. The dataset differences make a fair comparison between segmentation methods almost impossible. Furthermore, some literature only reports pixel-wise evaluation metrics, which contain very little information about the actual segmentation quality in terms of electrically relevant errors. And while meaningful metrics, such as the Electrically Significant Difference (ESD)Footnote 1 have been proposed [9], the lack of open datasets and algorithm implementations obstructs a thorough comparison between segmentation algorithms for IC layout extraction. In this work, we strive to address these problems as follows:
-
First, we devise an automated approach for track segmentation on metal-layer IC SEM images that eliminates the need for costly and time-consuming manual labeling. To this end, we present a novel algorithm based on unsupervised deep learning that – not relying on labeled training data – allows for adoption to new datasets with little human interaction.
-
Second, we perform a thorough evaluation of our approach on a real-world dataset consisting of 648 SEM images from the metal-1 and metal- 2 layers of a commercial 180 nm IC. Our results indicate low ESD error rates of around 0.8% with high consistency, which is competitive to some state-of-the-art approaches for SEM image segmentation that use supervised learning.
-
Third, we enable other researchers to compare their segmentation methods to our results and apply our method to their own datasets by making the data we used in our work available under an open-source license. This data consists of our algorithm implementations, our evaluation code, as well as the real-world SEM image dataset described above, including a corresponding ground truth for the metal-2 layer.
Compared to our previous workshop version of this paper [10], we extend our work as follows. In Section 2.4, we (1) propose improvements to our approach that greatly enhance training stability, albeit achieving high performance only in a specific configuration. In Section 3, we introduce multiple state-of-the-art supervised Machine Learning (ML) models, which we (2) evaluate our original and improved approaches against in Section 5. Finally, we (3) perform an ablation study on our best configuration in Section 5.4, experimentally removing different parts of our algorithm to asses their impact on performance.
2 Architecture for unsupervised SEM image segmentation
In this section, we briefly introduce the advantages of unsupervised learning and propose an unsupervised ML architecture for SEM image segmentation that allows adaptation to new datasets without prior manual labeling.
In recent years, deep learning techniques have achieved impressive results for image analysis tasks, such as segmentation [11]. A common issue for supervised ML approaches is the availability of labeled training data, which is why unsupervised and semi-supervised approaches have gained traction. A supervised ML model learns a function that maps input data to provided labels. In contrast, an unsupervised model trains without labeled data and extracts information directly from the input distribution [12]. We base our architecture on the autoencoder design, which consists of two ML models, encoder and decoder [13]. The encoder compresses an input SEM image into a self-learned so-called hidden representation, from which the decoder reconstructs the original data. Our goal is to shape this hidden representation into a segmentation mask that classifies each pixel as either silicon background, metal track, or metal via. When training encoder and decoder together, the autoencoder would optimize its hidden representation for optimal input reconstruction, without forming a concept of background, track, or via segments. Instead, we train the decoder with segmentation masks obtained from conventional image segmentation algorithms, forcing the encoder to output a representation close to the decoder’s learned input format.
To generate training data for the autoencoder, we denoise the SEM images in a preprocessing phase using median filtering and split them into 512\(\times \)512 pixel SEM patches to improve scalability.
2.1 Conventional segmentation algorithms
For decoder training, we apply one of three conventional SEM image segmentation algorithms to the SEM image patches. Although their performance is inadequate for direct automatic IC layout reconstruction because they produce a large number of segmentation errors, they aptly shape the decoder’s expected input and thereby constrain the autoencoder’s hidden representation.
Fixed threshold. Arguably the simplest image segmentation algorithm, fixed thresholding classifies image pixels as either background, track, or via depending on their brightness. We pick minimum track and via brightnesses – the thresholds – systematically by averaging the histograms of all images in the dataset and choosing appropriate thresholds between peaks corresponding to background, tracks, and vias.
Random threshold. We also tested randomizing track and via thresholds on a per-patch basis by drawing them uniformly random distributed from an interval around the fixed thresholds.
Morphological active contours without edges. A more advanced segmentation algorithm, Morphological Active Contours Without Edges (MorphACWE) evolves a level-set curve along edges in the image while being resistant to noise [14]. We use fixed thresholds to obtain initial level-set curves and run the algorithm separately for track and via labeling.
2.2 Decoder
Figure 1a visualizes the decoder training. First, we apply the conventional segmentation algorithms from the previous section to our 512\(\times \)512 pixel SEM image patches and thereby generate training data for the decoder. Instead of reconstructing the input SEM patch directly, we let the decoder predict an approximation of its gradient magnitude. The gradient magnitude of an image is the norm of the brightness differences between neighboring pixels in X and Y directions. Training the decoder to predict the image gradient, we prioritize the accurate placement of track and via boundaries over precise coloring of uniform areas, such as the background.
Reconstruction loss. As reconstruction loss for decoder training, we use Mean Squared Error Loss (MSELoss). The loss compares the decoder output to
from Equation 1. In the equation,
denotes the morphological gradient, an approximation of the gradient magnitude that, with a sufficiently large structuring element, produces less noise than the exact version. The hyperparameter \(\lambda _1\) allows tuning the loss function in conjunction with clamping the gradient to the decoder output range. We choose \(\lambda _1 = 5\), which saturates the gradients around vias and thus reduces the difference to the smaller gradients along track borders, balancing correct track border with via reconstruction.

Decoder architecture. As decoder architecture, we use U-Net [15] with batch normalization and
activation in the last layer for output normalization. Instead of deconvolution operations, we employ resize-convolution to suppress high frequency artifacts in the output image [16]. We also use padded convolutions to retain the input size of 512\(\times \)512 pixels for the output.
(a) A conventional segmentation algorithm generates masks from SEM image patches as training data. We train the decoder to predict SEM patch gradient approximation from these masks using MSELoss. (b) We train the encoder using the decoder and its reconstruction loss as encoder loss function, adding the channel exclusivity loss term to improve the generated segmentation
2.3 Encoder
The encoder receives SEM image patches and predicts their segmentation masks with separate channels for background, track, and via class probabilities. These masks are the primary output of our segmentation approach and serve as the basis for layout extraction.
In an unsupervised setting, we cannot assess the quality of the encoder output directly. We can, however, apply the decoder to the encoder output and compute the reconstruction loss of the resulting gradient prediction, as depicted in Figure 1b. Assuming that an accurate gradient reconstruction from the decoder requires a high quality segmentation mask from the encoder, we gain an error metric for the encoder output. Propagating the reconstruction loss back through the decoder, we receive a differentiable loss function that allows us to train the encoder. During encoder training, we only update the encoder weights and do not train the decoder.
Class exclusivity loss. We add the term from Equation 2 to the encoder loss function to incentivize the encoder to predict that a pixel belongs exclusively to either background (B), track (T), or via (V) and scale this term by hyperparameter \(\lambda _2\), choosing \(\lambda _2 = 0.1\). In conjunction with
,
trains the encoder to predict mostly binary segmentation masks that allow for straightforward track and via extraction.

Encoder architecture. For our workshop version, we trained the decoder on smaller 128\(\times \)128 pixel SEM image patches and added an overlap for the encoder to increase segmentation accuracy around patch edges, resulting in a 174\(\times \)174 pixel encoder input size. In the extended version, we compare our approach to multiple supervised ML model architectures, which we discuss in Section 3, switching to 512\(\times \)512 pixel patches without overlap to integrate with the existing supervised training pipeline. Instead of training the decoder and encoder alternately each epoch, we now train the decoder first and then the encoder, computing its loss with the final decoder model. Separating this process allows encoder pre-training as presented in the next section, which would cause a performance imbalance with alternating decoder-encoder training. We additionally tried training the decoder on encoder output, but found that the autoencoder optimizes its hidden representation for reconstruction, while losing its interpretability as segmentation masks.
The encoder uses the decoder U-Net architecture with
as final activation function to equalize class scores between pixels. During deployment, we only require the encoder segmentation results and not the decoder, reducing complexity and size of the model.
2.4 Improving training stability
With the alternating training procedure of decoder and encoder, we already observed stability issues in our workshop paper [10]. Randomly initializing the model weights before training, we encountered large variations in model performance between trained instances, with results ranging from competitive to supervised ML architectures to unusable with error rates near 100%.
We initially solved this problem by training five model instances and picking the one with the lowest reconstruction loss on the training data, as we noted that low reconstruction loss tends to come with a good ESD performance. In practice, this usually works well but causes a significant overhead in resources required for training. Additionally, this bears the risk of not yielding any high performing models after multiple rounds of training, which we did observe for fixed thresholding in our experiments.
In this extension of our work, we present an improvement to the training process that solves the stability issues, but appears to be dependent on the input segmentation algorithm. Instead of starting the unsupervised encoder training process on randomly initialized weights, we pre-train the encoder with a similar setup and the same pseudo masks that we use for decoder training. For that, we provide SEM image patches as encoder inputs and compute the MSELoss of its output compared to these masks as labels. Those pseudo masks can be generated unsupervised by the conventional segmentation algorithms. Although learning from erroneous data, the pre-training imparts the expected outputs for background, tracks, and vias in the encoder, so that the unsupervised training starts with relatively consistent model weights between runs. With pre-training, the unsupervised encoder training we described in the previous section produces models with consistent ESD performance, solving the stability issues. However, we were only able to obtain satisfactory ESD results, on par with supervised ML models, when using random thresholding as the input segmentation algorithm. In contrast, fixed tresholding and MorphACWE only marginally improved over the input segmentations when using pre-training. We discuss these results in detail in Section 5, comparing them to different supervised ML model architectures.
3 Comparison with supervised ML models
To benchmark our proposed unsupervised architecture with existing supervised ML approaches, three segmentation networks have been identified for the comparison, namely Fully Convolutional Network (FCN) [17], U-Net [15], and DeepLabV3 [18]. FCN employs an encoder-decoder architecture, where the encoder downsamples the input to extract high-level features, and the decoder upsamples these features to produce dense predictions for each pixel. U-Net also adopts the encoder-decoder architecture but features a U-shaped structure with skip connections between all corresponding encoder and decoder layers that share the same feature dimensions to retain spatial information. DeepLabV3 utilizes dilated convolutions to capture multi-scale features without reducing spatial resolution. Following a feature extraction backbone, DeepLabV3 utilizes a modified atrous spatial pyramid pooling modules to aggregate multi-scale image features. In our experiments, both FCN and DeepLabV3 adopt ResNet50 as the feature extraction backbone following official PyTorch implementations.
As supervised ML approaches, all three segmentation networks require pairs of input SEM images and corresponding targeted outputs in the training phase. We experimented with two different types of targeted outputs, the ground-truth masks as described in Section 4 and pseudo masks generated by fixed thresholding.
4 Real-world IC SEM image datasets
To train and evaluate our IC image segmentation approach, we employ a real-world SEM image dataset, whose creation and characteristics we detail in this section. The data consists of 648 SEM images showing the logic area of the metal-1 (M1) and metal-2 (M2) layers of a commercial IC produced on a 180 nm technology node. The target IC has a total of six metal and a polysilicon layer and was developed for (medium) security applications. Figure 2 contains a sample image with ground truth from the M2 layer, which we labeled and used for our evaluation. We published our dataset, including the M2 labels, under a permissive open-source license (CC-BY 4.0)Footnote 2.
4.1 Chip preparation and imaging
Prior to imaging, we prepared the chip as follows: First, we chemically removed the device packaging, leaving the bare silicon die. The packaged chip dimensions are \(3{\times }3\,\hbox {mm}\) and the actual die measures \(2.16\,\hbox {mm}\times 1.68\,\hbox {mm}\). After controlled removal of a thick aluminium (Al) top metal structure and a silicon nitride (SiN) / silicon dioxide (\(\hbox {SiO}_{2}\)) layer, we obtained an almost perfectly flat surface, which is the prerequisite for the following processing steps. Subsequently, we performed delayering of each metal interconnect layer using a broad ion milling system. Using argon (Ar) ions at a pressure of \(4\times 10^{-4}\,\hbox {mbar}\), 400 V beam voltage, and a current density of \(0.18\,\hbox {mA}/\hbox {cm}^{2}\), we achieved an etch rate of 2 to 80 nm/min, depending on beam incident angle and material sputter yield.
Next, we acquired images of each layer using a Scanning Electron Microscope (SEM) with backscattered electron detector. As the backscatter electron yield is material dependent, both the metal layer and the underlying tungsten vias are visible in the images. To obtain satisfactory image contrast and signal-to-noise ratio, allowing us to discern tracks and vias from background, we used an electron energy of 15 keV and a dwell time of \({3}\,\upmu \hbox {s}\).
4.2 Dataset characteristics and labeling
The M1 and M2 layers were captured with a resolution of 14.65 nm per pixel and 10% overlap between SEM images. Each image is 4096\(\times \)3536 pixels in size. After discarding images of the area’s boundaries, our dataset contains 327 images from the M1 layer and 321 from M2, yielding a total of 6 GiB grayscale image data. For training and evaluation, we split the M2 layer SEM images into 15,408 patches of 512\(\times \)512 pixels. Compared to our original setup [10], we do not use the M1 layer as additional training data.
Typically, the preparation and imaging processes create artifacts in SEM images that may cause challenges for segmentation. Uneven delayering, for example, yields inhomogeneous track and background coloring, which reduces contrast and, in extreme cases, can uncover tracks from adjacent metal layers. Additionally, bright spots that occur during imaging may interfere with via detection. Finally, vias in the SEM image are regularly surrounded by halos, which can make boundaries between neighboring tracks hard to determine, an artifact we call bleeding. Our dataset also contains such artifacts (see Figure 2a for an example).
We automatically labeled track polygons and via positions on the M2 layer, using techniques such as thresholding, edge detection, and size, position, and complexity filtering. In a final step, an experienced analyst manually validated and corrected these labels, which took an average of 6 min per image and a total of 6 person days for the entire layer. As the labels were originally intended for manual IC layout extraction, we decided against removing duplicates of detected vias. Furthermore, tracks spanning multiple images are not labeled on each image and instead appear as missing tracks when images are processed in isolation.
4.3 Existing datasets
To our knowledge, we are the first to publish a real metal-layer IC SEM image dataset and only two other open datasets exist. Cheng et al. [19] annotated gates on 640 polysilicon-layer IC SEM images with 384\(\times \)512 pixels each, which are available on request. Wilson et al. [20] generated and published 800,000 synthetic polysilicon and M1-layer SEM images from 32 nm and 90 nm IC layout files. With 250\(\times \)250 pixels each, the simulated images are much smaller than our real SEM images, only showing standard cell sections. Their simulation adds noise to emulate the imaging process, simple shape changes to imitate deformations from IC manufacturing, and shifted image regions to mimic stitching errors. In contrast to the artifacts we observed in our SEM images and described in the previous section, their background and tracks appear otherwise uniform.
(a) Delayering artifacts manifest as uneven background and track coloring. Imaging artifacts are visible as bright spots on tracks. (b) Imperfect segmentation of (2) containing a short in red and an open track in blue. Track labels are outlined in gray
5 Evaluation
Here, we present the evaluation of our original and improved approaches on our real-world dataset from the previous section. First, we advocate for meaningful evaluation metrics and propose using the Electrically Significant Difference (ESD) error metric introduced by Trindade et al. [9]. Second, we provide implementation details and report our results, comparing our approach to three state-of-the-art supervised ML models. Third, we perform an ablation study to determine the impact of individual parts of our algorithm on performance. Finally, we discuss related work, as well as limitations of our approach and future research directions.
5.1 Evaluation metrics
IC layout extraction from SEM images is error prone, since even a few added or missing electrical connections between components can greatly affect the resulting layout and may require manual review and cleanup. Minimizing such faulty connections is thus paramount. However, commonly used evaluation metrics, namely mean Pixel Accuracy (mPA) and mean Intersection over Union (mIoU) [8, 19, 21], assess the classification accuracy of individual pixels without considering connectivity. Inaccurately segmented track borders, while still allowing perfect layout recovery as long as connectivity remains unchanged, can therefore have a profound impact on mPA and mIoU. Conversely, bridging two tracks through a few incorrectly classified pixels induces a connectivity error, while having a negligible impact on the aforementioned per-pixel metrics. Therefore, these metrics do not provide a meaningful quality measure for SEM image segmentation with respect to layout extraction.
Electrically significant difference. For this reason, we evaluate our approach with the ESD metric proposed by [9], sometimes also referred to as Connected Component Analysis (CCA) [20]. This metric counts the number of electrical shorts and opens within the segmentation. Shorts are created when two distinct tracks are merged into one segment, and opens are single tracks that are split into multiple segments by the algorithm. We additionally report ESD false positives (FPs), i. e., segments without corresponding track in the ground truth, and false negatives (FNs), i. e., tracks that do not appear in the segmentation. Figure 2b illustrates short and open errors in a segmented sample image.
5.2 Implementation details
Compared to our original work [10], we introduce a number of changes into our algorithm, the training process, and the evaluation. These changes integrate our setup with the existing ML pipeline we use for comparison with the supervised ML models described in Section 3 and significantly improve our training and testing methodology. we published our revised implementation together with our previous version on GitHubFootnote 3.
Dataset. To compare our approaches to the supervised models, we only use the annotated images from the M2 layer of our dataset. We split these into train, validation, and test sets comprised of 10%, 2%, and 88% of image patches, respectively. This split keeps the effort for manually labeling training and validation data reasonable in practice, while providing us with a large test set allowing for a comprehensive evaluation. Although our unsupervised approach does not have access to ground-truth information during training, we evaluate both the supervised models and our approaches only on the test set to prevent overfitting of the former and thus introduce a clear separation between training and test data.
Conventional segmentation algorithms. As discussed in Section 2.1, all our conventional SEM image segmentation algorithms rely on some form of thresholding. We systematically choose track and via thresholds from the histogram of all M2 layer images in the dataset, by picking the minimum between peaks where background resp. track pixel brightnesses cluster as track threshold, and an appropriate value for vias, since the number of via pixels in the dataset is too small to manifest as peak in the histogram. Thus, we arrive at the same track threshold of \(\frac{67}{256}\) we used in the evaluation of our workshop paper and a higher treshold of \(\frac{172}{256}\) for vias. Deriving from a MorphACWE reference implementationFootnote 4, we built a parallelized version of this algorithm using OpenCL. Based on qualitative visual inspection, we run Morph-ACWE for tracks and vias separately for 50 iterations each, with three rounds of smoothing per iteration and a foreground weight of 2 for tracks and \(\frac{1}{2}\) for vias.
Supervised models. As a benchmark and upper bound on the performance of our approach, we consider the three state-of-the-art supervised ML models introduced in Section 3. To detect performance variations between runs, we train five instances for different random seeds for each model architecture. Specifically, we train every model for 100 epochs, while evaluating the mIoU on the validation set each epoch, and choose the checkpoint with the best mIoU as final model. We use PyTorch with Adam [22] optimizer with an initial learning rate of 0.1 and a learning rate scheduler that reduces the same by a factor of 10 on stagnation over 10 epochs. Using batch size 32, we parallelize training over four NVIDIA® RTX™ A5000 GPUs.
Pseudo masks. Using the previous process, we also train the supervised ML architectures on pseudo masks instead of ground-truth labels, which we generate using fixed thresholding (see Section 2.1) as segmentation algorithm. With this more efficient and conceptually simpler approach, we obtain a lower bound for evaluating the segmentation performance of our approach.
Our unsupervised approach. We further adjust the training process from the supervised models to train our encoder and decoder instances for the three conventional segmentation algorithms introduced previously. Following the supervised process, we train every encoder and decoder model for 100 epochs, albeit with a lower learning rate of \(10^{-3}\), the PyTorch default. As final model, we pick the checkpoint with the lowest reconstruction loss on the validation set, which is possible in an unsupervised manner. To initialize the model weights before encoder training, thus improving training stability, we pre-train the encoder for one epoch on the segmentation algorithm results we use as decoder training inputs, before proceeding with the encoder training we described in Section 2.3. Because encoder training also evaluates the decoder at every step, any parallelization speedups from replicating data and splitting input batches across GPUs per model are more than offset by the increased data transfers from gathering results after each model’s forward pass. Instead, we only use one GPU and reduce the batch size to 6 patches, as dictated by memory constraints.
ESD evaluation. To integrate with the supervised ML pipeline, we adapt our ESD evaluation to assess connectivity errors on a per-patch basis, rather than on entire SEM images. Instead of computing ESD errors on the segmentation masks and label image patches directly, we binarize the mask before extracting track and label polygons using OpenCV. This simplifies the implementation, as we can perform the ESD evaluation with polygon intersection. To not overly amplify ESD open errors, we count each ground-truth track that is split as a single error, rather than counting each of the segments it is split into as an error, which we did in the workshop publication.
In order to reduce the number of false positives, we filter polygons with a total area smaller than 35 pixels and ignore ESD errors for tracks fully enclosed within a 35 pixel margin around patch borders, accounting for missing labels near image borders in our dataset (see Section 4.2) and deteriorating model performance due to missing pixel information near borders. For suitable datasets, overlapping patches could be used with appropriately cut results instead to achieve good performance on all image regions. We compute the ESD error rate as the ratio of correctly segmented tracks to the total number of tracks considered. Tracks spanning multiple patches are split on patch borders and counted per patch.
5.3 Results
We evaluate the performance of our improved approach against the original algorithm from our workshop paper [10], the supervised ML models trained on ground-truth labels and thresholded pseudo masks as described in Section 3, as well as the conventional input segmentation algorithms from Section 2.1 used directly. To evaluate instabilities between model instances, we train each model architecture five times for different random-seeds. Figure 3 shows the ESD error rates for each run, denoting ratios between short, open, FP and FN errors, as well as the mean error rate across the five instances. We list the results of each run in more detail in Appendix A. Due to the changes to our evaluation that we detailed in the previous section, in particular the use of patches instead of whole SEM images, our results are not directly comparable to our workshop version and we re-ran these experiments with our new training procedure.
Comparison of ESD error rates on our M2 layer test set between our original and improved approaches, supervised ML models, and input segmentations. As lower baseline, we trained the supervised models on pseudo masks obtained using fixed thresholding. For each approach, we report the results of each run individually, as well as the mean error rate across runs. While we report ESD error rate on a logarithmic scale, we denote the ratios of different error types in linear space
Conventional segmentation algorithms. To generate training data for the unsupervised approaches we evaluate in this section, we run each of the three conventional segmentation algorithms on the M2 layer patches we created from our dataset. For reference, we computed the ESD errors when using these masks as segmentations directly.
The results fall between 6.6% for fixed thresholding and 10.5% for random thresholds, revealing significant differences to our previous experiments [10]. Notably, by reducing the impact of ESD open errors, we halved this error type for random thresholding, yielding an overall error rate of 10.5% that is much closer to the two other segmentation algorithms, compared to the previous 24.8% predominantly open errors. Fixed thresholding additionally outperforms MorphACWE with the changes to our evaluation methodology. We attribute these differences in total errors and ratios to evaluating patches individually, rather than entire SEM images, which is a change necessary to simplify our implementation and integrate with the supervised ML pipeline. Since larger tracks are now split on patch borders, the total number of tracks increases from 115 861 to 128 051, despite our test set covering only 88% of the M2 layer area.
Supervised models. The supervised ML models trained on ground-truth labels perform consistently well, with all instances achieving ESD error rates between 0.2% and 1.9%. Among the three trained model architectures, DeepLabV3 achieves the best results and is closely followed by the equally stable FCN, albeit at a slightly higher mean error rate. Finally, U-Net performs worst overall and has more variance between instances.
Pseudo masks. When training the same models on pseudo masks created with fixed thresholding, they do not significantly improve over this conventional segmentation algorithm. As conceptually simpler approach and together with the conventional algorithms, they form a lower bound for the performance we aim to achieve with our approach.
Our original approach. Rerunning the experiments with our previously published algorithm from Section 2 with alternating decoder-encoder training and without pre-training, we again observe training instabilities for different random seeds. On the one hand, the best performing instances with random thresholding and MorphACWE as input algorithms achieve ESD error rates of 0.52% and 0.56% respectively, which is on par with the supervised U-Net architecture. On the other hand, most training iterations produce either mediocre models that only slightly outperform the conventional algorithms or deteriorated instances producing unusable segmentations. Unexpectedly, all fixed threshold models perform fairly consistent with error rates between 4.4% and 6%, only a marginal improvement over the input segmentations and models trained on pseudo masks. We believe that this apparent stability across random seeds is only by chance and that with more training iterations, the same instabilities that we also observed for this input algorithm in our previous work would similarly emerge. However, we did not investigate these ’stable’ but sub-par results further.
Our improved approach. As the main improvements to our approach compared to the workshop version, we separate encoder and decoder training and, to improve stability, introduce an epoch of pre-training the encoder on the pseudo masks we use for decoder training. Our experiments show that for each of the three input algorithms, the trained model performance is very consistent between random seeds, with ESD error rates within 2% of each other. However, for fixed thresholding and MorphACWE, the models fail to improve significantly over the input segmentations and supervised models trained on pseudo masks. Conversely, all instances trained with random thresholds achieve error rates below 1.4% and 0.8% on average, which is competitive to the supervised models trained on ground-truth labels and the best results of our original, unstable approach.
First, we believe that randomness might aid the performance of the trained models. Because fixed thresholding and MorphACWE are deterministic, the encoder can start learning to replicate their output while pre-training. During encoder training, passing similar segmentations to the decoder, which was already trained on pseudo masks from the respective conventional algorithm, might further optimize the encoder into replicating the input segmentations, as decoder reconstruction losses for these well-known inputs tend to be low. Thus, the encoder outputs would also exhibit ESD error rates similar to the input pseudo masks. With random thresholding, neither encoder nor decoder can optimize for specific segmentations over the overall segmentation quality. To stop the encoder from imitating the deterministic conventional algorithms, we tried reducing the number of patches used for pre-training. Additionally, we experimented with choosing different conventional segmentation algorithms for decoder and encoder pre-training, e. g., fixed thresholding for the decoder and random ones for the encoder. Neither of these variations, however, improved the results.
Second, the ratio between shorts and opens, the two dominant ESD error types, of the input segmentations seems to be important for model performance. Fixed thresholding and MorphACWE have both more than 10 times as many shorts as opens in their pseudo masks, while random thresholding produces mainly ESD open errors. Models that outperform the conventional segmentation algorithms largely exhibit a more balanced ratio between these error types. In particular, this observation also holds for the supervised ML models trained on labelled data and for the one well-performing instance of our original approach using MorphACWE segmentations, which produces significantly more opens than its input algorithm. Conversely, pre-training or exclusively training on pseudo masks seems to impart the input error ratios in the model results to some extent. When using pre-training for stability, it therefore seems to be important to choose an input algorithm that produces masks with suitable ratios between different ESD error types, as seems to be the case for random thresholding, but not for fixed thresholding and MorphACWE.
Per-pixel metrics. To allow comparison to existing work, we report the standard segmentation metrics mean Intersection over Union (mIoU) and mean Pixel Accuracy (mPA) for our results in Table 1. As discussed in Section 5.1, we consider mIoU and mPA insufficient performance measures for segmentation quality in the context of layout extraction. The evaluation results support our claim as while the per-pixel metrics tend to correlate with the overall segmention quality, they are not suitable for detailed comparisons between models. For example, DeepLabV3 performs significantly better than supervised U-Net on average, which their equal mIoUs and mPAs do not reflect. And vice versa, our improved approach with pre-training using random-thresholds performs on par with U-Net in terms of segmentation errors, but has considerably worse per-pixel results.
Summary of results. Ultimately, both our original and improved approaches achieve results competitive to supervised ML models in some instances, without requiring manual labels. In this extended version of our work, we solved the training stability issues with random thresholding and generate models with high performance consistently, while no longer being input algorithm agnostic. Our extensive evaluation shows that we outperform a traditional unsupervised learning method of training models on pseudo masks, which cannot improve significantly over the input segmentations.
5.4 Ablation study
To investigate the impact of each individual building block of our approach, we perform an ablation study and disable parts of our final algorithm with random threshold segmentations and one epoch of pre-training (see Section 2 for details). Specifically, we consider our approach without pre-training, class exclusivity loss \(\text {L}_\text {excl}\), and reconstructing the SEM image gradient
. For the latter, we instead let the decoder reconstruct the SEM image directly. With each of these parts removed in isolation, we train our approach five times and present the results in Table 2.
Pre-training. Without pre-training, our algorithm becomes very similar to our workshop version, the largest difference being that we first fully train the decoder and then the encoder instead of training them alternately. As expected, our results are in line with the previous experiments. One instances performs well with an ESD error rate of 0.7%, comparable to supervised approaches and the results of our improved approach. All other instances, however, produce unusable segmentation with 100% ESD errors, which confirms that pre-training indeed improves training stability substantially.
Class exclusivity loss. Evaluating our approach with pre-training, but removing \(\text {L}_\text {excl}\) loss produces slightly higher error rates and variations between instances. Additionally, the resulting segmentation masks show discoloring patterns rather than the uniform background, track, and via areas we achieve with this loss term and a worse mIoU of 83.2%, compared to 87.3%. These results indicated that \(\text {L}_\text {excl}\) loss aids in accurately reconstructing track borders on a per-pixel level.
SEM gradient. Reconstructing the encoder input SEM images directly, instead of their gradients achieves slightly worse ESD errors than our final approach, while achieving a comparable per-pixel accuracy. While we originally introduced SEM gradients to focus the decoder on correctly reconstructing track borders, pre-training seems to reduce its performance gain compared to our workshop version, as it possibly achieves the same goal by imparting track-shapes on the encoder model instead.
Summary. In conclusion, each of the three building blocks of our approach that we considered for our ablation study improve its performance to varying degrees. In particular, pre-training boosts training stability while retaining the high performance for random thresholding.
5.5 Related work
Here we place our results in the context of four ML approaches to track segmentation that also report ESD errors. Hong et al. [5] train a Convolutional Neural Network (CNN) on 200 labeled SEM images and achieve 0.83 shorts respectively 0.26 opens per 2048\(\times \)1536 pixel image, which translates to 0.09 ESD errors per 512\(\times \)512 pixel patch. For comparison, our improved approach has an average per-patch error rate of 0.07. Using a Generative Adversial Network (GAN) on labeled patches half this size, Tee et al. [7] achieve with a mPA of 0.9442 and a mIoU of 0.8563 a similar per-pixel performance to our approach and report an ESD error rate of 4.71%, albeit on SEM images with a higher track density, based on their figures. Yu et al. [6] train a CNN on 21 8192\(\times \)8192 pixel SEM images for 100 epochs and use post processing to reduce the number of ESD errors, reporting 50.71 errors total (0.2 per our patch size) with 95.75% mPA and a high 91.86% mIoU. The images again appear to have a higher track density than our dataset. Wilson et al. [20] tested multiple ML and non-ML algorithms on their synthetically generated REFICS dataset (see Section 4.3), achieving segmentation error rates as low as 1% using CycleGAN [23] trained on labeled data, with a 89% mIoU. Although not directly comparable due to the different underlying datasets, the accuracy of our approach appears to be on par with the state of the art, which, however, requires labeled training data.
5.6 Limitations and future research
Our approach yields promising results and our work for this extended paper produces stable models, while no longer being input-algorithm agnostic. However, some of the limitations and areas for future work that we identified in our workshop paper still apply.
First, unsupervised learning is limited by the complexity of patterns it can observe in the input data which for our dataset – due to random bright artifacts caused by the imaging process – interferes with reliable via detection. While a supervised approach with extensive labeling might learn to differentiate such artifacts, our approach has the advantage of automatically generating a low-error segmentation for tracks without any manual annotations. In future research, unsupervised track segmentation algorithms could be combined with specialized algorithms for via detection [2].
Second, we employed three relatively simple conventional input segmentation algorithms for decoder training, settling on random thresholding as the algorithm that produces well performing models with high consistency. We briefly considered adaptive and multi-level Otsu thresholding [24] as additional algorithms but were unable to obtain satisfactory initial segmentations. Evidently, our approach has specific requirements on these inputs: To supply the decoder with sufficient information to reconstruct SEM image features, suitable conventional algorithms have to discern between three classes with very different prevalence in the input data, e.g., considering the tiny footprint of vias compared to background areas.
Third, we originally evaluated our approach on a single metal layer of our real-world SEM image dataset, where it performs well. Manual inspection of selected results on our dataset’s M1 layer, which features much thinner and more densely packed tracks, indicates a deterioration in the segmentation quality of our approach. For the extended version, we were able to test our approach on additional datasets, achieving mixed results. On one dataset with high quality SEM images, all methods we evaluated in this paper, including our approach but even the conventional segmentation algorithms, performed equally well, which is why we omit these experiments here. For datasets where our conventional segmentation algorithms could not reliably separate tracks from background, we were unable to train models that produce usable masks.
Having solved the instability during training, we can thus identify the conventional segmentation algorithms and specifically global thresholding, which all three evaluated algorithms rely on, as the main limitation of our approach. Finding suitable alternatives that retain the training stability we observed for random thresholding is therefore the most promising area of future work, with the potential to greatly improve adaptability to other datasets.
6 Conclusion
In this work, we have introduced a novel unsupervised approach for SEM image segmentation in IC layout extraction. Our method eliminates the need for manual labeling by leveraging unsupervised deep learning, enabling easier adaptation to new datasets within the constraints we identified for our input segmentation algorithms. Our evaluation on the M2 layer of a real-world dataset demonstrated low Electrically Significant Difference (ESD) error rates for track segmentation, comparable to state-of-the-art supervised approaches. However, challenges for successful layout extraction remain, such as imaging artifacts and the diversity of materials and technologies used in IC fabrication. To address these challenges, future research could explore alternative input segmentation algorithms to improve performance and generalizability, or otherwise combine our approach with supervised or conventional image processing.
To foster reproducibility and facilitate further research, we have released our dataset and algorithm implementation under an open-source license. This will enable other researchers to apply our approach to their own datasets and evaluate their methods using our dataset, fostering fair comparisons among different segmentation algorithms. By promoting collaboration and transparency, we hope to drive progress in the field of IC layout extraction.
Data Availability
Our SEM image dataset is available on Edmond, the Open Research Data Repository of the Max Planck Society under a permissive open-source license (CC-BY 4.0), https://doi.org/10.17617/3HY5SYN. We will publish our revised algorithms together with our previous version on GitHub, https://github.com/emsec/unsupervised-ic-sem-segmentation.
Notes
While the acronym ESD usually refers to the term electrostatic discharge, we have chosen to remain consistent with the original work.
References
Asadizanjani, N., Rahman, M.T., Tehranipoor, M., Asadizanjani, N., Rahman, M.T., Tehranipoor, M.: Physical inspection of integrated circuits. Physical Assurance: For Electronic Devices and Systems, 49–65 (2021)
Puschner, E., Moos, T., Becker, S., Kison, C., Moradi, A., Paar, C.: Red team vs. blue team: A real-world hardware trojan detection case study across four modern CMOS technology generations. In: IEEE Symposium on Security and Privacy (SP), pp. 56–74. IEEE Computer Society, Los Alamitos, CA, USA (2023)
Kimura, A., Scholl, J., Schaffranek, J., Sutter, M., Elliott, A., Strizich, M., Via, G.D.: A decomposition workflow for integrated circuit verification and validation. J. Hardw. Syst. Secur. 4, 34–43 (2020)
Lin, T., Shi, Y., Shu, N., Cheng, D., Hong, X., Song, J., Gwee, B.H.: Deep learning-based image analysis framework for hardware assurance of digital integrated circuits. Microelectron. Reliab. 123, 114196 (2021)
Hong, X., Cheng, D., Shi, Y., Lin, T., Gwee, B.H.: Deep learning for automatic ic image analysis. In: 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), pp. 1–5. IEEE, Shanghai, China (2018)
Yu, Z., Trindade, B.M., Green, M., Zhang, Z., Sneha, P., Tavakoli, E.B., Pawlowicz, C., Ren, F.: A data-driven approach for automated integrated circuit segmentation of scan electron microscopy images. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 2851–2855. IEEE, Bordeaux, France (2022)
Tee, Y.-Y., Hong, X., Cheng, D., Chee, C.-S., Shi, Y., Lin, T., Gwee, B.-H.: Patch-based adversarial training for error-aware circuit annotation of delayered ic images. IEEE Trans. Circuits Syst. II Express Briefs 70(9), 3694–3698 (2023)
Tee, Y.-Y., Cheng, D., Chee, C.-S., Lin, T., Shi, Y., Gwee, B.-H.: Unsupervised domain adaptation with histogram-gated image translation for delayered ic image analysis. In: 2022 IEEE Physical Assurance and Inspection of Electronics (PAINE), pp. 1–7. IEEE, Huntsville, AL, USA (2022)
Machado Trindade, B., Ukwatta, E., Spence, M., Pawlowicz, C.: Segmentation of Integrated Circuit Layouts from Scan Electron Microscopy Images. In: 2018 IEEE Canadian Conference on Electrical & Computer Engineering (CCECE), pp. 1–4. IEEE, Quebec, QC, Canadas (2018)
Rothaug, N., Klix, S., Auth, N., Böcker, S., Puschner, E., Becker, S., Paar, C.: Towards Unsupervised SEM Image Segmentation for IC Layout Extraction. In: Proceedings of the 2023 Workshop on Attacks and Solutions in Hardware Security. ASHES’23. Association for Computing Machinery, New York, NY, USA (2023)
Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., Terzopoulos, D.: Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3523–3542 (2022)
Bengio, Y., Courville, A.C., Vincent, P.: Unsupervised feature learning and deep learning: A review and new perspectives. 1(2665), (2012) CoRR, arXiv:1206.5538
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, pp. 499–523. MIT Press, Cambridge, MA, USA (2016). Chap. 14. http://www.deeplearningbook.org
Márquez-Neila, P., Baumela, L., Alvarez, L.: A Morphological Approach to Curvature-Based Evolution of Curves and Surfaces. IEEE Trans. Pattern Anal. Mach. Intell. 36(1), 2–17 (2014)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI, pp. 234–241. Springer, Munich, Germany (2015)
Odena, A., Dumoulin, V., Olah, C.: Deconvolution and Checkerboard Artifacts (2016). http://distill.pub/2016/deconv-checkerboard
Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587 (2017)
Cheng, D., Shi, Y., Lin, T., Gwee, B., Toh, K.: Delayered IC image analysis with template-based Tanimoto Convolution and Morphological Decision. IET Circuits Devices Syst. 16(2), 169–177 (2021)
Wilson, R., Lu, H., Zhu, M., Forte, D., Woodard, D.L.: Refics: A step towards linking vision with hardware assurance. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3461–3470. IEEE, Waikoloa, HI, USA (2022)
Cheng, D., Shi, Y., Lin, T., Gwee, B.-H., Toh, K.-A.: Hybrid \({K}\) -Means Clustering and Support Vector Machine Method for via and Metal Line Detections in Delayered IC Images. IEEE Trans. Circuits Syst. II Express Briefs 65(12), 1849–1853 (2018)
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, San Diego, CA, USA (2015)
Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. IEEE, Venice, Italy (2017)
Liao, P.-S., Chen, T.-S., Chung, P.C.: A Fast Algorithm for Multilevel Thresholding. J. Inf. Sci. Eng. 17, 713–727 (2001)
Acknowledgements
This work was partly supported by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2092 CASA – 390781972, by the German Federal Ministry of Education and Research (BMBF) under the project FINANTIA – 13N15298, and by the Research Center Trustworthy Data Science and Security (https://rc-trust.ai), one of the Research Alliance centers within the UA Ruhr (https://uaruhr.de). A research visit by the lead author to the School of Electrical and Electronic Engineering at NTU facilitated this extension of this paper.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
N.R. conducted the experiments concerning our unsupervised approach for the extended journal paper. N.R. further extended the workshop manuscript text, which he originally co-authored with S.K. and S.Be.. S.K. also drafted Section 2.4 of the journal paper. For the extended version, D.C. performed the supervised ML model experiments, contributed Section 3 to the manuscript, and co-supervised the work on the extension. N.A. and S.Bö. provided the SEM images for our dataset and, along with E.P., drafted Section 4. E.P. and S.Bö. also created the ground-truth labels for the dataset. S.Be. supervised the work on both the workshop and extended publications and, together with C.P., provided feedback on the manuscript.
Corresponding author
Ethics declarations
Competing Interests
Christof Paar is a member of the steering committee of the Journal of Cryptographic Engineering.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Experimental results
Experimental results
In Table 3, we list the Electrically Significant Difference (ESD) errors, mean Intersection over Union (mIoU), and mean Pixel Accuracy (mPA) as performance metrics for each model we trained for our evaluation. We repeated the training five times for different random seeds and report errors on our test set of 13 559 M2 layer SEM image patches with 512\(\times \)512px each. The ESD evaluation considers a total of 128 051 tracks. Please refer to Section 5.3 for a discussion of the results.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rothaug, N., Cheng, D., Klix, S. et al. Advancing training stability in unsupervised SEM image segmentation for IC layout extraction. J Cryptogr Eng 15, 21 (2025). https://doi.org/10.1007/s13389-025-00385-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s13389-025-00385-5


