{"title": "Zero-shot recognition with unreliable attributes", "book": "Advances in Neural Information Processing Systems", "page_first": 3464, "page_last": 3472, "abstract": "In principle, zero-shot learning makes it possible to train an object recognition model simply by specifying the category's attributes. For example, with classifiers for generic attributes like striped and four-legged, one can construct a classifier for the zebra category by enumerating which properties it possesses --- even without providing zebra training images. In practice, however, the standard zero-shot paradigm suffers because attribute predictions in novel images are hard to get right. We propose a novel random forest approach to train zero-shot models that explicitly accounts for the unreliability of attribute predictions. By leveraging statistics about each attribute\u2019s error tendencies, our method obtains more robust discriminative models for the unseen classes. We further devise extensions to handle the few-shot scenario and unreliable attribute descriptions. On three datasets, we demonstrate the benefit for visual category learning with zero or few training examples, a critical domain for rare categories or categories defined on the fly.", "full_text": "Zero-Shot Recognition with Unreliable Attributes\n\nDinesh Jayaraman\n\nUniversity of Texas at Austin\n\nAustin, TX 78701\n\ndineshj@cs.utexas.edu\n\nKristen Grauman\n\nUniversity of Texas at Austin\n\nAustin, TX 78701\n\ngrauman@cs.utexas.edu\n\nAbstract\n\nIn principle, zero-shot learning makes it possible to train a recognition model\nsimply by specifying the category\u2019s attributes. For example, with classi\ufb01ers for\ngeneric attributes like striped and four-legged, one can construct a classi\ufb01er for\nthe zebra category by enumerating which properties it possesses\u2014even without\nproviding zebra training images.\nIn practice, however, the standard zero-shot\nparadigm suffers because attribute predictions in novel images are hard to get\nright. We propose a novel random forest approach to train zero-shot models that\nexplicitly accounts for the unreliability of attribute predictions. By leveraging\nstatistics about each attribute\u2019s error tendencies, our method obtains more robust\ndiscriminative models for the unseen classes. We further devise extensions to han-\ndle the few-shot scenario and unreliable attribute descriptions. On three datasets,\nwe demonstrate the bene\ufb01t for visual category learning with zero or few training\nexamples, a critical domain for rare categories or categories de\ufb01ned on the \ufb02y.\n\n1\n\nIntroduction\n\nVisual recognition research has achieved major successes in recent years using large datasets and\ndiscriminative learning algorithms. The typical scenario assumes a multi-class task where one has\nample labeled training images for each class (object, scene, etc.) of interest. However, many real-\nworld settings do not meet these assumptions. Rather than \ufb01x the system to a closed set of thoroughly\ntrained object detectors, one would like to acquire models for new categories with minimal effort\nand training examples. Doing so is essential not only to cope with the \u201clong-tailed\u201d distribution of\nobjects in the world, but also to support applications where new categories emerge dynamically\u2014for\nexample, when a scientist de\ufb01nes a new phenomenon of interest to be detected in her visual data.\nZero-shot learning offers a compelling solution. In zero-shot learning, a novel class is trained via\ndescription\u2014not labeled training examples [10, 18, 8]. In general, this requires the learner to have\naccess to some mid-level semantic representation, such that a human teacher can de\ufb01ne a novel\nunseen class by specifying a con\ufb01guration of those semantic properties. In visual recognition, the\nsemantic properties are attributes shared among categories, like black, has ears, or rugged. Sup-\nposing the system can predict the presence of any such attribute in novel images, then adding a new\ncategory model amounts to de\ufb01ning its attribute \u201csignature\u201d [8, 3, 18, 24, 19]. For example, even\nwithout labeling any images of zebras, one could build a zebra classi\ufb01er by instructing the system\nthat zebras are striped, black and white, etc. Interestingly, computational models for attribute-based\nrecognition are supported by the cognitive science literature, where researchers explore how humans\nconceive of objects as bundles of attributes [25, 17, 5].\nSo, in principle, if we could perfectly predict attribute presence1, zero-shot learning would offer\nan elegant solution to generating novel classi\ufb01ers on the \ufb02y. The problem, however, is that we\ncan\u2019t assume perfect attribute predictions. Visual attributes are in practice quite dif\ufb01cult to learn\n\n1and have an attribute vocabulary rich enough to form distinct signatures for each category of interest\n\n1\n\n\faccurately\u2014often even more so than object categories themselves. This is because many attributes\nare correlated with one another (given only images of furry brown bears, how do we learn furry and\nbrown separately? [6]), and abstract linguistic properties can have very diverse visual instantiations\n(compare a bumpy road to a bumpy rash). Thus, attribute-based zero-shot recognition remains in the\n\u201cproof of concept\u201d realm, in practice falling short of alternate transfer methods [23].\nWe propose an approach to train zero-shot models that explicitly accounts for the unreliability of\nattribute predictions. Whereas existing methods take attribute predictions at face value, our method\nduring training acknowledges the known biases of the mid-level attribute models. Speci\ufb01cally,\nwe develop a random forest algorithm that, given attribute signatures for each category, exploits\nthe attribute classi\ufb01ers\u2019 receiver operating characteristics to select discriminative and predictable\ndecision nodes. We further generalize the idea to account for unreliable class-attribute associations.\nFinally, we extend the solution to the \u201cfew-shot\u201d setting, where a small number of category-labeled\nimages are also available for training.\nWe demonstrate the idea on three large datasets of object and scene categories, and show its clear\nadvantages over status quo models. Our results suggest the valuable role attributes can play for\nlow-cost visual category learning, in spite of the inherent dif\ufb01culty in learning them reliably.\n2 Related Work\n\nMost existing zero-shot models take a two-stage classi\ufb01cation approach: given a novel image, \ufb01rst its\nattributes are predicted, then its class label is predicted as a function of those attributes. For example,\nin [3, 18, 30], each unseen object class is described by a binary indicator vector (\u201csignature\u201d) over its\nattributes; a new image is mapped to the unseen class with the signature most similar to its attribute\npredictions. The probabilistic Direct Attribute Prediction (DAP) method [8] takes a similar form, but\nadds priors for the classes and attributes and computes a MAP prediction of the unseen class label.\nA topic model variant is explored in [31]. The DAP model has gained traction and is often used in\nother work [23, 19, 29]. In all of the above methods, as in ours, training an unseen class amounts to\nspecifying its attribute signature. In contrast to our approach, none of the existing methods account\nfor attribute unreliability when learning an unseen category. As we will see in the results, this has a\ndramatic impact on generalization.\nWe stress that attribute unreliability is distinct from attribute strength. The former (our focus) per-\ntains to how reliable the mid-level classi\ufb01er is, whereas the latter pertains to how strongly an image\nexhibits an attribute (e.g., as modeled by relative [19] or probabilistic [8] attributes). PAC bounds\non the tolerable error for mid-level classi\ufb01ers are given in [18], but that work does not propose a\nsolution to mitigate the in\ufb02uence of their uncertainty.\nWhile the above two-stage attribute-based formulation is most common, an alternative zero-shot\nstrategy is to exploit external knowledge about class relationships to adapt classi\ufb01ers to an unseen\nclass. For example, an unseen object\u2019s classi\ufb01er can be estimated by combining the nearest exist-\ning classi\ufb01ers (trained with images) in the ImageNet hierarchy [23, 14], or by combining classi\ufb01ers\nbased on label co-occurrences [13].\nIn a similar spirit, label embeddings [1] or feature embed-\ndings [4] can exploit semantic information for zero-shot predictions. Unlike these models, we focus\non de\ufb01ning new categories through language-based description (with attributes). This has the ad-\nvantage of giving a human supervisor direct control on the unseen class\u2019s de\ufb01nition, even if its\nattribute signature is unlike that observed in any existing trained model.\nAcknowledging that attribute classi\ufb01ers are often unreliable, recent work abandons purely semantic\nattributes in favor of discovering mid-level features that are both detectable and discriminative for\na set of class labels [11, 22, 26, 15, 30, 27, 1]. However, there is no guarantee that the discovered\nfeatures will align with semantic properties, particularly \u201cnameable\u201d ones. This typically makes\nthem inapplicable to zero-shot learning, since a human supervisor can no longer de\ufb01ne the unseen\nclass with concise semantic terms. Nonetheless, one can attempt to assign semantics post-hoc (e.g.,\n[30]). We demonstrate that our method can bene\ufb01t zero-shot learning with such discovered (pseudo)-\nattributes as well.\nOur idea for handling unreliable attributes in random forests is related to fractional tuples for han-\ndling missing values in decision trees [21]. In that approach, points with missing values are dis-\ntributed down the tree in proportion to the observed values in all other data. Similar concepts are\nexplored in [28] to handle features represented as discrete distributions and in [16] to propagate\n\n2\n\n\finstances with soft node memberships. Our approach also entails propagating training instances in\nproportion to uncertainty. However, our zero-shot scenario is distinct, and, accordingly, the training\nand testing domains differ in important ways. At training time, rather than build a decision tree from\nlabeled data points, we construct each tree using the unseen classes\u2019 attribute signatures. Then, at\ntest time, the inputs are attribute classi\ufb01er predictions. Furthermore, we show how to propagate both\nsignatures and data points through the tree simultaneously, which makes it possible to account for\ninter-dependencies among the input dimensions and also enables a few-shot extension.\n\n3 Approach\n\nGiven a vocabulary of M visual attributes, each unseen class k is described in terms of its attribute\nsignature Ak, which is an M-dimensional vector where Ak(i) gives the association of attribute\ni with class k.2 Typically the association values would be binary\u2014meaning that the attribute is\nalways present/absent in the class\u2014but they may also be real-valued when such \ufb01ne-grained data\nis available. We model each unseen class with a single signature (e.g., whales are big and gray).\nHowever, it is straightforward to handle the case where a class has a multi-modal de\ufb01nition (e.g.,\nwhales are big and gray OR whales are big and black), by learning a zero-shot model per \u201cmode\u201d.\nWhether the attribute vocabulary is hand-designed [8, 3, 19, 29, 23] or discovered [30, 11, 22], our\napproach assumes it is expressive enough to discriminate between the categories.\nSuppose there are K unseen classes of interest, for which we have no training images. Our zero-shot\nmethod takes as input the K attribute signatures and a dataset of images labeled with attributes, and\nproduces a classi\ufb01er for each unseen class as output. At test time, the goal is to predict which unseen\nclass appears in a novel image.\nIn the following, we \ufb01rst describe the initial stage of building the attribute classi\ufb01ers (Sec. 3.1).\nThen we introduce a zero-shot random forest trained with attribute signatures (Sec. 3.2). Next we\nexplain how to augment that training procedure to account for attribute unreliability (Sec. 3.2.2) and\nsignature uncertainty (Sec. 3.2.3). Finally, we present an extension to few-shot learning (Sec. 3.3).\n\n3.1 Learning the attribute vocabulary\nAs in any attribute-based zero-shot method [3, 8, 18, 23, 19, 7, 29], we \ufb01rst must train classi\ufb01ers to\npredict the presence or absence of each of the M attributes in novel images. Importantly, the images\nused to train the attribute classi\ufb01ers may come from a variety of objects/scenes and need not contain\nany instances of the unseen categories. The fact that attributes are shared across category boundaries\nis precisely what allows zero-shot learning.\nWe train one SVM per attribute, using a training set of images xi (represented with standard de-\nscriptors) with binary M-dimensional label vectors yi, where yi(m) = 1 indicates that attribute m\nis present in xi. Let \u02c6am(x) denote the Platt probability score from the m-th such SVM applied to\ntest input x.\n\n3.2 Zero-shot random forests\n\nNext we introduce our key contribution: a random forest model for zero-shot learning.\n3.2.1 Basic formulation: Signature random forest\nFirst we de\ufb01ne a basic random forest training algorithm for the zero-shot setting. The main idea is\nto train an ensemble of decision trees using attribute signatures\u2014not image descriptors or vectors\nof attribute predictions. In the zero-shot setting, this is all the training information available. Later,\nat test time, we will have an image in hand, and we will apply the trained random forest to estimate\nits class posteriors.\nRecall that the k-th unseen class is de\ufb01ned by its attribute signature Ak \u2208 (cid:60)M . We treat each such\nsignature as the lone positive \u201cexemplar\u201d for its class, and discriminatively train random forests to\ndistinguish all the signatures, A1, . . . , AK. We take a one-versus-all approach, training one forest\nfor each unseen class. So, when training class k, the K \u2212 1 other class signatures are the negatives.\n\n2We use \u201cclass\u201d and \u201ccategory\u201d to refer to an object or scene, e.g., zebra or beach, and \u201cattribute\u201d to refer\n\nto a property, e.g., striped or sunny. \u201cUnseen\u201d means we have no training images for that class.\n\n3\n\n\fFor each class, we build an ensemble of decision trees in a breadth-\ufb01rst manner. Each tree is learned\nby recursively splitting the signatures into subsets at each node, starting at the root. Let In denote\nan indicator vector of length K that records which signatures appear at node n. For the root node,\nall K signatures are present, so we have In = [1, . . . , 1]. Following the typical random forest\nprotocol [2], the training instances are recursively split according to a randomized test; it compares\none dimension of the signature against a threshold t, then propagates each one to the left child l\nor right child r depending on the outcome, yielding indicator vectors Il and Ir. Speci\ufb01cally, if\nIn(k) = 1, then if Ak(m) > t, we have Ir(k) = 1. Otherwise, Ir(k) = 0. Further, Il = In \u2212 Ir.\nThus, during training we must choose two things at each node: the query attribute m and the thresh-\nold t, represented jointly as the split (m, t). We sample a limited number of (m, t) combinations3\nand choose the one that maximizes the expected information gain IGbasic:\n\nIGbasic(m, t) = H(pIn ) \u2212`P (Ai(m) > t|In(i) = 1) H(pIl ) + P (Ai(m) \u2264 t|In(i) = 1) H(pIr )\u00b4 (1)\nwhere H(p) = \u2212(cid:80)\n\ni p(i) log2 p(i) is the entropy of a distribution p. The 1-norm on an indicator\nvector I sums up the occurrences I(k) of each signature, which for now are binary, I(k) \u2208 {0, 1}.\nSince we are training a zero-shot forest to discriminate class k from the rest, the distribution over\nclass labels at node n is a length-2 vector:\n\n\u201e (cid:107)Il(cid:107)1\n\n= H(pIn ) \u2212\n\n(cid:107)Ir(cid:107)1\n(cid:107)In(cid:107)1\n\nH(pIl ) +\n\n(cid:107)In(cid:107)1\n\nH(pIr )\n\n,\n\n\u00ab\n\n(2)\n\n(cid:80)\n\n(cid:20) In(k)\n\n(cid:107)In(cid:107)1\n\n,\n\ni(cid:54)=k In(i)\n(cid:107)In(cid:107)1\n\n(cid:21)\n\npIn =\n\n.\n\n(3)\n\nJ\n\ntest\n\njP j\n\n(cid:80)\n\nk ((cid:96)), the average of the posteriors across the ensemble.\n\nWe grow each tree in the forest to a \ufb01xed, maximum depth, terminating a branch prematurely if less\nthan 5% of training samples have reached a node on it. We learn J = 100 trees per forest.\nimage xtest, we compute its predicted attribute signature \u02c6a(xtest) =\nGiven a novel\n[\u02c6a1(xtest), . . . , \u02c6aM (xtest)] by applying the attribute SVMs. Then, to predict the posterior for\nclass k, we use \u02c6a(xtest) to traverse to a leaf node in each tree of k\u2019s forest. Let P j\nk ((cid:96)) denote\nthe fraction of positive training instances at a leaf node (cid:96) in tree j of the forest for class k. Then\nP (k|\u02c6a(xtest)) = 1\nIf we somehow had perfect attribute classi\ufb01ers, this basic zero-shot random forest (in fact, one such\ntree alone) would be suf\ufb01cient. Next, we show how to adapt the training procedure de\ufb01ned so far to\naccount for their unreliability.\n3.2.2 Accounting for attribute prediction unreliability\nWhile our training \u201cexemplars\u201d are the true attribute signatures for each unseen class, the test im-\nages will have only approximate estimates of the attributes they contain. We therefore augment the\nzero-shot random forest to account for this unreliability during training. The main idea is to gener-\nalize the recursive splitting procedure above such that a given signature can pursue multiple paths\ndown the tree. Critically, those paths will be determined by the false positive/true positive rates of\nthe individual attribute predictors. In this way, we expand each idealized training signature into a\ndistribution in the predicted attribute space. Essentially, this preemptively builds in the appropriate\n\u201ccushion\u201d of expected errors when choosing discriminative splits.\nImplementing this idea requires two primary extensions to the formulation in Sec. 3.2.1: (i) we\ninject attribute validation data and its associated attribute classi\ufb01cation error statistics into the tree\nformation process, and (ii) we rede\ufb01ne the information gain to account for the partial propagation\nof training signatures. We explain each of these components in turn next.\nFirst, in addition to signatures, at each node we maintain a set of validation data in order to gauge\nthe error tendencies of each attribute classi\ufb01er. For the experiments in this paper (Sec 4), our method\nreserves some attribute classi\ufb01er training data for this purpose. Denote this set of attribute-labeled\nimages as DV . During random forest training, this data is recursively propagated down the tree\nfollowing each split once it is chosen. Let DV (n) \u2286 DV denote the set of validation data inherited\nat node n. At the root, DV (n) = DV .\n\n3With binary Ai(m), all 0 < t < 1 are equivalent in Sec 3.2.1. Selecting t becomes important in Sec 3.2.2.\n\n4\n\n\fWith validation data thus injected, we can estimate the test-time receiver operating characteristic\n(ROC)4 for an attribute classi\ufb01er at any node in the tree. For example, the estimated false positive\nrate at node n for attribute m at threshold t is FP(n, m, t) = Pn(\u02c6am(x) > t | y(m) = 0), which is\nthe fraction of examples in DV (n) for which the attribute m is absent, but the SVM predicts it to be\npresent at threshold t. Here, y(m) denotes the m-th attribute\u2019s label for image x.\nn(k) \u2208 [0, 1] records the fractional\nFor any node n, let I(cid:48)\nn(k) = 1, \u2200k. For a\noccurrence of the training signature for class k at node n. At the root node, I(cid:48)\nsplit (m, t) at node n, a signature Ak splits into the right and left child nodes according to its ROC\nfor attribute m at the operating point speci\ufb01ed by t. In particular, we have:\n\nn be a real-valued indicator vector, such that I(cid:48)\n\n(cid:48)\nl (k) = I\n\nn(k)Pn(\u02c6am(x) > t | y(m) = Ak(m)), and I\n(cid:48)\n\nn(k)Pn(\u02c6am(x) \u2264 t | y(m) = Ak(m)),\n(cid:48)\n(cid:48)\nI\nr(k) = I\n(4)\nwhere x \u2208 DV (n) . When Ak(m) = 1, the probability terms are TP(n, m, t) and FN(n, m, t)\nrespectively; when Ak(m) = 0, they are FP(n, m, t) and TN(n, m, t). In this way, we channel all\npredicted negatives to the left child node. In contrast, a naive random forest (RF) trained on signa-\ntures assumes ideal attribute classi\ufb01ers and channels all ground truth negatives\u2014i.e., true negatives\nand false positives\u2014through the left node.\nTo illustrate the meaning of this fractional propagation, consider a class \u201celephant\u201d known to have\nthe attribute \u201cgray\u201d. If the \u201cgray\u201d attribute classi\ufb01er \ufb01res only on 60% of the \u201cgray\u201d samples in the\nvalidation set, i.e., TP=0.6, then only 0.6 fraction of the \u201celephant\u201d signature is passed on to the\npositive (i.e., right) node. This process repeats through more levels until fractions of the single \u201cele-\nphant\u201d signature have reached all leaf nodes. Thus, a single class signature emulates the estimated\nstatistics of a full training set of class-labeled instances with attribute predictions.\nWe stress two things about the validation data propagation. First, the data in DV is labeled by\nattributes only; it has no unseen class labels and never features in the information gain computation.\nIts only role is to estimate the ROC values. Second, the recursive sub-selection of the validation data\nis important to capture the dependency of TP/FP rates at higher level splits. For example, if we were\nto select split (m, t) at the root, then the fractional signatures pushed to the left child must all have\nA(m) < t, meaning that for a candidate split (m, s) at the left child, where s > t, the correct TP and\nFP rates are both 0. This is accounted for when we use DV (n) to compute the ROC, but would not\nhave been, had we just used DV . Thus, our formulation properly accounts for dependencies between\nattributes when selecting discriminative thresholds, an issue not addressed by existing methods for\nmissing [21] or probabilistically distributed features [28].\nNext, we rede\ufb01ne the information gain. When building a zero-shot tree conscious of attribute unre-\nliability, we choose the split maximizing the expected information gain according to the fractionally\npropagated signatures (compare to Eqn. (2)):\n\nIGzero(m, t) = H(pI(cid:48)\n\nn\n\n) \u2212\n\nl(cid:107)1\nn(cid:107)1\n(cid:107)I(cid:48)\n\nH(pI(cid:48)\n\nl\n\n) +\n\nr(cid:107)1\n(cid:107)I(cid:48)\n(cid:107)I(cid:48)\nn(cid:107)1\n\nH(pI(cid:48)\n\nr\n\n)\n\n.\n\n(5)\n\n(cid:18) (cid:107)I(cid:48)\n\n(cid:19)\n\nz\n\n, z \u2208 {l, r} is computed as in Eqn. (3). For full pseudocode and a schematic\n\nThe distribution pI(cid:48)\nillustration of our method, please see supp.\nThe discriminative splits under this criterion will be those that not only distinguish the unseen classes\nbut also persevere (at test time) as a strong signal in spite of the attribute classi\ufb01ers\u2019 error tenden-\ncies. This means the trees will prefer both reliable attributes that are discriminative among the\nclasses, as well as less reliable attributes coupled with intelligently selected operating points that\nremain distinctive. Furthermore, they will omit splits that, though highly discriminative in terms of\nidealized signatures, were found to be \u201cunlearnable\u201d among the validation data. For example, in\nthe extreme case, if an attribute classi\ufb01er cannot distinguish positives and negatives, meaning that\nTPR=FPR, then the signatures of all classes are equally likely to propagate to the left or right, i.e.,\nI(cid:48)\nr(k)/I(cid:48)\nn(j) for all k, j, which yields an informa-\ntion gain of 0 in Eqn. (5) (see supp). Thus, our method, while explicitly making the best of imperfect\nattribute classi\ufb01cation, inherently prefers more learnable attributes.\n\nn(j) and I(cid:48)\n\nn(k) = I(cid:48)\n\nl(k)/I(cid:48)\n\nn(k) = I(cid:48)\n\nl(j)/I(cid:48)\n\nr(j)/I(cid:48)\n\n4The ROC captures the true positive (TP) vs. false positive (FP) rates (equivalently the true negative (TN)\n\nand false negative (FN) rates) as a function of a decision value threshold.\n\n5\n\n\fThe proposed approach produces unseen category classi\ufb01ers with zero category-labeled images.\nThe attribute-labeled validation data is important to our solution\u2019s robustness. If that data perfectly\nrepresented the true attribute errors on images from the unseen classes (which we cannot access,\nof course, because images from those classes appear only at test time), then our training procedure\nwould be equivalent to building a random forest on the test samples\u2019 attribute classi\ufb01er outputs.\n3.2.3 Accounting for class signature uncertainty\nBeyond attribute classi\ufb01er unreliability, our framework can also deal with another source of zero-\nshot uncertainty: instances of a class often deviate from class-level attribute signatures. To tackle\nthis, we rede\ufb01ne the soft indicators I(cid:48)\nl in Eqn. 4, appending a term to account for annotation\nnoise. Please see supp. for details.\n3.3 Extending to few-shot random forests\nOur approach also admits a natural extension to few-shot training. Extensions of zero-shot models\nto the few-shot setting have been attempted before [31, 26, 14, 1]. In this case, we are given not only\nattribute signatures, but also a dataset DT consisting of a small number of images with their class\nlabels. We essentially use the signatures A1, . . . , AK as a prior for selecting good tree splits that\nalso satisfy the traditional training examples. The information gain on the signatures is as de\ufb01ned in\nSec. 3.2.2, while the information gain on the training images, for which we can compute classi\ufb01er\noutputs, uses the standard measure de\ufb01ned in Sec. 3.2.1. Using some notation shortcuts, for few-shot\ntraining we recursively select the split that maximizes the combined information gain:\n\nr and I(cid:48)\n\nIGf ew(m, t) = \u03bb IGzero(m, t){A1, . . . , AK} + (1 \u2212 \u03bb) IGbasic(m, t){DT},\n\n(6)\n\nwhere \u03bb controls the role of the signature-based prior. Intuitively, we can expect lower values of \u03bb to\nsuf\ufb01ce as the size of DT increases, since with more training examples we can more precisely learn\nthe class\u2019s appearance. This few-shot extension can be interpreted as a new way to learn random\nforests with descriptive priors.\n\n4 Experiments\n\nDatasets and setup We use three datasets: (1) Animals with Attributes (AwA) [8] (M = 85\nattributes, K = 10 unseen classes, 30,475 total images), (2) aPascal/aYahoo objects (aPY) [3] (M =\n65, K = 12, 15,339 images) (3) SUN scene attributes (SUN) [20] (M = 102, K = 10, 14,340\nimages). These datasets capture a wide array of categories (animals, indoor and outdoor scenes,\nhousehold objects, etc.) and attributes (parts, affordances, habitats, shapes, materials, etc.). The\nattribute-labeled images originate from 40, 20, and 707 \u201cseen\u201d classes in each dataset, respectively;\nwe use the class labels solely to map to attribute annotations. We use the unseen class splits speci\ufb01ed\nin [9] for AwA and aPY, and randomly select the 10 unseen classes for SUN (see supp.). For all three,\nwe use the features provided with the datasets, which include color histograms, SIFT, PHOG, and\nothers (see [9, 3, 20] for details).\nFollowing [8], we train attribute SVMs with combined \u03c72-kernels, one kernel per feature channel,\nand set C = 10. Our method reserves 20% of the attribute-labeled images as ROC validation data,\nthen pools it with the remaining 80% to train the \ufb01nal attribute classi\ufb01ers. We stress that our method\nand all baselines have access to exactly the same amount of attribute-labeled data.\nWe report results as mean and standard error measured over 20 random trials. Based on cross-\nvalidation, we use tree depths of (AwA-9, aPY-6, SUN-8), and generate (#m, #t) tests per node\n(AwA-(10,7), aPY-(8,2), SUN-(4,5)). When too few validation points (< 10 positives or negatives)\nreach a node n, we revert to computing statistics over the full validation set DV rather than DV (n).\n\nBaselines\nIn addition to several state-of-the-art published results and ablated variants of our\nmethod, we also compare to two baselines: (1) SIGNATURE RF: random forests trained on class-\nattribute signatures as described in Sec. 3.2.1, without an attribute uncertainty model, and (2) DAP:\nDirect Attribute Prediction [8, 9], which is a leading attribute-based zero-shot object recognition\nmethod widely used in the literature [8, 3, 18, 30, 8, 23, 19, 29].5\n\n5We use the authors\u2019 code: http://attributes.kyb.tuebingen.mpg.de/\n\n6\n\n\fFigure 1: Zero-shot accuracy on AwA as a function of attribute uncertainty, in controlled noise\nscenarios.\n\nMethod/Dataset\nDAP\nSIGNATURE-RF\nOURS W/O ROC PROP, SIG UNCERTAINTY\nOURS W/O SIG UNCERTAINTY\nOURS\nOURS+TRUE ROC\n\nAwA\n40.50\n36.65 \u00b1 0.16\n39.97 \u00b1 0.09\n41.88 \u00b1 0.08\n43.01 \u00b1 0.07\n54.22 \u00b1 0.03\n\naPY\n18.12\n12.70 \u00b1 0.38\n24.25 \u00b1 0.18\n24.79 \u00b1 0.11\n26.02 \u00b1 0.05\n33.54 \u00b1 0.07\n\nSUN\n52.50\n13.20 \u00b1 0.34\n47.46 \u00b1 0.29\n56.18 \u00b1 0.27\n56.18 \u00b1 0.27\n66.65 \u00b1 0.31\n\nTable 1: Zero-shot learning accuracy on all three datasets. Accuracy is percentage of correct cate-\ngory predictions on unseen class images, \u00b1 standard error.\n\n4.1 Zero-shot object and scene recognition\n\nControlled noise experiments Our approach is designed to overcome the unreliability of attribute\nclassi\ufb01ers. To glean insight into how it works, we \ufb01rst test it with controlled noise in the test images\u2019\nattribute predictions. We start with hypothetical perfect attribute classi\ufb01er scores \u02c6am(x) = Ak(m)\nfor x in class k, then progressively add noise to represent increasing errors in the predictions. We\nexamine two scenarios: (1) where all attribute classi\ufb01ers are equally noisy, and (2) where the average\nnoise level varies per attribute. See supp. for details on the noise model.\nFigure 1 shows the results using AwA. By de\ufb01nition, all methods are perfectly accurate with zero\nnoise. Once the attributes are unreliable (i.e., noise > 0), however, our approach is consistently\nbetter. Furthermore, our gains are notably larger in the second scenario where noise levels vary\nper attribute (right plot), illustrating how our approach properly favors more learnable attributes\nas discussed in Sec. 3.2.2. In contrast, SIGNATURE-RF is liable to break down with even minor\nimperfections in attribute prediction. These results af\ufb01rm that our method bene\ufb01ts from both (1)\nestimating and accounting for classi\ufb01er noisiness and (2) avoiding uninformative attribute classi\ufb01ers.\n\nReal unreliable attributes experiments Next we present the key zero-shot results for our method\napplied to three challenging datasets using over 250 real attribute classi\ufb01ers. Table 1 shows the\nresults. Our method signi\ufb01cantly outperforms the existing DAP method [9]. This is an important\nresult: DAP is today the most commonly used model for zero-shot object recognition, whether using\nthis exact DAP formulation [8, 23, 19, 29] or very similar non-probabilistic variants [3, 30]. Note that\nour approach beats DAP despite the fact we use only 80% of the attribute-labelled images to train\nattribute classi\ufb01ers. This indicates that modeling how good/bad the attribute classi\ufb01ers are is even\nmore important than having better attribute classi\ufb01ers. Furthermore, this demonstrates that modeling\nonly the con\ufb01dence of an attribute\u2019s presence in a test image (which DAP does) is inadequate; our\nidea to characterize their error tendencies during training is valuable.\nOur substantial improvements over SIGNATURE-RF also con\ufb01rm it is imperative to model attribute\nclassi\ufb01er unreliability. Our gains over DAP are especially large on SUN and aPY, which have fewer\npositive training samples per attribute, leading to less reliable attribute classi\ufb01ers\u2014exactly where\nour method is needed most. On AwA too, we outperform DAP on 7 out of 10 categories, with\nlargest gains on \u201cgiant panda\u201d(10.2%),\u201cwhale seal\u201d(9.4%) and \u201cpersian cat\u201d(7.4%), classes that are\nvery different from the train classes. Further, if we repeat the experiment on AwA reducing to 500\nrandomly chosen images for attribute training, our overall accuracy gain over DAP widens to 8 points\n(28.0 \u00b1 0.9 vs. 20.42).\n\n7\n\n00.511.522.533.54020406080100noise level \u03b7accuracy(%)Uniform noise levels ourssignature\u2212RFDAP00.511.522.533.54020406080100mean noise level \u03b7accuracy(%)Attribute\u2212specific noise levels ourssignature\u2212RFDAP\fMethod\nLampert et al. [8]\nYu and Aloimonos [31]\nRohrbach et al. [24]\nKankuekul et al. [7]\nYu et al. [30]\nOURS (named attributes)\nOURS (discovered attributes)\n\nAccuracy\n40.5\n40.0\n35.7\n32.7\n48.3\n43.0 \u00b1 0.07\n48.7 \u00b1 0.09\n\n(b) Zero-shot vs. state of the art\n\n(a) Few-shot. Stars denote selected \u03bb.\n\nFigure 2: (a) Few-shot results. (b) Zero-shot results on AwA compared to the state of the art.\n\nTable 1 also helps isolate the impact of two components of our method: the model of signature\nuncertainty (see OURS W/O SIG UNCERTAINTY), and the recursive propagation of validation data\n(see OURS W/O ROC PROP, SIG UNCERTAINTY). For the latter, we further compute TPR/FPRs\nglobally on the full validation dataset DV rather than for node-speci\ufb01c subsets DV (n). We see both\naspects contribute to our full method\u2019s best performance (see OURS). Finally, OURS+TRUE ROC\nprovides an \u201cupper bound\u201d on the accuracy achievable with our method for these datasets; this is the\nresult attainable were we to use the unseen class images as validation data DV . This also points to\nan interesting direction for future work: to better model expected error rates on images with unseen\nattribute combinations. Our initial attempts in this regard included focusing validation data on seen\nclass images with signatures most like those of the unseen classes, but the impact was negligible.\nFigure 2b compares our method against all published results on AwA, using both named and discov-\nered attributes. When using standard AwA named attributes, our method comfortably outperforms\nall prior methods. Further, when we use the discovered attributes from [30], it performs comparably\nto their attribute decoding method, achieving the state-of-the-art on AwA. This result was obtained\nusing a generalization of our method to handle the continuous attribute strength signatures of [30].\n\n4.2 Few-shot object and scene recognition\n\nFinally, we demonstrate our few-shot extension. Figure 2a shows the results, as a function of both the\namount of labeled training images and the prior-weighting parameter \u03bb (cf. Sec 3.3).6 When \u03bb = 0,\nwe rely solely on the training images DT ; when \u03bb = 1, we rely solely on the attribute signatures i.e.,\nzero-shot learning. As a baseline, we compare to a method that uses solely the few training images\nto learn the unseen classes (dotted lines). We see the clear advantage of our attribute signature prior\nfor few-shot random forest training. Furthermore, we see that, as expected, the optimal \u03bb shifts\ntowards 0 as more samples are added. Still, even with 200 training images in DT , the prior plays\na role (e.g., the best \u03bb = 0.3 on blue curve). The star per curve indicates the \u03bb value our method\nselects automatically with cross-validation.\n\n5 Conclusion\n\nWe introduced a zero-shot training approach that models unreliable attributes\u2014both due to classi\ufb01er\npredictions and uncertainty in their association with unseen classes. Our results on three challenging\ndatasets indicate the method\u2019s promise, and suggest that the elegance of zero-shot learning need\nnot be abandoned in spite of the fact that visual attributes remain very dif\ufb01cult to predict reliably.\nFurther, our idea is applicable to other uses of semantic mid-level concepts for higher tasks e.g.,\nposelets for action recognition [12], discriminative mid-level patches for location recognition [27]\netc., and in domains outside computer vision. In future work, we plan to develop extensions to\naccommodate inter-attribute correlations in the random forest tests and multi-label random forests\nto improve scalability for many unseen classes.\nAcknowledgements: We thank Christoph Lampert and Felix Yu for helpful discussions and sharing\ntheir code. This research is supported in part by NSF IIS-1065390 and ONR ATL.\n\n6These are for AwA; see supp. for similar results on the other two datasets.\n\n8\n\n00.20.40.60.811.236384042444648505254565850shot (our prior)100shot (our prior)200shot (our prior)50shot (baseline)100shot (baseline)200shot (baseline)lambdaaccuracy(%)\fReferences\n[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-Embedding for Attribute-Based Classi\ufb01ca-\n\ntion. In CVPR, 2013.\n\n[2] L. Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[3] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009.\n[4] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep\n\nvisual-semantic embedding model. In NIPS, 2013.\n\n[5] P. G\u00a8ardenfors. Conceptual Spaces: The Geometry of Thought, volume 106. 2000.\n[6] D. Jayaraman, F. Sha, and K. Grauman. Decorrelating semantic visual attributes by resisting the urge to\n\nshare. In CVPR, 2014.\n\n[7] P. Kankuekul, A. Kawewong, S. Tangruamsub, and O. Hasegawa. Online incremental attribute-based\n\nzero-shot learning. In CVPR, 2012.\n\n[8] C Lampert, H Nickisch, and S Harmeling. Learning to Detect Unseen Object Classes by Between-class\n\nAttribute Transfer. In CVPR, 2009.\n\n[9] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classi\ufb01cation for zero-\n\nshot visual object categorization. TPAMI, 2014.\n\n[10] H. Larochelle, D. Erhan, and Y. Bengio. Zero-data learning of new tasks. In AAAI, 2008.\n[11] D. Mahajan, S. Sellamanicka, and V. Nair. A joint learning framework for attribute models and object\n\ndescriptions. In ICCV, 2011.\n\n[12] S. Maji, L. Bourdev, and J. Malik. Action recognition from a distributed representation of pose and\n\nappearance. In CVPR, 2011.\n\n[13] T. Mensink, E. Gavves, and C. Snoek. COSTA: Co-occurrence statistics for zero-shot classi\ufb01cation. In\n\nCVPR, 2014.\n\n[14] T. Mensink and J. Verbeek. Metric learning for large scale image classi\ufb01cation: Generalizing to new\n\nclasses at near-zero cost. In ECCV, 2012.\n\n[15] R. Mittelman, H. Lee, B. Kuipers, and S. Savarese. Weakly Supervised Learning of Mid-Level Features\n\nwith Beta-Bernoulli Process Restricted Boltzmann Machines. In CVPR, 2013.\n\n[16] C. Olaru and L. Wehenkel. A complete fuzzy decision tree technique. Fuzzy Sets and Systems,\n\n138(2):221\u2013254, Sept 2003.\n\n[17] D. Osherson, E. Smith, T. Myers, E. Sha\ufb01r, and M. Stob. Extrapolating human probability judgment.\n\nTheory and Decision, 36:103\u2013129, 1994.\n\n[18] M. Palatucci, D. Pomerleau, G. Hinton, and T. Mitchell. Zero-shot Learning with Semantic Output Codes.\n\nIn NIPS, 2009.\n\n[19] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.\n[20] G Patterson and J Hays. SUN Attribute Database: Discovering, Annotating, and Recognizing Scene\n\nAttributes. In CVPR, 2012.\n\n[21] J. Quinlan. Induction of decision trees. Machine learning, pages 81\u2013106, 1986.\n[22] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes.\n\nIn ECCV, 2012.\n\n[23] M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a large-\n\nscale setting. In CVPR, 2011.\n\n[24] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps where and why? semantic\n\nrelatedness for knowledge transfer. In CVPR, 2010.\n\n[25] E. Rosch and B. Lloyd. Cognition and categorization. 1978.\n[26] V. Sharmanska, N. Quadrianto, and C. Lampert. Augmented attribute representations. In ECCV, 2012.\n[27] S. Singh, A. Gupta, and A. Efros. Unsupervised discovery of mid-level discriminative patches. In ECCV,\n\n2012.\n\n[28] S. Tsang, B. Kao, K. Yip, W.-S. Ho, and S. Lee. Decision Trees for Uncertain Data. IEEE Transactions\n\non Knowledge and Data Engineering, 23(1):64\u201378, January 2011.\n\n[29] N. Turakhia and D. Parikh. Attribute dominance: what pops out? In ICCV, 2013.\n[30] F. Yu, L. Cao, R. Feris, J. Smith, and S.-F. Chang. Designing Category-Level Attributes for Discriminative\n\nVisual Recognition. In CVPR, 2013.\n\n[31] X. Yu and Y. Aloimonos. Attribute-based transfer learning for object categorization with zero/one training\n\nexample. In ECCV, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1807, "authors": [{"given_name": "Dinesh", "family_name": "Jayaraman", "institution": "UT Austin"}, {"given_name": "Kristen", "family_name": "Grauman", "institution": "UT Austin"}]}