0% found this document useful (0 votes)
24 views11 pages

A Sequence-Oblivious Generation Method For Context

This document presents a novel generative approach for context-aware hashtag recommendation, focusing on the inter-dependency among generated tags using a recurrent BERT model. The proposed method outperforms traditional ranking schemes and autoregressive models by employing early fusion of multimodal context features, allowing for deeper interactions during tag generation. Empirical results demonstrate the effectiveness of the approach in enhancing hashtag relevance on social networking platforms like Instagram.

Uploaded by

er.kirtijain25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

A Sequence-Oblivious Generation Method For Context

This document presents a novel generative approach for context-aware hashtag recommendation, focusing on the inter-dependency among generated tags using a recurrent BERT model. The proposed method outperforms traditional ranking schemes and autoregressive models by employing early fusion of multimodal context features, allowing for deeper interactions during tag generation. Empirical results demonstrate the effectiveness of the approach in enhancing hashtag relevance on social networking platforms like Instagram.

Uploaded by

er.kirtijain25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/346701035

A Sequence-Oblivious Generation Method for Context-Aware Hashtag


Recommendation

Preprint · December 2020

CITATIONS READS

0 217

4 authors, including:

Jeonghwan Kim Sung-Hyon Myaeng


Korea Advanced Institute of Science and Technology Korea Advanced Institute of Science and Technology
6 PUBLICATIONS 12 CITATIONS 216 PUBLICATIONS 3,192 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Jeonghwan Kim on 28 December 2020.

The user has requested enhancement of the downloaded file.


A Sequence-Oblivious Generation Method for Context-Aware Hashtag
Recommendation

Junmo Kang Jeonghwan Kim Suwon Shin Sung-Hyon Myaeng


School of Computing, KAIST
Daejeon, Republic of Korea
{junmo.kang, jeonghwankim123, ssw0093, myaeng}@kaist.ac.kr

0.43 0.22 0.15 … 0.03


Abstract #Tag1 #Tag2 #Tag3 … #TagN

Like search, a recommendation task accepts Ranking


Model
an input query or cue and provides desirable
arXiv:2012.02957v1 [cs.CL] 5 Dec 2020

items, often based on a ranking function. Such Context


a ranking approach rarely considers explicit
dependency among the recommended items.
#Tag1 #Tag2 #TagN
In this work, we propose a generative approach
to tag recommendation, where semantic tags Generation Generation Generation
are selected one at a time conditioned on Model Model … Model
the previously generated tags to model inter-
Context Context #Tag1 Context #Tag1 #Tag2 … #TagN-1
dependency among the generated tags. We ap-
ply this tag recommendation approach to an
Instagram data set where an array of context Figure 1: Ranking vs. Generation model.
feature types (image, location, time, and text)
are available for posts. To exploit the inter-
dependency among the distinct types of fea- while the tags can be seen as predefined categories,
tures, we adopt a simple yet effective architec- they can be generated as a sequence of tags like a
ture using self-attention, making deep interac- natural language sentence.
tions possible. Empirical results show that our Social networking service (SNS) platforms like
method is significantly superior to not only the
Instagram and Twitter are example cases in which
usual ranking schemes but also autoregressive
models for tag recommendation. They indi- tag recommendation plays a significant role in ag-
cate that it is critical to fuse mutually support- gregating and distributing information. Users tend
ing features at an early stage to induce exten- to include hashtags when they post daily life stories
sive and comprehensive view on inter-context or advertisements, expecting they would serve as
interaction in generating tags in a recurrent keywords for the semantics and pragmatics of the
feedback loop. unstructured content like image and text. Recom-
mending appropriate hashtags to the users would in-
1 Introduction
crease global coherency of the tags used in the user
From traditional term-based methods to deep neu- population and subsequently facilitate grouping the
ral network based models, recommendation func- posts of similar topics for easier navigation/search.
tions in the Internet domain are widely adopted. A number of hashtag recommendation methods
While categorical features are often used for col- have been proposed to date (Weston et al., 2014;
laborative filtering types along with a user cue, rec- Gong and Zhang, 2016; Wu et al., 2018a; Wang
ommending text or image, i.e. content-based rec- et al., 2019; Zhang et al., 2019; Yang et al., 2020b;
ommendation, requires handling unstructured data Kaviani and Rahmani, 2020; Yang et al., 2020a).
based on relevance of items toward a user query. These methods focus on modeling latent topic dis-
Like search, most of these recommendation sys- tribution over hashtags (Godin et al., 2013), heavily
tems rank and return items with top-k predicted relying on late fusion approach to model the inter-
logit values (Covington et al., 2016; Weston et al., action between image and text input (Zhang et al.,
2014; Wu et al., 2018b). We posit that tag recom- 2017; Yang et al., 2020b) or project words within
mendation touches on the middle ground because the given SNS post and the hashtag embeddings (i.e.
tag embeddings) to a common high-dimensional has been used in other studies (Wang et al., 2019;
space and update the embeddings with pairwise Yang et al., 2020b), we build a BERT-based autore-
ranking loss (Weston et al., 2014). However, prior gressive (AR) model with a Transformer (Vaswani
studies taking the ranking approach to tag recom- et al., 2017) decoder. The experimental result show
mendation neglect inter-dependency among the that our model also outperforms the AR model by
generated hashtags for the given context (i.e., post). a large margin.
We propose a recurrent hashtag feedback ap- Our key contributions are summarized are as
proach to tag recommendation, which enables the follows:
recommendation model to repeatedly consider pre- • A generation framework, recurrent hashtag
viously generated tags in generating the next “rel- feedback, for tag recommendation, consider-
evant” tag. Our recurrent model using BERT (De- ing inter-tag dependency
vlin et al., 2019) generates hashtags conditioned
on the assorted context information and previously • An early fusion approach enabling deep in-
generated hashtags as in Figure 1. Note that this teractions among context features and tags,
recurrent BERT model is devised for the unique and
nature of syntax-free tag generation, instead of the
usual RNN or BERT approach for language gen- • Experiments showing the superiority of the
eration. On a different note, this approach can be proposed approaches and shedding light on
seen as analogous to pseudo-relevance feedback the way the context features interact among
(PRF) (Xu and Croft, 1996) where previously re- each other for tag generation.
trieved items are deemed relevant and additional
query terms are extracted as relevance feedback. 2 Related Work
For tag recommendation, we also assume that pre-
The hashtag recommendation problem has been
viously generated tags can be trusted to be relevant
studied as a ranking problem (Park et al., 2016;
but incorporate them for generating the next one.
Zangerle et al., 2011; Denton et al., 2015; Sed-
Nonetheless, they can be seen as part of the new
hai and Sun, 2014; Li et al., 2016; Weston et al.,
“query” for generation of the next tag.
2014; Wu et al., 2018b; Gong and Zhang, 2016).
Our work also proposes an early fusion of multi- A representative approach is to use a visual fea-
modal context features of Instagram posts (image, ture extractor from the input image and employs a
location, time, and text). Prior works (Denton et al., multi-label classifier to calculate the score of each
2015; Wang et al., 2019; Gong et al., 2018; Li et al., hashtag and provides top-k hashtags recommenda-
2016) on combining multi-modal features in hash- tions (Park et al., 2016). Others handle multiple
tag recommendation tend to merge the representa- input feature types (i.e. image, text) by mapping
tions with either a co-attention (Lu et al., 2016) or a them into a common representation space and ap-
bi-attention (Seo et al., 2017) mechanism after inde- ply the pairwise ranking loss algorithm (Denton
pendently encoding features of differing modalities. et al., 2015; Weston et al., 2014; Wu et al., 2018b),
Based on our intuition that context input features such as the weighted approximate-rank pairwise
should affect the representation modeling process (WARP) loss (Weston et al., 2011), as the training
as early as possible, we exploit the self-attention objective. Many of the prior studies on hashtag rec-
based pre-trained BERT to fuse the different fea- ommendation (Godin et al., 2013; Ding et al., 2012;
tures and encode the relationships at an early stage Li et al., 2019; Zhao et al., 2016) take topic mod-
of building representations. This approach has an eling approaches with Latent Dirichlet Allocation
added benefit of allowing for an investigation of (LDA), which is often used to discover general top-
how the features influence each other for tag gener- ics in a large collection of documents. Unlike our
ation, in addition to a usual ablation study where approach, however, such ranking approaches do
we can only reveal the role of each feature type for not explicitly consider the inter-dependency among
the overall performance. the generated hashtags.
Our experimental work shows that the proposed The use of multi-modal features is also evident
method outperforms the ranking approaches by a (Denton et al., 2015; Wang et al., 2019; Gong et al.,
significant margin. To further differentiate and eval- 2018; Li et al., 2016; Zhang et al., 2017; Yang et al.,
uate our model against a generative approach that 2020a,b). The types of multi-modal features and
^ ^ ^
ht2 ht3 hts

… … … … …

BERT BERT … BERT

… … … … …

img1 … [IMG] loc1 … [LOC] time1 … [TIME] txt1 … [SEP] ht^ 1 … [MASK] img loc time txt ^ ht
ht 1
^ [MASK]
2 img loc time txt ^
ht < s [MASK]

^
hts-1
I2 = [C, ht^< 2, [MASK]] I3 … Is
where C = [img, loc, time, txt] Is-1

Figure 2: Overall architecture of the proposed approach (sequential tag generation).

the ways they are fused are quite distinct. For in- immediately preceding token is pooled (Vaswani
stance, Denton et al. incorporate user metadata et al., 2017), our model fuses mutually supporting
(e.g. age, gender) with 3-way multiplicative gat- context features “directly” with the self-attention
ing along with image for hashtag recommendation. mechanism along with the incrementally appended
Another example (Wang et al., 2019) uses both the hashtags. This early fusion approach is conducive
text description of a given tweet and the thread to modeling our representation space because the
conversation by employing bi-attention (Seo et al., output space is jointly modeled with the combined
2017). A more recent approach makes use of audio context-tag representation for every generation step.
features (Yang et al., 2020a) for short video infor- On the contrary, the commonly used late fusion of
mation. A common drawback of these previous latent representations of multi-modal input vectors
models is that they capture a very limited amount is limited to the aggregation of the projected infor-
of relationship among the input features of different mation. The expected benefit of the early fusion
modalities. Our approach of using a self-attention is the extensive and comprehensive view on the
mechanism by employing a pre-trained BERT en- contextual information in generating a hashtag.
sures that inter-dependency among the given image, Inspired by the generative application of BERT
location, time and text description is learned at the in (Chan and Fan, 2019), our model uses a pre-
early stage of the layer hierarchy to model the con- trained BERT model that generates tags one after
textual information dispersed throughout the input another and recurrently feeds them back as input.
features. Our BERT-based model is trained on a hashtag
prediction task. Given context features and co-
3 Approach occurring tags as input, it predicts a relevant hash-
3.1 Recommendation as a Generation Task tag while taking into account what hashtags have
been generated. More formally, the sth hashtag is
In contrast to the ranking approaches, our model generated as follows:
generates a sequence of interrelated hashtags given
an assortment of context features. As in Figure 2,
ˆs = argmax P (hts |img, loc, time, txt, ht
ht ˆ <s ) (1)
our model takes as input the context C consisting hts ∈HT
of the image, location, time and text features, to-
gether with the [MASK] token appended at the tail ˆ <s refers
where HT is the set of all the hashtags, ht
end of the sequence (see Section 3.3 for details) for to the hashtags generated so far for a given post in-
encoding of each context. Between the context fea- stance, and img, loc, time, and txt denote image,
tures and the [MASK] token comes a sequence of location, time, and text features, respectively.
hashtags generated by the model (initially null) to
incorporate the tag dependency for the subsequent 3.2 Generation Model
input sequences. As each hashtag ht is generated, As in (Chan and Fan, 2019) where BERT is used
it is added to the input sequence as in Figure 2. as a generative model in a question generation
Unlike autoregressive approaches in which only a task, we employ the recurrent BERT model for
hidden state vector is passed on to the next state the tag generation task. BERT is a language model
(Sutskever et al., 2014) or the representation of the that takes as input a sequence of partially masked
words and predicts which word should originally be a representation with the self-attention mechanism.
placed in the masked position. The recurrent BERT This approach makes it amenable to take advantage
model takes advantage of this characteristic of us- of BERT’s language modeling capability on the
ing the [MASK] token for producing a probability fused representation of each input context. On the
distribution over a vocabulary set (i.e. a hashtag set contrary, late fusion attempts to encode different
in our case). The recurrent BERT model is adopted feature types in different spaces that are mapped
and extended for hashtag recommendation. into a common space at a later stage. As in (Denton
Given a sequence of n tokens x = et al., 2015; Wang et al., 2019; Gong et al., 2018; Li
[x1 , x2 , x3 , ..., xn ] representing the input context, et al., 2016), typical late fusion strategies employ
our model begins with: co-attention or bi-attention.
We pre-process each input feature type as fol-
X1 = [x, [SEP], [MASK]] (2) lows:
• Image. We generate an image caption from
where [SEP] indicates the end of the input se- each image using the image captioning mod-
quence, and [MASK] is appended at the end as a ule provided by Microsoft Azure Computer
generative token over a target vocabulary set V . As Vision API. The first image is used for each
the model generates one token after another, the post.
predicted tokens are consecutively appended to the
input to be fed back to the model: • Location. We utilize only symbolic names
(e.g., My Home, XX National Park) given by
the users of Instagram, ignoring coordinate
Xi = [x, [SEP], ŷ1 , ŷ2 , ..., ŷi−1 , [MASK]] (3)
information such as a GPS location.
where Xi refers to the ith input of a given data • Time. A numeric time expression is converted
instance. Following the concatenation of the gen- into words in three categories based on a rule-
erated sequence of tokens between the [SEP] and based converter: season, day of the week,
[MASK] tokens, the recurrent BERT feeds it back and part of the day (i.e. morning, afternoon,
into BERT and takes the representation from the evening or night). For example, ’2020-07-01
[MASK] token’s position to generate the subse- (Wed) 14:52:00’ is converted into {summer,
quent token: weekday, afternoon} and used as the time fea-
ture.
hi = BERT (Xi ) (4)
• Text. A set of words are collected from a
z = hi[MASK] · W Tvocab +b (5)
user’s textual description in a post. We strip
exp(z k )
pk = P (6) hashtags from the description and use texts
ht∈HT exp(z ht )
only as the text feature.
ŷi = argmax(p) (7)
yi
We then enter the input context C of image img,
where hi ∈ Rn , z ∈ R|V | , and p ∈ R|V | . Then location loc, time time and text description txt
the generated ŷi is again used to expand the input with a delimiter token for each ([IMG], [LOC],
sequence for the next input: [TIME], [SEP]). Note that the hashtag(s) gener-
ˆ and [MASK] token
ated from the previous stage ht
are appended to the end:
Xi+1 = [x, [SEP], ŷ1 , ŷ2 ..., ŷi−1 , ŷi , [MASK]] (8)
img = [img1 , img2 , ..., img|img|−1 , [IMG]] (9)
loc = [loc1 , loc2 , ..., loc|loc|−1 , [LOC]] (10)
3.3 Early Fusion of Different Features time = [time1 , time2 , ..., time|time|−1 , [TIME]] (11)
txt = [txt1 , txt2 , ..., txt|txt|−1 , [SEP]] (12)
As implied by Eq. 1 where a hashtag is generated C = [img, loc, time, txt] (13)
for an input context consisting of img, loc, time, ˆ <s , [MASK]]
Is = [C, ht (14)
and txt, a precursor of our early fusion strategy is
to convert the input context into textual form. That where Is is the input to our recurrent BERT model
way, the four different types of data are entered at sth step for a given post. The last token rep-
into BERT as a sequence of tokens and fused into resentation h[MASK] is fed to the fully-connected
softmax layer to obtain a probability distribution that order in the post. This assumption is referred
over a defined set of vocabulary for hashtags, as in to as orderlessness in this paper, which can be
Eq. (4-6). A hashtag htˆ s with the highest probabil- seen as enforced globally because even the same
ity is chosen by Eq. 7, given the input Is . set of hashtags associated with multiple posts can
appear in different orders, thereby mimicking a
3.4 Training and Decoding Strategy global permutation of multiple hashtags.
We turn an Instagram post instance containing N Locally, we enforce the 1-to-M relationship us-
ground truth hashtags into N separate training in- ing KL divergence loss (Eq. 18), comparing the
stances. Each instance begins with the context fea- output distribution of predicted tags against the
tures [img, loc, time, txt] and receives no tag, one ground truth distribution.
tag, and so on, all the way to the maximum number
of tags in the recurrent model. With the first train-
ing instance T1 being [C, [SEP], [MASK]] where X q(ht)
DKL (q||p) = q(ht) log (18)
C denotes the list of context features, we obtain a p(ht)
ht∈HT
total of N training instances with:
where ht refers to a tag within tag space of HT .
Ti = [C, [SEP], ht1 , ht2 , ..., hti−1 , [MASK]] (15) To generate top-k tags at the inference step, we
employ greedy search rather than beam search com-
where i = 1, ..., N .
monly used in language generation tasks. The ratio-
When each training instance is used for training,
nale is that prediction of context-relevant hashtags
it goes through the BERT module (corresponding
can take place early in the generation process, lead-
to the number of hashtags that have been gener-
ing to incremental addition of pertinent information
ated) so that the resulting representation for the
in the input sequence. Since tag generation does
[MASK] becomes the input to the fully-connected
not consider how “probable” the sequence is but
network classifier that returns h[MASK] , which is
how relevant each one is, greedy search suits the
then compared to the ground truth.
purpose.
Given a training instance Ti (Eq. 15), we can use
either one ground truth label or multiple ground
4 Experiments
truth labels that have not been generated up to that
point. For a training instance with i − 1 tags that 4.1 Experimental Settings
have been predicted, the model attempts to predict
the ith tag, which is judged to be correct as long 4.1.1 Data Construction
as it matches one of the remaining N − i + 1 tags To collect meaningful and diverse hashtags from
in the ground truth. We define the terms “1-to-1” Instagram, we first define a set of seed tags based
and “1-to-M” (for one-to-many) with the following on the level of generality and frequency. Seed tags
ground truth labels, Li : consist of 6 categories (Activity, Emotion,
Event, Location, Object, Time), and each
{1−to−1}
Li = hti (16) category constitutes 10 TAGS. For example,
{1−to−M }
Li = [hti , hti+1 , ..., htN ] (17) “#beach” is assigned to Location and “#happy”
to Emotion. Using the seed tags we collect
181,620 posts from Instagram and filter out those
Since the exact sequence needs to be generated with more than 20 hashtags, resulting in 87,872
for usual language generation, there should be only posts and 194,773 unique hashtags in total. This
one ground label at every generation step (i.e. in a filtering strategy is based on the rationale that the
1-to-1 fashion). For hashtag generation, however, posts with exceptionally many hashtags are highly
there can be multiple ground truth tags, any of likely to be an advertisement. To avoid recom-
which can serve as the ground truth at that step mending overly specific and meaningless hashtags,
(i.e. in 1-to-M fashion ). That is, the order by we also filter out those with less than 400 occur-
which the ground truth tags are generated does rences across the data set, resulting in the final set
not matter. For example, the next hashtag that of 907 hashtags. For training and evaluation, we
a model needs to generate is not necessarily ht2 split 87,872 instances in 9:1 ratio, ending up with
when ht1 is part of the input, even if they appear in 79,085 training and 8,787 evaluation samples.
Model P@1 P@3 P@5 P@10 R@1 R@3 R@5 R@10
Frequency-Based 0.0699 0.0533 0.0465 0.0387 0.0156 0.0422 0.0576 0.1104
Context-Tag Mapping 0.1521 0.1130 0.0958 0.0730 0.0621 0.1278 0.1740 0.2521
BERT-based Ranking (LF) 0.2884 0.1905 0.1488 0.1039 0.1134 0.2124 0.2677 0.3585
BERT-based Ranking (EF) 0.4178 0.2652 0.2051 0.1350 0.1752 0.2869 0.3450 0.4209
BERT-based AR (1-to-1) 0.3006 0.1854 0.1466 0.1047 0.1283 0.2049 0.2588 0.3544
BERT-based AR (1-to-M) 0.3734 0.2233 0.1746 0.1213 0.1553 0.2404 0.2947 0.3809
Ours (1-to-1) 0.4615 0.2803 0.2058 0.1278 0.1981 0.3044 0.3465 0.3953
Ours (1-to-M) 0.5185 0.3015 0.2203 0.1390 0.2229 0.3238 0.3662 0.4263

Table 1: Precision@K and Recall@K evaluation on baseline models and our model.

4.1.2 Metrics • BERT-based Ranking (EF vs. LF). This


We conduct our evaluation with precision-at-k model sends only the context features to
(P@K) and recall-at-k (R@K), both of which are BERT and produces a top-k list of hashtags
widely used for recommendation tasks. With k at based on the [CLS] representation. To inves-
1, 3, 5 and 10, we calculate the scores for the given tigate the difference between the early fusion
metrics to compare our model performance against (EF) strategy proposed by our model and the
several baselines. commonly used late fusion (LF), we design
two different versions of this model. The EF
1 X |Ranked top-K ∩ Ground Truth| version takes the input C (Eq. 13) whereas
P @K = the LF version passes each feature separately
N K
1 X |Ranked top-K ∩ Ground Truth| through the BERT to independently encode
R@K =
N |Ground Truth| each feature.

• BERT-based AR (1-to-1 vs. 1-to-M). For


4.1.3 Baselines the autoregressive (AR) tag generation model,
To validate our hypotheses and conduct a compara- we employ the Transformer architecture. For
tive analysis of our recurrent BERT hashtag gener- the sake of fairness, we use BERT as the en-
ation model, we design and evaluate the following coder and a single-layer decoder from Trans-
baselines: former to match the size of parameters with
other BERT-based baselines. This model is
• Frequency-Based. We generate most fre- divided into a 1-to-1 and 1-to-M approaches.
quent tags regardless of a given context.
4.1.4 Implementation details
• Context-Tag Mapping (Ranking). Conven- Our implementations of BERT-based models are
tional tag recommendation models (Weston based on transformers library (Wolf et al.,
et al., 2014; Denton et al., 2015; Wu et al., 2019) and we use a V100 NVIDIA GPU for train-
2018b; Yang et al., 2020a) project input em- ing. With the batch size set to 8, the hidden size to
bedding and tag embeddings onto the same 768 and the learning rate to 5e-5, we use the Adam
representation space and learn with a pair- optimizer and a seed equal to 42. The maximum
wise ranking loss. Building upon this scheme, input sequence length of our model is set to 384.
this generalized context-tag mapping model
merges encoded features to represent context. 4.2 Lessons from the Comparisons
At inference time, it selects the top-k tags Table 1 presents P@K and R@K scores of our
nearest to the context embedding in the jointly model against the baselines, showing very clearly
projected space. It uses CNN for image and that our model outperforms the baseline models by
LSTM for text encoding, and is trained with significant margins. The most salient outcome out
triplet loss that brings positive tags closer and of the experiment is that the particular generative
negative tags away from the context at the approach in our proposed model is much superior
same time. to the ranking and the auto-regressive generation
Model P@1 P@3 P@5 P@10 R@1 R@3 R@5 R@10
All features 0.5185 0.3015 0.2203 0.1390 0.2229 0.3238 0.3662 0.4263
w/o Image 0.4057 0.2169 0.1572 0.0992 0.1744 0.2350 0.2629 0.3026
w/o Location 0.4979 0.2851 0.2085 0.1299 0.2129 0.3061 0.3464 0.3990
w/o Time 0.4323 0.2596 0.1927 0.1229 0.1849 0.2769 0.3164 0.3714
w/o Text 0.3724 0.2048 0.1474 0.0913 0.1573 0.2182 0.2435 0.2792

Table 2: Ablation over the feature types using our model. Only one feature is removed at a time.

approaches. Still another significant result is that eration. The BERT-based AR models are di-
the early fusion strategy is superior to late fusion rectly influenced by immediately preceding
for both ranking and generative approaches. Fur- input token (i.e., hashtag). This property is
ther analyses follow: suitable for a sequentially dependent natural
language generation but refrains the AR mod-
• Context-Tag Mapping vs. BERT-based els from using the entire context including
Ranking (LF). This comparison is intended the generated hashtags, which causes such
to ensure that even for late fusion without weaker performance than the ranking model.
deep interaction among the features, the lan- On the contrary, our model uses a [MASK] to-
guage modeling capabilities of BERT con- ken representation which learns to aggregate
tributes significantly to the hashtag recommen- entire contextual and previous tag information
dation problem, compared to the joint space in generating the output hashtags. The result
approach. validates the effectiveness of our generation
model specifically suited for the hashtag rec-
• EF vs. LF. We assess the early fusion and late
ommendation task.
fusion approaches for inter-context feature
modeling by comparing BERT-based ranking • 1-to-1 vs. 1-to-M. For the BERT-based AR
(LF) and BERT-based ranking (EF) baselines. model and our model, we see 1-to-M mod-
In the LF approach, the model separately en- els outperform 1-to-1 models by a significant
codes each feature (image, location, time and gap, which shows the effectiveness of our ap-
text) with a shared parameter BERT and aver- proach under the orderlessness assumption.
ages over them to form a single, aggregated To test orderlessness, in addition, we shuffled
context representation. On the other hand, the and reversed the order of hashtags within the
EF approach jointly feeds the context features posts and trained our model under the same
in a single step in ranking the top-k hashtags setting. There was no meaningful gap in per-
based on the fused context information. As the formance with the original result, validating
results between these two models imply, early our assumption.
fusion provides an extensive and more com-
4.3 Ablation Study
prehensive view of the given features unlike
the delayed aggregation of separately modeled We also conduct an ablation study to see how each
context representations. feature contributes to the model performance and
the hashtag recommendation. As can be seen in
• Ranking vs. Generation. The experimental Table 2, every evaluation score decreases when we
result shows that casting recommendation as remove one of the input features, implying all of
generation using recurrent BERT has a sig- the features contribute to the task and the model. It
nificant benefit in performance. This perfor- shows that Text is the most important feature, prob-
mance gain is attributed to the generation as- ably because it comes directly from the users and is
pect that considers the dependency among the most native to the BERT language model. On the
generated tags. Note that the BERT-based AR other hand, the location and time features appear
models are weaker than the ranking model, to be less important because they are secondary
which reinforces the importance of the par- descriptions derived from the original descriptor.
ticular way the proposed model handles gen- Usually the text form of location is too specific and
diverse for the model to capture the patterns. We
leave the issue of extracting more innate represen- P@1

tations for those features for future work.

Image Location Time Text Tag B = 1 (Greedy)


R@1 B=3
B=5
Image 39.74% 13.65% 22.33% 9.76% 14.52% B = 10
Location 13.48% 47.01% 19.37% 8.80% 11.34%
Time 11.07% 11.61% 51.67% 10.77% 14.88% 0.1 0.2 0.3 0.4 0.5

Text 12.39% 10.37% 16.71% 39.19% 21.34%


Tag 9.29% 6.72% 10.27% 10.41% 63.31% Figure 4: Beam search results with different beam
[MASK] 14.27% 9.28% 11.58% 10.86% 54.01% widths.

Table 3: Interactions among features.


beam search and greedy search adopted for our
model. We apply beam search with different width
(a) settings in generating the tag sequence. In Figure
4, we observe a substantial performance drop in
our model when we applied beam search to tag
evening
weekday

[MASK]
[IMG]

[LOC]

[TIME]
[UNK]

[UNK]
[SEP]

#autumn
image

#london
regent

#park
regent
an

#nature
river
of

s
park

fall
a

,
s
park

#uk

generation. This result can be explained in terms


(b) of the characteristic differences between a natu-
ral language sequence and a tag sequence. In an
branches
weekday

[TIME]
close

##que

[LOC]
evening

[SEP]

#nature
[IMG]

##ade

#trees

[MASK]
roberto
up

bu
##rle
marx

fall

#park
#sky
of

par

cid
a

tree

da
a

NLG task, making a sentence natural and under-


standable overall is important. Thus, beam search,
which maximizes the probability of the entire se-
Figure 3: Visualization of attention scores for encod-
ing a [MASK] token to be used for generating a next quence (sentence) improves the NLG performance.
hashtag. However, tag generation inherently prefers greedy
search because it requires to increase the number
of relevant hashtags from the early steps of gen-
4.4 Attention Analysis eration, without having to consider the complex
We examine the level of interactions among the fea- syntactic and semantic constraints. This outcome
ture types, the context-to-tag correlation, and the substantiates our orderlessness assumption.
importance of generated tags as well as the feature
4.6 Qualitative Analysis
types in the [MASK] token representation. Table
3 shows the averaged attention scores for each fea- Aside from the extensive quantitative analysis, we
ture over the test set, which indicate how much the show a few test cases where we can examine the
feature attend to the other features (row-wise). It is way different models generate lists of hashtags
evident that the features interact with one another given the posts in Figure 5. The predicted tags
quite actively as expected. We note that a major por- in red are the correct ones that actually exist for
tion of attention from [MASK] (54.01%) is on the the post. For the top post, we see that the BERT-
previous tags, witnessing the dependency among based Ranking model produces relevant hashtags
tags manifests itself in our model. In Figure 3 (a) like #disney but fails to predict others like #fear
and (b), which illustrate how much [MASK] token and #anger, which require considering inter-tag
attends to other tokens, we observe that it largely dependency (Joy, Fear and Anger are characters
focuses on the previous hashtags. This shows that from the Disney movie, Inside Out). However,
the generated hashtags play a critical role in gener- our model successfully generates those hashtags
ating a related tag, which is in accordance with our because it can make use of information from pre-
claim. viously generated hashtags. For the bottom post,
the BERT-based AR model fails to generate any
4.5 Greedy vs. Beam Search of the gold hashtags, because of its autoregressive
As a way to confirm our hypothesis that our pro- property that heavily relies on the immediately pre-
posed model is more amenable to hashtag gener- ceding hashtag instead of seeing the entire context.
ation than the generation models developed for Since the model initially produced incorrect hash-
Natural Language Generation (NLG), we compare tags (#disgust, #insideout), the autoregressive char-
BERT-based
Ranking #drawing #art #disney #digitalart
#sketch #disgust #illustration #joy #artist #artwork

Ours #insideout #disgust #disney #pixar


#sadness #joy #fear #anger #disneyland #green
#disney #insideout #sadness
#joy #anger #fear #disgust

BERT-based
AR #disgust #insideout #disney #fear
#anger #sadness #pixar #joy #halloween #emotions

Ours #park #fall #hiking #autumn #nature #forest


#nature #park #nature #blackandwhite
#family #love #trees #outdoors
#fall #sun #hiking #outdoors #november
#blackandwhitephotography #pretty

Figure 5: Qualitative analysis for comparing generated tags by different models.

acteristic propagates the erroneous tag information Paul Covington, Jay Adams, and Emre Sargin. 2016.
through the subsequent generation steps. In con- Deep neural networks for youtube recommendations.
In Proceedings of the 10th ACM conference on rec-
trast, our model generates many of the correct tags
ommender systems, pages 191–198.
successfully, showing its accountability for direct
view of contextual information. Emily Denton, Jason Weston, Manohar Paluri,
Lubomir Bourdev, and Rob Fergus. 2015. User con-
5 Conclusion ditional hashtag prediction for images. In Proceed-
ings of the 21th ACM SIGKDD international con-
We introduce a novel hashtag feedback/generation ference on knowledge discovery and data mining,
approach that explicitly considers the dependency pages 1731–1740.
among the tags, while accounting for the charac- J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
teristic difference between tag generation and lan- Toutanova. 2019. Bert: Pre-training of deep bidirec-
guage generation. To exploit the rich information tional transformers for language understanding. In
among assorted features, we adopt an early fusion NAACL-HLT.
approach by converting distinct features to the same Zhuoye Ding, Qi Zhang, and Xuan-Jing Huang. 2012.
modality (i.e., text) and leverage BERT’s language Automatic hashtag recommendation for microblogs
modeling capabilities. Through an extensive analy- using topic-specific translation model. In Proceed-
ings of COLING 2012: Posters, pages 265–274.
sis over the baseline models, we show significant
improvements. We also prove that the 1-to-M gen- Fréderic Godin, Viktor Slavkovikj, Wesley De Neve,
eration approach under orderlessness assumption Benjamin Schrauwen, and Rik Van de Walle. 2013.
is more appropriate for the tag recommendation Using topic models for twitter hashtag recommenda-
tion. In Proceedings of the 22nd International Con-
task. For future work, we plan to evaluate our ference on World Wide Web, pages 593–596.
model on other benchmark of similar kind to test
the robustness of our tag generation approach. Our Yeyun Gong, Qi Zhang, and Xuanjing Huang. 2018.
generative framework is also expected to general- Hashtag recommendation for multimodal microblog
posts. Neurocomputing, 272:170 – 177.
ize over other recommendation tasks that deal with
items with inter-dependency. Yuyun Gong and Qi Zhang. 2016. Hashtag recom-
mendation using attention-based convolutional neu-
ral network. In Proceedings of the Twenty-Fifth
References International Joint Conference on Artificial Intelli-
gence, IJCAI’16, page 2782–2788. AAAI Press.
Ying-Hong Chan and Yao-Chung Fan. 2019. A recur-
rent BERT-based model for question generation. In Mohadeseh Kaviani and Hossein Rahmani. 2020.
Proceedings of the 2nd Workshop on Machine Read- Emhash: Hashtag recommendation using neural net-
ing for Question Answering, pages 154–162, Hong work based on bert embedding. In 2020 6th Interna-
Kong, China. Association for Computational Lin- tional Conference on Web Research (ICWR), pages
guistics. 113–118. IEEE.
Quanzhi Li, Sameena Shah, Armineh Nourbakhsh, Xi- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
aomo Liu, and Rui Fang. 2016. Hashtag recommen- Chaumond, Clement Delangue, Anthony Moi, Pier-
dation based on topic enhanced embedding, tweet ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
entity data and learning to rank. In Proceedings of towicz, et al. 2019. Transformers: State-of-the-
the 25th ACM International on Conference on Infor- art natural language processing. arXiv preprint
mation and Knowledge Management, pages 2085– arXiv:1910.03771.
2088.
Gaosheng Wu, Yuhua Li, Wenjin Yan, Ruixuan Li,
Yang Li, Ting Liu, Jingwen Hu, and Jing Jiang. Xiwu Gu, and Qi Yang. 2018a. Hashtag recommen-
2019. Topical co-attention networks for hashtag dation with attention-based neural image hashtag-
recommendation on microblogs. Neurocomputing, ging network. In International Conference on Neu-
331:356–365. ral Information Processing, pages 52–63. Springer.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Ledell Yu Wu, Adam Fisch, Sumit Chopra, Keith
2016. Hierarchical question-image co-attention for Adams, Antoine Bordes, and Jason Weston. 2018b.
visual question answering. In Advances in neural Starspace: Embed all the things! In AAAI.
information processing systems, pages 289–297.
Jinxi Xu and W. Bruce Croft. 1996. Query expansion
Minseok Park, Hanxiang Li, and Junmo Kim. 2016. using local and global document analysis. In Pro-
Harrison: A benchmark on hashtag recommenda- ceedings of the 19th Annual International ACM SI-
tion for real-world images in social networks. arXiv GIR Conference on Research and Development in
preprint arXiv:1605.05054. Information Retrieval, SIGIR ’96, page 4–11, New
York, NY, USA. Association for Computing Machin-
Surendra Sedhai and Aixin Sun. 2014. Hashtag rec- ery.
ommendation for hyperlinked tweets. In Proceed-
ings of the 37th international ACM SIGIR confer- Chao Yang, Xiaochan Wang, and Bin Jiang. 2020a.
ence on Research & development in information re- Sentiment enhanced multi-modal hashtag recom-
trieval, pages 831–834. mendation for micro-videos. IEEE Access,
8:78252–78264.
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and
Hannaneh Hajishirzi. 2017. Bidirectional attention Qi Yang, Gaosheng Wu, Yuhua Li, Ruixuan Li,
flow for machine comprehension. In 5th Inter- Xiwu Gu, Huicai Deng, and Junzhuang Wu. 2020b.
national Conference on Learning Representations, Amnn: Attention-based multimodal neural network
ICLR 2017, Toulon, France, April 24-26, 2017, Con- model for hashtag recommendation. IEEE Transac-
ference Track Proceedings. tions on Computational Social Systems.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. E. Zangerle, W. Gassler, and G. Specht. 2011. Rec-
Sequence to sequence learning with neural networks. ommending #-tags in twitter. In Proceedings of
In Advances in neural information processing sys- the Workshop on Semantic Adaptive Social Web
tems, pages 3104–3112. (SASWeb 2011). CEUR Workshop Proceedings, vol-
ume 730, pages 67–78.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Qi Zhang, Jiawen Wang, Haoran Huang, Xuanjing
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Huang, and Yeyun Gong. 2017. Hashtag recommen-
Kaiser, and Illia Polosukhin. 2017. Attention is all
dation for multimodal microblog using co-attention
you need. In Advances in neural information pro-
network. In IJCAI, pages 3420–3426.
cessing systems, pages 5998–6008.
Suwei Zhang, Yuan Yao, Feng Xu, Hanghang Tong, Xi-
Yue Wang, Jing Li, Irwin King, Michael R Lyu, and aohui Yan, and Jian Lu. 2019. Hashtag recommen-
Shuming Shi. 2019. Microblog hashtag generation dation for photo sharing services. In Proceedings of
via encoding conversation contexts. In Proceed- the AAAI Conference on Artificial Intelligence, vol-
ings of the 2019 Conference of the North American ume 33, pages 5805–5812.
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 Feng Zhao, Yajun Zhu, Hai Jin, and Laurence T Yang.
(Long and Short Papers), pages 1624–1633. 2016. A personalized hashtag recommendation ap-
proach using lda-based topic model in microblog en-
Jason Weston, Samy Bengio, and Nicolas Usunier. vironment. Future Generation Computer Systems,
2011. Wsabie: Scaling up to large vocabulary im- 65(C):196–206.
age annotation.

Jason Weston, Sumit Chopra, and Keith Adams. 2014.


#TagSpace: Semantic embeddings from hashtags.
In Proceedings of the 2014 Conference on Em-
pirical Methods in Natural Language Processing
(EMNLP), pages 1822–1827, Doha, Qatar. Associ-
ation for Computational Linguistics.

View publication stats

You might also like