A Sequence-Oblivious Generation Method For Context
A Sequence-Oblivious Generation Method For Context
net/publication/346701035
CITATIONS READS
0 217
4 authors, including:
All content following this page was uploaded by Jeonghwan Kim on 28 December 2020.
… … … … …
… … … … …
img1 … [IMG] loc1 … [LOC] time1 … [TIME] txt1 … [SEP] ht^ 1 … [MASK] img loc time txt ^ ht
ht 1
^ [MASK]
2 img loc time txt ^
ht < s [MASK]
^
hts-1
I2 = [C, ht^< 2, [MASK]] I3 … Is
where C = [img, loc, time, txt] Is-1
the ways they are fused are quite distinct. For in- immediately preceding token is pooled (Vaswani
stance, Denton et al. incorporate user metadata et al., 2017), our model fuses mutually supporting
(e.g. age, gender) with 3-way multiplicative gat- context features “directly” with the self-attention
ing along with image for hashtag recommendation. mechanism along with the incrementally appended
Another example (Wang et al., 2019) uses both the hashtags. This early fusion approach is conducive
text description of a given tweet and the thread to modeling our representation space because the
conversation by employing bi-attention (Seo et al., output space is jointly modeled with the combined
2017). A more recent approach makes use of audio context-tag representation for every generation step.
features (Yang et al., 2020a) for short video infor- On the contrary, the commonly used late fusion of
mation. A common drawback of these previous latent representations of multi-modal input vectors
models is that they capture a very limited amount is limited to the aggregation of the projected infor-
of relationship among the input features of different mation. The expected benefit of the early fusion
modalities. Our approach of using a self-attention is the extensive and comprehensive view on the
mechanism by employing a pre-trained BERT en- contextual information in generating a hashtag.
sures that inter-dependency among the given image, Inspired by the generative application of BERT
location, time and text description is learned at the in (Chan and Fan, 2019), our model uses a pre-
early stage of the layer hierarchy to model the con- trained BERT model that generates tags one after
textual information dispersed throughout the input another and recurrently feeds them back as input.
features. Our BERT-based model is trained on a hashtag
prediction task. Given context features and co-
3 Approach occurring tags as input, it predicts a relevant hash-
3.1 Recommendation as a Generation Task tag while taking into account what hashtags have
been generated. More formally, the sth hashtag is
In contrast to the ranking approaches, our model generated as follows:
generates a sequence of interrelated hashtags given
an assortment of context features. As in Figure 2,
ˆs = argmax P (hts |img, loc, time, txt, ht
ht ˆ <s ) (1)
our model takes as input the context C consisting hts ∈HT
of the image, location, time and text features, to-
gether with the [MASK] token appended at the tail ˆ <s refers
where HT is the set of all the hashtags, ht
end of the sequence (see Section 3.3 for details) for to the hashtags generated so far for a given post in-
encoding of each context. Between the context fea- stance, and img, loc, time, and txt denote image,
tures and the [MASK] token comes a sequence of location, time, and text features, respectively.
hashtags generated by the model (initially null) to
incorporate the tag dependency for the subsequent 3.2 Generation Model
input sequences. As each hashtag ht is generated, As in (Chan and Fan, 2019) where BERT is used
it is added to the input sequence as in Figure 2. as a generative model in a question generation
Unlike autoregressive approaches in which only a task, we employ the recurrent BERT model for
hidden state vector is passed on to the next state the tag generation task. BERT is a language model
(Sutskever et al., 2014) or the representation of the that takes as input a sequence of partially masked
words and predicts which word should originally be a representation with the self-attention mechanism.
placed in the masked position. The recurrent BERT This approach makes it amenable to take advantage
model takes advantage of this characteristic of us- of BERT’s language modeling capability on the
ing the [MASK] token for producing a probability fused representation of each input context. On the
distribution over a vocabulary set (i.e. a hashtag set contrary, late fusion attempts to encode different
in our case). The recurrent BERT model is adopted feature types in different spaces that are mapped
and extended for hashtag recommendation. into a common space at a later stage. As in (Denton
Given a sequence of n tokens x = et al., 2015; Wang et al., 2019; Gong et al., 2018; Li
[x1 , x2 , x3 , ..., xn ] representing the input context, et al., 2016), typical late fusion strategies employ
our model begins with: co-attention or bi-attention.
We pre-process each input feature type as fol-
X1 = [x, [SEP], [MASK]] (2) lows:
• Image. We generate an image caption from
where [SEP] indicates the end of the input se- each image using the image captioning mod-
quence, and [MASK] is appended at the end as a ule provided by Microsoft Azure Computer
generative token over a target vocabulary set V . As Vision API. The first image is used for each
the model generates one token after another, the post.
predicted tokens are consecutively appended to the
input to be fed back to the model: • Location. We utilize only symbolic names
(e.g., My Home, XX National Park) given by
the users of Instagram, ignoring coordinate
Xi = [x, [SEP], ŷ1 , ŷ2 , ..., ŷi−1 , [MASK]] (3)
information such as a GPS location.
where Xi refers to the ith input of a given data • Time. A numeric time expression is converted
instance. Following the concatenation of the gen- into words in three categories based on a rule-
erated sequence of tokens between the [SEP] and based converter: season, day of the week,
[MASK] tokens, the recurrent BERT feeds it back and part of the day (i.e. morning, afternoon,
into BERT and takes the representation from the evening or night). For example, ’2020-07-01
[MASK] token’s position to generate the subse- (Wed) 14:52:00’ is converted into {summer,
quent token: weekday, afternoon} and used as the time fea-
ture.
hi = BERT (Xi ) (4)
• Text. A set of words are collected from a
z = hi[MASK] · W Tvocab +b (5)
user’s textual description in a post. We strip
exp(z k )
pk = P (6) hashtags from the description and use texts
ht∈HT exp(z ht )
only as the text feature.
ŷi = argmax(p) (7)
yi
We then enter the input context C of image img,
where hi ∈ Rn , z ∈ R|V | , and p ∈ R|V | . Then location loc, time time and text description txt
the generated ŷi is again used to expand the input with a delimiter token for each ([IMG], [LOC],
sequence for the next input: [TIME], [SEP]). Note that the hashtag(s) gener-
ˆ and [MASK] token
ated from the previous stage ht
are appended to the end:
Xi+1 = [x, [SEP], ŷ1 , ŷ2 ..., ŷi−1 , ŷi , [MASK]] (8)
img = [img1 , img2 , ..., img|img|−1 , [IMG]] (9)
loc = [loc1 , loc2 , ..., loc|loc|−1 , [LOC]] (10)
3.3 Early Fusion of Different Features time = [time1 , time2 , ..., time|time|−1 , [TIME]] (11)
txt = [txt1 , txt2 , ..., txt|txt|−1 , [SEP]] (12)
As implied by Eq. 1 where a hashtag is generated C = [img, loc, time, txt] (13)
for an input context consisting of img, loc, time, ˆ <s , [MASK]]
Is = [C, ht (14)
and txt, a precursor of our early fusion strategy is
to convert the input context into textual form. That where Is is the input to our recurrent BERT model
way, the four different types of data are entered at sth step for a given post. The last token rep-
into BERT as a sequence of tokens and fused into resentation h[MASK] is fed to the fully-connected
softmax layer to obtain a probability distribution that order in the post. This assumption is referred
over a defined set of vocabulary for hashtags, as in to as orderlessness in this paper, which can be
Eq. (4-6). A hashtag htˆ s with the highest probabil- seen as enforced globally because even the same
ity is chosen by Eq. 7, given the input Is . set of hashtags associated with multiple posts can
appear in different orders, thereby mimicking a
3.4 Training and Decoding Strategy global permutation of multiple hashtags.
We turn an Instagram post instance containing N Locally, we enforce the 1-to-M relationship us-
ground truth hashtags into N separate training in- ing KL divergence loss (Eq. 18), comparing the
stances. Each instance begins with the context fea- output distribution of predicted tags against the
tures [img, loc, time, txt] and receives no tag, one ground truth distribution.
tag, and so on, all the way to the maximum number
of tags in the recurrent model. With the first train-
ing instance T1 being [C, [SEP], [MASK]] where X q(ht)
DKL (q||p) = q(ht) log (18)
C denotes the list of context features, we obtain a p(ht)
ht∈HT
total of N training instances with:
where ht refers to a tag within tag space of HT .
Ti = [C, [SEP], ht1 , ht2 , ..., hti−1 , [MASK]] (15) To generate top-k tags at the inference step, we
employ greedy search rather than beam search com-
where i = 1, ..., N .
monly used in language generation tasks. The ratio-
When each training instance is used for training,
nale is that prediction of context-relevant hashtags
it goes through the BERT module (corresponding
can take place early in the generation process, lead-
to the number of hashtags that have been gener-
ing to incremental addition of pertinent information
ated) so that the resulting representation for the
in the input sequence. Since tag generation does
[MASK] becomes the input to the fully-connected
not consider how “probable” the sequence is but
network classifier that returns h[MASK] , which is
how relevant each one is, greedy search suits the
then compared to the ground truth.
purpose.
Given a training instance Ti (Eq. 15), we can use
either one ground truth label or multiple ground
4 Experiments
truth labels that have not been generated up to that
point. For a training instance with i − 1 tags that 4.1 Experimental Settings
have been predicted, the model attempts to predict
the ith tag, which is judged to be correct as long 4.1.1 Data Construction
as it matches one of the remaining N − i + 1 tags To collect meaningful and diverse hashtags from
in the ground truth. We define the terms “1-to-1” Instagram, we first define a set of seed tags based
and “1-to-M” (for one-to-many) with the following on the level of generality and frequency. Seed tags
ground truth labels, Li : consist of 6 categories (Activity, Emotion,
Event, Location, Object, Time), and each
{1−to−1}
Li = hti (16) category constitutes 10 TAGS. For example,
{1−to−M }
Li = [hti , hti+1 , ..., htN ] (17) “#beach” is assigned to Location and “#happy”
to Emotion. Using the seed tags we collect
181,620 posts from Instagram and filter out those
Since the exact sequence needs to be generated with more than 20 hashtags, resulting in 87,872
for usual language generation, there should be only posts and 194,773 unique hashtags in total. This
one ground label at every generation step (i.e. in a filtering strategy is based on the rationale that the
1-to-1 fashion). For hashtag generation, however, posts with exceptionally many hashtags are highly
there can be multiple ground truth tags, any of likely to be an advertisement. To avoid recom-
which can serve as the ground truth at that step mending overly specific and meaningless hashtags,
(i.e. in 1-to-M fashion ). That is, the order by we also filter out those with less than 400 occur-
which the ground truth tags are generated does rences across the data set, resulting in the final set
not matter. For example, the next hashtag that of 907 hashtags. For training and evaluation, we
a model needs to generate is not necessarily ht2 split 87,872 instances in 9:1 ratio, ending up with
when ht1 is part of the input, even if they appear in 79,085 training and 8,787 evaluation samples.
Model P@1 P@3 P@5 P@10 R@1 R@3 R@5 R@10
Frequency-Based 0.0699 0.0533 0.0465 0.0387 0.0156 0.0422 0.0576 0.1104
Context-Tag Mapping 0.1521 0.1130 0.0958 0.0730 0.0621 0.1278 0.1740 0.2521
BERT-based Ranking (LF) 0.2884 0.1905 0.1488 0.1039 0.1134 0.2124 0.2677 0.3585
BERT-based Ranking (EF) 0.4178 0.2652 0.2051 0.1350 0.1752 0.2869 0.3450 0.4209
BERT-based AR (1-to-1) 0.3006 0.1854 0.1466 0.1047 0.1283 0.2049 0.2588 0.3544
BERT-based AR (1-to-M) 0.3734 0.2233 0.1746 0.1213 0.1553 0.2404 0.2947 0.3809
Ours (1-to-1) 0.4615 0.2803 0.2058 0.1278 0.1981 0.3044 0.3465 0.3953
Ours (1-to-M) 0.5185 0.3015 0.2203 0.1390 0.2229 0.3238 0.3662 0.4263
Table 1: Precision@K and Recall@K evaluation on baseline models and our model.
Table 2: Ablation over the feature types using our model. Only one feature is removed at a time.
approaches. Still another significant result is that eration. The BERT-based AR models are di-
the early fusion strategy is superior to late fusion rectly influenced by immediately preceding
for both ranking and generative approaches. Fur- input token (i.e., hashtag). This property is
ther analyses follow: suitable for a sequentially dependent natural
language generation but refrains the AR mod-
• Context-Tag Mapping vs. BERT-based els from using the entire context including
Ranking (LF). This comparison is intended the generated hashtags, which causes such
to ensure that even for late fusion without weaker performance than the ranking model.
deep interaction among the features, the lan- On the contrary, our model uses a [MASK] to-
guage modeling capabilities of BERT con- ken representation which learns to aggregate
tributes significantly to the hashtag recommen- entire contextual and previous tag information
dation problem, compared to the joint space in generating the output hashtags. The result
approach. validates the effectiveness of our generation
model specifically suited for the hashtag rec-
• EF vs. LF. We assess the early fusion and late
ommendation task.
fusion approaches for inter-context feature
modeling by comparing BERT-based ranking • 1-to-1 vs. 1-to-M. For the BERT-based AR
(LF) and BERT-based ranking (EF) baselines. model and our model, we see 1-to-M mod-
In the LF approach, the model separately en- els outperform 1-to-1 models by a significant
codes each feature (image, location, time and gap, which shows the effectiveness of our ap-
text) with a shared parameter BERT and aver- proach under the orderlessness assumption.
ages over them to form a single, aggregated To test orderlessness, in addition, we shuffled
context representation. On the other hand, the and reversed the order of hashtags within the
EF approach jointly feeds the context features posts and trained our model under the same
in a single step in ranking the top-k hashtags setting. There was no meaningful gap in per-
based on the fused context information. As the formance with the original result, validating
results between these two models imply, early our assumption.
fusion provides an extensive and more com-
4.3 Ablation Study
prehensive view of the given features unlike
the delayed aggregation of separately modeled We also conduct an ablation study to see how each
context representations. feature contributes to the model performance and
the hashtag recommendation. As can be seen in
• Ranking vs. Generation. The experimental Table 2, every evaluation score decreases when we
result shows that casting recommendation as remove one of the input features, implying all of
generation using recurrent BERT has a sig- the features contribute to the task and the model. It
nificant benefit in performance. This perfor- shows that Text is the most important feature, prob-
mance gain is attributed to the generation as- ably because it comes directly from the users and is
pect that considers the dependency among the most native to the BERT language model. On the
generated tags. Note that the BERT-based AR other hand, the location and time features appear
models are weaker than the ranking model, to be less important because they are secondary
which reinforces the importance of the par- descriptions derived from the original descriptor.
ticular way the proposed model handles gen- Usually the text form of location is too specific and
diverse for the model to capture the patterns. We
leave the issue of extracting more innate represen- P@1
[MASK]
[IMG]
[LOC]
[TIME]
[UNK]
[UNK]
[SEP]
#autumn
image
#london
regent
#park
regent
an
#nature
river
of
s
park
fall
a
,
s
park
#uk
[TIME]
close
##que
[LOC]
evening
[SEP]
#nature
[IMG]
##ade
#trees
[MASK]
roberto
up
bu
##rle
marx
fall
#park
#sky
of
par
cid
a
tree
da
a
BERT-based
AR #disgust #insideout #disney #fear
#anger #sadness #pixar #joy #halloween #emotions
acteristic propagates the erroneous tag information Paul Covington, Jay Adams, and Emre Sargin. 2016.
through the subsequent generation steps. In con- Deep neural networks for youtube recommendations.
In Proceedings of the 10th ACM conference on rec-
trast, our model generates many of the correct tags
ommender systems, pages 191–198.
successfully, showing its accountability for direct
view of contextual information. Emily Denton, Jason Weston, Manohar Paluri,
Lubomir Bourdev, and Rob Fergus. 2015. User con-
5 Conclusion ditional hashtag prediction for images. In Proceed-
ings of the 21th ACM SIGKDD international con-
We introduce a novel hashtag feedback/generation ference on knowledge discovery and data mining,
approach that explicitly considers the dependency pages 1731–1740.
among the tags, while accounting for the charac- J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
teristic difference between tag generation and lan- Toutanova. 2019. Bert: Pre-training of deep bidirec-
guage generation. To exploit the rich information tional transformers for language understanding. In
among assorted features, we adopt an early fusion NAACL-HLT.
approach by converting distinct features to the same Zhuoye Ding, Qi Zhang, and Xuan-Jing Huang. 2012.
modality (i.e., text) and leverage BERT’s language Automatic hashtag recommendation for microblogs
modeling capabilities. Through an extensive analy- using topic-specific translation model. In Proceed-
ings of COLING 2012: Posters, pages 265–274.
sis over the baseline models, we show significant
improvements. We also prove that the 1-to-M gen- Fréderic Godin, Viktor Slavkovikj, Wesley De Neve,
eration approach under orderlessness assumption Benjamin Schrauwen, and Rik Van de Walle. 2013.
is more appropriate for the tag recommendation Using topic models for twitter hashtag recommenda-
tion. In Proceedings of the 22nd International Con-
task. For future work, we plan to evaluate our ference on World Wide Web, pages 593–596.
model on other benchmark of similar kind to test
the robustness of our tag generation approach. Our Yeyun Gong, Qi Zhang, and Xuanjing Huang. 2018.
generative framework is also expected to general- Hashtag recommendation for multimodal microblog
posts. Neurocomputing, 272:170 – 177.
ize over other recommendation tasks that deal with
items with inter-dependency. Yuyun Gong and Qi Zhang. 2016. Hashtag recom-
mendation using attention-based convolutional neu-
ral network. In Proceedings of the Twenty-Fifth
References International Joint Conference on Artificial Intelli-
gence, IJCAI’16, page 2782–2788. AAAI Press.
Ying-Hong Chan and Yao-Chung Fan. 2019. A recur-
rent BERT-based model for question generation. In Mohadeseh Kaviani and Hossein Rahmani. 2020.
Proceedings of the 2nd Workshop on Machine Read- Emhash: Hashtag recommendation using neural net-
ing for Question Answering, pages 154–162, Hong work based on bert embedding. In 2020 6th Interna-
Kong, China. Association for Computational Lin- tional Conference on Web Research (ICWR), pages
guistics. 113–118. IEEE.
Quanzhi Li, Sameena Shah, Armineh Nourbakhsh, Xi- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
aomo Liu, and Rui Fang. 2016. Hashtag recommen- Chaumond, Clement Delangue, Anthony Moi, Pier-
dation based on topic enhanced embedding, tweet ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
entity data and learning to rank. In Proceedings of towicz, et al. 2019. Transformers: State-of-the-
the 25th ACM International on Conference on Infor- art natural language processing. arXiv preprint
mation and Knowledge Management, pages 2085– arXiv:1910.03771.
2088.
Gaosheng Wu, Yuhua Li, Wenjin Yan, Ruixuan Li,
Yang Li, Ting Liu, Jingwen Hu, and Jing Jiang. Xiwu Gu, and Qi Yang. 2018a. Hashtag recommen-
2019. Topical co-attention networks for hashtag dation with attention-based neural image hashtag-
recommendation on microblogs. Neurocomputing, ging network. In International Conference on Neu-
331:356–365. ral Information Processing, pages 52–63. Springer.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Ledell Yu Wu, Adam Fisch, Sumit Chopra, Keith
2016. Hierarchical question-image co-attention for Adams, Antoine Bordes, and Jason Weston. 2018b.
visual question answering. In Advances in neural Starspace: Embed all the things! In AAAI.
information processing systems, pages 289–297.
Jinxi Xu and W. Bruce Croft. 1996. Query expansion
Minseok Park, Hanxiang Li, and Junmo Kim. 2016. using local and global document analysis. In Pro-
Harrison: A benchmark on hashtag recommenda- ceedings of the 19th Annual International ACM SI-
tion for real-world images in social networks. arXiv GIR Conference on Research and Development in
preprint arXiv:1605.05054. Information Retrieval, SIGIR ’96, page 4–11, New
York, NY, USA. Association for Computing Machin-
Surendra Sedhai and Aixin Sun. 2014. Hashtag rec- ery.
ommendation for hyperlinked tweets. In Proceed-
ings of the 37th international ACM SIGIR confer- Chao Yang, Xiaochan Wang, and Bin Jiang. 2020a.
ence on Research & development in information re- Sentiment enhanced multi-modal hashtag recom-
trieval, pages 831–834. mendation for micro-videos. IEEE Access,
8:78252–78264.
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and
Hannaneh Hajishirzi. 2017. Bidirectional attention Qi Yang, Gaosheng Wu, Yuhua Li, Ruixuan Li,
flow for machine comprehension. In 5th Inter- Xiwu Gu, Huicai Deng, and Junzhuang Wu. 2020b.
national Conference on Learning Representations, Amnn: Attention-based multimodal neural network
ICLR 2017, Toulon, France, April 24-26, 2017, Con- model for hashtag recommendation. IEEE Transac-
ference Track Proceedings. tions on Computational Social Systems.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. E. Zangerle, W. Gassler, and G. Specht. 2011. Rec-
Sequence to sequence learning with neural networks. ommending #-tags in twitter. In Proceedings of
In Advances in neural information processing sys- the Workshop on Semantic Adaptive Social Web
tems, pages 3104–3112. (SASWeb 2011). CEUR Workshop Proceedings, vol-
ume 730, pages 67–78.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Qi Zhang, Jiawen Wang, Haoran Huang, Xuanjing
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Huang, and Yeyun Gong. 2017. Hashtag recommen-
Kaiser, and Illia Polosukhin. 2017. Attention is all
dation for multimodal microblog using co-attention
you need. In Advances in neural information pro-
network. In IJCAI, pages 3420–3426.
cessing systems, pages 5998–6008.
Suwei Zhang, Yuan Yao, Feng Xu, Hanghang Tong, Xi-
Yue Wang, Jing Li, Irwin King, Michael R Lyu, and aohui Yan, and Jian Lu. 2019. Hashtag recommen-
Shuming Shi. 2019. Microblog hashtag generation dation for photo sharing services. In Proceedings of
via encoding conversation contexts. In Proceed- the AAAI Conference on Artificial Intelligence, vol-
ings of the 2019 Conference of the North American ume 33, pages 5805–5812.
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 Feng Zhao, Yajun Zhu, Hai Jin, and Laurence T Yang.
(Long and Short Papers), pages 1624–1633. 2016. A personalized hashtag recommendation ap-
proach using lda-based topic model in microblog en-
Jason Weston, Samy Bengio, and Nicolas Usunier. vironment. Future Generation Computer Systems,
2011. Wsabie: Scaling up to large vocabulary im- 65(C):196–206.
age annotation.