0% found this document useful (0 votes)
93 views19 pages

NLP Style and Semantics Analysis

This document discusses style and semantics in natural language processing. It covers topics like stylometry, authorship attribution, style transfer, and meaning representations. It introduces concepts like Abstract Meaning Representation (AMR) and Minimal Recursion Semantics (MRS) as formal representations of meaning. It also discusses using neural models with supervised and unsupervised training to learn disentangled representations of form and meaning from text. Evaluation of such models includes style transfer, retrieval, and zero-shot prediction tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views19 pages

NLP Style and Semantics Analysis

This document discusses style and semantics in natural language processing. It covers topics like stylometry, authorship attribution, style transfer, and meaning representations. It introduces concepts like Abstract Meaning Representation (AMR) and Minimal Recursion Semantics (MRS) as formal representations of meaning. It also discusses using neural models with supervised and unsupervised training to learn disentangled representations of form and meaning from text. Evaluation of such models includes style transfer, retrieval, and zero-shot prediction tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Style, Semantics, and Other

Things
Krishnapriya Vishnubhotla (KP)
Intro: Style in NLP

● Uniqueness of writing style

● Due to:
○ Lexical choices (big words vs small words)

○ Sentence structure (short n simple vs complex with clauses)

● Stylometry:
○ Surface features (word lengths, sentence lengths)

○ Lexical features (LIWC, number of hapax legomena)

○ Syntactic features (function word frequencies, PoS tag frequencies, parse tree features, character trigrams)

● Authorship attribution, plagiarism detection, digital forensics


Form and Meaning

● Text generation process:


○ a meaning, or content +
○ Form, or style
● Multiple surface realisations are possible for the same meaning
● Natural language corpora:
○ Complex vs simple wikipedia
○ Literary translations
● Closely related to: paraphrases
Paraphrases

● Paraphrase identification, generation


● Datasets: Quora Question Pairs, Microsoft Research Paraphrase Corpus, ParaNMT
● Semantic Textual Similarity tasks
NLP: Style Transfer

● Lots of work on style transfer in NLP

● “Style” ---> factor of variation


○ Sentiment
○ Attributes
○ Topics

● Usually guided by the dataset used.

● Problematic:
○ What should be preserved?
○ Adds to already problematic evaluation metrics
Complications

● There are no true synonyms -- “near-synonyms”


● Changing active to passive → change of focus
● Pragmatics -- viewpoint, framing, denotation, connotation, implication.
● Can draw some fuzzy boundaries between clusters of near-synonyms at a word-level
○ What about for phrases/sentences/documents?
● Style: Literary definition: what is “lost in translation”
Meaning Representations

● Formal representation of meaning/semantics


● Lots of CL research on logical forms, compositionality
● Two relatively-recent projects I came across
○ Abstract Meaning Representation (AMR)
○ Minimal Recursion Semantics (MRS)
Abstract Meaning Representation

● Rooted, directed, (edge+leaf)-labelled graph


● Uses PropBank frames
● Example: “The dog is eating a bone,”

Relations
Variable / Concept
● “The dog ate the bone that he found.”

● Has ways to handle:


○ Coreference
○ Negation
○ Numbers/quantity
○ Names
Generalisation capabilities

- The man described the mission as a disaster.


- The man’s description of the mission: disaster. Same AMR.
- As the man described it, the mission was a disaster.
- The man described the mission as disastrous.

● Abstracts away morphological and syntactic variations.


● But does not handle synonyms
○ “afraid” and “terrified” are treated as different concepts.
● Useful?
○ Not yet.
○ Purpose: dataset to help develop algorithms that can generate AMRs.
Minimal Recursion Semantics

● Another formalism: phrase structure grammar


● More fine-grained
● Can distinguish between tense, number.
● Practical utility:
○ Has a command-line parser you can use
○ Can generate simple paraphrases
Practical Utility

● Unlikely that they can parse many real-world sentences:


○ LIT paper: successful at 19.7% of SNLI sentences

● Using AMR to detect paraphrases:


○ ~85% on the Microsoft Paraphrase Corpus

● A separate research problem, not a tool to be used.


Back to Representation Learning

● Let us assume we have…

● Some proxy information for:


○ Form
○ Meaning Text t

Form Vector Meaning Vector

Stylistic similarity Semantic similarity


Neural Models
z classifier

● Modified Autoencoders
Paraphrases
● Encode into two vectors
● Use both to reconstruct
● Restrict information using
motivational/adversarial
discriminators

Semantic z classifier

Syntactic
What kinds of supervision?

Datasets
● Style class labels
● Paraphrases ● Paraphrase datasets
● Heuristic info: ● Parallel style transfer datasets
○ BoW for content ○ Formality
● Syntax: Syntax tree features ○ Diachronic language change
○ Tree edit distance ● Data-to-text datasets
○ ~Synthetic
Synthetic Dataset: PersonageNLG

● Personality model might be questionable


● BUT gives us two neat dimensions of variation.
All the losses later….

Evaluation:
● Style transfer (swap variables + generate)
● Retrieval
● Prediction (kNN)
More supervision == better representations

● Kinda boring
● Just train a separate supervised model
for each end-goal?
● Style transfer:
○ Generation problems
○ Evaluation problems
● Real-world text: not so cleanly
separable.
:(
What would be interesting?

● Unsupervised disentanglement?
○ beta-VAE in vision
○ At least for the synthetic dataset
● Evaluating the representations:
○ Probe for linguistic knowledge/features
○ Robust to “noise”? → domain adaptation/zero-shot prediction
● Using pre-trained models?
● (TBD) Should the latent spaces be entirely unrelated?
○ Where do style and semantics intersect?
○ What is a “latent space of sentences” anyway?

You might also like