Adversarial inputs in embedding spaces

Back when I was working on a Kaggle competition focused on LLMs and text transformations, I discovered some fascinating patterns regarding embedding spaces and their edge-case behaviors when used for semantic similarity.

The competition's goal was to guess the original prompt given to an LLM that converted text A to text B. Here, I'll focus on a particularly interesting aspect: how embedding spaces behave in unintuitive ways.

Here's an example of what the competition data could look like:

Original Text	Prompt (Target - hidden during inference)	Transformed Text
Beautiful is better than ugly. Simple is better than complex. Complex is better than complicated.	Convert this text into a sea shanty	Oh, beauty's the way, not ugliness bold, And simplicity's path we shall hold! Though complex she be, ain't complicated to see, As we sail through these principles old!

Not all those who wander are lost. All that is gold does not glitter.	Convert this text to Shakespearean style	Hark! These wandering souls, they stray not from path divine, Though golden treasures lack their expected shine.

Friends, Romans, countrymen, lend me your ears. I come to bury Caesar, not to praise him.	Convert this text to modern teen speech	OMG guys! Everyone listen up! I'm just here to get Caesar six feet under, not hype him up fr fr.

The scoring mechanism

Comparing the similarity of two sentences – in this case, a candidate prompt and the actual prompt – is fundamentally challenging.

The simplest approach, which is rewarding only exact matches, is too draconian. After all, there are often multiple valid ways to phrase the same instruction. String similarity metrics like Levenshtein distance aren't much better since they operate on surface-level text patterns: a one-letter difference can completely change the meaning of a sentence, while synonyms and rephrasing get heavily penalized despite preserving the semantic content.

Bag of words and n-gram-based approaches suffer from similar issues. While they can capture some basic patterns of word usage, they still fail to recognize when two differently phrased prompts mean the same thing. They also tend to overemphasize common words and struggle with word order, which can be crucial for understanding meaning.

Sentence embedding similarity seems like a more promising approach: in theory, it should capture the degree of semantic difference between prompts, understanding when two different phrasings carry the same meaning.

This is the direction the competition took, using a sharpened cosine similarity with exponent of 3 between embeddings from the T5 sentence model. The sharpening (cubing the cosine similarity) helps penalize wrong answers more strongly. In practice, however, the situation turns out to be more complex, as we'll see below.

Let's look at similarity scores for candidate sentences compared to Convert this text into a sea shanty.

sentence	score
Transform this text into a sea shanty.	0.970698
Convert this text into a shanty.	0.901210
Convert this text into a song.	0.638192
Make this text better.	0.621427
Convert this text into song lyrics.	0.609144
Improve this text.	0.604455
Convert this text into rap format.	0.593446

This scoring method turned out to be quite unforgiving. Missing a few key words could tank your score dramatically, even if you were conceptually in the right ballpark. For example, "Convert this text into a shanty" scores significantly lower than "Convert this text into a sea shanty," despite being semantically very similar. It gets worse if you manage to find the overall theme, song conversion but couldn't find the exact keyword, shanty.

Finding the mean sentence

One interesting discovery came from analyzing how sentences relate to each other in the embedding space. By computing the average similarity between each sentence and all others in a diverse pool of prompts, we can find what I call a "mean sentence" – one that sits centrally in the embedding space.

Vague and general instructions like "Make this text better" often score higher on average than more specific ones, even if sometimes, the specific sentence, according to a human reader, is more closely related to the original. This suggests that such generic prompts occupy a central position in the embedding space, making them potentially useful as fallback guesses when unsure about the specific prompt.

These insights proved valuable – building up on this idea helped me eventually reach a high score. But the story doesn't end there. The most interesting findings (and ranking boost) came from pushing these observations further.

Exploring edge cases in embedding space

Having found that vague instructions occupy a comparatively central position in the embedding space, two key questions emerge:

Can we find sentence embeddings that score even higher than generic phrases like "make this text better"?
Do these optimal embeddings need to correspond to meaningful English sentences?

And the answers turn out to be interesting: there are unexpected patterns in how embedding spaces represent text. Even with the simplest optimization approach, we can generate nonsensical sequences that score higher on average than any natural language prompt in the embedding space.

Here's a greedy algorithm that optimizes token selection to maximize embedding similarity scores. For each position from left to right, it tries every possible token and keeps the one that yields the highest average similarity against our sentence pool. After reaching the maximum length, it starts over from the beginning for additional optimization passes:

import numpy as np

def score(data: list, candidates: list) -> list:
    ...

def generate_adversarial_prompt(model, sentence_pool, num_iterations=2, length=10):
   """Generate adversarial prompt for a given model and sentence pool.

   Intentionally left out early stopping to prevent getting stuck in a local minimum.
   """

   tokenizer = model.tokenizer
   special_token_ids = set(tokenizer.all_special_ids)
   vocabulary = [token for token, token_id in tokenizer.vocab.items() if token_id not in special_token_ids]


   current_tokens = [""]*20

   all_scores = []
   all_texts = []

   for _ in range(num_iterations):
     for i in range(length):
         candidates = [current_tokens.copy() for _ in vocabulary]
         for j,t in enumerate(vocabulary):
             candidates[j][i] = t
             candidates[j] = [t for t in candidates[j] if t!=""]        

         candidate_texts = [tokenizer.convert_tokens_to_string(c) for c in candidates]

         scores = score(sentence_pool, candidate_texts)

         current_score = np.max(scores)

         next_token = vocabulary[np.argmax(scores)]
         current_tokens[i] = next_token
         current_text = tokenizer.convert_tokens_to_string(current_tokens)
         print(f"{current_text=} {current_score=}")

         all_texts.append(current_text)
         all_scores.append(current_score)

   return all_texts[np.argmax(all_scores)]

(Full version is here)

With length=10 and num_iterations=2, we can find a nonsensical string like "Text this PieceICWLISHTION aslucrarea", which scores higher than any natural language prompt on average. This reveals a fundamental vulnerability in embedding-based scoring systems: they can be gamed by inputs that exploit the geometric properties of the embedding space without regard for semantic meaning.

Performing beam search or using a more global optimization method like a genetic algorithm can help achieve an even higher score. In my experience, however, the returns were diminishing after the greedy search.

The unexpected behaviors in embedding spaces aren't unique to this competition. One fascinating example is the case of "SolidGoldMagikarp" – a now-patched token that became famous for causing unusual behaviors in GPT models. This seemingly random combination of words could cause language models to produce erratic outputs, demonstrating how certain token sequences can interact with embedding spaces in unexpected ways.

Further investigations have revealed the existence of "under-trained tokens" in large language models, as documented in this exploration. These are tokens that received limited exposure during training, potentially leading to unusual geometric positions in the embedding space. This relates to our findings where certain token combinations achieved unexpectedly high similarity scores despite lacking semantic meaning.

A particularly interesting example comes from research on universal adversarial triggers, where certain sequences of tokens can reliably cause specific behaviors in language models when prepended to any input. This demonstrates how the geometric properties of embedding spaces can be leveraged in systematic ways.

The phenomenon isn't limited to text – similar patterns have been observed in multimodal embedding spaces, where researchers found that carefully crafted perturbations in one modality (like text) could affect the behavior of the system in another modality (like images).

Practical implications

These findings have implications beyond just gaming a competition:

Scoring mechanisms based on embedding similarity might not be as robust as they appear at first glance.
The relationship between semantic similarity and geometric proximity in embedding spaces isn't always intuitive.
Systems using embedding similarity for matching or comparison might need additional safeguards against adversarial inputs.

This observation becomes particularly relevant as more systems rely on embedding similarity for tasks like semantic search, document comparison, or prompt matching. For instance, a malicious actor could potentially game search engine rankings by generating content that is geometrically optimal in the embedding space but semantically meaningless to humans. Embedding spaces are powerful tools, but their geometrical properties can sometimes be exploited in ways that diverge from their intended semantic purpose.

Intuitively, a more sophisticated model should be better at avoiding these pitfalls. For example, it could learn to distinguish between well-formed English sentences and nonsensical text, pushing gibberish away from the center of the embedding space. Models like OpenAI's embeddings, trained on vastly more data and with more parameters, might show more resistance to these attacks. However, I suspect this would only raise the bar for adversarial inputs rather than eliminate the fundamental vulnerability - the core problem of embedding spaces having a "middle ground" would likely remain.

The challenge lies in developing scoring mechanisms that maintain the benefits of embedding spaces while being more resistant to these kinds of adversarial inputs. For this specific competition, a straightforward improvement would have been to combine the embedding similarity score with a binary classifier that detects valid English sentences. While this wouldn't completely prevent gaming the system, it would at least minimize nonsensical inputs like "Text this PieceICWLISHTION aslucrarea" and force competitors to find adversarial inputs that either: (i) maintain grammatical structure or (ii) escape the classifier as well.

Combining these insights about embedding space vulnerabilities with a fine-tuned model for prompt prediction, I managed to achieve 11th place out of 2000+ competitors in this Kaggle competition. The success came from understanding both the ML aspects of prompt recovery and the geometric properties of the scoring mechanism that could be exploited.

This exploration into embedding spaces and their vulnerabilities was just one aspect of the competition, but it highlights broader challenges in building robust AI systems. Later, when building a client's text-based recommender system, these insights helped me better understand embedding limitations. As we continue to develop and deploy such systems, understanding these vulnerabilities becomes increasingly important.

The scoring mechanism

Finding the mean sentence

Exploring edge cases in embedding space

Related work and interesting token phenomena

Practical implications