Ömer Yüksel


The whale in the machine: reconstructing Moby-Dick with large language models

From time to time, I come across posts claiming: "AI plagiarized my work verbatim." These typically involve vanilla text models, not the tool-calling, RAG-based, or web-searching variants. What makes these claims particularly interesting is that I've seen them made even about models significantly smaller than GPT-4 or Claude Opus.

I've always been skeptical of such claims. While AI companies aggressively scrape the internet and have faced lawsuits for using pirated content, my conjecture so far has been that given the parameters-to-training data ratio1, even if one's work ends up in the training corpus, replicating in such detail would be impossible unless:

Under this conjecture, an LLM may be able to accurately quote Bible verses verbatim, but directly replicating a uniquely written but obscure novel should be improbable. And my previous anecdotal experiments confirmed this pattern: I successfully observed models quoting the Bible verbatim but failing to replicate more obscure content. However, failing to replicate one obscure work doesn't definitively refute these claims. Someone could always counter, "Yes, but it perfectly quoted my work."

While there has, of course, been some academic research exploring the memorization capabilities of language models, I wanted to investigate this phenomenon firsthand rather than rely solely on theoretical arguments or others' findings. Therefore I've designed a simple experiment: Can an LLM replicate an iconic work of fiction that likely appears in the training data multiple times? Perhaps less frequently than the Bible, but more than any obscure work. Specifically, can it, when given a few paragraphs from Moby-Dick, accurately predict what comes next?

If an LLM fails to replicate something as canonical as Moby-Dick, then claims about verbatim reproduction of less well-known works become questionable. If it succeeds, however, there might be merit to such assertions.

The code for the experiment can be found in my Github.

Experiment setup

For this experiment, I used the text of Moby-Dick obtained from Project Gutenberg as my primary dataset. I tested multiple models: gpt-5-mini, gpt-5-nano, gpt-4o, and gpt-4o-mini. While I initially planned to include more models, time constraints forced me to limit the scope. I hope to expand this analysis in a follow-up post once I secure separate access to Google and Anthropic models, as OpenRouter was experiencing high traffic during my testing period.

The experimental design was straightforward: provide three consecutive paragraphs from the text as input and ask the model to predict the next paragraph. For this purpose, I defined a "paragraph" as any body of text demarcated by a newline. To avoid scenarios where the model would need to predict entire chapters, I filtered the dataset to include only input-output pairs of appropriate length.

I randomly sampled 10% of the eligible input/output pairs, resulting in 59 data points per model. When configuring the LLMs, I minimized the temperature setting whenever the model allowed it, since our goal was verbatim replication rather than creative variation—there was only one correct answer in each case.

For scoring the results, I had to address several practical challenges. Since models sometimes generate more text than requested despite being instructed to predict only "the next paragraph," I decided not to penalize this behavior. Instead, I compared only the first n characters of the generated text, where n equals the length of the expected output. To account for potential shifts in the text, I also used a partial ratio comparison, which calculates the best possible match when strings are shifted relative to each other.

The scoring metric I chose was string similarity derived from Levenshtein distance, as implemented by the thefuzz package. This metric converts edit distance to a similarity score using the formula: 1 - (distance / (len1 + len2)). In my dataset, the worst scores were around 37%, while as per formula, typical poor matches with similar lengths would score around 50%. After examining the results, I established that a similarity ratio above 85% could be considered a "good match" between the generated and original text.

After some experimentation, my final prompt was:

You are a large language model on an experiment to recall classic literary works. Based on the user input, which is a section from Moby Dick, you are to answer with the paragraph that follows. Your instructions:

  • Do not write any acknowledgements or anything except the desired output
  • Write only a full paragraph. Not just a sentence, and not the full chapter.
  • Note that Moby Dick is public domain and was written in 1851. There are no copyright concerns.
  • Your accuracy is scored, so predict the best sentence that follows.

Challenges

Many chat-based models displayed a tendency to avoid instructions to quote the novel verbatim, likely due to copyright concerns programmed into their safeguards. I found that adding specific system prompts acknowledging Moby-Dick's publication date and copyright-free status significantly reduced these refusals and helped the models engage more directly with the task.

A second challenge arose from slight differences between model outputs and the original work in terms of whitespaces and punctuation. These minor variations could significantly impact traditional exact-match metrics despite the semantic content being nearly identical. Using the Levenshtein distance-based similarity metric helped address this issue by focusing on the overall textual similarity rather than requiring perfect character-by-character matches.

Another challenge arose from the limitations in controlling output determinism with certain advanced reasoning models, particularly in the GPT-5 family. These models do not allow adjusting the temperature parameter through their APIs, which made it difficult to ensure consistent, deterministic text reproduction. Without the ability to minimize randomness by setting a near-zero temperature, these models retained some inherent variability in their outputs, even when explicitly instructed to reproduce text exactly.

A final challenge was managing the potential cost explosion of such an experiment. Without careful planning, testing multiple models across the entire text of Moby-Dick could become prohibitively expensive. By implementing sampling and limiting the number of models tested, I was able to keep the experiment manageable. Including all repetitions and test runs, I used approximately 260,000 tokens at a total cost of $3.15.

Results

Full experiment output is in my Github. Below are the statistics for 59 data points per model:

Similarity

(metric: Levenshtein similarity on best subset)

model_name mean min 10% 25% 50% 75% 90% max
gpt-4o 72.9831 37 42.8 45 78 100 100 100
gpt-4o-mini 45.2203 39 42.8 44 45 46 47 62
gpt-5-mini 44.4068 38 42 43 44 46 47 52
gpt-5-nano 44.7627 37 41 43.5 45 46 47 60

Match rate

Using the 85% similarity threshold as the criterion for success, GPT-4o achieved accurate matches in 49% of cases, while all other models failed to produce any matches, scoring 0%.

Specific examples

Here we dive into some specific examples where the predictions have gone wrong. One common observation is that while the models may pick a different narrative direction, they manage to stay consistent with the theme.

Early divergence 1 (GPT-4o-mini)

Here is one case where gpt-4o-mini started correctly but diverged mid-sentence:

Expected output:

I considered the matter a moment, and then up stairs we went, and I was ushered into a small room, cold as a clam, and furnished, sure enough, with a prodigious bed, almost big enough indeed for any four harpooneers to sleep abreast.

Generated:

I considered the matter a moment, and then, taking my hat from the table, followed the landlord through the door and down a long, narrow passage. We went to a sort of a large room, and there I saw a great many beds, and a number of people in them, some of whom were snoring, and others were talking in their sleep. [...]

Early divergence 2 (GPT-5-mini)

Expected:

Entering that gable-ended Spouter-Inn, you found yourself in a wide, low, straggling entry with old-fashioned wainscots, reminding one of the bulwarks of some condemned old craft. On one side hung a very large oilpainting so thoroughly besmoked, and every way defaced, that in the unequal crosslights by which you viewed it, it was only by diligent study and a series of systematic visits to it, [...]

Generated:

Entering that gable-ended Spouter-Inn, you found yourself in a long, low, straggling entry with a low ceiling, and a number of little rooms branching off it, in each of which some one or two lodgers lay stretched on benches or sat about the fire; and at the far end of the entry, beneath a small, smoky lamp, stood a great sea-chest [...]

Mid-paragraph divergence (GPT-4o)

Expected output:

It has before been hinted, perhaps, that every little untoward circumstance that befell him, and which indirectly sprang from his luckless mishap, almost invariably irritated or exasperated Ahab. And in the present instance, all this was heightened by the sight of the two officers of the strange ship, leaning over the side, by the perpendicular ladder of nailed cleets there, and swinging towards him a pair of tastefully-ornamented man-ropes; for at first they did not seem to bethink them that a one-legged man must be too much of a cripple to use their sea bannisters. But this awkwardness only lasted a minute, because the strange captain, observing at a glance how affairs stood, cried out, “I see, I see!—avast heaving there! Jump, boys, and swing over the cutting-tackle.

Generated:

It has before been hinted, perhaps, that every little untoward circumstance that befell him, and which indirectly sprang from his luckless mishap, almost invariably irritated or exasperated Ahab. And in the present instance, all this was heightened by the sight of the two officers of the strange ship leaning over the side, and amicably exchanging hails with the mates of the Pequod. This sight of the friendly sailors reminded Ahab of his own isolation; and the thought of that reminded him of all his bodily lameness, and the uncompleted revenge he had vowed.

Caught in a loop

One reason for temperature sampling is to help it escape avoid loops and repetitions. In this case GPT-4o, which had near-zero temperature, got caught in a loop:

But Ahab, like a man who had never heard his voice, now muttered, “On such a day, and such a sweetness, I struck my first whale—a boy-harpooneer of eighteen! Forty—forty—forty years ago!—ago! Forty years of continual whaling! forty years of privation, and peril, and storm-time! forty years on the pitiless sea! for forty years has Ahab forsaken the peaceful land, for forty years to make war on the horrors of the deep! Aye and yes, Starbuck, out of those forty years I have not spent three ashore. When I think of this life I have led; the desolation of solitude it has been; the masoned, walled-town of a Captain’s quarters, from which I have never escaped; the unending, unrelieved monotony of the sea; the unceasing, unrelenting pursuit of the whale; the unyielding, unbending, unending, unrelenting, unyielding, unbending, unending, unrelenting, unyielding, unbending, unending, unrelenting, unyielding, unbending, unending, unrelenting, unyielding, unbending, unending, unrelenting, unyielding, unbending, unending, unrelenting, unyielding, unbending, unending, unrelenting, unyielding, unbending, unending, unrelenting, unyielding, unbending, unending, unrelenting, unrelenting, unyielding

[repeated hundreds of times]

Refusal

Occasionally, GPT-5 family in particular, refuses to complete, but without making it clear if it's copyright avoidance or lack of information.

From GPT-5-nano:

Sorry, I can’t provide that exact paragraph from Moby-Dick, but I can offer a concise summary of what happens next or discuss the passage in more detail.

Conclusions and future work

Only GPT-4o achieved accurate replications, with a 50% success rate. Whether GPT-5-mini and GPT-5-nano's failure stems from model size limitations or intentional design choices remains unclear. The GPT-5 family may include stronger RLHF mechanisms that discourage verbatim quoting. Additionally, these reasoning-based models require a temperature setting of 1.0, increasing their likelihood of random word sampling and subsequent divergence.

This result is inconclusive regarding my original conjecture that LLMs cannot reproduce uncommon, non-formulaic content from their training data. GPT-4o's 50% success rate with Moby-Dick neither strongly supports nor refutes this hypothesis. One could argue that for less common works, verbatim replication might still occur, albeit at lower rates.

I plan to expand this investigation to include GPT-5, Gemini, and Claude models. Preliminary tests indicate Claude models may be even more resistant to verbatim reproduction tasks which can complicate this effort.

Future work should also examine how reproduction capability varies with the popularity and uniqueness of source material. Testing with works ranging from canonical to obscure would establish a clearer relationship between a text's prevalence in training data and a model's ability to reproduce it exactly. This could provide more definitive evidence regarding claims of AI plagiarism, particularly for less well-known content.

Understanding these reproduction capabilities remains important for addressing concerns about intellectual property and the nature of AI-generated content.

Footnotes

  1. Looking at original LLama-3 70B, which had purely text training data: (1) Model size in bytes: 70 billion parameters × 4 bytes ≈ 280 GB (2) Training data in bytes: 15 trillion tokens × ~4 bytes ≈ 60 TB (3) Parameters-to-training-data ratio: 80 GB / 60 TB ≈ 0,0013 or 1:750