We Rated 29,875 Books
by Vocabulary Difficulty
Ulysses is harder than Pride and Prejudice. Middlemarch is harder than Emma. We can now prove both — mathematically.
The Question
Every reader has an intuition about which books are “harder” than others. Ulysses has a reputation for difficulty. Pride and Prejudice is considered more accessible. But what does “harder” actually mean, and can it be measured precisely?
Most attempts to measure reading difficulty focus on sentence length and word frequency — metrics designed for children's books, not adult literature. They would tell you that a book using short sentences and simple words is “easy” even if the vocabulary is exceptionally rare.
We took a different approach.
The Method
Lemmerly maintains an ELO rating for every word in its corpus — a precise measure of how rare and difficult that word is, calibrated against the performance of thousands of players. A word rated 1600 is one that a Lex Expert player would typically know, but a Word Apprentice would miss.
Project Gutenberg, the world's largest library of free public-domain books, publishes word frequency counts for every book in its collection. The Standardised Project Gutenberg Corpus contains these counts for 55,905 books.
For each book, we matched its words against our vocabulary corpus, collected the ELO ratings of every matched word, and computed the 80th percentile — the ELO level at which 80% of the book's vocabulary falls below. That percentile becomes the book's rating.
A book rated 1600 uses vocabulary that a Lex Expert reader would find mostly familiar — but that a Word Apprentice reader would find dense and difficult. A book rated 1400 sits comfortably within the range of an advanced reader but below the level of serious literary scholarship.
What We Found
29,875 books were rated. The range runs from 1,356 to 1,783. A few things stood out immediately.
The intuitions hold. The rankings largely confirm what serious readers already believe. Ulysses is indeed harder than Dubliners — and significantly harder than Emma. The data validates the literary consensus.
The differences within authors are real. Joyce's Ulysses (1649) is measurably harder than his Dubliners (1566). Austen's Persuasion and Emma are close, but Pride and Prejudice (1582) is slightly harder than Emma (1579) — a distinction even devoted Austen readers might not have expected.
Classic literature is genuinely hard. Every book in our Gutenberg corpus falls in the Lex Adept to Lex Master range. This makes sense — public-domain books are predominantly 19th-century literary works written for educated adult readers. There are no easy books in this corpus. The variation tells you which ones are harder within a uniformly challenging set.
Reading at Your Level
The practical application is straightforward. If you are at 1400 ELO on Lemmerly, books rated 1400-1500 are at your level — you will find them challenging but not impenetrable. Books rated 1600+ will stretch you. Books rated below 1350 will feel comfortable.
Linguists call this principle “comprehensible input plus one” — the most effective reading for vocabulary growth is material where roughly 95% of the vocabulary is familiar, and 5% is new. Lemmerly can now tell you exactly which books sit in that zone for your current rating.
This is why the book ratings appear on your session summary. After each session, Lemmerly suggests three books calibrated to your current ELO — classic literature you can actually read and absorb at your level.
The Limitation
Our methodology measures vocabulary difficulty, not overall reading difficulty. A book can have relatively accessible vocabulary but be conceptually demanding — War and Peace, for instance, is rated 1617 but its true difficulty comes from its scope and structure, not its word choice. Conversely, a book with rare vocabulary can be easy to follow if the writing is lucid.
The rating tells you something real and useful. It does not tell you everything.