Simple Readability Formulas And Boring Preprocessing
Friday, January 23. 2009 • Category: Automatic Mind • Comments (5) • Trackback (1)-->
Intro
Readability formulas date back to the 1920s. They come in countless shapes and flavors, all sharing one common dream of their makers: to have a simple mathematical means of determining the reading difficulty of a given text. Is this text suitable as a reading for 4th-graders? Just stuff it into the formula and you will know which grade-level it fits. Of course, people put up warning sings telling the naive users out there what to do and what not to do with these formulas. But don't these formulas resemble the big dream of all natural language processing (NLP)? After all, all we want to have is something smart and simple that does the job of dealing with real world language. In this blog post, I will give a basic introduction on readability measures and I will point out in some detail that ›boring‹ preprocessing steps such as tokenization and sentence splitting are often underestimated. An interactive demo for computing readability scores is included.
Two Example Formulas
One of the most popular formulas is the Flesch Reading Ease, introduced by Rudolf Flesch in 1948 in his article A New Readability Yardstick. A reading ease of 90.0 to 100.0 is to indicate that the given text is very easy to read. A value in this range is achieved for comic books. The range of 0 to 30 indicates texts which are very hard to read. According to Flesch, such a value is achieved in scientific publications. The formula by Flesch looks relatively plain and simple, apart from some funny magic numbers:
where
WL = the number of syllables per 100 words (word length)
SL = the average sentence length
The formulation per 100 words indicates a fact common to many early formulas: as computations had to be done manually, many authors advised users to work on small samples such as 100 words. Some variants of this advice suggest to take a small sample each from the beginning, middle, and the end of a text. Kincaid and colleagues later on adapted the formula to yield grade levels of U.S. education. The general idea and the linguistic analysis remained the same, only the magic numbers were adapted.
A less popular measure is the Läsbarhetsindex (Readability Index, LIX) introduced by Carl-Hugo Björnsson in 1968. ›Less popular‹ here means less popular in the English speaking part of the world. For Swedish and Danish, LIX seems to be widely used. It is unclear where the formula given in the the Swedish Wikipedia article is taken from. The German translation Lesbarkeit mit LIX (Readability with LIX) mentions several versions of one formula with differing magic numbers adapted for German and Swedish each. I need to do further research on this. The formula is commonly given as follows:
where
P = number of periods in the text or sample (lazy version of number of sentences)
W = number of words in the text or sample
L = number of long words (more than 6 characters) in the text or sample
The interpretation of LIX values ranges from 25 (very easy) to 65 (very difficult). But what is more important is that LIX does not require syllable counting. Syllable counting was found to be tedious in 1968, and the addressed audience being people involved in language teaching probably did neither have access to computer machinery nor the knowledge to operate it. Nowadays, the problem is not computational power but the lack of accurate analysis. Part of this issue is discussed below.
Most readability formulas look like the ones above. Of course, there are formulas that include more intelligent analyses such as word frequency lists, or even syntactic analyses (sentence structure). As it seems, the base of most readability formulas does not stand on solid grounds. But then again, who cares about these assumptions as long as these formulas actually work? Most authors carefully restrict the validity if their formulas to a certain language and even certain text types and profiles of the readers addressed. For example, the FORCAST formula introduced 1973 by Caylor and colleagues is to be used only for U.S. Army technical documents that are read by young adult male readers.
Readability Demo
If you have Java installed on your system, you can get a feeling for a number of readability measures by using the little program below. Some ideas for input: Texts for children, Normal English Wikipedia articles vs. Simple English Wikipedia articles, abstracts of scientific papers. (Use the copy-paste keyboard shortcuts in the text area of the program.)
(This program is an old, probably incorrect version.)
The program actually is a preview on my readability library called Phantom Readability Library which will be available from the software projects section of my web page soon. Be aware that some measures might be incorrect. I am currently gathering the publications behind those formulas in order to check each of them for correctness.
Tokenization, Counting Sentences and Syllables
In the beginning of my readability adventure, I planned to simply use Java Fathom, a port of Perl's Lingua::EN::Fathom available from CPAN. I played with some texts and a story for children computed to a Flesch-Kincaid grade-level of above 13. Clearly, something must have gone wrong. After a short journey into the code I found out, that the sentence counter could not deal with direct speech as used in that story. As mentioned above, most formulas punish texts for having long sentences. Having fixed the issue, the grade-level now computes to 2.6 for the same text (in the current version of the program mentioned above).
A former student colleague kindly provided me with her version of a Perl-based syllable counter, which I then ported to Java. She reports over 96% correct results on a large test data set, which is fairly good for a rule-based approach that operates directly on English spelling. Having that tool available, I compared my results to those of other tools, mostly those being available online. They gave different results on measures with syllable counting. I found out that an average syllables-per-word ratio of 1.5 vs 1.2 (my program) affects the readability measures a lot.
So what's the message? What I am trying to say here is that the preprocessing accuracy does matter. The example with the sentence counting shows that it even matters a lot. And this is what happens to me all the time: after having implemented any analysis component beyond tokenization and sentence splitting, I always find out that due to some errors during preprocessing, all later steps fail to a large extend. Preprocessing is the most underestimated thing in NLP! People tend to think about it as a solved problem, but in fact and in practice, it is still one of the biggest challenges we face.
Sometimes, as with the Fathom module, it is just people not being imaginative enough. Having [a-zA-Z] as the only legal characters of a word reveals an anglocentric computer scientist's view on a world without any other language than English and without foreign words being used.
The Use and Abuse of Readability Formulas
So what are people doing with these formulas? First of all, there are offers of commercial tools with remarkable prices. That does not say these products are actually selling, though. William H. DuBay writes in his accessible overview The Principles of Readability that the average adult citizen of the U.S. reads at the 7th grade level. A text written at the 10th grade level will not be understood by 80% of the U.S. population. In genres where a broad audience is addressed, such as manuals, health care, or government information, checking the texts with readability formulas can give hints on where to improve the communication. Apart from that, publishers may want to decrease the reading difficulty of their books or newspapers in order to reach a larger number of customers. However, they might scare away readers with higher reading proficiency as reading a lot of text below one's level can be boring or even exhausting.
Language teachers could in theory use readability formulas to judge whether or not a text is suitable for their students.
One issue concerning readability formulas is that one must not ›write to the formula‹. Quite naturally, one would trick the formula by e.g. using shorter words or shorter sentences. Which leads to texts with shorter sentences that are not necessarily easier. Readability formulas simply do not work the other way round.
A common misconception may be that these formulas are supposed to be an exact means of judging the reading difficulty of a text. However, they do not produce much more than ballpark figures. Their exactness heavily depends on the suitability of the text for the formula in use. Furthermore, the implementations in computer programs may differ widely, depending on the quality of preprocessing as discussed above and the linguistic conceptions of their makers. Christian Watson discusses the differences in the results of several online tools in his Smiley Cat Web Design Blog. Last but not least, there is the possibility to write difficult text with easy words and easy sentences, simply because they deal with a topic hardly anyone is familiar with. This issue is partly addressed by formulas using vocabulary frequency lists… another story to write a blog post about.
Outro
Readability formulas are one of the things in computational linguistics that work well under certain conditions without requiring complete and complex analyses of language. However, these are rather shallow heuristics that may fail for a large number of texts. If they are used in the wrong context, they will almost certainly fail. What makes them attractive is their simplicity – which at the same time bares the danger of being blinded by its beauty, leading to the overuse or abuse of the formulas. From the computational perspective, these formulas are fragile because they depend on preprocessing analyses such as sentence splitting and syllable counting. The success of any analysis component stands and falls with preprocessing. We as CL people should not forget about that simple fact. The helpfulness of readability measures in computer-aided language learning (CALL) will be subject of further analysis and discussion in my master's thesis.
Graphics taken from Open Clip Art Library, modified by Niels Ott.

5 Comments
Let me try to recapitulate: readability seems to me to be a very difficult problem, that has been 'solved' on a 'good enough' basis. This means, for me, that when I want to implement a readability measure for a particular domain, I will not spend more than a day's or two's work on it, except I make it my main focus of research/work. In fact, I'd hope to be able to get away with an API call.
On the other hand, in order to have a deeper measure, one would have to invest a disproportionate amount of work in it. After all, it's subjective anyway. Moreover, making deeper analysis would only open up some more attack vectors for rival scientists to excoriate you and your work.
Which leads me to the one point where I have to disagree with you: those formulas are in no way beautiful. In fact, I've rarely seen a formula more disgraceful than R = 206.835 - .846 × WL - 1.015 × SL. Even an experimental physicist would have to cringe on that one.
They're dirty hacks at best.
i've got one question: Are there any formulas or statistic values for getting the average number of syllables for the different languages?
It takes a lot of time for splitting the text into syllables. I want to calculate the needed values in real time while the author is writing his text. If there is an average number of syllables per word (depending on the length of the word), i can use these values to get an approximate value. These values may depend on the length of the text too, but i think that the resulting values may be not that bad!?!
I don't know if any such data is available, but it should be easy to produce if you have two things for the respective language: a syllable splitter and a moderately large corpus of text: run the splitter on the corpus once, plot the number of syllables against the number of characters, and do a regression analysis. This should give you a nice and, I guess, very accurate function.
Add Comment