Text Difficulty and Information Retrieval
Thursday, August 20. 2009 • Category: Automatic Mind • Comments (2) • Trackbacks (2)-->
Intro
It has been a long pause here on Automatic Mind. After finishing my Master's project and Thesis, it took me some time to adjust to my new situation as a researcher here at Tübingen University. Meanwhile some things went on in the readability corner. The tool for computing readability formulas that I demonstrated as a Java applet in an earlier post is now freely available as a Java library–including the applet and a standalone demo GUI. Some bugs have been squashed and all formulas have been cross-checked with the corresponding original publications. In this post I will focus on what one can do with those readability formulas in information retrieval. This is a brief summary of topics from my MA Thesis entitled Information Retrieval for Language Learning: An Exploration of Text Difficulty Measures. The practical part of my thesis continues living as the Information Retrieval fo Language Learning (IR4LL) project which also features an online demo and web site.
Yet another Search Engine?
»There is Google, so what do you want?« Apart from the fact that it is cool and nerdy to be able to say that one has developed one's own search engine, there's other stuff to it. Google and its competitors are very good at retrieving web pages of interest. But they can of course not guarantee that the texts on the returned pages are easy to read. However, it is crucial to language learners and teachers gathering readings to have material that is at a certain level of difficulty. Otherwise students will either be bored or overstrained. It is hard to say whether or not Google could do this if they were interested. They usually focus on statistical methods and it is unlikely that they would do deep natural language processing– it is simply too expensive in terms of processing power. So here we go with a new search engine that limits itsself to only a few web sites containing promising readings at several levels. At least this is the plan for the future, the current prototype shows that this is possible. Everything is work in progress.
Readability Measures and Beyond
A brief roundup on readability measures in general and on my previous post: readability measures or formulas try to compute a single value stating how difficult a text is to read. Many formulas are supposed to yield numbers in the scale of U.S. grade levels. Most formulas use the average sentence length and the average word length as variables. Word length is measured in syllables or in characters. While the formulas look like they were containing a lot of magic numbers, it is in fact the case that their constants are usually designed to match the variables to texts with a known difficulty level. To illustrate this, here is the Flesch-Kincaid formula which is supposed to yield levels on the grade level scale:
where
AWLs is the average word length counted in syllabes and
ASL is the average sentence length counted in words.
To play with readability measures, check out the Java Webstart of the Phantom Demo GUI.
Unfortunately, the world of language is not as simple as that. Sentence length depends on the domain. It probably produces a stronger classification for novels vs. technical manuals than for easy vs. hard texts. Word length also reveals some flaws: the general assumption that long words are harder (or even rarer, if we agree with Zipf's The Psycho-biology of Language) is questioned by frequent long words such as beautiful or absolutely. While the current IR4LL implementation strongly focuses on readability measures, there are things to be explored beyond those.
Vocabulary and Grammar
For future research, there are two basic strands which I want to follow with respect to Information Retrieval for Language Learning (IR4LL):
- measures of text difficulty based on vocabulary lists, and
- syntax-based measures of text difficulty.
Thinking back to one's own foreign language classes in school, vocab drill is what everybody had to do. There are different styles of vocabulary learning, but there is no doubt that new words must be learned actively. Which also means that missing vocabulary makes a text harder to read. Word lists are relatively easy to deal with for computational linguists. However, the contents of those lists must be well-chosen, so this might turn out as a challenge. Furthermore, vocabulary is again domain-specific. If a learner is a fantasy literature nerd, he or she will probably be able to read a novel of that genre that even native speakers will have a hard time with.
Syntactic complexity seems to be one of the more promising things to look at. Simple phrases linked with and or commas are probably much easier to understand than deeply nested subordinate clauses. There are a couple of existing approaches which I'm planning to integrate into the IR4LL prototype. A positive side-effect of that will be the possibility to directly query for linguistic forms. One could query for texts containing a lot of gerunds, or a lot of simple past, and so on. Since most tenses are taught separately, this could reveal nice real life reading materials for classroom use. Furthermore, the WERTi system (an automatic intelligent workbook) interacts with IR4LL already in its latest prototype version. A syntax-aware IR4LL system could greatly improve the usability of WERTi by finding better-suited readings.
At the time of writing, much of these thoughts are left to future research, which I will hope we will be able to conduct here at the linguistics department of Tübingen University.
Combining Measures into Simpler Difficulty Levels
There is yet another challenge to IR4LL: most users will not be able to specify their language proficiency level in a detailed way. A self-assessment such as »yeah, I'm rather weak at future in the past, but I don't have any trouble with deeply structured sentences« are unlikely. Therefore, IR4LL aims to combine several measures (text difficulty, vocab-based, syntax-based) into single templates. I call these query models because they are used as part of the search engine query. It would be great to have query models that actually classify texts into a well-known scale such as CEF levels or to a stages in a foreign language teaching curriculum.Outro
With not too much related work being out there, my MA thesis project contributes to the small field of IR4LL. The thesis discusses ways of measuring text difficulty and the implementation part provides an extensible framework with a running prototype. Future plans include the integration of a web crawler and the refinement of the text categorization into well-established difficulty levels. Once this will have been successfully approached, the search engine will be of great use to language teachers and learners. Will boring school book texts finally be a thing of the past? If we manage and if there is enough easy text on the web, they will.
Graphics taken from Open Clip Art Library, modified by Niels Ott.

2 Comments
By watching users' clicks, one will always have the mixture of contents and readability: did the user switch to another site because he or she didn't find the text informative enough, or because it was too hard or too easy to read?
(Which also brings up the always-present issue of the inter-relation between contents and readability, which would be subject to other research.)
Add Comment