Hacker Newsnew | past | comments | ask | show | jobs | submit | sole's commentslogin

They are included because they were the reason people gave when asked why "The Silmarillion" was so unreadable.

(I'm the author)


The problem with stop words is that they tend to be the most common words in every piece of text[1], regardless genre.

So, in order to test the hypothesis "Silmarillion is harder to read because it has lots of stop words", you need to calculate the relative frequencies of lots of other texts and see if there's something special about Silmarillion's top 10 versus all other's.

Surely, you have already done that using LOTR and The_Hobbit, but a much bigger sample is needed. At the very least, you may want to use 10-15 other works of fantasy from different authors, and that will be just like a back-of-the-envelop test to see if it is worth to pursue this experiment with a statistically significant sample.

[edit] 1. Provided it is sufficiently large.


Yes--maybe I am thinking out 'too loudly' sometimes, but I think it's interesting since that explains how one can come to a conclusion, and maybe if it is wrong, it can be corrected as you know how the conclusion was reached. It's like you can "debug" a thought process, in a way, as you have a "trace" :-)

And thanks for all the suggestions! I've made a note of them all. Good to hear from people who know more about the topic than me.


Author here.

I got them in txt files, online. I own the original books too. I would have typed them in if I had all the time of the world, of course.


Hi, author here.

First, it's she, not he.

Second, because I hadn't heard about that obvious test at all back then. I never pretended to do a superserious scientific analysis but rather answer to the questions that came to my mind, by using a computer to validate hypothesis.

But many thanks for the pointers & suggestions, though! I was thinking about rebuilding this to make it realtime+interactive so that more than one text could be analysed/visualised, so I'll make sure to introduce the new tests.


Hi author, I really liked the visualisations, especially the black hole thing.

I think I may have misunderstood the purpose of your article: I first read it as a serious attempt to use textual analysis to do a comparison of the comprehensibility of Tolkein's best-known works, in which case you fell way short of the mark by going no further than word counting. I think it started off like that, but on re-reading it I see that you say you hit a brick wall (paraphrasing) and decided to have some fun visualising your results so far; something you did very nicely. So perhaps I got the wrong end of the stick.

I thought about what you've taken on here. Doing things at the word level is pretty easy. Taking into account grammatical structure to get things like sentence lengths and clauses per sentence, or breaking words up into syllables (needed for F-K, for example) is considerably harder. I'm interested to hear what you come up with. Flesch & Flesch-Kincaid are US inventions so perhaps not obvious to everyone.

On he/she: I agonised over that for a good ten minutes over my breakfast. I read around your blog and your twitter page this morning for clues as to the appropriate pronoun. In the end I couldn't tell, so I went with 'he' because I stereotyped you. I nearly changed my wording to use constructions like "the author," "they," and various other mealy-mouthed alternatives but they were too ugly. So I did try, but I got it wrong. I apologise.


Isn't FK determined almost completely by sentence length? That is what I recall from messing with MS Word docs in high school.


FK combines average sentence length (total words/total sentences), average syllables per word and some fixed coefficients to come up with an equivalent school grade.

Flesch Reading Ease score, which is what I actually meant, does the same but with different coefficients to come up with a more granular difficulty score, usually in the range of 30-100.

They're both pretty arbitrary. The more I read up on this subject the more respect I have for the author's own attempts at an originality score. It's all subjective ultimately.


> On he/she

While I consider defaulting to 'he' to be legitimate and acceptable, I actually prefer the zie/zir gender-neutral pronouns when I think about it. http://santiago.mapache.org/nonfiction/essays/zie.html If I ever begin to agonize about the gender of the person I'm talking about, that's enough to kick me over into using GNPs.


Have fun with it! This seems like a great example to encourage further analysis because you still don't have a testable quantitative hypothesis for why it was harder to read. Makes me want to start coming up with metrics, too! Of course there's a big list of readability metrics, but it's way more fun to discover them on (y)our own. That said, I too was a little surprised when you had a heading titled "The classic graphs" without actually having perspective in the field for what would be classic.


The "originality" index bothered me, because as the work of a length grows, you'd generically expect less words to be introduced -- exactly what the results show.

The idea makes sense, but I'm unclear on how to actually measure it in a way that's normalized by page count.


There's the concept of a vocabulary growth curve. It shows the number of words occurring once as a function of the amount of text, e.g., how much new words in the first 1000 words? how much in first 2000 words? etc.


But is there a good way to boil that down to a single metric? (i.e., can we parametrize these curves with a single value?)


I wouldn't know off the top of my head, but the concept was introduced by Harold Baayen. There are free PDFs online of his with lots of interesting quantitative techniques.


Binomial distribution mumble mumble logarithm of page count mumble.


Would it be possible for you to generate a HD version of the "Graphical representation of words frequency" image? It would look great as a poster.


Nice work! So my question is where did you get the text to analyze?


You can find the text for pretty much any popular work if you look in the proper places :-)


I think you're just talking about the applications in a stock Nexus One, not the operating system itself (i.e. the UI, the services it provides, etc). If the apps are lousy it's one thing, but the system behaves quite predictably more often than not.

And yes, the stock apps could do with a bit improvement, but you can always install another alternative app from the market. That's what it is for, isn't it?


In my experience that's a distinction iphone owners don't need to make.


Being an iPhone owner I disagree (though original iPhone so I don't know about newer models). Mine crashed, not applications, the phone itself, on a regular basis.


I never ever had a crash of the OS (starting with 3.x, now still 4.0) so far on my iPhone 3GS.


The only idea that comes to mind is auditing advertising companies, but it doesn't seem too likely/feasible to me.


As far as I know, this won't work if the user came from https:// - the referrer will be empty, won't it?


That's right, but the article is about Google search, so the referrer will be hidden only if the user comes from https://encrypted.google.com


We were taught basic programming with Emacs, connecting to an HP-UX machine. So we really didn't know much Emacs anyway :-)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: