Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Generate a quiz from a Wikipedia page (github.com/alexgreene)
388 points by alex_g on Feb 19, 2017 | hide | past | favorite | 85 comments



Have you thought about using dbpedia?

https://en.wikipedia.org/wiki/DBpedia


Has someone tried to host a DBpedia copy offline?

The supported Virtuoso database is quite esoteric ( https://en.wikipedia.org/wiki/Virtuoso_Universal_Server ).

Has one succeeded writing a script to import the data to Postgres, MySQL or Lucene?


I've hosted it in Jena, and I think most graph databases include importers.

You need a RDF/Graph database (unless you are up for a lot of re-engineering)


I hadn't but I will now. Thanks for sharing!


How does DBpedia compare with WikiData?


There was Freebase with a community around the project. Sadly Google took it offline and now only use it internally, and prevent others AI projects and competitors like IBM (Watson), Apple (Siri), and Microsoft (Cortana) from using it. So IVM bought Blekko semantic search engine, Apple has now its own web crawler, and Microsoft already bought various SV startups like Powerset to enhance MSN/Live/Bing search.

DBpedia is big as well.

WikiData is several magnitudes smaller than Freebase and DBpedia at the moment, but has an active healthy community.

What's the best way to run an offline copy of WikiData these days? Does it run on MySQL (or Postgres) like all other Wikimedia properties?


There was Freebase with a community around the project. Sadly Google took it offline

This isn't what happened. Freebase is still available to download and is slowly (with Google support) being migrated to WikiData[1].

There are some other pretty large Knowledge Graphs around. ConceptNet and Probase/MS Concept Graph are two that are worth looking at.

[1] https://static.googleusercontent.com/media/research.google.c...


I'd assume so, because it acts like a highly specialized MediaWiki.


Built this last week as part of the interview process for a job. I know it's flawed, but in my opinion neat nonetheless!


Excellent idea. There are a lot of interesting ways to improve this, but you have an MVP running, which is a good start.

Regarding your codebase: clear and to-the-point code, well commented, and helpful commit messages. Including a `requirements.txt` is a plus.

Good job, keep it up!


Well, I believe that any code should be readable enough so that the comments like below wouldn't be required.

# splits a Wikipedia section into sentences # and then chunks/tokenizes each sentence

If I had interviewed the author, I would have asked him what's the purpose of commenting like that.


I believe it's easier to read a native language than code. I would also counter there is no harm to comments like this so just because you don't find it useful doesn't mean someone else won't.


Okay. My point is in that in the real job you don't have time for writing this type of comments. Instead you have your current task to work on, the issue that was re-opened and needs to be revisited, the bug to argue with QA about, the deadline to discuss with PM, the code-review to do ASAP. You simply don't have time to write the perfect code that is full of the comments in the "native language".


This rather sounds like you don't have the time to not do it.

Imagine using the time that spent on "re"-visiting "re"-opened bugs that are vague enough to be argued about on writing code that doesn't need these "re"s in the first place.

I contend that that might be a difficult place to get to especially because it's a team effort as well, but I feel it's more productive and less stressful to work like that.


I am too long in IT to imagine anything like that.


That particular comment seems OK. In general, one should make the code readable to the point of not needing a comment. Comments can rot over time in legacy systems. Somethings really benefit from a comment, however I feel the following snippet is a better example of a comment that should not exist as it does not serve to clarify the code and is just a direct English translation of the simple code:

        # Iterate through article's sections

        for section in self.page.sections:


Yeah that's clearly redundant


One of the purposes is to write down what you need to do as comments, and then implement each part. Like pseudo code. Saying you don't have time for it is like saying you don't have time to think through what you're trying to do.


It's entirely possible to think through what you're trying to do without writing the code comments. I prefer good old-fashioned pen and paper for example.


Of course, but you don't share those papers with us :) I like those kinds of comments a lot. In a good editor it's so easy to just scan a lot of code and understand what's what. You could look at something you or someone else wrote years ago, in different style, different paradigm, different language, ... and still have a perfectly clear picture of what code does in no time.

To each his own I guess.


The project is cool and I know the code is not necessarily the point. That said, if I were being picky, I'd ask why the author chose not to use docstrings. The code itself is fine but not very Pythonic. There are small inconsistencies* that running pycodestyle [1] once would have caught and could be fixed quickly — I recommend OP consider that.

*Mostly related to: whitespace / spacing, indentation, mixing single quotes and double quotes, magic numbers, naming conventions

[1]: https://pypi.python.org/pypi/pycodestyle


Super super awesome, what a brilliant idea. You might want to do pattern matching such that the answer to the question doesn't match the text of the question. Your example image shows the immediate flaw there.


Thanks, and good point. It's a very simple approach so there a numerous weaknesses that will be improved upon with a bit more knowledge of NLP.


Did you get the job?


As far as I know they're still reviewing it.


Good luck! If it doesn't work out for any reason let me know if you're interested in Poll Everywhere.


They should have given you an offer... you clearly delivered.

I bet you're going to get some offers from making this post on HN. Make sure you have your contact info in your profile. :)


Good luck!


Neat and simple implementation. Consider docstrings for describing methods, this tends to integrate with IDEs a lot better than comments.


It's a nice idea

One potential improvement is to remove the common parts of answer and question (as in your Triumph example)


Very cool. Can you please add more info or talk about how the grammar/parsing is set up?


Sure! Take a look at: https://github.com/alexgreene/WikiQuiz/blob/master/python/Ar...

I used nltk (natural language toolkit), which takes care of most of the hard work. It tokenizes whatever text you pass it, and even assigns each word a part-of-speech (noun, adjective, etc).

The grammar is where I tinkered the most. You can see I have 3 grammar rules set up (NUMBER, LOCATION, PROPER). nltk will go through the tokenized words and see if any sequences of words match any of the rules.If it finds a match it groups/chunks those words together into a phrase with the tag you've specified (ie. LOCATION).

As for the rules themselves, they're very easy to write once you understand the syntax. For example, let's look at my PROPER rule, {<NNP|NNPS><NNP|NNPS>+}

Everything in the {} is the rule. The tags inside of the <> are the parts-of-speech assigned by nltk. Translating the rule literally would be: match any sequence that has: [an NNP or an NNPS] followed by one or more of [an NNP or an NNPS]. In other words, any sequence of two or more NNP or NNPS words.


Thanks a lot for the explanation. NNP would be Noun-Noun-Phrase and NNPS would be Noun-Noun-Phrase-Sentence I believe? I will play around with the syntax more.


Good luck mate, this is a really cool little project.


Awesome work! How long did this take you?


Those interested in this may also be interested in https://en.wikipedia.org/wiki/Incremental_reading which gradually converts reading material into flashcards that are memorized using spaced repetition software.


This is so cool!

Let me know if you plan on continuing it. I'd love to collaborate.


Thanks! let's chat: @alexg473 (twitter) or alexgrn7 (gmail)


For the lazy, hosted: https://wiki-quiz.herokuapp.com/


Thank you kind fellow!


no worries, cool project. Here's the changes I had to make to make it hostable if you're interested.: https://github.com/lutherism/WikiQuiz/commits/master


Ha! For a class in school we had to create a web app that allowed people to create quizzes and challenge others. Among the question types we implemented a sort of fill in the blank using Wikipedia random article feature (https://en.m.wikipedia.org/wiki/Special:Random)


Can you tell more about that web app? Are the quizzes generated automatically and if so, how did you extract the information?


UPDATE: I'm truly excited about all of the feedback this project has received. Credit to Volley (http://volley.com) for requesting/inspiring this project!


> In Australian aboriginal mythology, ? is a god of earthly knowledge and physical might, created by Altjira to ensure that people did not get too arrogant or self-conceited.

[Jar'Edo Wens]

> Correct!


It appears that there is an open issue[1] in the Wikipedia python library where it does not list the different sections in a wikipedia page. So right now, this would only generate questions from the "Summary" section of any wikipedia page.

[1] https://github.com/goldsmith/Wikipedia/issues/119


Would be awesome if this could be used to generate a Kahoot quiz: https://kahoot.it/


Thanks for sharing this. I've done a similar thing to help study for exams except with pattern matching instead of nltk. I'm looking forward to understanding the natural language part.

It's very buggy though... I get more invalid questions than good ones, haha


I cloned the repository and installed the requirements but after starting 'python python/server.py' I get a 404 if I try to open index.html as described. Anyone else having that problem?


Don't go to localhost:5000/, like you might think. Open file://.../WikiQuiz/index.html as a file in Chrome. It will make ajax requests to localhost:5000


thats the idea behind http://github.com/divbit/grimoire as well, except more for private notes.


"grok" mode does a quiz on all notes of a certain topic


srry readme a bit out of date..


Nice, but still has some room for improvements

http://i.imgur.com/EVToWfI.png


The hosted demo isn't looking past the summary of the wiki article. If you follow the instructions on the README and run it locally it will have a much larger pool of answers/choices.


I really could have used this when I was teaching English at high schools. Thanks for the link, I will defiantly be sharing with mates still in the industry.


[flagged]


Give them some benefit of the doubt. Not saying it's the case, but phones and auto-correct can do the most amazing things.


Yeah. The worst thing is, I miss that autocorrect so often -.-


Yeah, autocorrect can be so helpful…until it's not!


A novel idea but the example screenshot looks really trivial. I'm not sure that it is Triumph but it being in the question narrows it down.


Nice! I am guessing you don't want to hard code the domain/hostname in script.js (line 40) ?


Right, it's just localhost right now, so I don't think it matters.


Btw I am unable to get it working locally. I only see 404s. Do I need anything special to serve the static files?


Happy to help. The static files should only be accessed by opening index.html in your browser. From another user: "Don't go to localhost:5000/, like you might think. Open file://.../WikiQuiz/index.html as a file in Chrome. It will make ajax requests to localhost:5000"


Ah I see, thanks for the help. Now I see 500s even though I have installed both the nltk packages. Hmm.


The exception appears to be:

<class 'TypeError'>, TypeError("a bytes-like object is required, not 'str'",),


Try reverting the changes from: https://github.com/alexgreene/WikiQuiz/commit/9696fe29b413a6...

Please report back if that worked or not with an Issue on the repo, so I can follow up with a fix. Thanks!


Seems like a good idea, it would be great if there was a demo available.


Sorry for that, I wasn't releasing this as a product, and certainly didn't expect it to get this much attention! Maybe a v2 of this will have its own server :)


I had the same thought. The Flask app could be a pretty trivial deploy on the Heroku free tier.


That's interesting!


Really cool idea!

Unfortunately, I'm getting a 500 error on every request.

What did I do wrong?


I got this problem too.

I solved it by downloading 'averaged_perceptron_tagger' from nltk.

>>>import nltk

>>>nltk.download('averaged_perceptron_tagger')


[deleted]


I tried both 2.7.10 and 3.6.0.


I've just updated the instructions on the README, let me know if you still can't get it working after that.


Hey. Can't get it working. Still get a 500 error on every request. Followed all instructions on the README. Checked for the presence of averaged_perceptron_tagger and punkt, too.


I have the same issue with both 2.7.10 and 3.5.1 after following the steps in your readme.


Managed it by doing nltk.download('punkt')


Fixed it for me too, thanks. How did you debug the issue?


It needs a better example.


Can you make it a multiplayer game played on phones, tablets, and watches, tvs, and cars? Okay, maybe not cars.


You never know, people are going to have a lot of free time in cars pretty soon.


People already play along with radio quizzes while driving, so make it voice controlled and it's no different than that.


It's certainly possible.


that picture is not that great of an example, answer is right in the question.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: