Hacker News new | past | comments | ask | show | jobs | submit login
Third Annual GitHub Data Challenge (github.com/blog)
82 points by geetarista on July 22, 2014 | hide | past | favorite | 9 comments



This just made me discover the github archive.

  $ wget http://data.githubarchive.org/2014-07-21-{0..23}.json.gz
  ...
  Downloaded: 24 files, 129M in 30s (4.25 MB/s)
Cool. A day's worth of public events is 129MB compressed. That's surprisingly small! Let's play for a second.

  $ ls *.gz | xargs -P4 -n1 gunzip
  $ du -sch *.json
  ...
  807M	total
Time to break out JQ: https://stedolan.github.io/jq/manual/

  $ time jq .type *.json | wc -l
  408218

  real	0m16.788s
  user	0m16.366s
  sys	0m0.325s
That's an easy amount of data to mess with. If a day is 16 seconds to process, I can do 14 years on my measly desktop in one day! 408k public records - around 5 a second. I somehow imagined events would flood into github even faster than that. I wonder what their public/private activity ratio is.

Let's explore the event types:

  $ time jq .type *.json | sort | uniq -c | sort -n
      405 "PublicEvent"
      697 "TeamAddEvent"
     1018 "ReleaseEvent"
     1636 "MemberEvent"
     3166 "CommitCommentEvent"
     3892 "GollumEvent"
     6925 "DeleteEvent"
     7051 "PullRequestReviewCommentEvent"
    14807 "ForkEvent"
    18579 "PullRequestEvent"
    19919 "IssuesEvent"
    37942 "WatchEvent"
    38402 "IssueCommentEvent"
    46033 "CreateEvent"
   207746 "PushEvent"
Pushes dominate - 10 pushes for every issue created.

This is probably more than enough for an HN comment. It'll be fun to see what people do with this stuff this year. :)


The Google BigQuery implementation of the archive can do such a query across all the data in seconds.

I wasn't aware until today that you could use BigQuery on a recently-updated data set, though.


I can confirm. That query took about 2 seconds. More discussion here: http://www.datatau.com/item?id=3608


I don't get why the first prize is a one-day course about data visualization. You already won the contest which shows that you are knowledgeable about data visualization, what would a 1 day course do for you?


The course is taught by Edward Tufte.


> you’re not participating from a country against which the United States has issued export sanctions or other trade restrictions, including Cuba, Iran, North Korea, the Sudan and Syria

Does it include Russia this days?


No this is the official boycott list I guess.


Why prizes for this competition is so pathetic. Looks like a corporate moral budget where you skimp for pennies (I know for a fact of a big company having $75 moral budget per person per year). Is this how much our time worth? At least the organizers could have been more creative if execs at Github decided to through mere pennies at developers to compete like giving out some cool designed t-shirts or something.

Top 3 winners get $200, $100, and $50


> Top 3 winners get $200, $100, and $50

Those are 2013 numbers. This year it's "all-expense paid trip to attend a one-day data visualization course", $500, and $250 for top 3 winners. But presumably this isn't meant to be done as a job.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: