Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This just made me discover the github archive.

  $ wget http://data.githubarchive.org/2014-07-21-{0..23}.json.gz
  ...
  Downloaded: 24 files, 129M in 30s (4.25 MB/s)
Cool. A day's worth of public events is 129MB compressed. That's surprisingly small! Let's play for a second.

  $ ls *.gz | xargs -P4 -n1 gunzip
  $ du -sch *.json
  ...
  807M	total
Time to break out JQ: https://stedolan.github.io/jq/manual/

  $ time jq .type *.json | wc -l
  408218

  real	0m16.788s
  user	0m16.366s
  sys	0m0.325s
That's an easy amount of data to mess with. If a day is 16 seconds to process, I can do 14 years on my measly desktop in one day! 408k public records - around 5 a second. I somehow imagined events would flood into github even faster than that. I wonder what their public/private activity ratio is.

Let's explore the event types:

  $ time jq .type *.json | sort | uniq -c | sort -n
      405 "PublicEvent"
      697 "TeamAddEvent"
     1018 "ReleaseEvent"
     1636 "MemberEvent"
     3166 "CommitCommentEvent"
     3892 "GollumEvent"
     6925 "DeleteEvent"
     7051 "PullRequestReviewCommentEvent"
    14807 "ForkEvent"
    18579 "PullRequestEvent"
    19919 "IssuesEvent"
    37942 "WatchEvent"
    38402 "IssueCommentEvent"
    46033 "CreateEvent"
   207746 "PushEvent"
Pushes dominate - 10 pushes for every issue created.

This is probably more than enough for an HN comment. It'll be fun to see what people do with this stuff this year. :)




The Google BigQuery implementation of the archive can do such a query across all the data in seconds.

I wasn't aware until today that you could use BigQuery on a recently-updated data set, though.


I can confirm. That query took about 2 seconds. More discussion here: http://www.datatau.com/item?id=3608




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: