ghukill's comments

ghukill · on Jan 30, 2024

If interested in WARC, recommend also checking out WACZ: https://specs.webrecorder.net/wacz/1.1.1/

infogulch · on Jan 30, 2024

What's the point of WACZ? It appears to wrap a number of WARC files into a single zip, enabling Range requests to specific WARC files so it can be served by a passive file server. But why is that needed?

nikisweeting · on Jan 30, 2024

It's huge for being able to replay big WARC files in a browser without having to download the whole thing. (e.g. try loading a 700mb WARC from IPFS to visit one page within it, it's too slow to work as-is)

It's used extensively by the Browsertrix/Webrecorder.io projects (who's team pioneered the WACZ format) and a few other projects.

infogulch · on Jan 30, 2024

Oh I may have missed that part. So the WACZ (indexes?) can contains offsets into the WARC file itself to each individual page?

nikisweeting · on Jan 30, 2024

WACZ is a replacement for WARC that has the index with offsets built in.

infogulch · on Jan 30, 2024

But it uses warc files inside as the archive format. It seems weird to call it a replacement when the original is still present.

nikisweeting · on Jan 30, 2024

I just meant from a user's perspective it's a format that superseeds WARC. But internally, yes, one is an encapsulation format for the other.

ghukill · on Oct 2, 2021

>> "Hell, at this point, GPT-3 is probably a better approach to knowledge processing than trying to piece together something actionable from a half baked information graph born of old programmers' utopian fever dreams."

Greatest thing I've read on HN. As a librarian and developer, can confirm. At least in most cases...(slipping back into fever dream)....

ghukill · on June 12, 2019

I don't miss the assumption that all developers are "guys".

ghukill · on Nov 16, 2018

And one hundred thousand people finally just got a good night's sleep...

ghukill · on June 30, 2018

I would propose a potential user as someone interested in some of the meta considerations and patterns of statistical reasoning, aka machine learning. There are is a vast amount of particulars the second hand on my watch operates (e.g. vibrating quartz, digital), but I can use that mostly reliable device to investigate higher level phenomenom, like calculating distance of planets by timing their movement. This library opens a direct line to these algorithims such that one might intuit, and apply, their high level behavior; as I could not time planets if consumed with the fidelitity and reliability of resonating quartz, it would slow my ability to explore this kind of reasoning if concerned with the minutiae.

That said, all points taken. If this sparks interest in someone, as is stands, it would be on them to dig in to all the considerations you've outline.

ghukill · on June 30, 2018

I love it. Pasted in the column headers to `iris.data` from the Iris website. Voila, up and running per instructions on Github. For prototyping / exploring ideas, for the syntactical layman, but conceptuallly familiar, what a boon.

ghukill · on June 21, 2017

Are there queries that SPARQL can perform over a triplestore that cannot be done with SQL over normalized data? Perhaps not.

But data normalization to that end is a moving target, while a bag of subject-predicate-object statements are quite doable. This, I believe, is a uniquely powerful characteristic of linked data / graph query languages and protocols.

To that end, agree with the comment above that GraphQL is mighty exciting.

barakm · on June 21, 2017

+1 Insightful. In fact, there's research toward showing the two are equivalent in possibility space of what can be represented/queried (https://arxiv.org/abs/1102.1889)

But yes, linked data and graphs are super powerful once the data is triplified. Suddenly you have an abstraction above the contents of your data into the 'shape' of your data.

SPARQL and RDF aren't going away, but they're the academic thing that I and others are trying to make useful. GraphQL is scratching the surface, but it's super exciting that it's scratching at all, imo.

(Disclosure: Founded CayleyGraph, supporting the open source https://github.com/cayleygraph/cayley, which I maintain and mostly wrote)

linkmotif · on June 21, 2017

GraphQL, though, is a bit of a lie nomenclature-wise. As I've experienced it, it's got nothing much to do with graphs, at least not in the sense that SPARQL deals with triples that form a graph. In this department I am really interested in TinkerPop [0].

I would love, some day, to spend some more time with triple stores, RDF and semantic technologies.

[0] http://tinkerpop.apache.org/docs/current/reference/

handojin · on June 21, 2017

You might really enjoy datomic (www.datomic.com). Everything is stored as entity attribute value time and you query with a dialect of datalog. You can check out www.learndatalogtoday.org to get a flavor.

tannhaeuser · on June 21, 2017

Datomic, though, isn't Datalog syntax at all.

I've got nothing against Datomic, but can't help to think learndatalogtoday is outright false advertising by trying to capture "Datalog" as SEO term for a proprietary graph database which has nothing to do with Datalog/Prolog.

The point of Datalog is that it's a subset of Prolog syntax, implying that engines can be reasonably exchanged for one another. But this is only possible with real Datalog, or SPARQL for that matter.

linkmotif · on June 21, 2017

Prolog and datalog ae really high on my list. Thanks for the reminder!

Datomic, though... wish there was an OSS version or CE or something.

ghukill · on April 7, 2016

Let the record show - this is how people will "bookmark". It's bringing in marked up data from the page, effectively treating websites as little interesting nuggets of data. We won't don't save links to aggregators, we save links to articles, nuggets. The UI is clunky, but it'll get better. Hierarchy is toast, and doesn't scale, welcome to your bag-of-visited-memories-websites-past.