Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The question of how to identify large complex works, of potentially variable forms (markup or editing format, HTML, PDF, ePub, .mobi, etc.) such that changes and correspondences can be noted, is a question I've been kicking around.

Chunk-based hashing could work. Corresponding those to document structure (paragraphs, sentences, chapters, ...) might make more sense.



Yeah that's an interesting question. How to parse the content into meaningful pieces and then hash in such a way that the content is not known, but the hash can be mapped to where it was in the document at an earlier time.


Keep in mind that at the scale of a large work, some level of looseness may be sufficient. Identity and integrity are distinct, the former is largely based on metadata (author, title, publication date, assigned identifiers such as ISBN, OCLC, DOI, etc.). Integrity is largely presumed unless challenged.

I'm familiar with RDA, FRBR, and WEMI, somewhat.

https://en.wikipedia.org/wiki/Functional_Requirements_for_Bi...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: