Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked with the Wikipedia category system a few years ago, and you could see the problems with hierarchical tagging systems right in action back then. (Though it may have gotten better in the meantime)

The system appeared simple: There were just two relations, "Article A is a member of category B" and "Category X is a subcategory of category Y".

However, in practice, the community was using this system to represent a whole host of wildly different relationships between items, often with different implications what a category actually applied to.

E.g., if A has a subcategory B, this could mean one of several things: B might be an additional constraint on the items in A ("American writers" -> "19th century American writers"), the things in B might be more specific than the things in A: ("Writers" -> "Novelists"), A might apply to the concept B, not the things in B ("Occupations" -> "Writers") or A might refer to the category B ("Categories with more than 100 entries" -> "Writers") and on and on...

Of course those different aspects could even be combined. E.g. "Categories with more than 100 entries" might have a child "Categories with more than 100 entries in need of review", which represents a constraint but might itself contain less than 100 entries...

The basic question "Is item X in category Y" becomes impossible to answer generally, because there is no clear indication if a category only applies to its direct children or to all of its descendants or only to the subcategories itself.

I'm sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation...



There are only two kinds of relation here, “subset of” and “instance of” (aka “element of”, type-token).

The category-category relations are intended to always be a subset relation. The article-category relations are intended to always be an instance-of relation.

- "19th century American writers" is a subset of "American writers“.

=> Both are a category, so no problem.

- “Novelists” is a subset of “Writers”.

=> Both are a category, so no problem.

- “Writer(s)” is an instance of “Occupation”.

=> Here the problem is that “Writers” is a category. It would be okay if it was an article “Writer (occupation)”.

- “Writers” is an instance of “Categories with more than 100 entries".

=> Here, again, the problem is that “Writers” is a category, and having an instance-of relation between categories is not an intended/supported use-case.

This could conceivably be solved by supporting an instance-of relation between categories, in addition to the existing subset (subcategory) relation. It could be called a meta-category relation. Then you could have the category of occupation categories.

Another way to put this is that categories have to be typed: a category contains either (just) articles, or it contains (just) categories. Subcategories then must match the type of their supercategories and correspondingly must contain either articles or categories.

Basically, Wikipedia’s type system is not expressive enough to allow everything people would want to express in it.


> Another way to put this is that categories have to be typed: a category contains either (just) articles, or it contains (just) categories.

The problem with using such a strict type system as a tagging-system is exacerbated by the cases where:

1. Someone adds an article to a category (they tag the article), but then want to add a subcategory to that category. Now the category contains both articles and subcategories (violating the constraint). So the user would have to move all its articles into its subcategories for the constraint to be satisfied. This can be an enormous amount of work (needing to invent new subcategories for all the articles in the category not fitting into the specific subcategory they had in mind).

2. Someone wants to add an article, but only has a vague idea of a super-category in which it would fit. Now they have to exhaustively crawl/navigate the tree of sub-categories, until they find only the leaf sub-categories which only contains articles, which is a place they could put it. The input barrier thus becomes high (which is antithetical to how people expect to use tags).


Re 1: I think you misunderstood. The restriction of either articles or categories is for the instance-of relation. (The subset-of relation is naturally restricted to categories to begin with.) If a category contains articles, it can’t also contain categories as elements (instance-of relation). But it can contain subcategories as subsets.

Analogy: The set of real numbers has the set of natural numbers as a subset, but it doesn’t have the set of natural numbers as an element, because the set of natural numbers is not a real number — the individual natural numbers are.

Likewise, the category “Occupations” may contain articles describing occupations, and it may have a subcategory “Clerical occupations” (subset-of relation), but it cannot contain the category “Writers” as an element (in an instance-of relation as with the articles), because writers are not a subset of occupations.

Furthermore, as an example of a meta-category, the category “Categories with more than 100 entries” may contain the category “Occupations” as an element (but not as a subcategory!), and hence cannot contain any articles as elements.

The element type of the category “Categories with more than 100 entries” is categories, and the element type of “Occupations” is articles. The point is that you can’t mix both types of elements within the same category. This is independent from subcategories. Any category can have subcategories, the only condition being that the subcategories must have the same element type as the suoercategory.

The idea is that a category can have both subset-of and instance-of relations at the same time (and each relation needs to be marked as such in the system), but the instance-of relation is restricted to be either articles or categories, but not both.

Re 2: I believe that problem goes away, given the above.


The issue is that system has nodes and edges, but no concept of distinct graphs. That leaves you trying to fit all notable human knowledge onto a single graph, which is non-optimal. Whether it’s also a DAG, tree, or something else doesn’t even matter.

Ontologies are like languages. There is no correct one. What matters is how good a fit it is for the problem at hand and that you’re all using the same one! If half the people are using Italian and half Spanish, it’s going to be a disaster. I wouldn’t use APL to write a UI and I wouldn’t architect a computer system in Shipibo.

Similarly, if I’m bird watching, “Birds of Northern California” is very useful. Organizing them by genus is less useful to me in that moment, but it’s not wrong.


I don't think you necessarily need multiple graphs; just labeled edges.


You just need some way to interact with it as multiple graphs. Some variation of labeled edges is probably the best.


In your examples, would the edges be like:

tagged_with_italian_tag vs. tagged_with_spanish_tag ?

tagged_with_genus_tag vs. tagged_with_geo_tag ?

Would that afford such multiple graphs?


There are a bunch of ways to do it. You could use the Entity-Attribute-Value[0]. Then it's (California Quail, Region, California), (California Quail, Genus, Callipepla). You could do relational tables, with a through table for each taxonomy. Or, one through table with tags. That's like your comment.

https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...


Isn't this literally just saying we need another layer of categorization on top of the categorization layer?


It’s saying you need support for multiple types of categories. You could use the same system to organize itself. No need for a meta layer.


Perhaps "adjacent to" rather than "on top of"? I've started looking at this kind of problem in terms of DB queries or set relations. Even "organization" can be a set relation if there are the right bits of metadata in place.


The problem might not be with hierarchical tagging systems, but with the specific hierarchical tagging system they use at Wikipedia.

Imagine another system with the following categories:

* People:ByOccupation:Creative:Writers

* Time:CommonEra:ByCentury:19

* Location:Earth:Americas:NorthAmerica:USA

In this scheme of things, e.g. Mark Twain would be tagged with all three. "19th century American writers" (which includes Mark Twain) would not be a category but a saved search. (Other saved searches — which would also include Mark Twain — would be "19th century people from Americas" or "Stuff from Planet Earth").


Suppose you have someone who did a bunch of writing in America, then moved to Europe and became famous as an inventor there. Under your proposal, this person has both Location:Europe and Location:USA, and both Occupation:Writer and Occupation:Inventor. They therefore show up for queries for European writers and American inventors, neither of which was intended; I bet we can come up with situations where the false positives are even worse. The presence of those tags have to be interpreted in light of each other.

If you do this naively I think it's pretty clear you've either sacrificed expressivity or made the system a LOT more complicated/harder to understand. At best you end up with some kind of product structure in (what is no longer just) the set of tags on an article. You can think of explicating an implicit product structure in joining "American" and "Writer" in the same object. But I think if you've started talking about compound tags, you're really talking about something other than a tagging system.


I think the only feature you need to express this is to be able limit a tag to the context of another tag. It's slightly different than a compound tag because each tag can still be used independently.

I experimented with a system like this recently[0] that used two different tag notations that seemed to make the mixing more intuitive. I didn't have enough time to iterate on it further or build it more seriously, but I think there is potential in this area.

[0] https://youtu.be/bi3YkY7UKmM

----

To be more specific, your entity could be tagged as Location:Europe in the context of Occupation:Inventor, and Location:USA in the context of Occupation:Writer. I still think the entity should match queries for any of the 4 tags, it just shouldn't match a query {Occupation:Inventor/Location:USA}


That introduces a dependency, or at least ordering, between the concepts of Location and Occupation that I'm not sure should exist, much less which direction it should point. It works for baking:skilled because skill level is inherently part of the property of being a baker, and skill is undefined without a thing to be skilled at, whereas someone can easily reside in a location with no occupation ({Location:USA/Occupation:Layabout}?) or have an occupation with unknown/unfixed location ({Occupation:Inventor/Location:Nomad}?).

And if you try to create a synthetic context to place both tags under, you get... compound tags, or close enough as makes no difference to me. :) I'd 1000x rather start from there, and special case it to return results for each tag individually, than start introducing spurious orderings or dependencies. (ed: maybe it would be clearer to say "composite tags", as in tags composed out of other tags?)


Perhaps I misunderstood the example but I thought the point was precisely that there is a dependency between Occupation and Location which individual tags cannot express...?


> I'm sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation

I think the problem is allowing users to freely tag, then. There should be easily accessed guidelines about how each tag should be used, and people who are constantly moving them, correcting them, and updating usage guidelines.

We need the ability to implement governance systems on top of web 2.0+ style content systems. People should be able to vote for representatives (with any number of voting systems), create committees, submit changes to be voted on, etc. Instead we usually work based on hierarchical dictatorships or imagined consensus. People need organizational management tools baked into software, because organization of information depends on it. Instead of proposing a new committee to come up with the schema of everything, better tools that enable users to build committees.


The fundamental tension in tagging systems, to me, is whether tagging is a feature the software offers to the user or a task the user performs to assist the software.

In the first case, you want freewheeling and tolerate ontological inconsistencies because you want to offer flexibility to users and will capture hard to quantify emergent benefits (some made up examples: "try the tag user233-favorite, I keep discovering awesome articles!", "the physicist-needed tag has highlighted a lot of misinformation surrounding quantum physics and relativity"). People use it to the extent it is useful.

The other way, with formal semantics, governance (which you made some very wise points about), etc allows the software to reply to queries like "19th-century + Missouri + humorists" in a performant and authoritative way. It's not really a feature so much as it is a way to enable other features.


I recently run into the same kind of problem in Wikidata.

https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Onto...

typical problem is of "light rail (Q1268865) is data visualization (Q6504956)" kind - this specific is fixed, but there are many similar

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/...

https://www.wikidata.org/wiki/Wikidata:Project_chat#Ontology...


Many comments below are hinting at - but not naming - triplestores. "A has relationship X with B". This is how wikidata works.

Learning about those and learning how to query wikidata just blew my mind.


If it isn't too much trouble, I would love to see an example of a particularly complex query that can be done on top of this...Pseudocode or just a text description is fine, it doesn't have to be precise syntax.


I was toying with the first world war at the time. You could query famous soldiers born in the same home town as the current president, for example.


> m sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation...

You might be interested in Snowmed CT, a way to describe medical concepts. It does something rather similar.


I encountered the same problem a few years ago and indeed realized that using categories to understand what type of article a thing was (person? subject? event?) was utterly useless, for the reasons you describe.

On the other hand, I discovered that infoboxes (the data in the top-right box on most pages) was generally extremely reliable, if frustrating to parse.


The infoboxes are created from a query to Wikidata, which you can query yourself! No scraping necessary! https://query.wikidata.org/

You'll want to learn SPARQL, but if you know SQL it's not so bad to pick up.


As far as I can tell, that is not the case, sadly.

Right now it appears that only 3,975 articles have infoboxes auto-generated from Wikidata. [1] The wikitext contains something like "{{Wikidata Infobox ...}}" instead of just "{{Infobox ...}}".

If you look up a popular article like Barack Obama [2], it's just a traditional hand-edited infobox. In fact, one of the first lines of data says "Vice President = Joe Biden", while the Wikidata entry for Barack Obama [3] doesn't reference Biden anywhere -- so not only is the Wikipedia infobox not generated from Wikidata, but Wikidata isn't pulling all the relevant info from Wikipedia either.

Back when I had been working on my project, I'd hoped Wikidata could be a solution but it was far too incomplete and information was regularly out of date. Perhaps (hopefully) it's better now, but it's clearly not being used to power infoboxes yet except in a tiny number of cases. (Which actually complicates things more now, since anybody parsing Wikipedia infoboxes now has to deal separately with the 3,975 ones that grab from Wikidata, since none of the actual data is copied over into the wikitext...)

[1] https://en.wikipedia.org/wiki/Category:Articles_with_infobox...

[2] https://en.wikipedia.org/wiki/Barack_Obama

[3] https://www.wikidata.org/wiki/Q76


For sure, thank you for the correction, I was under the impression it's role was broader.


Wikidata is not solution at all.

I recently run into the same kind of problem in Wikidata.

https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Onto...

typical problem is of "light rail (Q1268865) is data visualization (Q6504956)" kind - this specific is fixed, but there are many similar

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/...

https://www.wikidata.org/wiki/Wikidata:Project_chat#Ontology...


> Categories with more than 100 entries" might have a child "Categories with more than 100 entries in need of review"

This should have been specified by 2 separate tags: "Categories with more than 100 entries" and "In need of review".


("Occupations" -> "Writers") seem wrong why would you do this? same for ("Categories with more than 100 entries" -> "Writers").

This seems like trying to put tag on category entity instead of creating a tag hierarchy.

Those 2 should be stored using different relationship type mechanisms.

("categoryTag", <SourceTag>, <DestinationTag>)

ex: ("categoryTag", "Occupations", "Writers")

and

("parentTag",<tagName1>, <tagName2>)

ex: (("parentTag", "American writers" , "19th century American writers")


Indeed. It's a bit like if a programming language was trying to represent base classes and meta classes using the same mechanism.

My guess is that no one realized the need for "meta" categories when the system was implemented, so later the existing hierarchy was simply co-opted instead of implementing a new functionality for that use case.

As long as the categories are only used by human editors and use is only within some small subcommunity, it can work quite well. The problem starts if you want to combine categories used by different communities or if you (or your program) lack the domain knowledge to understand which nodes represent "meta" categories.

As another poster said, the better approach to use Wikipedia data for automated processing is using infoboxea or the explicitly machine-readable Wikidata repository. The category system looks machine-readable on first glance but really isn't.


An exactly analogous problem exists in the Collections hierarchy at the Internet Archive, of uploaded/digitized material (not the Wayback Machine web captures).

A single graph is applied locally with very different semantics; and absent a distinct tagging systems, collection membership is sometime used to mark material for treatment in some way.


Clearly the solution to all of this would be the category of all those categories that do not contain themself.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: