Wednesday, September 29, 2021

Tags Are A Bad Data Model

We've all seen tags, right? Twitter, Instagram, Steam, Stack Overflow, Bandcamp, just about every blogging engine or CMS.... all have tags. Sometimes they seem useful but mostly not. Why is that?

Tags are a fundamentally bad data model, because they offer exactly one extremely weak semantic.

Real quick, what do tags look like? Here's a basic relational implementation. Let's assume we have a post table, and we want to have tags and be able to associate tags to posts.

create table tag
(
  tag_id  serial        primary key
, name    varchar(200)  not null
);

create table post_tag_map
(
  post_tag_map_id  serial  primary key
, post_id          int     not null
, tag_id           int     not null
);

create unique index uidx_posttagmap_postidtagid on post_tag_map(post_id, tag_id) ;

There, those are the absolute basics. You would want other stuff like foreign keys, more indexes, created timestamps, and maybe what user created a tag or association, but this is the core of the model.

What can we do with this? Well, we can enumerate all tags that a post has, so that we can show them. Or, given a tag, we can enumerate all posts that have that tag. And... that's it.

To see how weak the semantic really is, let's imagine doing some basic analytics on our tags. How would we summarize the tagging of posts?

select p.post_id
     , p.name
     , max((t.name = 'fiction')::int) as TAG_FICTION
     , max((t.name = 'mysticism')::int) as TAG_MYSTICISM
     , max((t.name = 'politics')::int) as TAG_POLITICS
     , max((t.name = 'wtf')::int) as TAG_WTF
     , max((t.name = 'statistics')::int) as TAG_STATISTICS
     , max((t.name = 'humor')::int) as TAG_HUMOR
     /* many more... */
  from post p
  left join post_tag_map ptm on(p.post_id = ptm.post_id)
  left join tag t on(ptm.tag_id = t.tag_id)
 where 1=1
 group by p.post_id
        , p.name
;
That's the best we can do. A seemingly-endless bit array of 1/0 (or true/false if you like) flags showing whether any particular post has any particular tag. If new tags are added, we need to adjust our query (and table, if we store these results for easy use).

No tag ever conflicts with any other tag. If we have tags for "red" and "blue" and "green", a post can have all of them. If we have tags for "fiction" and "non-fiction" a post can have both of them. Remember, each one is just a flag, and they are all independent of one another.

In fact, we can describe our original data model a different way...

create table post_tag
(
  post_id         int      primary key
, tag_fiction     boolean  not null
, tag_mysticism   boolean  not null
, tag_politics    boolean  not null
, tag_wtf         boolean  not null
, tag_statistics  boolean  not null
, tag_humor       boolean  not null
/* many more... */
);

 ...where the ability to add columns -- at runtime! -- has been delegated to users, whether that be end-users or admins.

That's all the tags data model is. An infinity of boolean flags. No categories. No hierarchies. No key/value pairs or additional detail. This is all you get.

That is an incredibly weak semantic! It's terrible!

 

It only ever works where you, the system designer, fundamentally have no knowledge of what kind of meaning your users might want to impute to Things in your system and never will.

It works on Twitter because the breadth of topics that get discussed on Twitter is up to Twitter users, and changes constantly. If someone wants to try to create a Schelling point around some topic by using a #hashtag, they can. Maybe it'll catch on and maybe it won't. Maybe the word of phrase they chose is awkward or ambiguous or otherwise fails to communicate meaning. Maybe it's disingenuous or an outright lie. It's (arguably) not up to Twitter to manage this, and given the speed at which trending topics morph there's really no way they could.

It works on Bandcamp because artists choose their tags, and they choose them with purpose in mind: self-identifying with genres or styles, to aid discoverability.

It works on Steam for similar reasons. Users tag games to help other users find games they might like. It helps that users have coalesced around a fairly finite set of popular tags which doesn't change much over time, and the Steam UI highlights only the most popular tags (which in turn relies on having a big, engaged user base).

It probably won't work very well on your blog. Absent any conceptual framework, you'll struggle to think of what tags each post should have. The tags I used in the example are real, taken from a blog I've read for years. They don't make a ton of sense, and have never been useful for... anything.

It definitely won't work within your business software, because you need stronger semantics! You need categories, and hierarchies, and sets of mutually-exclusive/collectively-exhaustive values. Maybe some flags, sure, but specific flags with specific meanings. And all of this needs to be designed by the folks that build the software, not left up to users (at least not by default).