A few thoughts on content categorization. No surprises there, less is more.

Since I've started collecting bookmarks using Delicious, I've put a lot of effort into their categorization, organizing them in such a way their browsing would be as simple as possible. The service supports two level categorization (tag – bundle) which helps to control massive amounts of links people have gathered. But it's the experimentation with different structures that gives real insight into content categorization, and because this topic was already mentioned and discussed a few times on this blog, it deserves a special mention. Let's begin.

Categories vs. Tags

Observing other blogs, I've noticed a lot of them use both Categories and Tags. While I can understand the SEO (Search Engine Optimization) benefit in having as many different entry points (landing pages) as possible, I don't see any other added value in using both. From the logical point of view, they do the same (categorize content), but on a different level. Here's where tag bundles come handy. With my bookmarks, I use tag bundles such as Wibe, Science, Brands, Work, etc., to combine different tags into groups according to their qualities. And aren't Categories and Tags just another form of the same thing, just two different tag bundles? Perhaps not, but that doesn't change the fact one is probably redundant.

I still see cases when Categories are used as single items (one post is filed under one category), while Tags are always used as multiple items (one post can have many tags). This corresponds with the technical 1:N and M:N database relationship, and even though the second is a bit more complex to create and maintain, it provides much more flexibility. Hierarchy vs. matrix.

Less is more, and intersections rock

Another thing I've noticed is that people use a lot of different tags. Too many to handle. I try to keep the number of tags as low as possible, working rather with intersections of tags (e.g. marketing + twitter) than looking for specific tags, used only a few times. I made a quick calculation on how this works, estimating a model with 10.000 contents and 200 tags, which corresponds with my situation on Delicious:

10.000 contents, 200 unique tags, average 5 tags per content
10.000 contents * 5 tags = 50.000 total tags
50.000 total tags / 200 unique tags = 250 occurrences of each tag (contents per tag)
5/200 probability of first tag * 4/199 probability of second tag = 1/1.990 (0,0005) probability of two specific tags on a single content
or (200! / (2!*(200-2)!) = 19.900 unique combinations of two tags; one bookmark with 5 tags allows 10 pairs of bookmarks, making a combination's probability 1/1.990
1/1.990 * 3/198 = 1/131.340 (0,0000076) probability of three specific tags on a content
Result: on average, 5 contents out of 10.000 will contain two desired tags and 0,07 three tags

The model is built on the assumption that all tags are spread evenly, which is far from reality, but you get the picture, the number of contents with multiple tags is pretty low. But if you lower the number of unique tags (e.g. 150 tags instead of 200 would raise the number of contents with a pair of tags from 5 to 8,9) or use the same tags more often (e.g. 6 instead of 5 tags per content would raise the number from 5 to 7,5), the results get even better. Basic mathematics is a powerful tool, and intersections with two, three or more tags are definitely the way to go.

Applications

I've made a few applications using the techniques mentioned. For general Categories of this blog, I used a combination both, having Categories behave like Tags, using a few of them as possible (but attaching many on a single post), displaying them as a tag cloud (bottom of the page). I used a similar approach on my iTunes library, abusing song Comments to act as Tags for advanced smart playlists. And some time ago, I developed a simple engine for related content, based on occurences of different Categories / Tags on my blog posts, acting both as an additional feature for readers, as a tool for internal hyperlinking, used for SEO.

These are a few cases which display the power of simplicity, using as little data as possible to create a lot of information. And while I know this is hard to do, I must continue to pursue this philosophy, may it be in software development or blogging (I ironically failed with this one). Things that are similar on an abstract, logical level, should be the same on the technical level. Try it, you'll be amazed by the results which will present themselves.

Categories:

Blogging, Chronolog, Data, Delicious, Internet, Lifehacks, Mathematics, Software, Technology, User Experience, Web 2.0

A few more things you might find interesting:

Please help me upgrade my Twitter bot
About half year after introduction, this bot needs a fresh set of brains.
I've developed a magazine based on my Delicious bookmarks. And a Twitter bot.
Using my bookmarks to experiment with different ways of displaying information.
The chronolog now understands connections between content
The next step in my blog's evolution is the ability to understand relationship between different blog posts.

Comment

written 29.4.2011 13:26 CET on chronolog

2918 views • 3 likes • Like •