I once made a promise that I will try to incorporate as many interesting features as possible into my blog. My previous development sessions were based mostly on interactions of readers with the posts, the peak of it being the Hot on the chronolog algorithm. But now, as the chronolog finally reached critical mass in the amount of content it operates with, the time has come to do something new. The next step is focused on a different functionality, and a few days ago, the chronolog received an algorithm for recognizing relationships between different blog posts.
The whole concept is based on the occurrences of categories (which are actually tags) on different blog posts, the most obvious being the number of the same tags two different posts share. We did something similar on a web portal we launched a few months ago, and it works pretty well. Sure, the proper way to do it would be using real text mining, where the strength of the relationships would be based on meaning and occurrences of words and external hyperlinks in a specific post. But in this stage, I'm keeping it simple: if two posts share a lot of tags, they appear more related.
Since some categories (tags) are used more often, they appear in many posts, making these posts too heavily related with each other. The number of categories attached to a single post also varies, giving a post with many tags a much stronger chance to appear as related to another. Therefore the general equation contains two modifiers, which are giving weight to each shared tag between two posts.
Categories that appear only a few times globally, have more weight, because they represent a more scarce and therefore a more interesting and stronger connection. This takes care of the tags which are used very often, making them not too dominant. On the other hand, the weight of each tag on a post drops with the total number of tags the post has, so those posts, which have a lot of tags, don't become every other's related post. It may sound confusing, but it's probably a bit simpler to develop than to explain.
I was actually quite surprised about the result the algorithm makes (which you can now see on the bottom of every post). As I was playing around a bit, observing how the calculation behaves and playing with constants, I actually found some interesting connections between posts which I didn't notice before. The engine finds quite a strong relationship between the post about using Web 2.0 logos in TV commercials and the one about Round browser icons, both of them being design clichés. The case of Microsoft and Google going social also made it strong, as the two posts are describing the struggle of two technology giants trying to adapt to the new situation. I could go on and on, but than you would probably just say I was doing SEO too hard.
Search Engine Optimization (SEO) is actually another hidden benefit of the feature, something that occurred to me after I've already finished working on it. Google likes it if you have your content internally cross linked, so what better way to do it than to have automation take care of it. So until SEO dies, this new functionality is actually a double win, because the chronolog became more optimized for crawlers and hopefully more useful for the readers. Even though most of you probably won't even notice.