The chronolog now understands connections between content

icon

I once made a promise that I will try to incorporate as many interesting features as possible into my blog. My previous development sessions were based mostly on interactions of readers with the posts, the peak of it being the Hot on the chronolog algorithm. But now, as the chronolog finally reached critical mass in the amount of content it operates with, the time has come to do something new. The next step is focused on a different functionality, and a few days ago, the chronolog received an algorithm for recognizing relationships between different blog posts.

The connections

The whole concept is based on the occurrences of categories (which are actually tags) on different blog posts, the most obvious being the number of the same tags two different posts share. We did something similar on a web portal we launched a few months ago, and it works pretty well. Sure, the proper way to do it would be using real text mining, where the strength of the relationships would be based on meaning and occurrences of words and external hyperlinks in a specific post. But in this stage, I'm keeping it simple: if two posts share a lot of tags, they appear more related.

The weight

Since some categories (tags) are used more often, they appear in many posts, making these posts too heavily related with each other. The number of categories attached to a single post also varies, giving a post with many tags a much stronger chance to appear as related to another. Therefore the general equation contains two modifiers, which are giving weight to each shared tag between two posts.

Categories that appear only a few times globally, have more weight, because they represent a more scarce and therefore a more interesting and stronger connection. This takes care of the tags which are used very often, making them not too dominant. On the other hand, the weight of each tag on a post drops with the total number of tags the post has, so those posts, which have a lot of tags, don't become every other's related post. It may sound confusing, but it's probably a bit simpler to develop than to explain.

The results

I was actually quite surprised about the result the algorithm makes (which you can now see on the bottom of every post). As I was playing around a bit, observing how the calculation behaves and playing with constants, I actually found some interesting connections between posts which I didn't notice before. The engine finds quite a strong relationship between the post about using Web 2.0 logos in TV commercials and the one about Round browser icons, both of them being design clichés. The case of Microsoft and Google going social also made it strong, as the two posts are describing the struggle of two technology giants trying to adapt to the new situation. I could go on and on, but than you would probably just say I was doing SEO too hard.

Search Engine Optimization (SEO) is actually another hidden benefit of the feature, something that occurred to me after I've already finished working on it. Google likes it if you have your content internally cross linked, so what better way to do it than to have automation take care of it. So until SEO dies, this new functionality is actually a double win, because the chronolog became more optimized for crawlers and hopefully more useful for the readers. Even though most of you probably won't even notice.

Like what you read? Now tell the world!

A few more things you might find interesting:


Comment
written 03/11/2010 22:20 CET on chronolog
230 views   •   3 likes   •   6 comments  •   Like   •   
you did. i really want to learn as much as i can about tagging, categorisations. I find these knowledge very useful. Proper tagging/categorisation method can make your life easier. With these amount of content it is important to master tagging/categorisation... this is how content gets its (added) value. I've been using Evernote for a while ... Syncing Android phone, laptop, desktop computer, job computer... i must say i'm getting better and making better decisions and better use of time... But for using evernote you need discipine ;)
commented 04/11/2010 21:32 CET by jaka
Jaka, thanks for the great question. I also noticed there's a difference between categories and tags on wordpress. I think the way it's meant is that you have a few (one) categories and many tags on one post, making the main difference between categories and tags the relationship one-to-one(few) and one-to many with posts (folders hierarchy vs. matrix). But then you end up using too many tags (most of them only once), making them useless for analysis and finding related content. On delicious I use as few tags as possible, because I think that is the proper way to do it, and I'm also doing a similar thing here. I decided to try to combine both approaches into one, naming it categories (because it's a classification concept), using many on each post (therefore behaving like tags). You can read a bit more about it on one of my first posts (http://stritar.net/Post/The_Chronolog_Is_Almost_Complete.aspx). Hope I've answered your question.
commented 04/11/2010 15:54 CET by Stritar
sorry for confusing writing and questions, i just wanted to know what you thing of different tagging categorisation methodologies.
commented 04/11/2010 15:21 CET by jaka
hmm, i was thinking about categories vs. tags (wordpress) yesterday. Categories are a very nice addition. On your blog, you have only "tags", but you name them categories. do you have any special reason why "category" instead of "tag"? categories are "(sub)folders" in which you put your posts with looots of different tags, right? I think wordpress categories are the same as "tag bundle" in delicious.
commented 04/11/2010 15:17 CET by jaka
If I remember correctly, isn't it you who's the guy with glasses? I'm a geek. ;P
commented 04/11/2010 12:10 CET by Stritar
You're such a super nerd Stritar! However, this is actually really useful stuff, hmmm
commented 04/11/2010 11:50 CET by Nick Taylor
date
date
date
date

Connect with Grega Stritar: