Stritar's chronolog

The Risk board game dice roll probability calculator and battle simulator

Sun, 17 Mar 2013 16:32:12 GMT

I know there are plenty of you out there who love to play the board game Risk. We're hooked on the Lord Of The Rings edition, and I still need to check out the very rare expansion pack one of my friends recently got. As you will see, I'm getting ready for it with all I've got, developing myself a weapon that will help me dominate the game. Something that will turn the odds in my favor without actually cheating. Say hi to my Risk battle simulator, which is able to calculate the chance of winning for specific Risk situations.

Launch

Risk is an interesting game, powered by simple mathematics. In the battle, the attacker throws three dice, the defender throws two. The attacker's advantage is the one extra die. The defender's advantage is that he/she wins when the dice are tied. Which, in the general situation of 3 dice vs. 2 dice roll, translates into the following odds: the attacker has 37.17% chance of winning, the defender has 33.58% chance of winning, and there's a 29.26 % chance of tie - they will both lose one army. But what about specific situations? How can I know if my 10 armies are enough against my opponent's 5+4?

Since I'm a developer and not a mathematician, I decided I will rather build a simple brute force JavaScript simulator than try to derive the formula behind the battles. An application that simulates 10.000 Risk fights (around 50.000 dice rolls) and calculates the result odds. That should be enough to make an approximation, right?

Risk's rules specify that the attacker always has to leave one army behind (can't use it to attack), and can attack any number of territories in a single turn. Which the calculator also takes into account. Every territory the attacker wins, his/her number of armies is subtracted by 1 (one army is left behind), and the last army is not able to attack. I think I got it right, but if you find and error, please let me know.

Risk dice roll battle simulation.

Besides, the simulator also knows how to calculate probabilities for specific single battle situations for different numbers of dice the attacker and defender throw, by going through all the possibilities of dice throw results (which means 7.776 (6⁵) different dice throws in 3 vs. 2 battle).

Risk dice roll possibilities and odds.

I made this to help myself and others understand and appreciate the statistics behind the dice. So, the next time you play Risk, don't forget to bring your phone and use the simulator to your unfair advantage. And if you're a developer, feel free to upgrade the algorithm (available on Github). Game on.

http://diceroll.stritar.net/risk.html
https://github.com/gstritar/DiceRoll

A case study in agile development: the algorithm for Ljubljana Realtime's event discovery

Mon, 22 Oct 2012 20:01:16 GMT

When we started working on Ljubljana Realtime, we decided to approach it in an agile way. Amongst others, we wanted to use a few interesting lean concepts such as rapid development, Minimum Viable Product and the Build - Measure - Learn iterations. Less than two months later, the results are in, and they are very pleasing. The MVP in the shape of an activity map was developed in a few weeks, only to show there is a lot of social noise which will somehow need to be taken under control. But that's currently low priority, since the first pivot is already taking place, slowly shifting the focus from the rich map application towards an event discovery algorithm and stream. That's where I see the most potential of Ljubljana Realtime, and in the last weeks, that's where the most work was done.

Launch

Ljubljana Realtime event discovery engine uses Foursquare trending venues and geo-tagged posts from Twitter, Instagram and Flickr to discover what's happening in real life. At least 6 people checked-in on Foursquare or two different people tweeting or posting photos in a single hour could mean something is going on. These events are posted to Twitter and Facebook, with links to the posts. A few versions of this algorithm were already deployed, each one solving new problems, resulting in a few micro Build - Measure - Learn cycles in a single month.

Iteration 1: Foursquare, no duplicates

The first version of the stream (bot) was a simple one, at that point it was meant to work as promotion for the map. The only thing it knew how to do was to wait a few hours until it posted the same thing again. I think Foursquare checkins are alive for three hours, so if a trending venue was still trending after that time, new people had to checkin and the venue was still buzzing.

Problem: Plain, no real added value.

Iteration 2: Adding activity from other sources

When we were trying to make some space on the crowded map, we started grouping posts from Twitter and Instagram by the nearest Foursquare venue, which meant having less boxes on the screen. This turned out to be quite a complex thing to do properly, but it was worth the effort. On only a few occasions, one venue would have multiple posts in a single hour, and in most cases, that meant something was happening there. This provided another very interesting potential for the activity stream. Actually, it made the stream bigger than the map could ever be.

(I love it when such things happen, when you are trying to solve a problem, and it turns out there is much more hidden behind the resolution.)

Groupping posts by a venue. Did Ljubljana Realtime just discover an athletic meeting taking place?

The next problem: Activity in some venues, specially generic ones such as "Ljubljana" would trigger the stream almost every day. Similarly, some large venues, such as supermarkets, would be trending too many times on Foursquare.

Iteration 3: Balancing the posts

The algorithm needed an update, which would lower the amount of times when a venue would be recognized as an event, either on Foursquare or on other channels. At first I though about an upgrade which would set the amount of people or tweets needed to trigger the "event discovered" action for a specific venue. This would enable us to reduce the importance of some venues, but it would also require manual work. Luckily, we came up with another brilliant idea: the more times a venue is trending, the harder it is for it to be trending again, at least for the next few days. Automatic balancing.

Venues with the most discovered events. Generic ones, besides massive places, such as train stations, cinemas, squares and shopping centers are too dominating.

The next problem: At this point, we have launched other test instances of Ljubljana Realtime (Maribor, Zagreb and Zurich), to see how the system behaves in other environments. Some cities are bigger, some are smaller, which means they produce different amount of activity. Besides, different services are used differently in different cultures.

Iteration 4: Supporting local instances

Foursquare is big in Croatia (Zagreb), but not so much in Switzerland (Zurich), which means Zagreb Realtime's stream had a lot of Foursquare trending posts, while Zurich's had a lot of "Increased activity on Twitter and Instagram" posts. It was obvious that local instances needed different algorithms. While having an option to set the amounts which would trigger the post on a specific venue would be too much to moderate, having the same logic on a specific region could work. And it does. Zagreb now needs more people checked-in on Foursquare, while Zurich needs more unique people tweeting or sharing photos.

Number of discovered events by type (Foursquare vs. Twitter + Instagram) on each day. Foursquare trending venues are dominating Zagreb, while social streams are dominating Zurich Realtime.

The next problem: The basic algorithm requires two different people to tweet/post from the same location in one hour. In case of Zurich, this amount was set to three, but it turns out this situation happens rarely, around 10 times fewer than with two people, or only two to three times a day. Obviously not enough.

Iteration 5: Improving the "increased activity" weight

You can only have a whole amount of people tweeting in the past hour. Two or three. In our case, we needed something in the range of 2 1/2. The modified solution adds the number of posts divided by ten to the number of users, which means that currently, at least two people making at least three posts in an hour will determine a trending event in Zurich. This is not a perfect solution from the event discovery view, but it does what urgently needed to be done: prevent having too many tweets in the stream.

The next problem: we currently have four Twitter accounts that tweet events for these four cities. Our target was for each of them to make around 10 - 15 tweets a day, which seems like a number that is not spam. But how can a person see which of these events is THE event?

Iteration 6: Going super venue level 2

The latest version of the algorithm now recognizes two levels of events. An event (mostly 6 people on Foursquare, mostly 2 different people tweeting), and an outstanding event (around 12 people on Foursquare, around 4 people tweeting). Our goal was to make this super event happen only once a few days, on rare occasions two times per day, and it has already happened a few times.

Sometimes super events happen, with tens of posts in a single hour, such as the one for Philips Fashion week. These events definitely require more exposure.

The next iterations

At this point, I'm very satisfied with how the algorithm works, even though a few other modifications need to be done (specially to support different days of week specifics and behavior). By measuring what is happening, learning from that information and building the next releases based on that knowledge, the activity stream logic has come a long way from the initial version. Measuring is crucial, and rarely we have went to such extent to enable this in the widest way possible (e.g. the update to balancing the posts based on the previous events would be trivial by itself, but we wanted to log things that would happen but didn't happen, besides things that actually happened).

These cycles of Build - Measure - Learn can be a lot hard work, but they provide great results, which are also very fun and rewarding. Some people simply need to see how deep the rabbit hole is. Do you have any other interesting cases or experience with this approach?

Putting 'people who look at you' to your Facebook profile would be the smartest thing to do

Sat, 02 Jun 2012 10:50:34 GMT

Are you one of those people who are wondering how Facebook decides which friends they put on your profile? I admit I am, both out of programmer's curiosity and of course, there have been rumors that those individuals are the ones who look at your profile. While LinkedIn offers this "who looks at your profile" insight to its (premium) users, Facebook is still very mysterious about it, denying this is how this particular algorithm works. But there is a simple reason I don't believe them: if I would be Facebook, I would design it exactly like this.

EdgeRank

Facebook uses EdgeRank to calculate the connection between two people, determined by the amount of mutual friends, interactions, tagged photos, attended events and other parameters in a time period. Besides other things, the EdgeRank influences which posts get displayed in your news feed. It seems Facebook is saying that a similar algorithm is used for the friends on your profile, but is it really?

The exploit

Some time ago, someone managed to find a way inside the EdgeRank results. This guy noticed that Facebook caches the list of your friends, together with the level of proximity you have with each one. This stored part of the social graph helps search and other lists on Facebook to work faster and be sorted better. He was nice enough to write a script and made it public, so everybody can see who their Facebook BFFs are. The results looks like the real deal, and it's actually quite fascinating that Facebook hasn't patched this potential abuse yet, it's been available for almost a year.

Bottom line: the list of friends in your EdgeRank and the list of friends on your profile are almost, but not quite, entirely unlike each other.

Comparing my closest friends to those that are showing up on my Facebook profile

Why bother?

Facebook needs to constantly drive your engagement, and they have infinite data about you. They are trying to seamlessly integrate their experience into every pore of your life and make you even more connected. They are saying they can predict when hookups and breakups will happen. Who do you think they would put on your profile?

It would work

Adding "people who look at you" to your Facebook profile would act as the poke that never got clicked. The most basic (inter)action, something that wants to lead to something bigger. The invisible act of someone longing for engagement. Potential connection, potential partnership, potential relationship. The beyond EdgeRank scary social experiment, which holds infinite possibilities, positive and negative. An almost godly algorithm. Why would anyone even think of doing it differently? It simply doesn't get much better than this.

I would do it, I believe Facebook would do it as well, but even if they did, it's pretty clear why they can't tell us. This feature would work only as long as we wouldn't really believe it's being used. That's why you need to forget about all of this and simply enjoy your virtual life.

Trademarks and logos are the property of their respective owners.

Did Google just admit Apple's Siri is the future of search?

Sun, 04 Dec 2011 15:21:19 GMT

I don't know if you saw The evolution of Google search video, which they've published a few days ago. You should, it's a cool movie, portraying the history of search and Google's vision of its future. But something went wrong. One of the punchlines of the video was a story from one of the engineers, who said that next-generation search engines will be able to answer complex questions such as the following:

"Hey, what is the best time for me to sow seeds in India given that monsoon was early this year?’"

A very legitimate question.

I don't know if you've tried out iPhone's new personal assistant, Siri. It's awesome in every bit. Not only does it have a state-of-the-art voice recognition, it's also packed with super smart artificial intelligence that supposedly allows you to ask crazy things things such as:

"Can you remind me to call my wife when I leave the office?"

Another very legitimate question.

And there's a strong resemblance there. Both requests are really abstract and probably require quite a bit of computational power to be understood by a program. They have nothing to do with mathematical or social ranking currently used by Google (search), they are all about Artificial Intelligence and semantic interpretation. And while Google currently doesn't provide (or at least market) services that would be able to understand such sentences, Apple does.

I've noticed quite a few articles saying concepts such as Siri are the future of search. It's obvious artificial intelligence will play a big role in this segment. Apple's already in. Even if their technology is not superior to Google's, who is also working on embedding AI into search, it's fully available today, and everybody knows it. Google should really be careful with such statements concerning their core business, Web search. Specially if they are competing against the marketing wizards of Apple, who know how to sell things even if they don't fully work.

Promoting a technology you don't have and your competition does? Stupid consumers such as myself might do something stupid.

UPDATE (5.12.2011): You can join the discussion on HackerNews.

Television and Social media? How did my recommendation engine miss this connection?

Sun, 27 Nov 2011 14:58:20 GMT

November has been a great month for this blog. For the first time in history, I managed to get more than 1.000 unique users on two different blog posts in a single month. Which is awesome, thanks! The first post was about the TV show Dexter and its Facebook game Slice of life. The other was about Slovenian TV show Soočenje and its buzz on Twitter. Just two posts, nothing special, right? Wrong. It's really obvious, but I missed it somehow. Both posts are talking about combining television and social media, silly me! I can't believe I failed to see it, but I did, and so did my blog. Not that it really matters anymore. You know those fantastic coincidences that happen sometimes and put everything into place? This story is full of them.

Function

Some of you may know this blog has an internal recommendation engine that calculates the correlation between different posts based on shared tags and their frequency, offering related reading in the bottom. It missed the connection. Others may know I'm a bit obsessed with cross-referencing my posts, which I do manually. I missed it too. Perhaps Facebook and Twitter aren't as similar as I would like to believe, but I'm putting my bet mostly on the different concepts of combining television and social media. The Dexter case was about entertainment, gaming and story-telling. Pop TV's case was about politics, news and ordinary people co-creating content. Different problem, different tags, but still, the strong relation between the two is very much there.

Try

The first funny coincidence was a blog post by @anejmehadzic written a few days after mine, discussing the general possibilities of a symbiosis between television and social media (in Slovene). The post provided enough insight to make me see what I missed. TV shows using Facebook, Twitter and YouTube to provide additional content to viewers was really something in between the two cases of mine. Revelation. At this point I knew I missed the connection myself, but how did my very smart algorithm also miss it?

Catch

The next lucky coincidence was a lecture on wwwh happening yet a few days later. @zbrchka was talking about transmedia, a term I haven't heard about before. I thought multi-platform or cross-platform could be concept that connected these two blog posts, but transmedia feels so much better. Transmedia is a technique for creating integrated content for different mediums, just the thing what I was looking for. Something that's becoming so important it deserves exposure on this blog too.

Finally

With the new gathered knowledge, I made a new tag Transmedia, putting it on both posts, besides the one about Šport TV tweeting about the basketball championship. It worked like a charm. Since this tag is used so rarely, it dominated the recommendation engine, and to my great relief, all three posts gained the correlation they require to be listed as related content one to another.

The blogosphere kicks ass, since bloggers are mutually inspiring each other and moving things forward. Wwwh is a great place to hang out and share knowledge and experience. My recommendation algorithm is awesome, fully working as expected. And those lucky coincidences are a thing that make this existence an interesting place to be visiting. Everything is just the way it should be.

* try-catch-finally is a an exception handling syntax used in some programming languages.

The great aquarium cleaning dilemma: should you be removing or replacing water?

Tue, 01 Nov 2011 18:44:18 GMT

Everybody that owns an aquarium probably came across this decision at one point. The water is filthy and needs to be replaced. All you have is a jar. And you ask yourself: should you be emptying the aquarium first, adding new water later on, or should you be replacing filthy water with clean water? The first choice seems more rational, but sometimes you can't fully empty the aquarium (e.g. you have fish), and you need to do more runs since you're not taking water both ways. The other option seems interesting since you're efficient both ways, but at the same time you're taking back fresh water mixed in the aquarium. So, what should you do?

The situation

In reality, you do have other options. A water pump, a larger intermediate basin or other things that can make this task easier. But believe me, sometimes you don't have the time to do it properly and you just want to clean the water a bit. And that's when you'll wonder what to do. It happened to me, and that's why I've made myself a model that would answer this question, a model that would determine the breaking point (where both options are equally effective) between the two techniques.

The model described contains the following parameters:

the aquarium volume (V - volume)
the jar volume (d - change)
number of two-way runs (x)

And it looks something like this:

The aquarium cleaning situation

The initial model

The original view I made in Excel is based on simple mathematics, where each one-way run is represented by one line in the table. Adding and removing water both-ways constantly reduces filthiness, but makes each additional run less effective. On the other hand, removing water first and adding fresh water later takes more runs (since your jar is empty in one direction, you require four one-way runs to replace an additional unit), but you're not removing clean water. Here's a preview of how this looks:

The basic calculation

The advanced model

The basic model is more understandable, but not appropriate to make a full mathematical equation. That's why I made a second model, which is based on two-way runs and exponential functions. Here's what I got:

The advanced calculation

The results

The results show that if you don't have much time, it's better if you're replacing water, taking it both ways. This accounts for faster cleaning, which is slowing down in the long run. But the point is that no matter what the parameters are (aquarium and jar size), the breaking point happens more to the end of the cleaning. So, if you're not prepared to fully empty the aquarium, it's generally better if you use the the first option, removing and adding water at the same time.

The graphical results displaying water filthiness based on number of two-way runs for different jar and aquarium sizes

The breaking point equation

The results are there, and I managed to make the following equation which would calculate where both ways are the equally effective:

This makes calculating the number of runs so complex it can't be calculated in basic Excel (because of the Lambert W or product log function).

But luckily, you can use Wolfram Alpha to calculate the breaking point of two-way runs by entering the aquarium and jar size in the boxes below. If you're prepared to make less two-way runs than the result, take water both-ways, otherwise, one-way.

Aquarium: Jar:

If you want to play around a bit, you can even download the complete model in Excel format.

Conclusion

Perhaps there's an easier way to calculate the great aquarium cleaning dilemma, which speaks in favor of water replacement (taking water both ways). There's also a chance I've made an error somewhere (please let me know!). But the results seem correct, and I hope you find all of this amusing and / or helpful when you need to update your water. Aquarium cleaning will never be the same again.

Update (26.11.2011): While this may have been a fun experiment, I've just read it's not that smart to fully remove water in the aquarium, since it disrupts the balance of bacteria.

Please help me upgrade my Twitter bot

Tue, 30 Aug 2011 20:33:12 GMT

Half a year ago I decided to make something out of my Delicious bookmarks. The magazine-style display inspired by Flipboard wasn't enough, I wanted to publish these links somewhere outside my chronolog, somewhere on Twitter. So I made a bot. It's doing quite well, posting like mad, but it's really not where I want it to be. Until now, it made about 3.000 tweets (around 500 per month), but has only 67 followers. I know my taste in content is a bit obscure, but still, only 67 followers?

This calls for an upgrade. And since I won't change my interests and bookmarking habits, something else needs to be done. That's where I need your advice. Crowdsourcing the concept and stuff.

The brand

This is probably one of the greatest fails of the project. At first, I thought of it as an extension of my blog. Hence the account @stritar_net, the name Stritar's chronolog, together with the description it has. Should I rename it and try to make it a standalone "brand"? Should I openly say it's a bot (Stritar's bot or something)?

The selection

The bot currently posts ALL my bookmarks to Twitter (as mentioned, around 15 per day or 500 per month) without any selection. But it could be done. Since my bookmarks are originally tagged, it could leave out those with too few tags (since I use the same method of counting tags to determine the initial weight of content for the magazine). Or specialize in specific segments according to tags. There could even be more of them bots. What do you think?

The frequency

Currently, the bookmarks I make go into a queue. I hate those Twitter accounts that post 10 tweets in 5 minutes and go silent for a day. I wanted to make it more smooth. So the queue always knows how many items it holds and adapts the frequency of posting according to it (less bookmarks in queue mean less frequent tweets). But that produces the situation where most of them are already a few hours or days out of date when they are published. A higher publishing frequency would solve some of it, but it opens a great dilemma: what's the lesser evil, over-spamming or being out of date?

The order

The order of bookmarks posted on Twitter is determined by two factors. The number of tags and the date they were published. More tags equals more importance. Older bookmarks get published sooner, otherwise they would get even more out of date. Should I do it the other way around and post more recent links sooner? This would make some of them more interesting and up-to-date, but other worse. Breaking content or consolidated content?

So many decisions… In this case, the best way probably doesn't exists, but trade-offs can always be decided for the better. Please let me know what you think, your help would be more than appreciated. I could help back if I can.

Determining if an element is in the last row of a table

Mon, 25 Jul 2011 11:57:52 GMT

Once upon a time I stumbled upon a problem, where I needed to calculate if an element is in the last row of a table. Here's the scenario: you have a number of items, which are put in a table from left to right. When the row is full, the items continue in the next row. Imagine an airplane or a theater where people start sitting front-left and continue to the right until they run out of space, then going to the next row and so on. Now we want to know which people are sitting in the last of the populated rows.

A weird problem, but hopefully I will be able to show you some cool results produced by this algorithm someday (yes, it's usable).

The equation has 2 parameters: the total number of elements and the number of columns in a single row. There are a few ways to do it, using division with remainders (the modulo operation). The simple way would be comparing the total number of rows with the element's current row. Another one would be to calculate the number of elements in the last row and see if our element is in those last few. While both seem logically easy, they actually suck, because they contain exceptions (some states need to be handled specifically - if the last row is full or not).

That's why I went for the ultimate way, using abstract mathematics which requires real magic, and since it would be too boring to explain it, just take it if you need it and don't try to understand - to be honest, even I'm not perfectly sure how I did it.

The number of rows way:

lastRow = ((Math.DivRem(currentElementIndex + 1, numberOfColumns, out remainder1) + Convert.ToInt16(remainder1 > 0)) >= Math.DivRem(totalElements, numberOfColumns, out remainder2) + Convert.ToInt16(remainder2 > 0))
* Math.DivRem returns the division result and the remainder, since both are required for the calculation

The number of items in the last row way:

lastRow = (totalElements - (totalElements % numberOfColumns) - (numberOfColumns * Convert.ToInt16((totalElements % numberOfColumns) == 0)) < currentElementIndex + 1)
* % is the modulo - remainder from dividing

The ultimate way:

lastRow = totalElements < ((numberOfColumns + currentElementIndex) - (currentElementIndex % numberOfColumns) + 1)
* % is the modulo - remainder from dividing

Mathematics is awesome.

UPDATE (27.7.2011): Silly me. Overwhelmed by "The ultimate way", I missed the opportunity to simplify "The number of rows way". I guess this is one of the cases which explain why counting starts with 0 instead of 1 in programming.

The improved number of rows way:

lastRow = Math.DivRem(currentElementIndex, numberOfColumns, out result1) >= Math.DivRem(totalElements - 1, numberOfColumns, out result2))
* Math.DivRem returns the division result and the remainder, since we need the number rounded down.

Is Delicious aiming to become the next Twitter?

Thu, 26 May 2011 17:03:41 GMT

The bookmarking service Delicious has had an interesting life. It was one of the first social services available, later bought by Yahoo and almost canceled, then being sold to Avos about a month ago. Avos was founded by the same people who've created YouTube (Chad Hurley and Steve Chen), and these guys obviously know what they're doing. A few days after acquiring Delicious, Avos also bought a social media analytics startup Tap11, and here's what they had to say about it:

"Our vision is to create the world's best platform for users to save, share, and discover new content. With the acquisition of Tap11, we will be able to provide consumer and enterprise users with powerful tools to publish and analyze their links’ impact in real-time."

While some bloggers think Avos will start competing against Google and Salesforce.com by analyzing social data, I can imagine a different strategy may be plotting. Let me explain.

Delicious was always ahead of its time, but did not really make it to broad mainstream. It allows online bookmarks, which you can tag, bundle and keep in a library for later use. It knows asymmetric relationships, so you can check out bookmarks by the people you follow. The bookmarking engine is really powerful, but something was missing. Delicious' biggest problem is its social layer - too weak and of secondary importance. In the mean time, other services such as Digg, Reddit and StumbleUpon took their place on the web and added communities and different types of recommendation to link sharing. And of course, there's Twitter, the current ruler of content and real-time.

Actually, Twitter is slowly becoming a content sharing platform rather than a microblogging platform (I guess microblogging should involve content creation, not sharing). But while your links may bring you audience, they are not categorized and useful to you. Still, most people use Twitter that way, and even authority-measuring services such as PeerIndex and Klout encourage you to share links, because that's what Twitter is all about and that's what will make you influential.

I'm not saying Twitter is not useful, it is very much useful. But imagine having a solid bookmarking platform, very useful for the person who uses it (save). Add a generic social layer of friends and followers, a few comments, perhaps something similar to what YouTube has (share). Now add a hard core mathematical layer which is able to calculate what you'll like based on what you already liked (discover). What you get is something that could be very special, something that could compete even with Twitter. And it could be happening right now in Avos' laboratories.

One guy said that the age social sharing is coming to an end. I think not, there's loads of information thirsty people surfing the web. What's really missing is a new innovative and powerful platform, something useful in many different ways, for keeping, dispatching and receiving new, personalized content. Delicious 2.0?

Trademarks and logos are the property of their respective owners.

A few thoughts on content categorization. No surprises there, less is more.

Fri, 29 Apr 2011 12:26:00 GMT

Since I've started collecting bookmarks using Delicious, I've put a lot of effort into their categorization, organizing them in such a way their browsing would be as simple as possible. The service supports two level categorization (tag – bundle) which helps to control massive amounts of links people have gathered. But it's the experimentation with different structures that gives real insight into content categorization, and because this topic was already mentioned and discussed a few times on this blog, it deserves a special mention. Let's begin.

Categories vs. Tags

Observing other blogs, I've noticed a lot of them use both Categories and Tags. While I can understand the SEO (Search Engine Optimization) benefit in having as many different entry points (landing pages) as possible, I don't see any other added value in using both. From the logical point of view, they do the same (categorize content), but on a different level. Here's where tag bundles come handy. With my bookmarks, I use tag bundles such as Wibe, Science, Brands, Work, etc., to combine different tags into groups according to their qualities. And aren't Categories and Tags just another form of the same thing, just two different tag bundles? Perhaps not, but that doesn't change the fact one is probably redundant.

I still see cases when Categories are used as single items (one post is filed under one category), while Tags are always used as multiple items (one post can have many tags). This corresponds with the technical 1:N and M:N database relationship, and even though the second is a bit more complex to create and maintain, it provides much more flexibility. Hierarchy vs. matrix.

Less is more, and intersections rock

Another thing I've noticed is that people use a lot of different tags. Too many to handle. I try to keep the number of tags as low as possible, working rather with intersections of tags (e.g. marketing + twitter) than looking for specific tags, used only a few times. I made a quick calculation on how this works, estimating a model with 10.000 contents and 200 tags, which corresponds with my situation on Delicious:

10.000 contents, 200 unique tags, average 5 tags per content
10.000 contents * 5 tags = 50.000 total tags
50.000 total tags / 200 unique tags = 250 occurrences of each tag (contents per tag)
5/200 probability of first tag * 4/199 probability of second tag = 1/1.990 (0,0005) probability of two specific tags on a single content
or (200! / (2!*(200-2)!) = 19.900 unique combinations of two tags; one bookmark with 5 tags allows 10 pairs of bookmarks, making a combination's probability 1/1.990
1/1.990 * 3/198 = 1/131.340 (0,0000076) probability of three specific tags on a content
Result: on average, 5 contents out of 10.000 will contain two desired tags and 0,07 three tags

The model is built on the assumption that all tags are spread evenly, which is far from reality, but you get the picture, the number of contents with multiple tags is pretty low. But if you lower the number of unique tags (e.g. 150 tags instead of 200 would raise the number of contents with a pair of tags from 5 to 8,9) or use the same tags more often (e.g. 6 instead of 5 tags per content would raise the number from 5 to 7,5), the results get even better. Basic mathematics is a powerful tool, and intersections with two, three or more tags are definitely the way to go.

Applications

I've made a few applications using the techniques mentioned. For general Categories of this blog, I used a combination both, having Categories behave like Tags, using a few of them as possible (but attaching many on a single post), displaying them as a tag cloud (bottom of the page). I used a similar approach on my iTunes library, abusing song Comments to act as Tags for advanced smart playlists. And some time ago, I developed a simple engine for related content, based on occurences of different Categories / Tags on my blog posts, acting both as an additional feature for readers, as a tool for internal hyperlinking, used for SEO.

These are a few cases which display the power of simplicity, using as little data as possible to create a lot of information. And while I know this is hard to do, I must continue to pursue this philosophy, may it be in software development or blogging (I ironically failed with this one). Things that are similar on an abstract, logical level, should be the same on the technical level. Try it, you'll be amazed by the results which will present themselves.

Can you believe Watson got the question about Slovenia wrong on Jeopardy?

Wed, 09 Mar 2011 07:33:12 GMT

Slovenia made it to the spotlight again, for the first time after the soccer world cup (when Slovenia was trending topic on Twitter and top search on Google). This time, it happened because IBM's supercomputer Watson competed against human champions in the famous TV show Jeopardy. IBM's computers are known to destroy people in various challenges, Deep Blue beat the world champion Garry Kasparov in a chess tournament in 1997. But chess is simple for computers to play, because it is pure logic and mathematics – the capability of a player is determined by the number of operations and actions it can calculate in advance. But a quiz is a totally different story, where the biggest challenge is semantics – understanding the meaning of words.

In 1950, Alan Turing, one of the greatest pioneers of computing introduced the Turing test, a methodology that could separate humans from computers using a set of questions, some of them formed in such a way computer wouldn't be able to understand and answer them. There are many questions which can't be answered with pure logic, the one I remember from high school goes something like this:

"Jack attended Sally's party, bring a doll. What was the present?"

The catch is in the connection between party – (birthday) – present – doll, which can't be noticed without abstract thinking humans are capable of. And today's computers still face the same problem - even though Watson dominated Jeopardy, it failed miserably on the following question about Slovenia:

"As of 2010, Croatia & Macedonia are candidates but this is the only former Yugoslav republic in the EU"

Watson's computing capabilities and knowledge banks are huge, but a question and an answer so obvious to humans presented a huge problem. Watson surely knows which countries are EU members, but it obviously didn't understand the question, thinking it was asked about which country would be next to start negotiating for EU membership, answering Serbia. The right answer was, of course, Slovenia.

The video is also fascinating from the cultural point of view – and extremely creepy. Those who have watched (or read) "2001: A Space Odyssey" may have experienced a slight shiver and carefully waited if Watson would say it: "Hello Dave". Others might have enjoyed this science fiction presentation, but besides Watson's obvious advantage in being the fastest to answer the question, it's clear that computers are still far away from being intelligent. And hopefully they will stay that way.

The chronolog now understands connections between content

Wed, 03 Nov 2010 21:20:56 GMT

I once made a promise that I will try to incorporate as many interesting features as possible into my blog. My previous development sessions were based mostly on interactions of readers with the posts, the peak of it being the Hot on the chronolog algorithm. But now, as the chronolog finally reached critical mass in the amount of content it operates with, the time has come to do something new. The next step is focused on a different functionality, and a few days ago, the chronolog received an algorithm for recognizing relationships between different blog posts.

The connections

The whole concept is based on the occurrences of categories (which are actually tags) on different blog posts, the most obvious being the number of the same tags two different posts share. We did something similar on a web portal we launched a few months ago, and it works pretty well. Sure, the proper way to do it would be using real text mining, where the strength of the relationships would be based on meaning and occurrences of words and external hyperlinks in a specific post. But in this stage, I'm keeping it simple: if two posts share a lot of tags, they appear more related.

The weight

Since some categories (tags) are used more often, they appear in many posts, making these posts too heavily related with each other. The number of categories attached to a single post also varies, giving a post with many tags a much stronger chance to appear as related to another. Therefore the general equation contains two modifiers, which are giving weight to each shared tag between two posts.

Categories that appear only a few times globally, have more weight, because they represent a more scarce and therefore a more interesting and stronger connection. This takes care of the tags which are used very often, making them not too dominant. On the other hand, the weight of each tag on a post drops with the total number of tags the post has, so those posts, which have a lot of tags, don't become every other's related post. It may sound confusing, but it's probably a bit simpler to develop than to explain.

The results

I was actually quite surprised about the result the algorithm makes (which you can now see on the bottom of every post). As I was playing around a bit, observing how the calculation behaves and playing with constants, I actually found some interesting connections between posts which I didn't notice before. The engine finds quite a strong relationship between the post about using Web 2.0 logos in TV commercials and the one about Round browser icons, both of them being design clichés. The case of Microsoft and Google going social also made it strong, as the two posts are describing the struggle of two technology giants trying to adapt to the new situation. I could go on and on, but than you would probably just say I was doing SEO too hard.

Search Engine Optimization (SEO) is actually another hidden benefit of the feature, something that occurred to me after I've already finished working on it. Google likes it if you have your content internally cross linked, so what better way to do it than to have automation take care of it. So until SEO dies, this new functionality is actually a double win, because the chronolog became more optimized for crawlers and hopefully more useful for the readers. Even though most of you probably won't even notice.

Twitfluence prototype calculation for measuring Twitter influence

Sun, 01 Aug 2010 11:54:40 GMT

The prototype calculation of Twitfluence uses the data available form Twitter API to measure your Twitter influence and coolness. The basic technical specifications of the application is available, but I will also be supplying the basic information about how the algorithm works. The actual calculation is already online for beta users, and generally speaking, there are three major components that add up to the score: your followers, your mentions and retweets, and your lists, all accounted as ratios between you and others.

Followers

The strongest component of the calculation is the number of followers you have. In my opinion, your presence on Twitter and getting followers can be influenced by at least the following three major factors concerning you and your Twitter account:

Persona – how known you are. Measured by the number of followers you have, compared to your time on Twitter.
Engagement – how engaged you are. Measured by the number of followers you have, compared the number of people you follow; Measured by the number of followers you have, compared to the number of mentions and retweets you’ve made.
Wits – how smart and creative your tweets are. Measured by the number of followers you have compared to the total number of tweets you've made.

For this part, I gave the followers/following ratio the weight of 3, the followers/tweets a weight of 2 and the followers/time a weight of 1. The followers/(mentions + retweets) has a weight of 0.5 and works in the negative way, so people who bother other people get a bit of a minus to their followers result. Besides, those who are able to get the same number of followers without mentioning people, must have a small advantage.

(Needs to be upgraded with taking into account only your mentions and retweets of people who don’t follow you.)

Interaction (mentions, replies, retweets)

The second most important part of the calculation is the ratio between mentions and being mentioned, together with the number of retweets you get with the absolute "reach" of those retweets (measured in the number of people who follow people that retweeted you). A similar reach is also accounted in the mentions and replies. This component of the calculation uses only the data from the last month, also to make Twitfluence a bit dynamic for multiple calculations for a single user over time. To finalize this part, the total number of tweets in the last month also contributes a small score.

(Needs to be upgraded with unique reaches of your retweets and mentions. For now, it just adds them together.)

Lists

Twitter lists are getting used more and more, so they are also considered in the calculation. The number of lists you appear on, the number of people who follow those lists and the number of people, who follow lists you've created are the basic parameters for the calculation. This component adds only a small bit to the final score.

(Needs to be upgraded with unique reaches)

The basic ratio calculation

All ratios in the calculation are based on the same elementary formula, which looks like this:

Generic result = Sqrt(others / you) * Log10(modifier + 10)
Followers = Sqrt(followers / following) * Log10(followers + 10)
Mentioned = Sqrt(mentioned / mentions) * Log10(mentioned + 10)

I've decided to go for this architecture because of a number of reasons. F.i., the followers / following and other ratios are used to get an objective value for all Twitter users. This ratio gets square rooted so the differences between people are not so huge. The multiplication is there for adjustment, so people who have the same ratio and the absolute number are bigger, get more points. The logarithm is used to make this modifier of absolute number smaller, while + 10 is used so this number is always bigger than 1 (and the logarithmic function becomes more stable after the result 1: Log10(10) = 1). This means that the modifier for those who follow 10 people is around 1, 100 people around 2, 1000 around 3 etc.

Putting it together

The three major components currently have the following weight in the final score:

Followers: around 60%
Mentions and retweets: around 30%
Lists: around 10%

That's about it for now. I've tested the behavior with some real accounts (thanks for help @TejaSmeja and @jakasibicekaka), together with some projections, and it seems to be working quite OK. But the real test will happen after it analyzes results of actual people, which will allow real insight into the performance and objectivity. The Twitfluence will be online soon, and I will be asking you to help with testing the prototype. You also more than welcome to leave any kind of feedback about the calculation as I've described it.

Let's play.

An approach to statistics and data analysis

Mon, 30 Nov 2009 20:32:06 GMT

When information systems evolve, they become greedier for both operational and advanced strategic statistics and data analysis. This need is a part of a natural evolution. The more data you have, the higher potential for extracting information you have. Looking at business environments using IT platforms, that's what analytics are actually all about - getting useful information from usually bad data. It turns out the task of analytical reporting is not so complex as it seems, but you definitely need a set of different skills / people to make it work.

There are tons of different statistical approaches, methods and theories, but it turns out that for average business needs you only need basic mathematics, where the most complex operations are sometimes logarithms. So, if it's so simple, where does the problem lay? Why do information systems often lack analytical support, which can be used for decision making?

In my opinion there are three main steps to consider when trying to make useful statistics and data analysis, and ignoring or underestimating any one of them will make your reports suck.

Data

Data is the king. If you don't have the data, you might as well give it up. If your data is bad or weak, you might consider rebuilding it. But you should know one thing - the better the structure of your data is, the better your analysis will be. Using a flat database such as a text file or an Excel spreadsheet gives you few analytical opportunities. Relational databases, such as Access, MySQL or SQL offer cross-data querying and advanced reporting, but huge and complex calculations can take a lot of time. For those, a multidimensional OLAP database designed strictly for analysis becomes the only option.

Challenges in this step: Technical

Information

The data discussed above defines the scope of potential information you can deliver. In this step, the main goal is simple - you need to know what you want to know. Business needs, process flow, strategic goals or just plain simple amusement are the main factors that need to be addressed. Having someone who is able to recognize these opportunities is crucial, because data is just numbers, but aggregated data - information - is knowledge. It's quite clear you won't be able to get something if you don't know what you want to get.

Challenges in this step: Analytical

Visualization

A picture can tell a thousand words and this goes a long way for data visualization. Even if you can't use charts, you can color information and use measures such as font size to represent another dimension of information or trends. Besides, always keep in mind that less is more, so you should put irrelevant information in the background and punchlines in the spotlight. Check out different chart types, they're useful for different representations and experimenting with them can display things that don't seem there at first sight. Observe patterns. Try to imagine a playground, where information can satisfy your curiosity and while doing it, it also brings useful and valuable results.

Challenges in this step: Creative

If you have the will, you can do all sorts of crazy stuff with statistics and data analysis, but you should know they sometimes take a lot of time. I'm proud my chronolog already has two nice looking children of these activities. The first one is a simple recommendation engine used for content ranking and the other one a set of reports which offer insight into activity and interactions of the chronolog. What can I say, I like to play around, and it may as well be any information system I can get my hands on. Give me the data and I'll give you information.

Hot on the chronolog - and how it works

Sat, 10 Oct 2009 19:49:41 GMT

When I first published my chronolog, a few people were making remarks about how it resembles FriendFeed, Twitter or Tumblr. I can't deny that. The influences of Web 2.0 are huge both on my personal and business life, so why should the chronolog be any different? It is a mashup of different web services and it displays information from different sources, so it's a kind of a Web 2.0 stream. But besides that, it's also my own personal playground for testing and developing high level services and functionalities, which will hopefully be cool and fun and make the chronolog interesting for all of us. Demonstration of concept and technology, if you like.

I already have a few of those smart features planned, and I can give you a little teaser already. I really look forward to developing the custom view of the chronolog, where advanced users will be able to do a bit of configuration. The prototype is already half developed, but sadly far from production. A different thing I'm working on is a complex set of statistics and analytics, which should give us deeper insight into the chronolog, it's data and our interactions with it. This one will probably go out next and it actually inspired the one already complete. From this day forward, the chronolog supports Hot on the chronolog, accessible from the views menu top right, which shows the most interesting posts in the desired time period.

A few Web 2.0 portals (specially those oriented in social news or social bookmarking) have recommendation engines, which give users access to information based on their interaction with the system. I would like to try that one out too once, but because I don't have registered users, the chronolog probably won't be the environment. What I can give you now, is the Hot view, which displays the most important posts based on the interaction (views, likes, comments) of all users of the chronolog. A global recommendation engine of some sort. I'm quite pleased with the algorithm I've developed, it looks like it's working, so you can give it a try.

Some of you will be interested in how it works. The core is a really super mega awesomely complex algorithm that gives ponders to different interactions in the selected time span. Well, it's not that complex from the mathematical point of view, but it still pretty smart. Combining these ponders and number of interactions, using a few square roots and logarithms, plus a small modifier for insert date (if two posts are tied, the older one appears "stronger"), it calculates which posts are more interesting and relevant and gives them a score accordingly. Simple as that. Besides, is also able to make that calculation for any time period. You can even hack it by changing the ?d=# in the url to any number of days you like.

When viewing longer periods (months, years), blog posts will probably take most of the top spots, because they are supported with social networking sites and have the most interactions. In the shorter periods (days, weeks), other types of posts will also take higher ranks. We will see if the algorithm works in the longer term too, when more users will be clicking around, but if needed, the calculation will be changed or modified. Oh, I almost forgot about the design touch I added - the importance of a post is portrayed using transparency, which looks quite cool and is a great exaple of using design for function.

The chronolog becomes smart. Hope you like it.

Billion = Trillion: who is the one that can't count?

Sun, 21 Jun 2009 18:53:07 GMT

There are a lot of cultural differences around the world and between individual countries of the western civilization. On which side of the road should I drive, how hot the weather is or perhaps most importantly - how big this beer is. The reaches of different measurement and interpretation are immense, so why should counting be any different.

How big is a billion? There are two different ways of naming big numbers, and they are called the Short scale and the Long scale. The long scale numerical system was used first, but in the 17th century, when traditional six-digit groups were split up into three-digit groups, short scale slowly came to use. Today, short scale numerical system is in use mainly in English speaking countries, while long scale is used in central Europe and around the world.

I personally prefer using the long scale, as it is mathematically more correct. Actually, I have no other choice, but it seems easy to represent something you were born into.

So what is the main difference? Long scale numerical system uses a word Billion to represent million millions or million square (1.000.000² = 1.000.000.000.000), and a Trillion is a million to the power of three or million billions (1.000.000³). On the the other hand, short scale uses "one more" for every new term greater than million. In this case, Billion represents thousand millions (1.000.000.000), Trilion is a million millions (1.000.000.000.000) etc., so yup, billion equals trillion.

It's an interesting world we live in. And different date formats are a pain in the ass for software developers.