Linearization
For the purposes of illustration, I created a dataset of well-known authors in our field, with the number of hits these names score in a Google search. When I use the raw data to create a tag cloud, I get the result in Figure 2(a). The tag cloud presents most of the names in approximately the same size. Only some names jump out, and some are nearly illegible. The reason is that the weights are not distributed evenly over the range of the source data. Most of the authors on my bookshelf have (roughly) the same number of Google hits. Only some authors have either very many or very few hits. It appears you can recognize a normal distribution (or Gaussian distribution) here, of which you can see examples in Figure 3. To get a more evenly distributed range of font sizes in the tag cloud, it is necessary to "linearize" the original values. You get a better result when you use a linearized representation, as in Figure 2(b). Technically, linearization means that the weights become less accurate. Bust because the tags have differing word lengths, there is already no such thing as an accurate reflection of the weights. Here, we are interested in usability, not accuracy.
The Pareto distribution, or "80-20 rule" (see Figure 4) is also frequently encountered. In this distribution, 80 percent of the weights are in the lowest 20 percent of the range, while the other 20 percent fill the remaining 80 percent of the range, or the other way around. Well-known examples of this distribution include wealth among people, popularity of websites, and the frequency of words from the English language. You need to select the right algorithm for linearization of your dataset. In Figure 2(c), my dataset (which contains a normal distribution) is linearized as if it contained a Pareto distribution. The result can be weird when you select the wrong distribution model. Strangely enough, I've noticed several authors doing exactly the oppositethey linearized datasets that contained Pareto distributions assuming (unknowingly, I suppose) that they were normal distributions. Evidently, statistical knowledge itself is not distributed evenly among software developers.
You will need several functions when linearizing multiple types of distributions. Each function only needs one collection of weights as input, and it returns a new (linearized) version of the collection. I suggest you work with generic interfaces for collections so that you can apply the same functions to different types of data sources. It is necessary to specify explicit upper and lower boundaries to the desired range of output values. It also seems proper to work with decimal or real numbers, not integers. Rounding the values to integers should be left to the UI code, in my opinion.
Listing Two is my attempt at linearizing a normal distribution, which is partly based on some examples on the Internet. The function calculates the standard deviation (sd) and makes the statistically correct assumption that nearly all numbers will be in the range -2 * sd to + 2 * sd. For each number, a new weight is calculated on a straight line through that range. Listing Three presents an algorithm that linearizes a Pareto distribution. This function calculates a new weight for each number using a logarithm, with e as the base number. (Diehards among us will not be satisfied with this and can determine from their own source data which base number would render the best approximation.) The remainder of the function in this case also plots the new values on a fictitious linear line between the minimum and maximum values.
Public Shared Function FromBellCurve( _ ByVal weights As ICollection(Of Decimal), _ ByVal minSize As Decimal, ByVal maxSize As Decimal) _ As ICollection(Of Decimal) 'First, calculate the mean weight. Dim meansum As Decimal = 0 For Each w As Decimal In weights meansum += w Next Dim mean As Double = meansum / weights.Count 'Second, calculate the standard deviation of the weights. Dim sdsum As Double = 0 For Each w As Decimal In weights sdsum += (w - mean) ^ 2 Next Dim sd As Double = ((1 / weights.Count) * sdsum) ^ 0.5 'Now calculate the slope of a straight line from -2*sd to +2*sd. Dim slope As Double If sd > 0 Then slope = (maxSize - minSize) / (4 * sd) End If 'Get the value in the middle between minSize and maxSize. Dim middle As Double = (minSize + maxSize) / 2 'Calculate the result for the given deviation from mean. Dim output As New List(Of Decimal) For Each w As Decimal In weights If (sd = 0) Then 'With sd=0 all tags have the same weight. output.Add(CDec(middle)) Else 'Calculate the distance from mean for this weight. Dim distance As Double = w - mean 'Calculate the position on the slope for this distance. Dim result As Double = CDec(slope * distance + middle) 'If the tag turned out too small, set minSize. If result < minSize Then result = minSize 'If the tag turned out too big, set maxSize. If result > maxSize Then result = maxSize output.Add(CDec(result)) End If Next Return output End Function
Public Shared Function FromParetoCurve( _ ByVal weights As ICollection(Of Decimal), _ ByVal minSize As Decimal, ByVal maxSize As Decimal) _ As ICollection(Of Decimal) 'Convert each weight to its log value. Const BASE As Double = Math.E Dim logweights As New List(Of Decimal) For Each w As Decimal In weights logweights.Add(CDec(Math.Log(w, BASE))) Next 'First, find the min and max weight. Dim min As Decimal = Decimal.MaxValue Dim max As Decimal = Decimal.MinValue For Each w As Decimal In logweights If w < min Then min = w If w > max Then max = w Next 'Now calculate the slope of a straight line, from min to max. Dim slope As Double If max > min Then slope = (maxSize - minSize) / (max - min) End If 'Get the value in the middle between minSize and maxSize. Dim middle As Double = (minSize + maxSize) / 2 'Calculate the result for each of the weights. Dim output As New List(Of Decimal) For Each w As Decimal In logweights If (max <= min) Then 'With max=min all tags have the same weight. output.Add(CDec(middle)) Else 'Calculate the distance from the minimum for this weight. Dim distance As Double = w - min 'Calculate the position on the slope for this distance. Dim result As Double = CDec(slope * distance + minSize) 'If the tag turned out too small, set minSize. If result < minSize Then result = minSize 'If the tag turned out too big, set maxSize. If result > maxSize Then result = maxSize output.Add(CDec(result)) End If Next Return output End Function