Wednesday, December 30, 2015

Data visualization with D3.js

What is Data Visualization ?

Data visualization is the way of present the data in a pictorial or graphical format and it helps people to understand the importance of data in a visual context. This is very crucial because data on its own can be very hard to understand and analyze.

Why Data Visualization ?

As millions of data is collected and analyzed, the decision makers use data visualization tools which enable them to see analytical results presented visually, find relevance among the variables, communicate concepts and hypotheses to others and even predict the results for future. Because of the way the human brain processes information, it is faster for people to gather the significance of many data points when they are displayed in charts and graphs rather than representing them over piles of spreadsheets, flat files or reading tables of reports and it helps to easily interpret the data, saving time and energy.

What is D3 ?

D3 is a JavaScript library which is used to manipulate documents based on data (interactive visualization). D3 helps bring data to life using HTML, SVG, and CSS. D3 stands for Data Driven Documents. Here documents refer to the DOM (Document Object Model) structure in html. It allows developers to bind arbitrary data to a DOM, and then apply data-driven transformations to the document.

Selections in D3

Before moving into the details, first look at the initial version of our html document below. (I'm referring a local copy of D3.js library here)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Example in D3</title>
    <script src="d3.min.js"></script>
</head>
<body>
<script>
    //D3 code goes here
</script>
</body>
</html>

Similar to jQuery, D3 allows us to select elements from the DOM based on CSS selectors, for instance by id, class attribute or tag name. The result of a select operation is an array of selected elements.

In D3 using select(), we can select a single element from the DOM in html. Let's say as an example if we want to color the background of the body tag using color blue then we can do it as follows.

d3.select("body").style("background-color", "blue");

For every data value in the selection, we can append a new DOM element given by element name and associate the data value to it. An example is showing below.

d3.select("body").append("h1").html("Support Vector Machines");

In D3.js selectAll() method uses CSS3 selectors to grab DOM elements. Unlike the Select() method which previously mentioned, the SelectAll() selects all the elements in the DOM that match the specific selector string. 

d3.selectAll("p").style("background-color","blue");

In the above example it selects all the <p> elements available on the page. If there is none then it returns an empty selection. Here most important thing is that we don’t need to loop over our set of elements in order to apply modification on them. Instead, we apply the style operator to the selection, and D3 takes care of invoking it on every single element within.

Scales in D3

Scales transform numbers or discrete values in a certain interval (called the domain) into numbers in another interval (called the range).  For instance, let’s suppose we have a  dataset which is always over 100 and always below 800. We would like to plot it, say, in a bar chart, which can be only within 100 pixels length.

Domain (data space) rep­re­sents the bound­aries within which our data lies. Let's say as an example if I have an array of num­bers with no num­ber smaller than 1 and no num­ber larger than 1000, my domain would be 1 to 1000.

There will not always be a direct map­ping between the data points and actual pixels on the screen. Let's say as an example, if we are plot­ting a graph of sales and the sales is in tens of thou­sands, it is unlikely that we will be able to have a bar graph with the same pixel length as the data. In that case, we need to spec­ify the bound­aries within which the orig­i­nal data can be trans­formed. These bound­aries are called the range.

The most com­mon types of scales are quan­ti­ta­tive scales and ordi­nal scales. Quan­ti­ta­tive scale func­tions are those that trans­late one numeric value into another numeric value using dif­fer­ent types of equa­tions such as lin­ear, log­a­rith­mic etc. Data may not always in the numeric format. It may con­tain ordinal/categorical/discrete values. For example, alpha­bets. Alpha­bets are ordi­nal (means categorical with clear ordering of the variables) val­ues, i.e. they can be arranged in an order, but you can­not derive one alpha­bet from the other unlike numbers.

d3.scale.linear() - It transforms numeric data in a given dataset into pixel space.
eg: d3.scale.linear().domain([0,1000]).range([0, 100]); 

d3.scale.ordinal() -  It transforms data that has discrete values into pixel space.
eg: d3.scale.ordinal().domain(["A", "B", "C", "D"]).rangePoints([0, 100]);

If you write the output of scale("A"), scale("B"), scale("C") and scale("D") to the console (eg: console.log(scale("A"))) it prints 0, 33.33, 66.66 and 100 respectively.

With rangePoints(interval), d3 fits to the number of categories (eg: n number of points or categories) in the domain within the interval. In that case, the value of the first point is the beginning of the interval, that of the last point is the end of the interval.

With rangeBands(interval), d3 fit n bands within the interval. Here, the value of the last item in the domain is less than the upper bound of the interval.

If we use rangeRound() instead of range(), this will guarantee that the output of the scales are integers, which is better to position marks on the screen with pixel precision than numbers with decimals.

How to visualize dataset using D3 ?

In this section I will able to describe how to plot a horizontal bar-chart for the following sample dataset.

 [{
    "tableName": "PROCESS_USAGE_SUMMARY_DATA",
    "timestamp": 1447495522776,
    "values": {
        "processDefKey": ["manualTaskProcess:1:14"],
        "avgExecutionTime": 31066
    }
}, {
    "tableName": "PROCESS_USAGE_SUMMARY_DATA",
    "timestamp": 1447495522890,
    "values": {
        "processDefKey": ["VacationRequest:1:22"],
        "avgExecutionTime": 16215.5
    }
}, {
    "tableName": "PROCESS_USAGE_SUMMARY_DATA",
    "timestamp": 1447495522987,
    "values": {
        "processDefKey": ["OrderProcess:1:10"],
        "avgExecutionTime": 54892
    }
}, {
    "tableName": "PROCESS_USAGE_SUMMARY_DATA",
    "timestamp": 1447495523074,
    "values": {
        "processDefKey": ["LoanProcess:1:6"],
        "avgExecutionTime": 25149
    }
}, {
    "tableName": "PROCESS_USAGE_SUMMARY_DATA",
    "timestamp": 1447495523145,
    "values": {
        "processDefKey": ["SubProcess:1:18"],
        "avgExecutionTime": 54145
    }
}]


In this example you can see that it has a set of five data items which was collected from the WSO2 DAS (Data Analytics Server) analytics REST API. In order to display the dataset in a horizontal bar chart, need to connect each datum to a bar that will represent it by its length. In D3, we can achieve this by applying the data() operator on the selection of bars. Here I display average execution time of the processes against process id. Therefore before applying any D3 functionality first what I have to do here is make the data into appropriate format as follows. (The array called data is hold the above dataset)

for(var i = 0 ; i < data.length ; i++){
    dataset.push({
        "processDefKey": data[i].values.processDefKey,
        "avgExecutionTime": data[i].values.avgExecutionTime
    });


Now you can see that dataset variable holds the following JSON array.

[{
    "processDefKey": ["manualTaskProcess:1:14"],
    "avgExecutionTime": 31066
}, {
    "processDefKey": ["VacationRequest:1:22"],
    "avgExecutionTime": 16215.5
}, {
    "processDefKey": ["OrderProcess:1:10"],
    "avgExecutionTime": 54892
}, {
    "processDefKey": ["LoanProcess:1:6"],
    "avgExecutionTime": 25149
}, {
    "processDefKey": ["SubProcess:1:18"],
    "avgExecutionTime": 54145
}]
 
Now create a div element in the .html file like below to render the bar-chart.

<div class="main" style="width: 850px;height: 400px;border-style: solid">
    <h2 style="text-align: center;">Process Id VS Average execution time</h2>
</div>


Before we can add x and y axises in D3, we need to clear some space in the margins. Here margins in D3 are specified as an object with top, right, bottom and left properties (you can see it below). Then, the outer size of the bar-chart area, which includes the margins, is used to compute the inner size available for graphical region by subtracting the margins. For example, values for a 700×400 chart are:

var margins = {top: 30, right: 100, bottom: 30, left: 100};
var height = 400 - margins.left - margins.right;
var width = 700 - margins.top - margins.bottom;
var barPadding = 5;


Here barpadding variable is used to keep the space between two rectangles in the bar-chart. 700 and 400 are the outer width and height respectively, while the computed inner width and height are 630 and 200. These inner dimensions can be used to initialize scale ranges. To apply the margins to the SVG container, I set the width and height of the SVG element to the outer dimensions, and add a group (g tag in D3) element to offset the origin of the chart area by the top-left margin.

var chart = d3.select('.main')
                       .append('svg')
                       .attr('width', width + margins.left + margins.right)
                       .attr('height', height + margins.top + margins.bottom)
                       .append('g')
                    .attr('transform', 'translate(' + margins.left + ',' + margins.top + ')');


The next step is adding the x and y axises and label them for the human readability. Here, define x and y axises by binding them to the existing x-scale and y-scale declaring one of the four orientations. Since x-axis will appear below the bars, and therefore use the bottom orientation. For y axis, use the left orientation.

In the domain function we're using a helper called d3.max() and it looks at our data set and figures out what is the largest value. Here d3.max() will iterate over an entire dataset (Array) for us. 

// Create a scale for the x-axis based on data
// Domain - min and max values in the dataset
// Range - physical range of the scale


var xScale = d3.scale.linear()
               .domain([0, d3.max(dataset, function(d){
                    return d.avgExecutionTime;
                })]).range([0, width]);

// Implements the scale as an actual axis
// Orient - places the axis on the bottom of the graph
// Ticks - number of points on the axis, automated


var xAxis = d3.svg.axis()
              .scale(xScale)
              .orient('bottom')
              .ticks(10);


Here ticks(n) method will split the domain of the given axis into n number of points and show them on the axis.

// Creates a scale for the y-axis based on process definition keys

var yScale = d3.scale.ordinal()
               .domain(dataset.map(function(d){
                    return d.processDefKey;
               })).rangeRoundBands([height, 0]);

// Creates an axis based off the yScale properties


var yAxis = d3.svg.axis()
              .scale(yScale)
              .orient('left');



Now define a tooltip to show the additional informations (in this case it is average execution time) when mouse pointer move on to the particular rectangle element.

//add tooltip

var tooltip = d3.select(".main").append("div").attr("class", "d3-tip");
tooltip.append('div').attr('class', 'label');
tooltip.append('div').attr('class', 'contentBox');


The next thing is mapping data to the rectangles in bar-chart. For that we can use selectAll() method and it will select all the existing rectangle elements (in D3 we can define rectangles using "rect") on the SVG. At the beginning there is no any rect elements in the chart, but we have only data array.  when we will invoke enter() method, it will give us virtual selection. Here all the stuff after invoking enter() will execute only for the case where there is no DOM element, there is no rect but there is data element. (That means data elements are entering into the picture)


// Step 1: selectAll.data.enter.append
// Loops through the dataset and appends a rectangle for each value


chart.selectAll('rect')
     .data(dataset)
     .enter()
     .append('rect')

// Step 2: X & Y
// X - Places the bars in horizontal order, based on number of
//        points & the width of the chart
// Y - Places vertically based on scale


     .attr('x', 0)
     .attr('y', function(d){
                    return yScale(d.processDefKey);
            })

// Step 3: Height & Width
// Width - Based on barpadding and number of points in dataset
// Height - Scale using avgExecution Time and height of the chart area


     .attr('height', (height / dataset.length) - barPadding)
     .attr('width', function(d){
                    return xScale(d.avgExecutionTime);
                })
     .attr('fill', 'steelblue')

// Step 4: Info for hover interaction


     .attr('class', function(d){
                    return d.processDefKey;

                })
     .attr('id', function(d){
                    return d.avgExecutionTime;
                })
                .on("mouseover", function(d) {
                    var pos = d3.mouse(this);
                    console.log(pos);
                    tooltip.transition()
                           .duration(200)
                           .style("left", (d3.event.pageX) + "px")
                           .style("top", (d3.event.pageY - 30) + "px");
                    tooltip.select('.label').html('AVG Execution Time');
                    tooltip.select('.contentBox').html(d.avgExecutionTime + ' ms');
                    tooltip.style('display', 'block');
                })
                .on("mouseout", function() {
                    tooltip.style('display', 'none');
                });


So as the final step we can render the x axis as well as y axis once the chart is finished. To avoid the overlap with the rectangles, moves the y-axis 10 pixels left and also add the x and y axises labels as below.

// Renders the yAxis once the chart is finished
// Moves it to the left 10 pixels so it doesn't overlap


chart.append('g')
     .attr('class', 'axis')
     .attr('transform', 'translate(-10, 0)')
     .call(yAxis);

// Appends the xAxis


chart.append('g')
     .attr('class', 'axis')
     .attr('transform', 'translate(0,' + (height + 10) + ')')
     .call(xAxis);

// Adds xAxis title


chart.append('text')
     .text('AVG Execution Time (ms)')
     .attr('transform', 'translate('+(width/2 - 50)+', ' + (height + 50) + ')');

// Add yAxis title


chart.append('text')
     .text('Process Definition Key')
     .attr('transform', 'translate(-70, -20)');


Now I will show you the CSS code below and there you can see the styles which applies to the tooltip, SVG, x and y axies and div element.

.main {
       margin: 0px 25px;
}

svg {
    padding: 20px 40px;
}

.axis path,
.axis line {
      fill: none;
      stroke: black;
      shape-rendering: crispEdges;
}

text,
.axis text {
      font-size: 11px;
}

rect:hover {
      fill: orange;
}

.d3-tip {
        background: #eee;
        border-radius: 10px;
        box-shadow: 0 0 5px #999999;
        color: #333;
        display: none;
        font-size: 11px;
        left: 130px;
        padding: 12px;
        position: absolute;
        text-align: center;
        top: 95px;
        height: 20px;
        width: 100px;
        z-index: 10;
}


The resulting bar chart is now I will show you below —five bars representing the five items in our data set.



References

[1] https://medium.com/@c_behrens/enter-update-exit-6cafc6014c36#.ppi08m9d9
[2] http://www.jeromecukier.net/blog/2011/08/11/d3-scales-and-color/
[3] http://bost.ocks.org/mike/bar/3/