Challenges of large data sets

How do you make a project management application that handles thousands of work-streams likes projects, tasks and actions? May have failed creating a solution that handles that amount of data. Many managers and users have been disappointed using yet another task management app that simply collapsed in front of their eyes under heavy load. Thousands of work-streams is not that much after all. If you manage a team and assemble a small team of 10 software developers, several operations operatives and you’re in a territory of thousands of tasks on the agenda within a year. You could achieve that easily in much smaller team within two years since adoption.

Software architecture is largely about building a house of cards that doesn’t collapse and many companies have started investing in their in-house work/task management solutions. What do you do if you want a project management solution that wouldn’t collapse after a year or two of regular use? There are three significant challenges: the user experience of managing large workloads, the front-end of a solution and the back-end. The first part of the very user experience is a topic for another post (watch carefully) and I’m going to focus on the two remaining, more technical aspects.

In our case we’ve had to hit all those issues during implementation of TaskBeat, see: http://taskbeat.com/. TaskBeat is a project and programme management application with back end has designed specifically to meet the goals and implement the very concept of TaskBeat’s methodology (see: http://taskbeat.com/documents/metodologia.pdf, in Polish). This methodology could be implemented in several ways, however in our case it looks more like a tree in the back end of application itself. We’ve tried many strategies for representing that tree in relational and non-relational databases. First we’ve went for a professional relational database* because we wanted a real-time system with no cache or distribution delays.

The relational databases are providing you the transactional, real-time experience of managing large projects, where the real-time bit was very important in our specific implementation. We didn’t want any delays in refreshes of data as this is the nature of what the collaborative platform should be. With a properly engineered solution you get all updates instantly. In fact in case of TaskBeat you can sit back and enjoy watching work being done in front of your eyes by watching the NewsFeed screen. I do it rarely as it’s too much information but every time I do I know that the data is not cached and is coming live directly of the back-end database, at the moment. As you can see relational databases can be very responsive, unfortunately they’re not very well suited to accommodate tree like structures.

We’ve experimented with different approaches to tackle this issue. Most notably adjacency list (you know: NodeId, ParentId, obviously) and the concept known as nested sets. There are some conclusions on how to implement this in yet another way in our project. In any other case or implementation you definitely need to start with those two approaches in this kind of solutions, thinking through the positive and negative consequences of each one, depending on expected / actual usage of each strategy amongst your very own target audience. If you want the technical bits and some very interesting points you can read even more on this topic in “Trees and Hierarchies in SQL” by my favourite author: Joe’s Celko (http://www.amazon.com/Hierarchies-Smarties-Kaufmann-Management-Systems/dp/1558609202)

In our very own case we’ve been able to combine different approaches and load balance some of the hard work between the database server and the application server. Since we’ve gone for a wildly powerful solution that compiles most of the database scripts to stored procedures and server scripts to the assembler (yes, it eventually sits compiled to Intel x64 machine code!). Therefore in our case we could shift some of the database queries and process them as object queries (yay! see: http://en.wikipedia.org/wiki/LINQ) using blazing fast native code spun off a super cheap “wintel” commodity machines instantly decreasing the cloud cost from few thousands to a less than one hundred dollars a month.

Another challenge is the front end. You really have to be good with your HTML/CSS but even more importantly if you’re displaying lot of data on the browser’s canvas you need to know the limits of your browser as you’re dealing with loads of elements. This is something we didn’t optimise best yet. One thing a good architect would take a note on is how much of the data you actually need rendering on the page and when. The when is the killer question as many solutions would try to display all the data at the time the list is being displayed. You need to consider two alternative approaches: loading some of html via detached DOM and attaching the data to the view only when is needed and second: incremental loading of data responsive to the scroll.

To conclude with a final and the most important note: watch the real life scenarios and the actual usage of your solution as it grows big and huge. Don’t get tangled into very technical details and never, ever optimise too early by trying to implement something perfect the first hit or getting into too many technicalities too early. The very ability to see the internals of the apps usage may be the most important feature you want to have implemented first, as it pays back over a time, every time. We’re yet to implement the how-to we want on the front end, as there are clearly some optimisations to be made but we have deliberately not have done so earlier as we didn’t have sufficient real-world usage back then that we have now.

In our case the choice of existing methodology dictated the implementation and we’ve benefited from our choice multiple times over the course of the project lifetime already. Because we have experimented a lot and haven’t optimised too heavily on the front side we’ve been able to deliver much better solution that actually scales with tens of thousands of work-streams under management being processed with ease.

* having TaskBeat on top of Microsoft SQL Server is worth saying that NoSQL database have also been implemented in the solution however it covers aspects of the functionality not covered in this article.