The goal of this project is to design and implement reconfigurable distributed systems to cope with their inherent dynamism and elasticity requirements at large-scale.
Today's datacenters must provide elasticity (the ability for a distributed machines to expand and shrink its resources according to the user demand) and fault-tolerance (the ability to hide the software or hardware failure of a node by replacing it with another). A common solution to these two problems is called reconfiguration, the action of replacing the configuration currently used by clients of the distributed system by a new configuration of more/less nodes that are all non-faulty. The crux is to make sure that the service provided by the distributed system to the clients is not disrupted during a reconfiguration.
The goal of this research project in collaboration with NICTA is to implement a reconfigurable distributed system that will be used at the core of a datacenter. It will exploit reconfigurable quorum systems, where quorums (mutually intersecting sets of servers) are replaced by new ones excluding servers that are unreliable or under-utilized. A consensus protocol should be designed and implemented to totally order the set of consecutive configuration installed in the system. A well-known algorithm used by Microsoft and Google is called Paxos and relies on a leader election that terminates when messages get delivered. While the use of such a leader is appealing to reconfigure in a simple way, it unfortunately limits availability as noted in the open source Yahoo! Zookeeper system. By contrast, reconfiguration requests should be directly routed towards a distributed set of quorum nodes to remedy this issue. Zookeeper could be used as a representative testbed of applications running on Twitter, Facebook, Netflix and Linked In's datacenters to validate the solution.
The opportunity ID for this research opportunity is 1791