Review note for Pregel: A System for Large-Scale Graph Processing
1 - Summary Large graphs have been under analysing for years due to their ubiquity and commercial values, while the existing approaches have many limitations in terms of locality, efficiency, flexibility, etc. Google introduced a vertex-centric computational model framework in this paper that is suitable for large-scale graphs processing on clusters of numerous commodity computers in a manner that developers can easily program with an abstract API without concerning distribution-related details behind it. The paper describes Pregel, the large-scale graph processing model, and associated C++ API, discusses its implementation issues, applications to some algorithms, performances results, and also points out the future directions.
2 - Problem More large-scale graphs are being produced and processed for decades. Custom distributed platform demands considerable efforts to implement and not flexible enough to fit new algorithm or graph representation. Frequently used distributed computing infrastructures are often not suitable for graph processing and can lead to suboptimal performance and usability issues. Single-computer graph algorithm libraries compromise the scalability of problems. There are indeed some existing parallel graph process approaches, but they do not address fault tolerance or other critical issues for distributed systems. None of the options ideally fit the comprehensive purposes of large-scale graph processing, being flexible, scalable, fault-tolerant and efficient.
3 - Solution To address the problems mentioned above, Google presents a vertex-centric system called Pregel. The computational model takes an initialised graph with unique vertex identifiers as input, and a sequence of iterations (a.k.a. supersteps) will then be carried out, in each of which the user-defined functions are invoked in parallel onto vertices, expressing the given algorithm. The vertex-centric philosophy makes sure that the mechanisms for detecting execution order within each iteration are hidden and communications are simply presented as from one iteration to the next. In one iteration, a vertex can receive messages sent from the previous iteration, propagate messages to other vertices along outgoing edges that will be received at the subsequent iteration, modify the states of its own and outgoing edges and mutate graph topology. The supersteps are organised by global synchronization points, and the synchronicity guarantees that the programs intrinsically avoid deadlocks and data races hence have competitive performances compared to asynchronous systems. Many real-world applications, for example, Page Rank, Shortest Paths, etc., have been deployed and more are being devised under Pregel.
4 - Novelty The paper proposed a new solution for large-scale graph processing. Compared to Sawzall, Pig Latin, and Dryad, it hides distribution details and provides a natural API. Compared to MapReduce and other dataflow-based model, the stateful-vertex-focused philosophy make the model more efficient for iterative computation. The idea of synchronous superstep model is actually from Bulk Synchronous Parallel (BSP) models. There are also many other models using BSP implementations, but they are not graph-specific, and their scalability and fault tolerance have been assessed on large clusters. Compared to similar model like Parallel Boost Graph Library, Pregel are more fault tolerant due to explicit messaging mechanism.
5 - Evaluation Pregel are implemented on a cluster of 300 commodity machines and evaluated by a range of experiments on SSSP implementation in terms of runtime. The results show that Pregel is efficient at processing very large graphs with billions of vertices and hundreds of billions of edges on huge cluster. The evaluation is not quite telling or comprehensive, because of some flaws stated below - Checkpointing was disabled in the experiments, and the fault tolerance had not been verified. - Lack of comparisons to other distributed graph processing models, similar or different. - Only the application to SSSP was evaluated, lack of evaluations based on many common graph algorithms.
6 - Opinion A cliché about centralized system is that it can be tricky if the master failed, and this has not been discussed in the paper. And as they concluded, there are still some aspects to be improved. For example, the barrier synchronization can be inefficient in cases that some faster workers need to wait for the slower; there is no particular partitioning mechanisms or models coping with a variety of graph topologies in order to balance the loads among workers. Overall, Pregel is a significant and influential framework in the field of graph processing and also a great infrastructure of distributed computing.
- master-slave architechture
- vertex-focused; stateful (active/inactive)
- explicit messaging mechanism, through directed edges
- global synchronization points for each superstep (iteration); highly straggler-sensitive
- checkpointing before a superstep; kind of like backup
- master regularly send ping message to workers, if not hearing back, mark failure.
- re-assign graph partitions to other workers, recovery from the most recent checkpoint.
- Drawback: too frequent checkpoints might cost more than the expected recovery.