Large Scale Machine Learning and Other Animals: Literature survey of graph databases

Wednesday, February 20, 2013

Literature survey of graph databases

I stumbled upon this tech report: Literature survey of graph database, by Bryan Thompson, Systap. It is a good survey of different graph database platforms, but especially I liked its extensive coverage of the GraphChi framework:

GraphChi (IO Efficient Graph Mining) GraphChi (Kyrola, 2012) is an IO efficient graph mining system that is also designed to accept topology updates based on a Parallel Sliding Window (PSW) algorithm. Each iteration over the graph requires P^2 sequential reads and P^2 sequential writes. Because all IO is sequential, GraphChi may be used with traditional disk or SSD. The system is not designed to answer ad-hoc queries and is not a database in any traditional sense – the isolation semantics of GraphChi are entirely related to the Asynchronous Parallel (ASP) versus Bulk Synchronous Parallel (BSP) processing paradigms. GraphChi does not support either vertex attributes or link attributes. The basic data layout for GraphChi is a storage model that is key-range partitioned by the link target (O) and then stores the links in sorted order (SO). This design was chosen to permit IO efficient vertex programs where the graph was larger than main memory.

While GraphChi does not support cluster-based process, the approach could be extended to a compute cluster. Because of the IO efficient design, the approach is of interest for out-of-core processing in hybrid CPU/GPU architectures. GraphChi operates by applying a utility program to split a data set into P partitions, where the user chooses the value of P with the intention that a single partition will fit entirely into main memory. The edges are assigned to partitions in order to create partitions with an equal #of edges – this provides load balancing and compensates for data skew in the graph (high cardinality vertices). GraphChi reads one partition of P (called the memory partition) into core. This provides the in-edges for all vertices in that partition. Because the partitions are divided into target vertex key-ranges, and because partitions are internally ordered by the source vertex, out-edges for those vertices are guaranteed to lie in a contiguous range of the remaining P-1 partitions. Those key-ranges may vary in their size since the #of out-edges for a vertex is not a constant. Thus, for the current memory partition, GraphChi performs 1 full partition scan plus P-1 partial partition scans.

In addition to the edges (network structure), GraphChi maintains the transient graph computation state for each edge and vertex. The edge and vertex each have a user assignable label consisting of some (user-defined) fixed-length data type. The vertices also have a flag indicating whether or not they are scheduled in a given iteration. The edge state is presumably read with the out-edges, though perhaps from a separate file (the paper does not specify this). The vertex state is presumably read from a separate file (again, this is not specified). After the update() function has been applied to each vertex in the current memory partition, the transient graph state is written back out to the disk. This provides one more dimension of graph computation state that is persisted on disk, presumably in a column paired with the vertex state.

If you like to read the rest of the overview, and also some proposed extensions, you should read the full paper. And of course, you can read about the collaborative filtering toolkit I am writing on top of GraphChi here.

An update: I just got from Bryan Thompson a note about additional useful resources Systap released:
- Large Scale Graph Algorithms on the GPU (Yangzihao Wang and John Owens, UC Davis)
- Graph Pattern Matching, Search, and OLAP (Dr. Xifeng Yan, UCSB)

Large Scale Machine Learning and Other Animals

Wednesday, February 20, 2013

Literature survey of graph databases

No comments:

Post a Comment

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax