10.7916/D87W6M2S
Optimal Splitters for Database Partitioning with Size Bounds
Columbia University
2008
2017-06-16
2018-08-30
Ross, Kenneth A.
Cieslewicz, John
Computer science
Columbia University. Computer Science
Reports
Partitioning is an important step in several database algorithms, including sorting, aggregation, and joins. Partitioning is also fundamental for dividing work into equal-sized (or balanced) parallel subtasks. In this paper, we aim to find, materialize and maintain a set of partitioning elements (splitters) for a data set. Unlike traditional partitioning elements, our splitters define both inequality and equality partitions, which allows us to bound the size of the inequality partitions. We provide an algorithm for determining an optimal set of splitters from a sorted data set and show that it has time complexity O(k lg_2 N), where k is the number of splitters requested and N is the size of the data set. We show how the algorithm can be extended to pairs of tables, so that joins can be partitioned into work units that have balanced cost. We demonstrate experimentally (a) that finding the optimal set of splitters can be done efficiently, and (b) that using the precomputed splitters can improve the time to sort a data set by up to 76%, with particular benefits in the presence of a few heavy hitters.