UC BERKELEY
EECS technical reports
TECHNICAL REPORTS


EECS-2007-178.pdf
Conditions of Use

Archive Home Page

Fully Distributed EM for Very Large Datasets

Authors:
Wolfe, Jason
Haghighi, Aria Delier
Klein, Daniel
Technical Report Identifier: EECS-2007-178
December 22, 2007
EECS-2007-178.pdf

Abstract: In EM and related algorithms, E-step computations distribute easily, because data items are independent given parameters. For very large data sets, however, even storing all of the parameters in a single node for the M-step can be prohibitive. We present a framework which exploits parameter sparsity to fully distribute the entire EM procedure. Each node interacts with only the subset of parameters relevant to its data, sending messages to other nodes along a junction-tree topology. We demonstrate the effectiveness of our framework over a MapReduce approach, on two tasks: word alignment for machine translation, and LDA for topic modeling.