UC BERKELEY
EECS technical reports
TECHNICAL REPORTS


EECS-2015-254.pdf
Conditions of Use

Archive Home Page

Supervised Text Region Identification on Historical Documents

Authors:
Eng, Jonathan
Technical Report Identifier: EECS-2015-254
December 18, 2015
EECS-2015-254.pdf

Abstract: We present multi-column text region identification support for Ocular, the unsupervised historical printed document transcription project of Berg-Kirkpatrick et. al (2013). We use structured prediction with rich features defined on the input document and incorporate a transition model based on prior document layout assumptions. Our model is trained using a structured-SVM objective on a randomly selected set of historical documents from The Proceedings of Old Bailey corpus. For learning, we use loss-augmented Viterbi decoding with a weighted Hamming loss function. We present our suite of features that achieve a 37.4 F1 text score and 39.4 F1 non-text improvement in text region identification over the Ocular baseline text cropper.