IMI/Publicaţii/CSJM/Ediţii/CSJM v.28, n.3 (84), 2020/

Semi-automated workflow for recognition of printed documents with heterogeneous content

Authors: Colesnicov Alexandru, Malahov Ludmila, Cojocaru Svetlana, Burţeva Liudmila
Keywords: platform for heterogeneous document recognition, page layout analysis, non-textual content recognition


The paper discusses problems of heterogeneous texts digitization. The archives of scanned printed documents grow dramatically by results of projects concerning cultural heritage preserving. Manual annotations of scanned document images and per page screen reading make the usage of these archives difficult and, sometimes, impossible. Existing document processing systems cannot automatically display content correctly due to the presence of heterogeneous content. We proposed a Web platform to maximize the support of semi-automated work of all used tools for recognition of heterogeneous documents. Maximizing support means both creating the convenient ``single window'' access to all tools, and reducing the manual part of the process as much as possible. For implementation, the convergent technology is used, which assembles complex software systems from ready-made heterogeneous modules on a single platform.

"Vladimir Andrunachievici" Institute of Mathematics and Computer Science
5 Academiei str., MD-2028, Chisinau
Republic of Moldova
E-mail:


