{rfName}
Co

License and use

Icono OpenAccess

Altmetrics

Grant support

Support by VINCES Consulting under the project VINCESAI-ARGOS, and BB forTAI (PID2021-127641OB-I00 MICINN/FEDER) . The work of A. Pena is supported by a FPU Fellowship (FPU21/00535) by the Spanish MIU. A. Morales is supported by the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with Universidad Autonoma de Madrid in the line of Excellence for the University Teaching Staff in the context of the V PRICIT (Regional Programme of Research and Technological Innovation) . VINCES had an active role on the development of the work, through the guidance of the different authors belonging to the corporation. The rest of funding sources had no role/influence on the development of this work.

Analysis of institutional authors

Pena, AlejandroCorresponding AuthorMorales, AythamiAuthorFierrez, JulianAuthorOrtega-Garcia, JavierAuthor

Share

June 2, 2024
Publications
>
Article

Continuous document layout analysis: Human-in-the-loop AI-based data curation, database, and evaluation in the domain of public affairs

Publicated to:Information Fusion. 108 102398- - 2024-08-01 108(), DOI: 10.1016/j.inffus.2024.102398

Authors: Pena, Alejandro; Morales, Aythami; Fierrez, Julian; Ortega-Garcia, Javier; Puente, Inigo; Cordova, Jorge; Cordova, Gonzalo

Affiliations

Univ Autonoma Madrid, BiDA Lab, Madrid 28049, Spain - Author
VINCES Consulting, Madrid 28049, Spain - Author

Abstract

In the digital era, the amount of digital documents generated each day have being increasing exponentially with the years, to a point where it is unfeasible to process them manually. Thus, there has been growing interest from different sectors to develop automatic tools to process digital documents in an automatic manner. Yet useful, this task is challenging, due to both the large variability and the multimodal nature inherent to the problem. In most cases, a text -only approach often falls short in comprehending the information conveyed by diverse components of varying significance. In this regard, Document Layout Analysis (DLA) has been an interesting research field for many years, whose objective it to detect and classify the basic components of a document. Thus, is an interesting task to obtain a first understanding on how the different components of the document interact with each other. In this work, we used a semi -automatic procedure to annotate digital documents with different layout labels, including 4 basic layout blocks and 4 text categories. We apply this procedure to collect a novel database for DLA in the public affairs domain, the PALdb database, using a set of 24 data sources from the Spanish Administration. The database comprises 37.9K documents with more than 441K document pages, and more than 8M labels associated to 8 layout block units. The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%. We also present a novel application of Quickest Change Detection (QCD) techniques on the DLA domain, which we use to continuously detect changes in the layout of the documents from multiple sources.

Keywords

Document layout analysisDocument understandingHuman-in-the-looHuman-in-the-loopLegal domainNatural language processingQcd-based detection

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

The work has been published in the journal Information Fusion due to its progression and the good impact it has achieved in recent years, according to the agency WoS (JCR), it has become a reference in its field. In the year of publication of the work, 2024 there are still no calculated indicators, but in 2023, it was in position 4/197, thus managing to position itself as a Q1 (Primer Cuartil), in the category Computer Science, Artificial Intelligence. Notably, the journal is positioned above the 90th percentile.

Independientemente del impacto esperado determinado por el canal de difusión, es importante destacar el impacto real observado de la propia aportación.

Según las diferentes agencias de indexación, el número de citas acumuladas por esta publicación hasta la fecha 2025-07-18:

  • Google Scholar: 1
  • Scopus: 2

Impact and social visibility

From the perspective of influence or social adoption, and based on metrics associated with mentions and interactions provided by agencies specializing in calculating the so-called "Alternative or Social Metrics," we can highlight as of 2025-07-18:

  • The use of this contribution in bookmarks, code forks, additions to favorite lists for recurrent reading, as well as general views, indicates that someone is using the publication as a basis for their current work. This may be a notable indicator of future more formal and academic citations. This claim is supported by the result of the "Capture" indicator, which yields a total of: 24 (PlumX).

With a more dissemination-oriented intent and targeting more general audiences, we can observe other more global scores such as:

    It is essential to present evidence supporting full alignment with institutional principles and guidelines on Open Science and the Conservation and Dissemination of Intellectual Heritage. A clear example of this is:

    • The work has been submitted to a journal whose editorial policy allows open Open Access publication.
    • Assignment of a Handle/URN as an identifier within the deposit in the Institutional Repository: https://repositorio.uam.es/handle/10486/712089

    Leadership analysis of institutional authors

    There is a significant leadership presence as some of the institution’s authors appear as the first or last signer, detailed as follows: First Author (PEÑA ALMANSA, ALEJANDRO) .

    the author responsible for correspondence tasks has been PEÑA ALMANSA, ALEJANDRO.