Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
offers some discussion on the benefits of ImageMatch's for image-cataloguing in Rare Books collections.
Evaluating image matching methods for book cover identification, 2019
Humans are capable of identifying a book only by looking at its cover, but how can computers do the same? In this paper, we explore different feature detectors and matching methods for book cover identification, and compare their performances in terms of both speed and accuracy. This will allow, for example, libraries to develop interactive services based on cover book picture. Only one single image of a cover book needs to be available through a database. Tests have been performed by taking into account different transformations of each book cover image. Encouraging results have been achieved.
Second International Conference on Document Image Analysis for Libraries (DIAL'06), 2006
This paper lies on the field of ancient patrimonial books valorization: it precisely relates to the development of suitable assistance tools for humanists and historians to help them to retrieve information in great corpus of digitized documents. This paper presents a part of this ambitious project and deals with the presentation of a pixel classification method for ancient typewritten documents. The presented approach lies on a multiresolution maps construction and analysis. For 5 resolutions we construct 5 different characterisation maps. All the maps are based on texture information (correlation of pixels orientations, grey level pixel density…). After the merging of these 25 maps, each pixel of the original image is described by a vector which allows the computing of a hierarchical classification. In order to avoid issues linked to the binarization process, all or maps are computed on grey level images. The system has been tested on a CESR database of ancient printed books of the Renaissance. The classification results are evaluated through several visual classification illustrations.
Lecture Notes in Computer Science, 2000
Antique printed books constitute a heritage that should be preserved and used. With novel digitising techniques is now possible to have these books stored in digital format and accessible to a wider public. However it remains the problem of how to use them. DEBORA (Digital accEss to BOoks of the RenAissance) is a European project that aims to develop a system to interact with these books through worldwide networks. The main issue is to build a database accessible through client computers. That will require to built accompanying metadata that should characterise different components of the books as illuminated letters, banners, figures and key words in order to simplify and speed up the remote access. To solve these problems, digital image analysis algorithms regarding filtering, segmentation, separation of text from non-text, lines and word segmentation and word recognition were developed. Some novel ideas are presented and illustrated through examples.
2017
While historically digital heritage libraries were first powered in image mode, they quickly took advantage of OCR technology to index printed collections and consequently improve the scope and performance of the information retrieval services offered to users. But the access to iconographic resources has not progressed in the same way, and the latter remain in the shadows: manual incomplete and heterogeneous indexation, data silos by iconographic genre. Today, however, it would be possible to make better use of these resources, especially by exploiting the enormous volumes of OCR produced during the last two decades, and thus valorize these engravings, drawings, photographs, maps, etc. for their own value but also as an attractive entry point into the collections, supporting discovery and serenpidity from document to document and collection to collection. This article presents an ETL (extract-transform-load) approach to this need, that aims to: Identify and extract iconography wher...
Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, 2007
In this article, we propose a method of characterization of pictures of old documents based on a texture approach. This characterization is carried out with the help of a multiresolution study of the textures contained in the pictures of the document. So, by extracting five features linked to the frequencies and to the orientations in the different parts of a page, it is possible to extract and to compare elements of high semantic level without expressing any hypothesis about the physical or logical structure of the analysed documents. Experiments show the feasibility of the fulfillment of tools for the navigation or the indexation help. In these experimentations, we will lay the emphasis upon the pertinence of these texture features and the advances that they represent in terms of characterization of content of a deeply heterogeneous corpus.
Journal of the American Institute for Conservation, 2014
Surface texture is a critical feature in the manufacture, marketing, and use of photographic paper. Raking light reveals texture through a stark rendering of highlights and shadows. Though close-up raking light images effectively document surface features of photographic paper, the sheer number and diversity of textures used for historic papers prohibits efficient visual classification. This work provides evidence that automatic, computer-based classification of texture documented with raking light is feasible by demonstrating an encouraging degree of success sorting a set of images made from samples of historic silver gelatin paper. Using this dataset, four university teams applied different image-processing strategies for automatic feature extraction and degree of similarity quantification. All four approaches successfully detected strong affinities and outliers built into the dataset. The creation and deployment of the algorithms was carried out by the teams without prior knowledge of the distributions of similarities and outliers. These results indicate that automatic classification of silver gelatin photographic paper based on close-up texture images is feasible and should be pursued. To encourage the development of other classification schemes, the sample 'training' dataset used in this work is available to other academic researchers at http://www.PaperTextureID.org.
In the context of historical collection conservation and worldwide diffusion, this paper presents an automatic approach of historical book page layout segmentation. In this article, we propose to search the homogeneous regions from the content of historical digitized books with little a priori knowledge by extracting and analyzing texture features. The novelty of this work lies in the unsupervised clustering of the extracted texture descriptors to find homogeneous regions, i.e. graphic and textual regions, by performing the clustering approach on an entire book instead of processing each page individually. We propose firstly to characterize the content of an entire book by extracting the texture information of each page, as our goal is to compare and index the content of digitized books. The extraction of texture features, computed without any hypothesis on the document structure, is based on two non-parametric tools: the autocorrelation function and multiresolution analysis. Secondly, we perform an unsupervised clustering approach on the extracted features in order to classify automatically the homogeneous regions of book pages. The clustering results are assessed by internal and external accuracy measures. The overall results are quite satisfying. Such analysis would help to construct a computer-aided categorization tool of pages.
Computer Vision and Image Understanding, 1998
The economic feasibility of maintaining large data bases of document images has created a tremendous demand for robust ways to access and manipulate the information these images contain. In an attempt to move toward a paperless office, large quantities of printed documents are often scanned and archived as images, without adequate index information. One way to provide traditional database indexing and retrieval capabilities is to fully convert the document to an electronic representation which can be indexed automatically. Unfortunately, there are many factors which prohibit complete conversion including high cost, low document quality, and the fact that many nontext components cannot be adequately represented in a converted form. In such cases, it can be advantageous to maintain a copy of and use the document in image form. In this paper, we provide a survey of methods developed by researchers to access and manipulate document images without the need for complete and accurate conversion. We briefly discuss traditional text indexing techniques on imperfect data and the retrieval of partially converted documents. This is followed by a more comprehensive review of techniques for the direct characterization, manipulation, and retrieval, of images of documents containing text, graphics, and scene images.
Digital Scholarship in the Humanities (formerly: Literary and Linguistic Computing), 2013
https://arxiv.org/abs/2108.08109, 2021
Illustrations are an essential transmission instrument. For an historian, the first step in studying their evolution in a corpus of similar manuscripts is to identify which ones correspond to each other. This image collation task is daunting for manuscripts separated by many lost copies, spreading over centuries, which might have been completely reorganized and greatly modified to adapt to novel knowledge or belief and include hundreds of illustrations. Our contributions in this paper are threefold. First, we introduce the task of illustration collation and a large annotated public dataset to evaluate solutions, including 6 manuscripts of 2 different texts with more than 2 000 illustrations and 1 200 annotated correspondences. Second, we analyze state of the art similarity measures for this task and show that they succeed in simple cases but struggle for large manuscripts when the illustrations have undergone very significant changes and are discriminated only by fine details. Finally, we show clear evidence that significant performance boosts can be expected by exploiting cycle-consistent correspondences. Our code and data are available on http://imagine.enpc.fr/~shenx/ImageCollation.
In this article, a complete framework for the comparative analysis of texture features is presented and evaluated for the segmentation and characterization of ancient book pages. Firstly, the content of an entire book is characterized by extracting the texture attributes of each page. The extraction of the texture features is based on a multiresolution analysis. Secondly, a clustering approach is performed in order to classify automatically the homogeneous regions of book pages. Namely, two approaches are compared based on two different statistical categories of texture features, autocorrelation and co-occurrence, in order to segment the content of ancient book pages and find homogeneous regions with little a priori knowledge. By computing several clustering and classification accuracy measures, the results of the comparison show the effectiveness of the proposed framework. Tests on different book contents (text vs. graphics, manuscript vs. printed) show that those texture features are more suitable to distinguish textual regions from graphical ones, than to distinguish text fonts.
2004
illu proven AT issue vec 150166 distinguished implementation nearest taken london load classe rotation diagonal omitted identification 1250 counted gray declared settling synthetic 1969 algebraic longest LL C16 Q30 kersten fair project 0087 estimate option magazine ...
2020
Digital humanities still need to unlock the potential of images anlysis algorithms to a large extent. Modern deep learning images processing can contribute much to quantify knowledge about visual components in books. In this study, we report on experiments carried out for historical print. The illustrations in books offer much for humanities research. Object recognition systems can identify the portfolio of objects in book illustrations. In a study with several hundreds of books, we applied systems to find illustrations and classify them. Results show that persons are shown in illustrations within fiction books with a higher frequency than in non-fiction books. We also show the classification results for an analysis of the printing technology. This expert task can still not be perfectly modeled by a CNN. A class activation map analysis can be used to analyze the performance qualitatively.
Nowadays, Digital Libraries have become a widely used service to store and share both digital born documents and digital versions of works stored by traditional libraries. Document images are intrinsically non-structured and the structure and semantic of the digitized documents is in most part lost during the conversion. Several techniques related to the Document Image Analysis research area have been proposed in the past to deal with document image retrieval applications. In this chapter a survey about the more recent techniques applied in the field of recognition and retrieval of text and graphical documents is presented. In particular we describe techniques related to recognition-free approaches.
Restaurator, 2002
First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings.
Large collections of scanned documents (books and journals) are now available in Digital Libraries. The most common method for retrieving relevant information from these collections is image browsing, but this approach is not feasible for books with more than a few dozen pages. The recognition of printed text can be made on the images by OCR systems, and in this case a retrieval by textual content can be performed. However, the results heavily depend on the quality of original documents. More sophisticated navigation can be performed when an electronic table of contents of the book is available with links to the corresponding pages. An opposite approach relies on the reduction of the amount of symbolic information to be extracted at the storage time. This approach is taken into account by document image retrieval systems. In this paper we describe a system that we developed in order to retrieve information from digitized books and journals belonging to Digital Libraries. The main feature of the system is the ability of combining two principal retrieval strategies in several ways. The first strategy allows an user to find pages with a layout similar to a query page. The second strategy is used in order to retrieve words in the collection matching a user-defined query, without performing OCR. The combination of these basic strategies allows users to retrieve meaningful pages with a low effort during the indexing phase. We describe the basic tools used in the system (layout analysis, layout retrieval, word retrieval) and the integration of these tools for answering complex queries. The experimental results are made on 1287 pages and show the effectiveness of the integrated retrieval.
Knjižnica. Revija za področje bibliotekarstva in informacijske znanosti, 2012
Books printed from wooden blocks in the fifteenth century, the so-called xylographa or blockbooks, are exceedingly rare and therefore extremely valuable objects of the European cultural heritage. The paper gives an introduction into the material and the questions examined in current research. Responding to scholars' needs for high-quality reproductions of the blockbooks and their watermarks, a project for the cataloguing and digitisation of the c. 90 blockbooks held at the Bayerische Staatsbibliothek München and other institutions in Bavaria was started in 2009. The aims and methods of the project are described in the paper, and the workflow of the digitisation project carried out at the Bayerische Staatsbibliothek is explained in detail, both with respect to the digitisation procedures for image capture and the detection and documentation of watermarks with an infrared camera. The illustrations exemplify the quality of digital images generated in the project, all of which are accessible on the website of the Munich Digiti-Zation Center (MDZ) at the Bayerische Staatsbibliothek München.
2009
In this work we explore the task of authenticating and dating ancient manuscripts by capturing images of pages in nearinfrared (NIR) and modelling and then comparing the ink appearance of segmented text. We present a texture feature descriptor to characterize and recognize semi-transparent materials such as the inks found in manuscripts. These textural patterns are different in nature from perceptual entities such as textons, tokens, frequency or repeatability of textural elements. Our ink texture descriptor relates a set of ink features from various first and second-order statistics to semi-liquid and viscous image-based properties of inks. In particular, we propose eigen features from the joint gray-level probabilities and off-diagonal sums of co-occurrence matrices. We test the qualities of the features with a classifier trained with the ink descriptor to show how well it recognizes eight different inks of known composition. Presented with the very same task the human visual system would fail to spot the ink composition difference given the inks inter-class and intra-class distances are extremely short.
Lecture Notes in Computer Science
This paper presents a mathematical morphology based methodology to identify and extract several components on antique printed books in order to automatically build metadata. These components were previously classified i nto five different sets (drop capitals, stripes, figures, annotations and text matter) each one characterised by particular geometric features. Based on that assumption several novel algorithms appealing to morphological operators are proposed. The evaluation of the methodology is performed on pages of XVI century books.
The work conducted in this article presents a structural signature based on texture for the characterization and categorization of digitized historical book pages. The proposed signature does not assume a priori knowledge regarding page layout and content, and hence, it is applicable to a large variety of ancient books. By integrating varying low-level features (e.g. texture) characterizing the different page components (i.e. different text fonts or graphic regions) on the one hand, and structural information describing the page layout on the other hand, the proposed signature provides a rich and holistic description of the layout and content of the analyzed book pages. More precisely, the signature-based characterization approach consists of two stages. The first stage is extracting automatically homogeneous regions. Then, the second one is proposing a graph-based page signature, which is based on the extracted homogeneous regions, reflecting its layout and content. This signature ensures the implementation of numerous applications for managing effectively a corpus or collections of books (e.g. information retrieval in digital libraries according to several criteria or page categorization). To illustrate the effectiveness of the proposed page signature, a detailed experimental evaluation has been conducted in this article for assessing two possible categorization applications, unsupervised page classification and page stream segmentation.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.