From objects to data

Reading a book gives us information (and fun!) but it is a slow method. If we want to read many books, we need a lot of time at our disposal.

So how can we read many books in a short time and still get out some information?

Entrance Distant reading and Co.

Distant reading requires as a first step, that the library we are interested in is machine readable. Once we have made a picture from every page of a book, we can use OCR to embedd the text in the image. In a second step we can try some automated processes to extract the structure of the text on a page, and save the result in the TEI format.

In the following, we discuss some aspects of availability and biases in these steps and start with some introductory exercises.

OCR pitfalls and opportunities

  • OCR = Optical character recognition

  • Make sources machine-readable

  • Most common programs good for “modern” (i.e. printed!) texts only (Abby / Teseract)

  • Many other approaches, see e.g. OCR4all

  • Tesseract is Open Source and useful tutorials can be found easily, see e.g. here.

For texts spanning a wide period of time, OCR quality will vary a lot

Example 1

As a first example for data quality, have a look at Leo Bergmanns Das Buch der Arbeit from 1855. Written in Fraktur, the text is still recognized and readable. However, we can find several errors in the OCR and the text structure is not recognized very well.

STABI Example

Scanning quality

  • OCR quality is improved by higher resolution images

  • But what about file size? ➡️ For 400dpi, one page eq. 40 MB

  • To research paper / materiality of sources, even higher resolution might be necessary

Long term preservation of raw data raises questions on data quality

Example 2

  • What about other languages?

Vespucci, B. et al.; Nota eorum quae in hoc libro continentur ...

Source: ECHO at MPIWG

Existing collections

There are already many existing text collections. Depending on the initial research question, it can be much faster to use them as the starting point. To capture the full structure of a text, many collections use the Text Encoding Initiative (TEI) standard based on XML. See e.g. TEI-C for an introduction.

  • LOC: Crowd-sourced transcriptions

  • Textgrid: German texts in TEI format

  • Newton Project: Works and correspondence of Isaac Newton

  • Verfassungsschutzberichte: All public reports from the Office for the Protection of the Constitution

  • Arxiv: Physics preprints of the last decades

Exercises

Exercise 1

  • Use a PDF you are interested in and convert it to TEI using Grobid

  • What result do you get?

  • Can it be converted, why not?

  • Is the information of authors/publishers correct? Only if you use an academic paper..

Exercise 2

Installation of JupyterLab can be done with pip as well…

  • Download and install Anaconda

  • Open JupyterLab

  • Download material from github

  • Have a look in the folder Working_with_Jupyter

Exercise 3

  • Download material text2data

  • Evaluate notebook Text2Data_1.ipynb

    • Do you need to install a new package?

  • Try to load a new text resource

    • E.g. from Textgrid, or a PDF you converted using Grobid.