Books as knowledge reservoirs

From critical distant reading to networks of ideas


At the core of this online book lies the idea of books as a means of storing knowledge. While we traditionally tap into these reservoirs by getting lost in a good book, the raise of digital methods in humanities gives us new opportunities to have a more quantitative access to textual data. As with any new method and technology, this brings novel types of obstacles, inequalities, and valid criticism. The book aims at empowering readers to understand both the possibilities as well as pit-falls of distant reading methods, both statistical and in the field of machine-learning by providing perspectives from the data science and critical theory.


The book offers a hands-on approach to every step of the process from OCR, data cleaning, and keyword analysis to building networks of words, persons or places and analyzing them. In each step, the specific sources of misinterpretation, finer points of data inequality and current criticisms towards the methods will be discussed. To further open up the black box of digital humanities, readers will be assisted to use open source technologies and encounter aspects of open science, like e.g. data publishing and crowd-assisted humanities.

Target group

The book requires no prior experience in coding or other technical skills. Basic usage of a computer and a personal laptop are required. The interdisciplinary book is aimed at readers from humanities, who are interested in practical knowledge of digital methods, and readers from technical domains, who are interested in critically assessing their digital skills.


The readers will get hands-on experience with the full pipeline of distant reading and learn to critically evaluate digital methods in the humanities. Readers are encouraged to write an interactive, digital paper on a specific research question developed during reading the book, which will bring the acquired digital humanities knowledge to practical use.


This documentation allows for several ways of interaction with the content.

Annotation and highlighting

Using the service, all pages of the document can be annotated or highlighted. This requires an account on If you set the visibility of a comment to public, it will be visible for all audiences.

Commenting on full page

Using the Utterances app you can comment on the bottom of each page. Comments are saved as Issues in Github and can be the basis of a discussion on methods or data, leading to changes and improvements in this book.

Interactive Binder instance

Furthermore, the top menu offers access to run the notebooks on a Binder instance offered by GESIS. In these you can change parameters and re-run e.g. the examples. Note that due to the file size, some notebooks can not be run on Binder.

Source code

The full source code for this Jupyter Book is available online, see corresponding Github button. If you find problems with the analysis or code, feel free to open an issue using the Issue button.


Each page can be downloaded in either text format or as Jupyter Notebook, depending on its source. All data is published as CC-by.

Relevant reading

Bolukbasi, T. et at. (2016). “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings”. Proc. Neural Information Processing Systems. URL

Caliskan, A.; Bryson, J. J.,; Narayanan, A. (2017). “Semantics derived automatically from language corpora contain human-like biases”. Science 356 (6334), 183-186, URL

Campbell, S., Yu, Z., Connell, S., and Dunne, C. (2018). “Close and Distant Reading via Named Entity Network Visualization: A Case Study of Women Writers Online”. Proc. 3rd Workshop on Visualization for the Digital Humanities (VIS4DH), URL

Düring, M. (2015). “From Hermeneutics to Data to Networks: Data Extraction and Network Visualization of Historical Sources”. The Programming Historian 4, URL

Hovy, D. and Spruit, S.L. (2016). “The Social Impact of Natural Language Processing”, Proc. Association for Computational Linguistics (ACL) 591-598, URL

Lavin, M.J. (2019). “Analyzing Documents with TF-IDF”. The Programming Historian 8, URL

Leek, J.T. and Peng, R.D. (2015). “Reproducible research can still be wrong”, PNAS 112 (6), 1645-1646. URL

Simanowski, R., ed. (2016). Digital Humanities and Digital Media: Conversations on Politics, Culture, Aesthetics and Literacy. Open Humanities Press, URL

Underwood, T. (2017). “A Genealogy of Distant Reading”. Digital Humanities Quarterly 11.2, URL

Walsh, B. and Horowitz, S. (2016). Introduction to Text Analysis: A Coursebook. Github, URL