Overview

The aim of this book is to understand basic DH technologies and their problems, while working with digital texts. For this aim, we will use modern text processing tools to build a natural language processing (NLP) pipeline. The results from this pipeline will then be used to build a multilayer network of occuring entities in the text, which then can be analyized by recent clustering algorithms.

  • Goal 🎉 A commented script to understand and critically asses all steps to come from books to networks.

Part I: “From objects to data”

How can we come from books and other textual sources to data, what is the technological basis for OCR, which biases should we be aware of, while dealing with existing digital repositries?

  • Working with data

    • Literal programming

    • OCR

    • Biases in source material

  • From text to data

    • Tagging

    • TEI

    • Intro to Spacy

Part II: “The art of the topic”

Introduction to NLP basics and their limitations,

  • Calculating topics

    • Biases between languages

    • Tokens / Lemmata and languages

  • Methods

    • Intro to Gensim

    • Time-dependent topics

  • Visualizations

    • Clouds and Pies

  • Close and distant reading

Part III: “Networks of everything”

Basics of networks and their use in DH, techniques of creation and analysis of multilayer networks.

  • Social networks and Scientometrics

    • Methodological questions

    • Sources of data

  • Methods and Software

    • Basics of networks

    • Intro to igraph

    • Multilayer networks