Overview¶
The aim of this book is to understand basic DH technologies and their problems, while working with digital texts. For this aim, we will use modern text processing tools to build a natural language processing (NLP) pipeline. The results from this pipeline will then be used to build a multilayer network of occuring entities in the text, which then can be analyized by recent clustering algorithms.
Goal 🎉 A commented script to understand and critically asses all steps to come from books to networks.
Part I: “From objects to data”¶
How can we come from books and other textual sources to data, what is the technological basis for OCR, which biases should we be aware of, while dealing with existing digital repositries?
Working with data
Literal programming
OCR
Biases in source material
From text to data
Tagging
TEI
Intro to Spacy
Part II: “The art of the topic”¶
Introduction to NLP basics and their limitations,
Calculating topics
Biases between languages
Tokens / Lemmata and languages
Methods
Intro to Gensim
Time-dependent topics
Visualizations
Clouds and Pies
Close and distant reading
Part III: “Networks of everything”¶
Basics of networks and their use in DH, techniques of creation and analysis of multilayer networks.
Social networks and Scientometrics
Methodological questions
Sources of data
Methods and Software
Basics of networks
Intro to igraph
Multilayer networks