Read all works by Tucholsky

The basic dataset for the book will be a collection of all works by Tucholsky. The reason for this is simply that building a clean research corpus is a difficult and long processes in itself, deeply connected to your research question. For the aim of this book, it is therefore easier to focus on an already existing corpus.

Reading TEI file

Textual corpora are mostly encoded in TEI XML format. Collected works of Kurt Tucholsky can for example be found in the TextGrid repository.

Source:

TextGrid Repository (2012). Tucholsky, Kurt. Werke. Digitale Bibliothek. TextGrid. https://hdl.handle.net/11858/00-1734-0000-0005-61C5-B

import os
import re
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
import pandas as pd
from bs4 import BeautifulSoup
import xml.etree.ElementTree as etree
fileList = os.listdir('../data/tei/')

Reducing the structure

Since we are not concerned with the finer textual structure, we simple read the full text of every identified text.

First, we look at XML elements of the form ./tei:teiCorpus/tei:TEI and extract the title and publication year. Then, we open the file with BeatifulSoup to find div elements which have the correct textID and collect the full text from them.

The resulting list contains all texts by Tucholsky which are available in this resource, their titles and publication years. Overall, we have around 1700 texts between 1917 and 1934.

def getText(path):
    ##
    ns = {'tei':"http://www.tei-c.org/ns/1.0","xml":"http://www.w3.org/XML/1998/namespace"}
    dateRegex = re.compile('(?<=/Literatur/M/Tucholsky, Kurt/Werke/)\d{4}(?=/)')
    titleRegex = re.compile('(?<=/Literatur/M/Tucholsky, Kurt/Werke/\d{4}/).+')
    basePath = '../data/tei/'
    ###
    filePath = basePath + path
    tempList = []
    if os.path.isfile(filePath) and path.endswith('.xml'):
        tree = etree.parse(filePath)
        root = tree.getroot()
        TEIs = root.findall("./tei:teiCorpus/tei:TEI",ns)
        for el in TEIs:
            tempDict = {}
            tempDict['ids'] = el.attrib["{http://www.w3.org/XML/1998/namespace}id"]
            tei_path = el.attrib['n']
            tempDict['year'] = int(re.findall(dateRegex,tei_path)[0])
            tempDict['title'] =  re.findall(titleRegex,tei_path)[0]
            tempList.append(tempDict)
        with open(filePath) as file:
            soup = BeautifulSoup(file,'lxml')
            for ids in tempList:
                try:
                    search = re.compile(ids['ids'].split('.')[0] + '+')
                    elems = soup.findAll('div',{'xml:id':search})
                    assert len(elems) == 1
                    text = elems[0].getText()
                    ids.update({'text':text})
                except:
                    print(ids)
                    search = re.compile(ids['title'] + '+')
                    elems = soup.findAll('div',{'n':search})
                    textList = [x.getText() for x in elems]
                    text = '\n\n\n\n'.join(textList)
                    ids.update({'text':text})
                    print('Found text for ids')
    return tempList
pool = Pool(cpu_count() - 1)
{'ids': 'tg996.wncv.0', 'year': 1927, 'title': 'Ein Pyrenäenbuch'}
Found text for ids
{'ids': 'tg1603.wnfc.0', 'year': 1931, 'title': 'Schloß Gripsholm'}
Found text for ids
result = pool.map(getText,fileList)
flatt = [x for y in result for x in y]
dfT = pd.DataFrame(flatt)
dfT.head(2)
ids year title text
0 tg203.wkgc.0 1919 Ein Deutschland! \n\n Ein Deutschl...
1 tg204.wkxr.0 1919 Achtundvierzig \n\n Achtundvierz...

We can easily check that all ids have a corresponding text.

dfT[dfT.text.isna()]

The resulting dataframe is saved in JSONL format for further processing.

dfT.to_json('../data/df_tucholsky.json', orient='records', lines=True)