Finding and counting words

As allways some cells need to be edited to run this exercise. Mostly, these parts are marked by ???.

As in the previous excercise, first load the data from the saved file.

import re
import pandas as pd
data = pd.read_json('../data/df_tucholsky.json',lines=True)

For counting the number of words in each text, we can use different methods.

For example, every text variable in Python has functionalities to work with it, e.g. .split(), .strip(), .startswith()

We can text some functionalities by taking the first text from the dataframe.

data.head(2)
text = data['text'].iloc[0]

The text itself is returned by printing the current variable value

text

If you want to have the text correctly formated, you need to print() it!

print(text)

If we apply .strip() we remove whitespaces and newlines at the beginning and end of the text.

newtext = text.strip()
newtext

The new text starts with the word Der which can be checked with the function .startswith().

newtext.startswith('Der')
  • Try to use .endswith('WORD') on the variable newtext, such that the function returns True.

newtext.endswith(???)

Counting words

If we apply the function .split() the text will be split at each whitespace, which is the default. You can also apply any string value and the text will be split for example at each letter e (.split('e'))

splitted = newtext.split()

The return value of splitting is a list of the splitted elements. To get the number of list elements, you can use the len() command.

len(splitted)
  • What happens if you split the text on newline characters? ("\n")

Splitting at whitespaces is almost a word list of the text, but there are some symbols like hyphens and words are followed by dots or so.

splitted

with Regular expressions

If we build a wordlist or count based on this list, we will get wrong results. A slightly better way is using regular expressions, a formal way to talk about language.

The Python package re imported in the first line, is dealing with regular expressions.

To find regular expressions, we use re.findall(EXPRESSION, TEXT).

The expression [A-Z] matches all capital letter. Adding a plus [A-Z]+ matches strings of one or more capital letters. If you want to match a certain number of capital letters, you can add curly brackets with the needed numbers {2,4} (between two and four), {3} (exactly three) or {2,} (two or more).

  • Use regular expressions to find all capital letters

  • Are there also two capital letters together?

  • Use the parameter \d to find exactly four digits. What number is this?

  • BONUS: Find an expression to match a date? (e.g. 12.06.2020)

  • BONUS: Can you create an expression to find the publishing metadata of Die Weltbühne (which issue, page)?

Combining capital and small letters with a plus finds a number of word-like objects, but some special characters of German are lost (e.g. ß). As a shortcut, one can use \w to find all alpha-numeric characters.

#re.findall('[A-Z]..\s[A-z]',text)

To find the number of all words in every text, we need to write a small function in Python.

A function always has the form

def NAME(INPUT):
    var = FUNCTION(INPUT)
    return var

In our case, the input is a text and the returned variable should be the length of the list of words.

  • Write the function with re.findall(), the input variable text and outputing the len of the word list

def wordNumber(text):
    var = len(re.findall(???,text))
    return var

The new function can be applied to the data by using the .apply() function of dataframes.

newdata = data.text.apply(lambda x: wordNumber(x))

This automatically applies the function to every row of the dataframe. To save the information in the dataframe, create a new column.

data['wordCount'] = newdata 

If you want to check you result, uncomment the cell below (i.e. remove the hash sign).

#data.wordCount

Count unique words

To get an overview of the used words in each text, we can use the Counter function from the in-built collections package

from collections import Counter

Counting the words of one text is simply done by

allwords = re.findall('\w+',text)
Counter(allwords)
  • Write a function to return the list of counted words for each text.

  • Bonus: Try to normalize the counts by the number of words in the text.

  • Bonus: Return the counted words ordered by the highest count. Hint: Try the most_common() method of the returned Counter object.

Plotting the output of Tucholsky

Using the plotting functionality of Pandas introduced in Excercise 1, we can create an overview of Tucholsky’s “productivity” over the years.

  • Frist, create a sub-dataframe by selecting the columns for year and wordcount by making a list of the column names cols = ["year","wordCount"]

  • Second, group the sub-dataframe by the year column

  • Third, calculate the .mean()

  • Then, plot the yearly as a bar graph by using .plot.bar()

Define the columns

#cols =

Select the sub-dataframe.

#subdata = data[cols]

Group the dataframe by years

#grouped = 

Calculate the mean number of words for each year

#mean = grouped

Plot the data as a bar graph

#mean

Save extended data to new file

#data.to_json('PATH_TO_NEW_FILE','tucholsky_meta')