Load JSONL with pandas¶
This notebook is purposfully not ready to be evaluated! You need to read and check every code and text cell and fill in some missing pieces. Mostly, they are marked by three question marks (???).
At the end, there is a bonus question with a slightly more involved exercise.
Read file into pandas¶
Enter the path to your downloaded corpus file.
import pandas as pd
data = pd.read_json('../data/df_tucholsky.json', lines=True)
The read-in file is now assigned to the variable data
.
data.head()
The size of the dataframe (i.e. table) can be obtained by using
data.shape
The first number gives the number of rows, the second the number of columns.
Check structure and access data¶
You can access a column by using the column name, e.g. data['name']
.
data['ids']
To read a specific row, you can use the function .iloc[NUMBER]
to access by count, or select a specific row by content.
This happens in two steps.
First, you create a list of True/False values, and then reduce the data to rows which are True.
To create a boolean mask (True/False list) for numbers you can use comparision operators, i.e. ==
for equal, or >=
for greater then.
Example: data['COLUMN'] >= 10
or data['COLUMN'] == 17
You can combine several conditions by using AND (&
, and
) and OR (|
, or
) operators, e.g. cond=(cond1 & cond2)
, .
Example: data[(cond1 & cond2)]
Select all rows of the year 1911
BONUS: Select all rows for the years 1920 to 1925
yearMask1 = data.year == ???
data[yearMask1]
yearMask2 = (data.year >=1920) & ???
data[yearMask2].head(2)
To select rows which contain text, we can use string comparision. For accessing all rows, which contain the text hello
, we create a mask cond = data["COLUMN"].str.contains("hello")
and apply it to the data as above.
Select all rows, where the title contains the word Krieg, (German for war)
BONUS: Select all rows where the title contains Krieg, which are published before 1924
text = 'test string'
['test' in text]
textMask = data.text.str.contains(???)
data[textMask]
To test whether the text is correctly found, we can simply read the full text of a found work.
data[textMask].text.iloc[2]
Sort data by publication years¶
To group a dataset by a certain column, we can use pandas .groupby
function. It returns a generator-like object containing tuples of sub-dataframes with the value of the grouped column, e.g. 1911,data_for_1911)
.
To get a specific group, you can use .get_group(VALUE)
on the resulting grouped object.
grouped = data.groupby('year')
Select the group for the year 1917
grouped.get_group(???)
For the size of each group (i.e. number of rows) we have the function .size()
.
Get a list of sizes for each year group
grouped.???
Plot resulting dataset¶
Sometimes a plot gives a good overview over a dataset.
For dataframes containing only numbers, its easy to generate graphs with pandas build-in functions .plot()
. Bar charts can be obtained with .plot.bar()
.
Generate a plot of the number of publications per year
Bonus: Generate a plot of the number of publications, which where published with the different pseudonyms of Tucholsky (‘Wrobel’, ‘Hauser’, ‘Tiger’, ‘Panter’) and compare them to the overall publication output and the publications published under his real name. Can you normalize the data for each year?