Natural Language Processing¶

NLP¶

NLP: Natural language processing tries to understand, interpret and manipulate human language.

In Python: nltk or spacy package
Routines for parts-of-speech tagging, named-entity-recognition, tokenization and more

Basic concepts and routines¶

For a given text:

Recognize the sentence boundaries
In each sentence, find all tokens (i.e. words without punctuation)
Lemmatization and Stemming describe the process to find roots of words
- Lemma of went is go, or for nouns mice becomes mouse
- Stemm is part of word which does not change, e.g. the stem of produced is produc because of words like production
Identify parts of speech (nouns,verbs, ..) for every lemmatized word

Sentences¶

Recognize the sentence boundaries

import pandas as pd
import nltk
data = pd.read_json('./data/df_tucholsky.json', lines=True)
text = data.text.iloc[0]

sentences = nltk.sent_tokenize(text)

sentences[0]

# outputs gives
'Der Floh Im Departement du Gard – ganz richtig, da, wo Nîmes liegt und der Pont du Gard: im südlichen Frankreich – da saß in einem Postbüro ein älteres Fräulein als Beamtin, die hatte eine böse Angewohnheit: sie machte ein bißchen die Briefe auf und las sie.'

Tokens¶

Recognize the sentence boundaries
In each sentence, find all tokens
- could be words only
- or including punctuation

import pandas as pd
import nltk
data = pd.read_json('./data/df_tucholsky.json', lines=True)
text = data.text.iloc[0]

sentences = nltk.sent_tokenize(text)
nltk.word_tokenize(sentences[0])

# outputs gives
['Der','Floh','Im','Departement','du','Gard','–','ganz','richtig',',','da',',','wo','Nîmes',
'liegt','und','der','Pont','du','Gard',...]

Lemmatization¶

Recognize the sentence boundaries
In each sentence, find all tokens
Lemmatization to find roots of token
- Lemma of goes is go, or for nouns mice becomes mouse

import nltk
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizer.lemmatize('goes')

# output gives
'go'

Stemming¶

Recognize the sentence boundaries
In each sentence, find all tokens
Stemming to find roots of tokens
- Stemm is part of token which does not change, e.g. the stem of produced is produc

import nltk
stemmer = nltk.stem.SnowballStemmer('english')
stemmer.stem('produced')

# output gives
'produc'

Lemmatization or Stemming¶

Stemming simpler, rule based process
Lemmatization computational more expensive, based on dictionaries

Strongly language dependent

Larger corpora mostly English, thus poorer results for less frequent languages
For example: No German lemmatizer in standard NLTK

Parts of Speech¶

Recognize the sentence boundaries
In each sentence, find all tokens
Lemmatization or Stemming to find roots of tokens
Identify parts of speech (nouns,verbs, ..) for every transformed token

import spacy
nlp = spacy.load('de_core_news_md')
doc = nlp(text) # loaded with Pandas, as before
tokens = [(token.lemma_,token.pos_) for token in doc]
tokens[:10]

# output gives
[('\n\n                              ', 'SPACE'),
('der', 'DET'),
('Floh', 'NOUN'),
('\n\n                              ', 'SPACE'),
('Im', 'ADP'),
('Departement', 'NOUN'),
('du', 'PROPN'),
('Gard', 'PROPN'),
('–', 'PUNCT'),
('ganz', 'ADV')]

Key concepts¶

Concordance
- visually ordered list of searched word in context
- derived from that: Find words with similar context or return the shared context of two or more words
Collocations
- words that often occur together, e.g. “crystal clear” or “nuclear family”
Expanded version: “n-grams”
- groups of n words or characters, e.g. character bi-grams “cr”, “ri”, “is”, “st”, “ta”, “al”

Books as knowledge reservoirs

Natural Language Processing¶

NLP¶

Basic concepts and routines¶

Sentences¶

Tokens¶

Lemmatization¶

Stemming¶

Lemmatization or Stemming¶

Parts of Speech¶

Key concepts¶