Natural Language Processing

NLP

NLP: Natural language processing tries to understand, interpret and manipulate human language.

  • In Python: nltk or spacy package

  • Routines for parts-of-speech tagging, named-entity-recognition, tokenization and more

Basic concepts and routines

For a given text:

  • Recognize the sentence boundaries

  • In each sentence, find all tokens (i.e. words without punctuation)

  • Lemmatization and Stemming describe the process to find roots of words

    • Lemma of went is go, or for nouns mice becomes mouse

    • Stemm is part of word which does not change, e.g. the stem of produced is produc because of words like production

  • Identify parts of speech (nouns,verbs, ..) for every lemmatized word

Sentences

  • Recognize the sentence boundaries

import pandas as pd
import nltk
data = pd.read_json('./data/df_tucholsky.json', lines=True)
text = data.text.iloc[0]

sentences = nltk.sent_tokenize(text)

sentences[0]

# outputs gives
'Der Floh Im Departement du Gard – ganz richtig, da, wo Nîmes liegt und der Pont du Gard: im südlichen Frankreich – da saß in einem Postbüro ein älteres Fräulein als Beamtin, die hatte eine böse Angewohnheit: sie machte ein bißchen die Briefe auf und las sie.'

Tokens

  • Recognize the sentence boundaries

  • In each sentence, find all tokens

    • could be words only

    • or including punctuation

import pandas as pd
import nltk
data = pd.read_json('./data/df_tucholsky.json', lines=True)
text = data.text.iloc[0]

sentences = nltk.sent_tokenize(text)
nltk.word_tokenize(sentences[0])

# outputs gives
['Der','Floh','Im','Departement','du','Gard','–','ganz','richtig',',','da',',','wo','Nîmes',
'liegt','und','der','Pont','du','Gard',...]

Lemmatization

  • Recognize the sentence boundaries

  • In each sentence, find all tokens

  • Lemmatization to find roots of token

    • Lemma of goes is go, or for nouns mice becomes mouse

import nltk
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizer.lemmatize('goes')

# output gives
'go'

Stemming

  • Recognize the sentence boundaries

  • In each sentence, find all tokens

  • Stemming to find roots of tokens

    • Stemm is part of token which does not change, e.g. the stem of produced is produc

import nltk
stemmer = nltk.stem.SnowballStemmer('english')
stemmer.stem('produced')

# output gives
'produc'

Lemmatization or Stemming

  • Stemming simpler, rule based process

  • Lemmatization computational more expensive, based on dictionaries

Strongly language dependent

  • Larger corpora mostly English, thus poorer results for less frequent languages

  • For example: No German lemmatizer in standard NLTK

Related links:

Parts of Speech

  • Recognize the sentence boundaries

  • In each sentence, find all tokens

  • Lemmatization or Stemming to find roots of tokens

  • Identify parts of speech (nouns,verbs, ..) for every transformed token

import spacy
nlp = spacy.load('de_core_news_md')
doc = nlp(text) # loaded with Pandas, as before
tokens = [(token.lemma_,token.pos_) for token in doc]
tokens[:10]

# output gives
[('\n\n                              ', 'SPACE'),
('der', 'DET'),
('Floh', 'NOUN'),
('\n\n                              ', 'SPACE'),
('Im', 'ADP'),
('Departement', 'NOUN'),
('du', 'PROPN'),
('Gard', 'PROPN'),
('–', 'PUNCT'),
('ganz', 'ADV')]

Key concepts

  • Concordance

    • visually ordered list of searched word in context

    • derived from that: Find words with similar context or return the shared context of two or more words

  • Collocations

    • words that often occur together, e.g. “crystal clear” or “nuclear family”

  • Expanded version: “n-grams”

    • groups of n words or characters, e.g. character bi-grams “cr”, “ri”, “is”, “st”, “ta”, “al”