Natural Language Processing¶
NLP¶
NLP: Natural language processing tries to understand, interpret and manipulate human language.
In Python: nltk or spacy package
Routines for parts-of-speech tagging, named-entity-recognition, tokenization and more
Basic concepts and routines¶
For a given text:
Recognize the sentence boundaries
In each sentence, find all tokens (i.e. words without punctuation)
Lemmatization and Stemming describe the process to find roots of words
Lemma of went is go, or for nouns mice becomes mouse
Stemm is part of word which does not change, e.g. the stem of produced is produc because of words like production
Identify parts of speech (nouns,verbs, ..) for every lemmatized word
Sentences¶
Recognize the sentence boundaries
import pandas as pd
import nltk
data = pd.read_json('./data/df_tucholsky.json', lines=True)
text = data.text.iloc[0]
sentences = nltk.sent_tokenize(text)
sentences[0]
# outputs gives
'Der Floh Im Departement du Gard – ganz richtig, da, wo Nîmes liegt und der Pont du Gard: im südlichen Frankreich – da saß in einem Postbüro ein älteres Fräulein als Beamtin, die hatte eine böse Angewohnheit: sie machte ein bißchen die Briefe auf und las sie.'
Tokens¶
Recognize the sentence boundaries
In each sentence, find all tokens
could be words only
or including punctuation
import pandas as pd
import nltk
data = pd.read_json('./data/df_tucholsky.json', lines=True)
text = data.text.iloc[0]
sentences = nltk.sent_tokenize(text)
nltk.word_tokenize(sentences[0])
# outputs gives
['Der','Floh','Im','Departement','du','Gard','–','ganz','richtig',',','da',',','wo','Nîmes',
'liegt','und','der','Pont','du','Gard',...]
Lemmatization¶
Recognize the sentence boundaries
In each sentence, find all tokens
Lemmatization to find roots of token
Lemma of goes is go, or for nouns mice becomes mouse
import nltk
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizer.lemmatize('goes')
# output gives
'go'
Stemming¶
Recognize the sentence boundaries
In each sentence, find all tokens
Stemming to find roots of tokens
Stemm is part of token which does not change, e.g. the stem of produced is produc
import nltk
stemmer = nltk.stem.SnowballStemmer('english')
stemmer.stem('produced')
# output gives
'produc'
Lemmatization or Stemming¶
Stemming simpler, rule based process
Lemmatization computational more expensive, based on dictionaries
Strongly language dependent
Larger corpora mostly English, thus poorer results for less frequent languages
For example: No German lemmatizer in standard NLTK
Related links:
Leipzig Corpus Miner (LCM)
Erlangen University Corpus-Linguistik
Hanover Tagger Source
Classical Languages Tool Kit CLTK
Parts of Speech¶
Recognize the sentence boundaries
In each sentence, find all tokens
Lemmatization or Stemming to find roots of tokens
Identify parts of speech (nouns,verbs, ..) for every transformed token
import spacy
nlp = spacy.load('de_core_news_md')
doc = nlp(text) # loaded with Pandas, as before
tokens = [(token.lemma_,token.pos_) for token in doc]
tokens[:10]
# output gives
[('\n\n ', 'SPACE'),
('der', 'DET'),
('Floh', 'NOUN'),
('\n\n ', 'SPACE'),
('Im', 'ADP'),
('Departement', 'NOUN'),
('du', 'PROPN'),
('Gard', 'PROPN'),
('–', 'PUNCT'),
('ganz', 'ADV')]
Key concepts¶
Concordance
visually ordered list of searched word in context
derived from that: Find words with similar context or return the shared context of two or more words
Collocations
words that often occur together, e.g. “crystal clear” or “nuclear family”
Expanded version: “n-grams”
groups of n words or characters, e.g. character bi-grams “cr”, “ri”, “is”, “st”, “ta”, “al”