Getting started with Apache OpenNLP

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. OpenNLP also includes entropy and perceptron based machine learning. . It contains several components for natural language processing pipeline like sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, co-reference resolution.
It provides both command line interface and application programming interface. It is built using Java. The recent stable version is 1.9.2 and licensed under Apache license 2.0. In this article, We will understand the usage by simple application examples.
Tokenizer:
Tokenize the words by white space into array of tokens. While it tokenize the words, it also includes the tab, new line feed along with white space characters also. Below code snippet tokenize the text and produces array of tokens where it eliminates the tab characters.
public void givenWhitespaceTokenizer_whenTokenize_thenTokensAreDetected()
throws Exception {
WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize("It is my first attempt, trying to learn apache opennlp.");
String expected[] = {"It", "is", "my", "first", "attempt,", "trying", "to", "learn", "apache", "opennlp."};
assertArrayEquals(tokens, expected);
}
If we want to consider all the punctuations and split into tokens then we can use SimpleTokenizer. It splits the sentence into words with punctuations also as separate tokens.
public void givenSimpleTokenizer_whenTokenize_thenTokensAreDetected()
throws Exception {
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize("It is my first attempt, trying to learn apache opennlp.");
String expected[] = {"It", "is", "my", "first", "attempt", ",", "trying", "to", "learn", "apache", "opennlp", "."};
assertArrayEquals(tokens, expected);
}
Apache OpenNLP provides pre trained models for basic language processing. Below code explains tokenizing the text using pre-trained model.
public void givenEnglishModel_whenTokenize_thenTokensAreDetected()
throws Exception {
InputStream inputStream = getClass().getResourceAsStream("/models/en-token.bin");
TokenizerModel model = new TokenizerModel(inputStream);
TokenizerME tokenizer = new TokenizerME(model);
String[] tokens = tokenizer.tokenize("Its my first attempt to learn apache nlp tutorial.");
String expected[] = {"Its", "my", "first", "attempt", "to", "learn", "apache", "nlp", "tutorial", "."};
assertArrayEquals(tokens, expected);
}
Named Entity Recognition
Named Entity Recognition is an algorithm that extracts information from unstructured text data and categorizes it into groups. Apache OpenNLP provides models for extracting person names, locations, organizations, money, percentage, time etc.
Consider we want to extract cricketers names from the news article, Below code helps to recognize person name using pre-trained model.
public void givenEnglishPersonModel_whenNER_thenPersonsAreDetected()
throws Exception {
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
InputStream tokenModelIn = getClass().getResourceAsStream("/models/en-token.bin");
TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIn);
TokenizerME tokenizerModal = new TokenizerME(tokenizerModel);
String[] tokens =
tokenizerModal.tokenize("Legends of the game, masters of their art –" +
" Muttiah Muralitharan, Anil Kumble and Shane Warne " +
"are the three leading wicket-takers in Tests");
InputStream inputStreamNameFinder = getClass().getResourceAsStream("/models/en-ner-person.bin");
TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
NameFinderME nameFinderME = new NameFinderME(model);
List<Span> spans = Arrays.asList(nameFinderME.find(tokens));
List<String> spanlist = new ArrayList<>();
for (Span span: spans) {
spanlist.add(span.toString());
System.out.println(span.getType() + " " + span.toString() + " " + span.getProb());
}
}
It will output the span type as person, it will also print the names word positions and also getting the probabilities of the person names.
person [10..12) person 0.8058874647477575
person [13..15) person 0.9360802286465706
person [16..18) person 0.889340515434591
en-ner-person.bin is the pre-trained available model for extracting the person names. Likewise, it also has the pre-trained models for each entity type as shown below.
Date name finder model | en-ner-date.bin |
Location name finder model | en-ner-location.bin |
Money name finder model | en-ner-money.bin |
Organization name finder model | en-ner-organization.bin |
Percentage name finder model | en-ner-percentage.bin |
Person name finder model | en-ner-person.bin |
Time name finder model | en-ner-time.bin |
POS Tagger
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'.
Now the given text will be tagged by the grammatical parts which can be used for further processing. Below example parse the text and provides the parts of speech.
public void parts_of_speech_tagger() throws Exception {
String sentence = "";
InputStream tokenModelIn = getClass().getResourceAsStream("/models/en-token.bin");
TokenizerModel tokenizerModel = new TokenizerModel(tokenModelIn);
Tokenizer tokenizer = new TokenizerME(tokenizerModel);
String tokens[] = tokenizer.tokenize("I am trying to tag the tokens");
InputStream posModelIn = getClass().getResourceAsStream("/models/en-pos-maxent.bin");
POSModel posModel = new POSModel(posModelIn);
POSTaggerME posTaggerME = new POSTaggerME(posModel);
String tags[] = posTaggerME.tag(tokens);
double[] probs = posTaggerME.probs();
System.out.println("Token\t:\tTag\t:\tProbability\n---------------------------------------------");
for(int i=0;i<tokens.length;i++){
System.out.println(tokens[i]+"\t:\t"+tags[i]+"\t:\t"+probs[i] );
}
}
It will produce the results with tokens and tag based on English grammatical parts. Like PRP is the Personal pronoun, VBP is the Verb, non-3rd person singular present and so on. Each abbreviated text can be looked up here for the description here
Token : Tag : Probability
---------------------------------------------
I : PRP : 0.9850802753661616
am : VBP : 0.975984809797987
trying : VBG : 0.9884076110770207
to : TO : 0.9948503758260098
tag : VB : 0.9713875923880564
the : DT : 0.9447257899870084
tokens : NNS : 0.8032102920939485
Sentence Detection
Pre-trained model en-sent.bin can be used to detect the sentences. Example below shows how the sentences can be detected.
public void givenEnglishModel_whenDetect_thenSentencesAreDetected()
throws Exception {
String paragraph = "This is a statement. This is another statement."
+ " Now is an abstract word for time, "
+ "that is always flying. And my email address is google@gmail.com.";
InputStream is = getClass().getResourceAsStream("/models/en-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetector sdetector = new SentenceDetectorME(model);
String sentences[] = sdetector.sentDetect(paragraph);
Assert.assertArrayEquals("Sentences detected successfully", sentences, new String[]{
"This is a statement.",
"This is another statement.",
"Now is an abstract word for time, that is always flying.",
"And my email address is google@gmail.com."});
}
Command Line (CLI):
All the features of Apache OpenNLP are available in command line interface. Download Apache OpenNLP, Untar the Apache OpenNLP and navigate to the bin directory.
For example the sentence detector can be executed as below.
nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ ls
brat-annotation-service brat-annotation-service.bat morfologik-addon morfologik-addon.bat opennlp opennlp.bat sampletext.txt sentences.txt
nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ cat sampletext.txt
This is a sample text file. We are going to check the number of sentences.
nagappan@nagappan-Latitude-E5450:/opt/apache-opennlp-1.9.2/bin$ ./opennlp SentenceDetector ../models/en-sent.bin < "sampletext.txt"
Loading Sentence Detector model ... done (0.052s)
This is a sample text file.
We are going to check the number of sentences.
Average: 666.7 sent/s
Total: 2 sent
Runtime: 0.003s
Execution time: 0.143 seconds
Reference:
Apache OpenNLP manual documentation - https://opennlp.apache.org/docs/1.9.2/manual/opennlp.html.