stanford nlp - StanfordNLP Training Steps Verification and loadClassifier check -


i need verifying training steps below , can add classifier -loadclassifier list?

-loadclassifier sample-ner-model.ser.gz, classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz \

sample.txt

the fate of lehman brothers, beleaguered investment bank, hung in balance on sunday federal reserve officials , leaders of major financial institutions continued gather in emergency meetings trying complete plan rescue stricken bank. several possible plans emerged talks, held @ federal reserve bank of new york , led timothy r. geithner, president of new york fed, , treasury secretary henry m. paulson jr.

step 1 tokenize

java -cp stanford-ner.jar edu.stanford.nlp.process.ptbtokenizer sample.txt > sample.tok

the fate of lehman brothers , beleaguered investment bank , hung in balance

. . .

president of new york fed , , treasury secretary henry m. paulson jr. .

step 2 classify

need better command replace eol "\n" "\to\n" . perl chomp not working. edited sample.tzv manually.

perl -ne 'chomp; print "$_\to"' sample.tok > sample.tsv

the 0 fate 0 of 0 lehman 0 brothers 0 , 0 0 beleaguered 0 investment 0 bank 0 , 0 hung 0 in 0 0 balance 0 . . . president 0 of 0 0 new 0 york 0 fed 0 , 0 , 0 treasury 0 secretary 0 henry 0 m. 0 paulson 0 jr. 0 . 0

step 3 adjust properties (sample.prop)

# location of training file trainfile = sample.tsv # location save (serialize) # classifier; adding .gz @ end automatically gzips file, # making smaller, , faster load serializeto = sample-ner-model.ser.gz . . . usetypeysequences=true wordshape=chris2uselc 

step 4 modify gold standard (sample.tsv)

the 0 fate 0 of 0 lehman org brothers org , 0 0 beleaguered 0 investment 0 bank 0 , 0 hung 0 in 0 0 balance 0 . . . president 0 of 0 0 new org york org fed org , 0 , 0 treasury pers secretary pers henry pers m. pers paulson pers jr. pers . 0

step 4 train

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.crfclassifier -prop sample.prop

step 5 test , verify

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.crfclassifier -loadclassifier sample-ner-model.ser.gz -testfile sample.tsv

production maybe:

java -mx1g edu.stanford.nlp.ie.nerclassifiercombiner -textfile sample.txt -ner.model \ -loadclassifier classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz \ -outputformat tabbedentities -textfile sample.txt > samplenew.tsv

this seems correct me.

yes, if build new model stanford corenlp can add list.

note models run in order, , earlier ner taggers in list tag first, , later models cannot overwrite tags (e.g. org, per) written previous ones (except o of course). put models matters, closer front takes priority.

also ner.combinationmode = high_recall allow every classifier in list apply of tags. ner.combinationmode = normal means first classifier applies tag (e.g. org, per) can apply it. can set ner.combinationmode in .prop file.


Comments