text corpora distributed large files containing specific documents on each new line. instance, have file 10 million product reviews, 1 per line, , each review contains multiple sentences.
when processing such files stanford corenlp, using command line, instance
java -cp "*" -xmx16g edu.stanford.nlp.pipeline.stanfordcorenlp -annotators tokenize,ssplit,pos,lemma -file test.txt the output, whether in text or xml format, number sentences 1 n, ignoring original line numbering separates documents.
i keep track of original file's line numbering (e.g. in xml format, have output tree <original_line id=1>, <sentence id=1>, <token id=1>). or else, able reset numbering of sentences @ start of each new line in original file.
i have tried answer similar question stanford's pos tagger, without success. options not keep track of original line numbers.
a quick solution split original file in multiple files, processing each of them corenlp , -filelist input option. however, large files millions of documents, creating millions of individual files preserve original line/document numbering seems inefficient.
i suppose possible modify source code of stanford corenlp, unfamiliar java.
any solution preserve original line numbering in output helpful, whether through command line or showing example java code achieve that.
i've dug through code base, , can't find command line flag you.
i wrote sample java code should trick.
i put in docperlineprocessor.java, put stanford-corenlp-full-2015-04-20. put file called sample-doc-per-line.txt had 4 sentences per line.
first make sure compile:
cd stanford-corenlp-full-2015-04-20
javac -cp "*:." docperlineprocessor.java
here command run:
java -cp "*:." docperlineprocessor sample-doc-per-line.txt
the output sample-doc-per-line.txt.xml should desired xml format, sentences have line number they're on.
here code:
import java.io.*; import java.util.*; import edu.stanford.nlp.io.*; import edu.stanford.nlp.ling.*; import edu.stanford.nlp.pipeline.*; import edu.stanford.nlp.trees.*; import edu.stanford.nlp.trees.treecoreannotations.*; import edu.stanford.nlp.util.*; public class docperlineprocessor { public static void main (string[] args) throws ioexception { // set properties properties props = new properties(); props.setproperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse"); // set pipeline stanfordcorenlp pipeline = new stanfordcorenlp(props); // read in product review per line iterable<string> lines = ioutils.readlines(args[0]); annotation mainannotation = new annotation(""); // add blank list put sentences list<coremap> blanksentenceslist = new arraylist<coremap>(); mainannotation.set(coreannotations.sentencesannotation.class,blanksentenceslist); // process each product review int linenumber = 1; (string line : lines) { annotation annotation = new annotation(line); pipeline.annotate(annotation); (coremap sentence : annotation.get(coreannotations.sentencesannotation.class)) { sentence.set(coreannotations.linenumberannotation.class,linenumber); mainannotation.get(coreannotations.sentencesannotation.class).add(sentence); } linenumber += 1; } printwriter xmlout = new printwriter(args[0]+".xml"); pipeline.xmlprint(mainannotation, xmlout); } } now when run this, sentence tags have appropriate line number. sentences still have global id, can mark line came from. set newline ends sentence.
please let me know if need clarification or if made errors transcribing code.
Comments
Post a Comment