java - How to extract text from docx with Tika -


i'm trying extract text docx: tika-app well, when try same thing in code result nothing , tika parser says content-type of docx file "application/zip".

how can do? should use recursive approach (like this) or there way?

update: file content-type correctly detected if add filename metadata:

inputstream =  new fileinputstream(myfile); autodetectparser parser = new autodetectparser(); bodycontenthandler handler = new bodycontenthandler(); metadata metadata = new metadata(); metadata.set(metadata.resource_name_key, myfilefilename); parsecontext context = new parsecontext(); context.set(parser.class, parser); parser.parse(is, handler, metadata, context); 

anyway @ parse() error

java.lang.noclassdeffounderror: org/apache/poi/openxml4j/exceptions/invalidformatexception @ org.apache.tika.parser.microsoft.ooxml.ooxmlparser.parse(ooxmlparser.java:82)

for me main confusing thing in apache tika can compiled without tika-parsers.jar, can't work without it. make sure installed tika-parsers.jar dependencies (they many).


Comments