i'm trying extract text docx: tika-app well, when try same thing in code result nothing , tika parser says content-type of docx file "application/zip".
how can do? should use recursive approach (like this) or there way?
update: file content-type correctly detected if add filename metadata:
inputstream = new fileinputstream(myfile); autodetectparser parser = new autodetectparser(); bodycontenthandler handler = new bodycontenthandler(); metadata metadata = new metadata(); metadata.set(metadata.resource_name_key, myfilefilename); parsecontext context = new parsecontext(); context.set(parser.class, parser); parser.parse(is, handler, metadata, context); anyway @ parse() error
java.lang.noclassdeffounderror: org/apache/poi/openxml4j/exceptions/invalidformatexception @ org.apache.tika.parser.microsoft.ooxml.ooxmlparser.parse(ooxmlparser.java:82)
for me main confusing thing in apache tika can compiled without tika-parsers.jar, can't work without it. make sure installed tika-parsers.jar dependencies (they many).
Comments
Post a Comment