ruby - Apache Tika server request to get 'main content' instead of 'plain text' -


i experimenting apache tika: app & server, gui , command line.

with tika app, can

    java -jar tika-app-1.7.jar --gui 

and choose 'view' -> 'main content', or

    java -jar tika-app-1.7.jar --text-main http://www.cnn.com/2015/07/09/politics/russian-bombers-u-s-intercept-july-4/index.html 

i need main content, seems in server mode can plain text. checking this guide.

    curl -s "http://amzn.com/b005iwm8pu" | curl -x put -t - http://<server_ip>:9998/meta     curl -s "http://amzn.com/b005iwm8pu" | curl -x put -t - http://<server_ip>:9998/tika 

maybe, comes after http://:9998/ trick? there way main content in server mode?

at end, request has made in ruby, tika-server-1.3.jar. far looks this:

    require "net/http"      tika_prefix = uri('http://<server_ip>:9998/tika')     url = 'http://www.cnn.com/2015/07/09/politics/russian-bombers-u-s-intercept-july-4/index.html'     request = net::http::put.new(tika_prefix.to_s)     request.body = url     request.content_type = 'text/html'     http = net::http.start(tika_prefix.hostname, tika_prefix.port)     http.request(request).body 

this possible of today. tika 1.15 implements tika-2343 feature request, adds --text-main equivalent in server mode.

vaites/php-apache-tika php binding tika use, , i've opened an issue regarding this, should able see being implemented soon.

edit: php binding library supports feature.


Comments