java - How to access crawled content from nutch for content categorisation -

i running nutch integrated solr search engine, nutch crawl job happens on hadoop. next requirement run content categorisation job crawled content, how can access text content stored in hdfs tagging job, planning run tagging job java, how can access content through java ?

the crawled content stored in data file in segments directory example: