java - How to access crawled content from nutch for content categorisation -
i running nutch integrated solr search engine, nutch crawl job happens on hadoop. next requirement run content categorisation job crawled content, how can access text content stored in hdfs tagging job, planning run tagging job java, how can access content through java ?
the crawled content stored in data file in segments directory example:
segments\2014...\content\part-00000\data
the file type sequence file. read can use code the hadoop book or this answer
Comments
Post a Comment