large files - Parsing urls containing specific filetypes in a mediawiki dump -
i've large .xml file (about 500mb) dump of site based on mediawiki.
my goal find url links, contain image filename extensions. group links second level domain , export result containing links in above order.
example: there're many links beginning domain.com/.png, host.com/.png , image.com/*.png. grouping them in separate files divided specific second level domain it's links - that's final result.
so want parse links in wikitext. writing mediawiki parser pain, should use existing parser.
the easiest way (easiest not easy) import dump mediawiki install , rebuild tables id needed, export externallinks table.
Comments
Post a Comment