large files - Parsing urls containing specific filetypes in a mediawiki dump -

i've large .xml file (about 500mb) dump of site based on mediawiki.

my goal find url links, contain image filename extensions. group links second level domain , export result containing links in above order.

example: there're many links beginning domain.com/.png, host.com/.png , image.com/*.png. grouping them in separate files divided specific second level domain it's links - that's final result.

so want parse links in wikitext. writing mediawiki parser pain, should use existing parser.

the easiest way (easiest not easy) import dump mediawiki install , rebuild tables id needed, export externallinks table.

Search This Blog

WIKI

large files - Parsing urls containing specific filetypes in a mediawiki dump -

Comments

Post a Comment

Popular posts from this blog

android - Automated my builds -

how to proxy from https to http with lighttpd -

python - Flask migration error -