Memory leak when using package XML on Windows -
having read memory leaks parsing xml in r (including linked posts) , this post on r , given time has passed again, still think unresolved issue deserves attention xml
package used throughout r universe.
thus please consider follow post and/or reference informative yet concise illustration of problem.
issue
parsing xml/html documents in way can searched xpath afterwards requires internal use of c pointers (afaiu). , seems @ least on ms windows (i'm running on windows 8.1, 64 bit) these references not recognized garbage collector. consumed memory not released leads freeze of r process @ point.
central findings far
to me seems xml:free
and/or gc
does/do not recognize all memory involved when parsing xml/html docs via xmlparse
or htmlparse
, subsequently processing them xpathapply
or like:
the reported memory usage of os task (rterm.exe) adding significantly fast while reported memory of r process "seen within r" (function memory.size
) increases moderately (in comparison, is). see list elements mem_r
, mem_os
, ratio
before , after substantial parsing cycle below.
all in , throwing in has been recommended (free
, rm
, gc
), memory usage still always increases when xmlparse
, called. it's question of how much. imho there must still that's not working correctly.
illustration
i borrowed profiling code duncan's omegahat git repository.
some preparations:
sys.setenv("language"="en") require("compiler") require("xml") > sessioninfo() r version 3.1.0 (2014-04-10) platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] lc_collate=german_germany.1252 lc_ctype=german_germany.1252 [3] lc_monetary=german_germany.1252 lc_numeric=c [5] lc_time=german_germany.1252 attached base packages: [1] compiler stats graphics grdevices utils datasets methods [8] base other attached packages: [1] xml_3.98-1.1
functions need:
gettaskmemorybypid <- cmpfun(function( pid=sys.getpid() ) { cmd <- sprintf("tasklist /fi \"pid eq %s\" /fo csv", pid) mem <- read.csv(text=shell(cmd, intern = true), stringsasfactors=false)[,5] mem <- as.numeric(gsub("\\.|\\s|k", "", mem))/1000 mem }, options=list(suppressall=true)) memoryleak <- cmpfun(function( x=system.file("exampledata", "mtcars.xml", package="xml"), n=10000, use_text=false, xpath=false, free_doc=false, clean_up=false, detailed=false ) { if(use_text) { x <- readlines(x) } ## before // mem_os <- gettaskmemorybypid() mem_r <- memory.size() prof_1 <- memory.profile() mem_before <- list(mem_r=mem_r, mem_os=mem_os, ratio=mem_os/mem_r) ## per run // mem_perrun <- lapply(1:n, function(ii) { doc <- xmlparse(x, astext=use_text) if (xpath) { res <- xpathapply(doc=doc, path="/blah", fun=xmlvalue) rm(res) } if (free_doc) { free(doc) } rm(doc) out <- null if (detailed) { out <- list( profile=memory.profile(), size=memory.size() ) } out }) has_perrun <- any(sapply(mem_perrun, length) > 0) if (!has_perrun) { mem_perrun <- null } ## garbage collect // mem_gc <- null if(clean_up) { gc() tmp <- gc() mem_gc <- list(gc_mb=tmp["ncells", "(mb)"]) } ## after // mem_os <- gettaskmemorybypid() mem_r <- memory.size() prof_2 <- memory.profile() mem_after <- list(mem_r=mem_r, mem_os=mem_os, ratio=mem_os/mem_r) list( before=mem_before, perrun=mem_perrun, gc=mem_gc, after=mem_after, comparison_r=data.frame( before=prof_1, after=prof_2, increase=round((prof_2/prof_1)-1, 4) ), increase_r=(mem_after$mem_r/mem_before$mem_r)-1, increase_os=(mem_after$mem_os/mem_before$mem_os)-1 ) }, options=list(suppressall=true))
results
scenario 1
quick facts: garbage collection enabled, xml doc parsed n
times not searched via xpathapply
notice ratios of os memory vs. r memory:
before: 1.364832
after: 1.322702
res <- memoryleak(clean_up=true, n=50000) save(res, file=file.path(tempdir(), "memory-profile-1.rdata")) > res $before $before$mem_r [1] 37.42 $before$mem_os [1] 51.072 $before$ratio [1] 1.364832 $perrun null $gc $gc$gc_mb [1] 45 $after $after$mem_r [1] 63.21 $after$mem_os [1] 83.608 $after$ratio [1] 1.322702 $comparison_r before after increase null 1 1 0.0000 symbol 7387 7392 0.0007 pairlist 190383 390633 1.0518 closure 5077 55085 9.8499 environment 1032 51032 48.4496 promise 5226 105226 19.1351 language 54675 54791 0.0021 special 44 44 0.0000 builtin 648 648 0.0000 char 8746 8763 0.0019 logical 9081 9084 0.0003 integer 22804 22807 0.0001 double 2773 2783 0.0036 complex 1 1 0.0000 character 44522 94569 1.1241 ... 0 0 nan 0 0 nan list 19946 19951 0.0003 expression 1 1 0.0000 bytecode 16049 16050 0.0001 externalptr 1487 1487 0.0000 weakref 391 391 0.0000 raw 392 392 0.0000 s4 1392 1392 0.0000 $increase_r [1] 0.6892036 $increase_os [1] 0.6370614
scenario 2
quick facts: garbage collection enabled, free
explicitly called, xml doc parsed n
times not searched via xpathapply
.
notice ratios of os memory vs. r memory:
before: 1.315249
after: 1.222143
res <- memoryleak(clean_up=true, free_doc=true, n=50000) save(res, file=file.path(tempdir(), "memory-profile-2.rdata")) > res $before $before$mem_r [1] 63.48 $before$mem_os [1] 83.492 $before$ratio [1] 1.315249 $perrun null $gc $gc$gc_mb [1] 69.3 $after $after$mem_r [1] 95.92 $after$mem_os [1] 117.228 $after$ratio [1] 1.222143 $comparison_r before after increase null 1 1 0.0000 symbol 7454 7454 0.0000 pairlist 392455 592466 0.5096 closure 55104 105104 0.9074 environment 51032 101032 0.9798 promise 105226 205226 0.9503 language 55592 55592 0.0000 special 44 44 0.0000 builtin 648 648 0.0000 char 8847 8848 0.0001 logical 9141 9141 0.0000 integer 23109 23111 0.0001 double 2802 2807 0.0018 complex 1 1 0.0000 character 94775 144781 0.5276 ... 0 0 nan 0 0 nan list 20174 20177 0.0001 expression 1 1 0.0000 bytecode 16265 16265 0.0000 externalptr 1488 1487 -0.0007 weakref 392 391 -0.0026 raw 393 392 -0.0025 s4 1392 1392 0.0000 $increase_r [1] 0.5110271 $increase_os [1] 0.4040627
scenario 3
quick facts: garbage collection enabled, free
explicitly called, xml doc parsed n
times , searched via xpathapply
each time.
notice ratios of os memory vs. r memory:
before: 1.220429
after: 13.15629
(!)
res <- memoryleak(clean_up=true, free_doc=true, xpath=true, n=50000) save(res, file=file.path(tempdir(), "memory-profile-3.rdata")) res $before $before$mem_r [1] 95.94 $before$mem_os [1] 117.088 $before$ratio [1] 1.220429 $perrun null $gc $gc$gc_mb [1] 93.4 $after $after$mem_r [1] 124.64 $after$mem_os [1] 1639.8 $after$ratio [1] 13.15629 $comparison_r before after increase null 1 1 0.0000 symbol 7454 7460 0.0008 pairlist 592458 793042 0.3386 closure 105104 155110 0.4758 environment 101032 151032 0.4949 promise 205226 305226 0.4873 language 55592 55882 0.0052 special 44 44 0.0000 builtin 648 648 0.0000 char 8847 8867 0.0023 logical 9142 9162 0.0022 integer 23109 23112 0.0001 double 2802 2832 0.0107 complex 1 1 0.0000 character 144775 194819 0.3457 ... 0 0 nan 0 0 nan list 20174 20177 0.0001 expression 1 1 0.0000 bytecode 16265 16265 0.0000 externalptr 1488 1487 -0.0007 weakref 392 391 -0.0026 raw 393 392 -0.0025 s4 1392 1392 0.0000 $increase_r [1] 0.2991453 $increase_os [1] 13.00485
i tried different versions. well, tried try ;-)
from source, omegahat.org
fyi: latest rtools 3.1 installed , included in windows path
(e.g. installing stringr
form source code worked fine).
> install.packages("xml", repos="http://www.omegahat.org/r", type="source") trying url 'http://www.omegahat.org/r/src/contrib/xml_3.98-1.tar.gz' content type 'application/x-gzip' length 1543387 bytes (1.5 mb) opened url downloaded 1.5 mb * installing *source* package 'xml' ... please define lib_xml (and lib_zlib, lib_iconv) warning: running command 'sh ./configure.win' had status 1 error: configuration failed package 'xml' * removing 'r:/home/apps/lsqmapps/apps/r/r-3.1.0/library/xml' * restoring previous 'r:/home/apps/lsqmapps/apps/r/r-3.1.0/library/xml' downloaded source packages in 'c:\users\rappster_admin\appdata\local\temp\rtmpqfz2ck\downloaded_packages' warning messages: 1: running command '"r:/home/apps/lsqmapps/apps/r/r-3.1.0/bin/x64/r" cmd install -l "r:\home\apps\lsqmapps\apps\r\r-3.1.0\library" c:\users\rappst~1\appdata\local\temp\rtmpqfz2ck/downloaded_packages/xml_3.98-1.tar.gz' had status 1 2: in install.packages("xml", repos = "http://www.omegahat.org/r", : installation of package 'xml' had non-zero exit status
github
i did not follow recommendations in readme on github repo points this directory contains tar.gz
of version 3.94-0
(while we're @ 3.98-1.1
on cran).
even though stated gihub repo not in standard r package structure, tried anyway install_github
- , failed ;-)
require("devtools") > install_github(repo="xml", username="omegahat") installing github repo xml/master omegahat downloading master.zip https://github.com/omegahat/xml/archive/master.zip installing package c:\users\rappst~1\appdata\local\temp\rtmpqfz2ck/master.zip installing xml "r:/home/apps/lsqmapps/apps/r/r-3.1.0/bin/x64/r" --vanilla cmd install \ "c:\users\rappster_admin\appdata\local\temp\rtmpqfz2ck\devtools15c82d7c2b4c\xml-master" \ --library="r:/home/apps/lsqmapps/apps/r/r-3.1.0/library" --with-keep.source \ --install-tests * installing *source* package 'xml' ... please define lib_xml (and lib_zlib, lib_iconv) warning: running command 'sh ./configure.win' had status 1 error: configuration failed package 'xml' * removing 'r:/home/apps/lsqmapps/apps/r/r-3.1.0/library/xml' * restoring previous 'r:/home/apps/lsqmapps/apps/r/r-3.1.0/library/xml' error: command failed (1)
whilst still in infancy (only couple of months old!), , has few quirks, hadley wickham has written library xml parsing, xml2
, can found on github @ https://github.com/hadley/xml2. restricted reading rather writing xml, parsing xml i've been experimenting , looks job, without memory leaks of xml package! provides functions including:
read_xml()
read xml filexml_children()
child nodes of nodexml_text()
text within tagxml_attrs()
character vector of attributes , values of node, can cast named listas.list()
note still need ensure rm()
xml node objects after you're done them, , force garbage collection gc()
, memory released o/s (disclaimer: tested on windows 7 seems 'memory leaky' platform anyway).
hope helps someone!
Comments
Post a Comment