Memory leak when using package XML on Windows -


having read memory leaks parsing xml in r (including linked posts) , this post on r , given time has passed again, still think unresolved issue deserves attention xml package used throughout r universe.

thus please consider follow post and/or reference informative yet concise illustration of problem.

issue

parsing xml/html documents in way can searched xpath afterwards requires internal use of c pointers (afaiu). , seems @ least on ms windows (i'm running on windows 8.1, 64 bit) these references not recognized garbage collector. consumed memory not released leads freeze of r process @ point.

central findings far

to me seems xml:free and/or gc does/do not recognize all memory involved when parsing xml/html docs via xmlparse or htmlparse , subsequently processing them xpathapply or like:

the reported memory usage of os task (rterm.exe) adding significantly fast while reported memory of r process "seen within r" (function memory.size) increases moderately (in comparison, is). see list elements mem_r, mem_os , ratio before , after substantial parsing cycle below.

all in , throwing in has been recommended (free, rm , gc), memory usage still always increases when xmlparse , called. it's question of how much. imho there must still that's not working correctly.


illustration

i borrowed profiling code duncan's omegahat git repository.

some preparations:

sys.setenv("language"="en")    require("compiler") require("xml")  > sessioninfo() r version 3.1.0 (2014-04-10) platform: x86_64-w64-mingw32/x64 (64-bit)  locale: [1] lc_collate=german_germany.1252  lc_ctype=german_germany.1252    [3] lc_monetary=german_germany.1252 lc_numeric=c                    [5] lc_time=german_germany.1252      attached base packages: [1] compiler  stats     graphics  grdevices utils     datasets  methods   [8] base       other attached packages: [1] xml_3.98-1.1 

functions need:

gettaskmemorybypid <- cmpfun(function(     pid=sys.getpid() ) {     cmd <- sprintf("tasklist /fi \"pid eq %s\" /fo csv", pid)     mem <- read.csv(text=shell(cmd, intern = true), stringsasfactors=false)[,5]     mem <- as.numeric(gsub("\\.|\\s|k", "", mem))/1000     mem }, options=list(suppressall=true))    memoryleak <- cmpfun(function(     x=system.file("exampledata", "mtcars.xml", package="xml"),     n=10000,     use_text=false,     xpath=false,     free_doc=false,     clean_up=false,     detailed=false ) {     if(use_text) {         x <- readlines(x)     }     ## before //     mem_os  <- gettaskmemorybypid()     mem_r   <- memory.size()     prof_1  <- memory.profile()     mem_before <- list(mem_r=mem_r,         mem_os=mem_os, ratio=mem_os/mem_r)      ## per run //     mem_perrun <- lapply(1:n, function(ii) {         doc <- xmlparse(x, astext=use_text)         if (xpath) {             res <- xpathapply(doc=doc, path="/blah", fun=xmlvalue)             rm(res)         }         if (free_doc) {             free(doc)         }         rm(doc)         out <- null         if (detailed) {             out <- list(                 profile=memory.profile(),                 size=memory.size()             )         }          out     })     has_perrun <- any(sapply(mem_perrun, length) > 0)     if (!has_perrun) {         mem_perrun <- null     }       ## garbage collect //     mem_gc <- null     if(clean_up) {         gc()         tmp <- gc()         mem_gc <- list(gc_mb=tmp["ncells", "(mb)"])     }      ## after //     mem_os  <- gettaskmemorybypid()     mem_r   <- memory.size()     prof_2  <- memory.profile()     mem_after <- list(mem_r=mem_r,         mem_os=mem_os, ratio=mem_os/mem_r)     list(         before=mem_before,          perrun=mem_perrun,          gc=mem_gc,          after=mem_after,          comparison_r=data.frame(             before=prof_1,              after=prof_2,              increase=round((prof_2/prof_1)-1, 4)         ),         increase_r=(mem_after$mem_r/mem_before$mem_r)-1,         increase_os=(mem_after$mem_os/mem_before$mem_os)-1     ) }, options=list(suppressall=true))   

results

scenario 1

quick facts: garbage collection enabled, xml doc parsed n times not searched via xpathapply

notice ratios of os memory vs. r memory:

before: 1.364832

after: 1.322702

res <- memoryleak(clean_up=true, n=50000) save(res, file=file.path(tempdir(), "memory-profile-1.rdata"))  > res $before $before$mem_r [1] 37.42  $before$mem_os [1] 51.072  $before$ratio [1] 1.364832   $perrun null  $gc $gc$gc_mb [1] 45   $after $after$mem_r [1] 63.21  $after$mem_os [1] 83.608  $after$ratio [1] 1.322702   $comparison_r             before  after increase null             1      1   0.0000 symbol        7387   7392   0.0007 pairlist    190383 390633   1.0518 closure       5077  55085   9.8499 environment   1032  51032  48.4496 promise       5226 105226  19.1351 language     54675  54791   0.0021 special         44     44   0.0000 builtin        648    648   0.0000 char          8746   8763   0.0019 logical       9081   9084   0.0003 integer      22804  22807   0.0001 double        2773   2783   0.0036 complex          1      1   0.0000 character    44522  94569   1.1241 ...              0      0      nan              0      0      nan list         19946  19951   0.0003 expression       1      1   0.0000 bytecode     16049  16050   0.0001 externalptr   1487   1487   0.0000 weakref        391    391   0.0000 raw            392    392   0.0000 s4            1392   1392   0.0000  $increase_r [1] 0.6892036  $increase_os [1] 0.6370614 

scenario 2

quick facts: garbage collection enabled, free explicitly called, xml doc parsed n times not searched via xpathapply.

notice ratios of os memory vs. r memory:

before: 1.315249

after: 1.222143

res <- memoryleak(clean_up=true, free_doc=true, n=50000) save(res, file=file.path(tempdir(), "memory-profile-2.rdata")) > res  $before     $before$mem_r [1] 63.48  $before$mem_os [1] 83.492  $before$ratio [1] 1.315249   $perrun null  $gc $gc$gc_mb [1] 69.3   $after $after$mem_r [1] 95.92  $after$mem_os [1] 117.228  $after$ratio [1] 1.222143   $comparison_r             before  after increase null             1      1   0.0000 symbol        7454   7454   0.0000 pairlist    392455 592466   0.5096 closure      55104 105104   0.9074 environment  51032 101032   0.9798 promise     105226 205226   0.9503 language     55592  55592   0.0000 special         44     44   0.0000 builtin        648    648   0.0000 char          8847   8848   0.0001 logical       9141   9141   0.0000 integer      23109  23111   0.0001 double        2802   2807   0.0018 complex          1      1   0.0000 character    94775 144781   0.5276 ...              0      0      nan              0      0      nan list         20174  20177   0.0001 expression       1      1   0.0000 bytecode     16265  16265   0.0000 externalptr   1488   1487  -0.0007 weakref        392    391  -0.0026 raw            393    392  -0.0025 s4            1392   1392   0.0000  $increase_r [1] 0.5110271  $increase_os [1] 0.4040627 

scenario 3

quick facts: garbage collection enabled, free explicitly called, xml doc parsed n times , searched via xpathapply each time.

notice ratios of os memory vs. r memory:

before: 1.220429

after: 13.15629 (!)

res <- memoryleak(clean_up=true, free_doc=true, xpath=true, n=50000) save(res, file=file.path(tempdir(), "memory-profile-3.rdata")) res $before $before$mem_r [1] 95.94  $before$mem_os [1] 117.088  $before$ratio [1] 1.220429   $perrun null  $gc $gc$gc_mb [1] 93.4   $after $after$mem_r [1] 124.64  $after$mem_os [1] 1639.8  $after$ratio [1] 13.15629   $comparison_r             before  after increase null             1      1   0.0000 symbol        7454   7460   0.0008 pairlist    592458 793042   0.3386 closure     105104 155110   0.4758 environment 101032 151032   0.4949 promise     205226 305226   0.4873 language     55592  55882   0.0052 special         44     44   0.0000 builtin        648    648   0.0000 char          8847   8867   0.0023 logical       9142   9162   0.0022 integer      23109  23112   0.0001 double        2802   2832   0.0107 complex          1      1   0.0000 character   144775 194819   0.3457 ...              0      0      nan              0      0      nan list         20174  20177   0.0001 expression       1      1   0.0000 bytecode     16265  16265   0.0000 externalptr   1488   1487  -0.0007 weakref        392    391  -0.0026 raw            393    392  -0.0025 s4            1392   1392   0.0000  $increase_r [1] 0.2991453  $increase_os [1] 13.00485 

i tried different versions. well, tried try ;-)

from source, omegahat.org

fyi: latest rtools 3.1 installed , included in windows path (e.g. installing stringr form source code worked fine).

> install.packages("xml", repos="http://www.omegahat.org/r", type="source") trying url 'http://www.omegahat.org/r/src/contrib/xml_3.98-1.tar.gz' content type 'application/x-gzip' length 1543387 bytes (1.5 mb) opened url downloaded 1.5 mb  * installing *source* package 'xml' ... please define lib_xml (and lib_zlib, lib_iconv) warning: running command 'sh ./configure.win' had status 1 error: configuration failed package 'xml' * removing 'r:/home/apps/lsqmapps/apps/r/r-3.1.0/library/xml' * restoring previous 'r:/home/apps/lsqmapps/apps/r/r-3.1.0/library/xml'  downloaded source packages in     'c:\users\rappster_admin\appdata\local\temp\rtmpqfz2ck\downloaded_packages' warning messages: 1: running command '"r:/home/apps/lsqmapps/apps/r/r-3.1.0/bin/x64/r" cmd install -l "r:\home\apps\lsqmapps\apps\r\r-3.1.0\library" c:\users\rappst~1\appdata\local\temp\rtmpqfz2ck/downloaded_packages/xml_3.98-1.tar.gz' had status 1  2: in install.packages("xml", repos = "http://www.omegahat.org/r",  :   installation of package 'xml' had non-zero exit status 

github

i did not follow recommendations in readme on github repo points this directory contains tar.gz of version 3.94-0 (while we're @ 3.98-1.1 on cran).

even though stated gihub repo not in standard r package structure, tried anyway install_github - , failed ;-)

require("devtools") > install_github(repo="xml", username="omegahat") installing github repo xml/master omegahat downloading master.zip https://github.com/omegahat/xml/archive/master.zip installing package c:\users\rappst~1\appdata\local\temp\rtmpqfz2ck/master.zip installing xml "r:/home/apps/lsqmapps/apps/r/r-3.1.0/bin/x64/r" --vanilla cmd install  \   "c:\users\rappster_admin\appdata\local\temp\rtmpqfz2ck\devtools15c82d7c2b4c\xml-master"  \   --library="r:/home/apps/lsqmapps/apps/r/r-3.1.0/library" --with-keep.source  \   --install-tests   * installing *source* package 'xml' ... please define lib_xml (and lib_zlib, lib_iconv) warning: running command 'sh ./configure.win' had status 1 error: configuration failed package 'xml' * removing 'r:/home/apps/lsqmapps/apps/r/r-3.1.0/library/xml' * restoring previous 'r:/home/apps/lsqmapps/apps/r/r-3.1.0/library/xml' error: command failed (1) 

whilst still in infancy (only couple of months old!), , has few quirks, hadley wickham has written library xml parsing, xml2, can found on github @ https://github.com/hadley/xml2. restricted reading rather writing xml, parsing xml i've been experimenting , looks job, without memory leaks of xml package! provides functions including:

  • read_xml() read xml file
  • xml_children() child nodes of node
  • xml_text() text within tag
  • xml_attrs() character vector of attributes , values of node, can cast named list as.list()

note still need ensure rm() xml node objects after you're done them, , force garbage collection gc(), memory released o/s (disclaimer: tested on windows 7 seems 'memory leaky' platform anyway).

hope helps someone!


Comments

Popular posts from this blog

how to proxy from https to http with lighttpd -

android - Automated my builds -

python - Flask migration error -