python - Why isn't this regexp working -


i have source code of webpage formatted this:

<span class="l r positive-icon"> turkish </span> <span> the.mist[2007]dvdrip[eng]-axxo </span> <span class="l r neutral-icon"> vietnamese </span> <span> the.mist.2007.720p.bluray.x264.yify  </span> 

as can see, there either spans class of "l r positive-icon" or "l r neutral-icon". want languages, between span class. use regexp gives me empty list:

alllanguages = re.findall('<span class=".*">\s(.*)\s</span>', alllanguagestags) 

alllanguagestags contains source code shown above. can give me hint?

don't use regular expressions. use actual html parser. recommend use beautifulsoup instead:

from bs4 import beautifulsoup  soup = beautifulsoup(yourhtml) languages = [s.get_text().strip() s in soup.find_all('span', class_=true)] 

demo:

>>> bs4 import beautifulsoup >>> soup = beautifulsoup('''\ ... <span class="l r positive-icon"> ... turkish ... </span> ... <span> ... the.mist[2007]dvdrip[eng]-axxo ... </span> ... <span class="l r neutral-icon"> ... vietnamese ... </span> ... <span> ... the.mist.2007.720p.bluray.x264.yify  ... </span> ... ''') >>> [s.get_text().strip() s in soup.find_all('span', class_=true)] [u'turkish', u'vietnamese'] 

Comments

Popular posts from this blog

how to proxy from https to http with lighttpd -

android - Automated my builds -

python - Flask migration error -