python - Why isn't this regexp working -

i have source code of webpage formatted this:

<span class="l r positive-icon"> turkish </span> <span> the.mist[2007]dvdrip[eng]-axxo </span> <span class="l r neutral-icon"> vietnamese </span> <span> the.mist.2007.720p.bluray.x264.yify  </span>

as can see, there either spans class of "l r positive-icon" or "l r neutral-icon". want languages, between span class. use regexp gives me empty list:

alllanguages = re.findall('<span class=".*">\s(.*)\s</span>', alllanguagestags)

alllanguagestags contains source code shown above. can give me hint?

don't use regular expressions. use actual html parser. recommend use beautifulsoup instead:

from bs4 import beautifulsoup  soup = beautifulsoup(yourhtml) languages = [s.get_text().strip() s in soup.find_all('span', class_=true)]

demo:

>>> bs4 import beautifulsoup >>> soup = beautifulsoup('''\ ... <span class="l r positive-icon"> ... turkish ... </span> ... <span> ... the.mist[2007]dvdrip[eng]-axxo ... </span> ... <span class="l r neutral-icon"> ... vietnamese ... </span> ... <span> ... the.mist.2007.720p.bluray.x264.yify  ... </span> ... ''') >>> [s.get_text().strip() s in soup.find_all('span', class_=true)] [u'turkish', u'vietnamese']

Search This Blog

WIKI

python - Why isn't this regexp working -

Comments

Post a Comment

Popular posts from this blog

android - Automated my builds -

how to proxy from https to http with lighttpd -

python - Flask migration error -