请问这个正则表达式如何写

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

This topic created in 4007 days ago, the information mentioned may be changed or developed.

<dt><a name="313"></a>ADHE 313 (6) Organization of Adult Basic Education Programs</dt>

想抓出ADHE313和Organization of Adult Basic Education Programs

programs

adult

表达式

18 replies • 2015-05-29 10:45:43 +08:00

asj

May 28, 2015

这难道不是应该用CSS/JQuery selector，或者XPath么？

phx13ye

May 28, 2015

<\/a>(.*)(.*?)<\/b>

sicongliu

May 28, 2015

xpath比较简单但是想学下正则的方法

shoumu

May 28, 2015

看一下pyquery吧，支持jQuery的语法

professorz

May 28, 2015

.+<\\/a>(.+)(6)(.+)<\\/b>.+
java下的regex

sicongliu

May 28, 2015

python的如何写

yiyiwa

May 28, 2015

python测试了一下，不完善，有空的东西。

'\>([^\<]*)\<'

sicongliu

May 28, 2015

m=re.search("</a>(.*?)\s(",text)
print (m.group(1))

m=re.search("(.*?)(",text)
print (m.group(1))

sicongliu

May 28, 2015

如果要取ADHE 313呢？
如何判断第二个空格？当然用字符串的search切片功能很容易达到，只是想知道正则如何达到

sicongliu

May 28, 2015

m=re.search("</a>(.*?)\s+\(",text)
print (m.group(1))

当然方法比较笨，如果第二个空格后不是“(”就没办法了

asj

May 28, 2015

简单写了一个，还很不完善
(?:<dt.*?>)(?:.*?\/.*?>)([\w ]*)(?:.*?)(?:<\/dt>)

http://regexr.com/3b3bs

May 28, 2015

这个需求不用正则，会简单得多

page.xpath("//dt/text()") -> ADHE 313 (6)
page.xpath("//dt/b/text()") -> Organization of Adult Basic Education Programs

picasso250

May 28, 2015

/a>([\w ()]+)([\w ]+)
最简单的解决了你现在的问题。

picasso250

May 28, 2015

对不起，上一个是错误的，多提取了(6)

/a>(\w+ \d+).+?([\w ]+)

leozy2014

May 28, 2015

print re.findall('</a>(.*?) \(6\) (.*?)</dt>', s)
#[('ADHE 313', 'Organization of Adult Basic Education Programs')]

wmttom

May 28, 2015

python正则 (?<=>)[\w, ,\(,\)]+?(?= \(|<)

re.findall("(?<=>)[\w, ,\(,\)]+?(?= \(|<)", '<dt><a name="313"></a>ADHE 313 (6) Organization of Adult Basic Education Programs</dt>')

['ADHE 313', 'Organization of Adult Basic Education Programs']

sicongliu

May 29, 2015

楼上两个貌似都不能用

sicongliu

May 29, 2015

sorry这个可行

print re.findall('</a>(.*?) \(6\) (.*?)</dt>', s)
#[('ADHE 313', 'Organization of Adult Basic Education Programs')]