使用 xpath 抓取网页时遇到保留字符该怎么解决？

This topic created in 3297 days ago, the information mentioned may be changed or developed.

如题，当网页内容中包含例如 < 这类保留字符时，xpath 就没法正常工作了

比如这个网页

<!DOCTYPE html>
<html>
<head>
	<title></title>
</head>
<body>
<article>
	123<
</article>
<article>
	dfsfsd
</article>
</body>
</html>

当想使用

$article = $xpath->query("//article")->item(0);

提取出第一个 article 元素时并不能得到正确的结果

请问有什么办法解决或绕过么？

Supplement 1 · May 6, 2017

没有找到什么比较好的解决办法，至少在 PHP 环境中，貌似所有依靠原生 XML 库的方法都会遇到这个问题，所以只能先通过正则表达式做个预处理了

3 replies • 2017-05-06 13:46:05 +08:00

1

binux

May 6, 2017

你需要一个现代的 parser

2

lgh

May 6, 2017 via iPhone

你这网页不规范……<应该转换成实体<

3

billlee

May 6, 2017

你需要一个 HTML parser, 而不是 XML parser