Nokogiri
is a really popular Xml and Html library for Ruby. People loves Nokogiri
is not just because it is powerful and fast, the most important is its flexible and convenient.Nokogiri
works perfect in most aspects, but there is a big pitfall when handling the xml namespace!
I met a super weird issue when processing xml returned by Google Data API, and the API returns the following xml document:
I instantiated a Nokogiri::XML
DOM with the xml document, and then I try to query the DOM with XPath: xml_dom.xpath '//entry'
:
|
|
I’m expecting entries
is an array with 4 elements, but actually it is empty array. After a few tries, I found the query yields empty array when I introduce the element name in the query.
|
|
It is super weird.
After half an hour fighting against the Nokogiri, I begin to realize that it must be related to the namespace.
And I found that there is an attribute applied to the root element of the document: xmlns="http://www.w3.org/2005/Atom"
, which means all the elements without explicit namespace declaration in the xml dom are under the namespace http://www.w3.org/2005/Atom
by default.
And for some reason, the XPath query is namespace sensitive! It requires the full name rather than the local name, which means we should query the DOM with the code: xml_dom.xpath '//atom:entry', 'atom' => 'http://www.w3.org/2005/Atom'
.
|
|
So in a sentence: XPath in Nokogiri
doesn’t inherit the default namespace, so when query the DOM with default namespace, we need to explicitly specify the namespace in XPath query. It is really a hidden requirement and is very likely to be ignored by the developers!
So if there is no naming collision issue, it is recommeded to avoid this kind of “silly” issues by removing the namespaces in the DOM. Nokogiri::XML::Document
class provides Nokogiri::XML::Document#remove_namespaces!
method to achieve this goal.