Nokogiri is a really popular Xml and Html library for Ruby. People loves
Nokogiri is not just because it is powerful and fast, the most important is its flexible and convenient.
Nokogiri works perfect in most aspects, but there is a big pitfall when handling the xml namespace!
I met a super weird issue when processing xml returned by Google Data API, and the API returns the following xml document:
I instantiated a
Nokogiri::XML DOM with the xml document, and then I try to query the DOM with XPath:
entries is an array with 4 elements, but actually it is empty array. After a few tries, I found the query yields empty array when I introduce the element name in the query.
It is super weird.
After half an hour fighting against the Nokogiri, I begin to realize that it must be related to the namespace.
And I found that there is an attribute applied to the root element of the document:
xmlns="http://www.w3.org/2005/Atom", which means all the elements without explicit namespace declaration in the xml dom are under the namespace
http://www.w3.org/2005/Atom by default.
And for some reason, the XPath query is namespace sensitive! It requires the full name rather than the local name, which means we should query the DOM with the code:
xml_dom.xpath '//atom:entry', 'atom' => 'http://www.w3.org/2005/Atom'.
So in a sentence: XPath in
Nokogiri doesn’t inherit the default namespace, so when query the DOM with default namespace, we need to explicitly specify the namespace in XPath query. It is really a hidden requirement and is very likely to be ignored by the developers!
So if there is no naming collision issue, it is recommeded to avoid this kind of “silly” issues by removing the namespaces in the DOM.
Nokogiri::XML::Document class provides
Nokogiri::XML::Document#remove_namespaces! method to achieve this goal.