Task #9572 (closed)
Opened 7 years ago
Closed 7 years ago
Bug: XML parser does not cope with entities in attributes nor correctly escape special characters
| Reported by: | rleigh | Owned by: | mlinkert |
|---|---|---|---|
| Priority: | major | Milestone: | Unscheduled |
| Component: | Bio-Formats | Version: | n.a. |
| Keywords: | xml | Cc: | ardo@…, ajpatterson |
| Resources: | n.a. | Referenced By: | n.a. |
| References: | n.a. | Remaining Time: | n.a. |
| Sprint: | n.a. |
Description
The OME XML parsing code does not correctly escape special characters when writing. It also does not cope with entities when reading (NPE).
Example adding "&" to a name attribute (/ome/team/rleigh/xmlentity/Sample-2008_02-new-amp.ome):
% showinf Sample-2008_02-new-amp.ome Checking file format [OME-XML] Initializing reader Populating metadata Exception in thread "main" java.lang.NullPointerException at loci.formats.services.OMEXMLServiceImpl.transformToLatestVersion(OMEXMLServiceImpl.java:161) at loci.formats.services.OMEXMLServiceImpl.createOMEXMLMetadata(OMEXMLServiceImpl.java:306) at loci.formats.services.OMEXMLServiceImpl.createOMEXMLMetadata(OMEXMLServiceImpl.java:295) at loci.formats.in.OMEXMLReader.initFile(OMEXMLReader.java:267) at loci.formats.FormatReader.setId(FormatReader.java:1178) at loci.formats.ImageReader.setId(ImageReader.java:727) at loci.formats.ReaderWrapper.setId(ReaderWrapper.java:529) at loci.formats.tools.ImageInfo.testRead(ImageInfo.java:988) at loci.formats.tools.ImageInfo.main(ImageInfo.java:1031)
Example adding "&" to a name attribute (/ome/team/rleigh/xmlentity/Sample-2008_02-new-ampreal.ome):
showinf Sample-2008_02-new-ampreal.ome Checking file format [OME-XML] Initializing reader Exception in thread "main" loci.formats.FormatException: Malformed OME-XML at loci.formats.in.OMEXMLReader.initFile(OMEXMLReader.java:241) at loci.formats.FormatReader.setId(FormatReader.java:1178) at loci.formats.ImageReader.setId(ImageReader.java:727) at loci.formats.ReaderWrapper.setId(ReaderWrapper.java:529) at loci.formats.tools.ImageInfo.testRead(ImageInfo.java:988) at loci.formats.tools.ImageInfo.main(ImageInfo.java:1031) Caused by: java.io.IOException at loci.common.xml.XMLTools.parseXML(XMLTools.java:338) at loci.common.xml.XMLTools.parseXML(XMLTools.java:306) at loci.formats.in.OMEXMLReader.initFile(OMEXMLReader.java:237) ... 5 more Caused by: org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1427) at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(XMLScanner.java:876) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanAttribute(XMLDocumentFragmentScannerImpl.java:1547) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1320) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2756) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:647) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at loci.common.xml.XMLTools.parseXML(XMLTools.java:330) ... 7 more
The first example is well-formed XML. The second is not.
I'm unsure of the purpose of the stripping of the special characters and numeric entities in sanitizeXML. Why don't entities get handled transparently by the XML parser? Should we really need to handle this specially? And why are numerical entities broken? Surely we can just translate them to the corresponding unicode code point and back on deserialisation/serialisation?
Roger
See also: https://www.openmicroscopy.org/community/viewtopic.php?f=13&t=1455
Change History (5)
comment:1 Changed 7 years ago by rleigh
comment:2 Changed 7 years ago by rleigh
- Cc ajpatterson added
comment:3 Changed 7 years ago by rleigh
With the following trivial testcase, it appears that by default, the SAX parser is correctly processing entities in attributes:
import java.util.Stack;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import loci.common.DataTools;
import loci.common.xml.XMLTools;
class testenc
{
public static void main(String[] args)
{
try
{
String xmlData = DataTools.readFile("/tmp/Sample-2008_02-new-amp.ome");
// System.out.println(xmlData);
xmlData = XMLTools.sanitizeXML(xmlData);
TestHandler h = new TestHandler();
XMLTools.parseXML(xmlData, h);
}
catch (java.io.IOException e)
{
System.out.println("E: " + e);
}
}
}
class TestHandler extends DefaultHandler {
public Stack<String> nameStack = new Stack<String>();
public String cdata = new String();
TestHandler() {
}
// -- DefaultHandler API methods --
public void endElement(String uri,
String localName,
String qName) {
if (!nameStack.empty() && nameStack.peek().equals(qName))
nameStack.pop();
if (cdata != null)
System.out.println(" cdata:" + cdata);
System.out.println("end element: " + qName);
cdata = null;
}
public void characters(char[] ch,
int start,
int length)
{
String s = new String(ch, start, length);
if (cdata == null)
cdata = s;
else
cdata += s;
}
public void startElement(String uri, String localName, String qName,
Attributes attrs) throws SAXException
{
cdata = null;
System.out.println("start element: " + qName);
int length = attrs.getLength();
for (int i=0; i<length; i++) {
String name = attrs.getQName(i);
String value = attrs.getValue(i);
System.out.println(" attr: " + name + "="+value);
}
nameStack.push(qName);
}
}
However, XMLTools.sanitizeXML doesn't preserve printable Unicode characters outside the ASCII range, which is not very useful. Is this needed for any specific formats? What's the use case for this behaviour?
This one line patch corrects that:
diff --git a/components/common/src/loci/common/xml/XMLTools.java b/components/common/src/loci/common/xml/XMLTools.java
index 6665abf..dfd06df 100644
--- a/components/common/src/loci/common/xml/XMLTools.java
+++ b/components/common/src/loci/common/xml/XMLTools.java
@@ -184,7 +184,7 @@ public final class XMLTools {
final char[] c = s.toCharArray();
for (int i=0; i<s.length(); i++) {
if (Character.isISOControl(c[i]) ||
- !Character.isDefined(c[i]) || c[i] > '~')
+ !Character.isDefined(c[i]))
{
c[i] = ' ';
}
Likewise, why do we get rid of invalid &# entities? Surely this should be picked up during parsing as invalid XML, rather than needing to mangle the input?
comment:4 Changed 7 years ago by rleigh
With it being clear that & entities *are* being correctly processed by default, it does imply that the failure to deal with it properly is in the OME-XML upgrading code/transforms or related code reading/writing OME-XML.
comment:5 Changed 7 years ago by mlinkert
- Resolution set to fixed
- Status changed from new to closed
Note that the above examples are testing reading only. When writing OME XML, it should also be correctly substituting entities where required.