View Issue Details Jump to Notes ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0010003CDashpublic2009-12-07 07:112010-04-16 13:39
ReporterMichal Cihar 
Assigned ToJulien Jomier 
PrioritynormalSeverityminorReproducibilityalways
StatusclosedResolutionfixed 
PlatformOSOS Version
Product Version1.5 
Target VersionFixed in Version1.6 
Summary0010003: Wrong escaping of utf-8
DescriptionI have no idea whether guilty here is CDash or CTest, but I'm unable to display coverage results for some files, eg http://cdash.cihar.com/viewCoverageFile.php?buildid=5122&fileid=113919 [^] . It fails with following error:

Warning: DOMDocument::loadXML(): Input is not proper UTF-8, indicate encoding ! Bytes: 0x82 0x26 0x61 0x6D in Entity, line: 17 in /srv/https/cdash.cihar.com/cdash/common.php on line 39

The problem seems to be in too much escaped UTF-8 text:

Copyright © 2003 - 2009 Michal Čiha

The original file contains:

Copyright © 2003 - 2009 Michal ?iha?

Attaching full XML which raises this error.
TagsNo tags attached.
Attached Filesxml file icon cdash.xml [^] (263,827 bytes) 2009-12-07 07:11
bz2 file icon 20091207-1415.tar.bz2 [^] (876,393 bytes) 2009-12-07 09:24
? file icon backup.c.gcov [^] (10,351 bytes) 2009-12-07 10:08

 Relationships
related to 0008647closedBrad King CMake CTest Dev: CTest sumitting invalid XML to dashboard resulting in tests getting dropped 

  Notes
(0018702)
Michal Cihar (reporter)
2009-12-07 09:04

I need to rerun the tests, so it will take some time.

Meanwhile I noticed that cmake 2.6-patch 4 just messes up the text, but cdash is able to display that:

https://cdash.cihar.com/viewCoverageFile.php?buildid=5125&fileid=113158 [^]

(Copyright <-62><-87> 2003 - 2009 Michal <-60><-116>iha<-59><-103>)

While using cmake 2.8, cdash is not able to display it at all as mentioned before.
(0018703)
Michal Cihar (reporter)
2009-12-07 09:26

XML files attached, for example check entry for ./python/gammu/src/convertors/backup.c in CoverageLog-0.xml, it is encoded as:

Copyright &#xc2;&#xa9; 2003 - 2009 Michal &#xc4;&#x8c;iha&#xc5;&#x99;
(0018704)
Julien Jomier (manager)
2009-12-07 09:38

Thanks a lot. I can replicate the error, I'll see what I can do.
(0018705)
Julien Jomier (manager)
2009-12-07 10:01

I have added a small fix to force the encoding of the file. It seems that the output from the coverage tool (gcov?) is non-UTF8. CDash should be able to parse and display the file now, but the ctest xml still uses non-UTF8 encoding therefore the characters are not displayed correctly. Is the machine running the coverage tool using a non-UTF8 encoding by default?
(0018706)
Michal Cihar (reporter)
2009-12-07 10:08

No the machine is using UTF-8 locales. However what I can see in XML was originally UTF-8, only somebody encoded two bytes of multibyte character separately. If I run gcov manually, I still get correct file.
(0018707)
Michal Cihar (reporter)
2009-12-07 11:08

About related bug: And this is not about dropping results, just cdash is not able to display coverage for some files.
(0018708)
Brad King (manager)
2009-12-07 11:11

There is some history here. At one time CTest only supported the "C" locale, but some tools would produce bad characters in their output (especially in failed tests that print garbage data). To prevent the bad text from messing up the XML files sent to CDash (actually Dart at the time), we converted non-basic characters to the angle-bracket syntax that CTest 2.6 produces.

CMake 2.8 adds support for arbitrary locales, assuming UTF-8 encoding. See issue 0008647. CTest now converts valid UTF-8 bytes into "&#xNN;", and still uses the angle brackets for truly bad characters. However, the result is still a valid UTF-8 XML file which we send to CDash. The XML spec allows the &#xNN; syntax to encode bytes, but the syntax is evaluated by XML parsers so no tools loading the files ever see the actual '&' '#' ... ';' characters...they see the UTF-8 encoded strings.

The problem in this issue lies somewhere between when CDash receives the XML file and when the user's browser displays the HTML that CDash generates. I participated in an investigation of this earlier this year, but we never finished. IIRC the problem has to do with PHP and/or MySQL locale and string handling settings on the server. Somewhere a string gets interpreted with the wrong encoding on input, and then encoded to UTF-8 on output, leading to the double-encoding. Check how these data are stored in the database on the server.

FYI, a work-around to this is to set "LC_ALL=C" in your test environment.
(0018709)
Michal Cihar (reporter)
2009-12-07 11:29

Well it looks like it is double encoded in the database:

Copyright \xc3\x82\xc2\xa9 2003 - 2009 Michal \xc3\x84\xc2\x8ciha\xc3\x85\xc2\x99
(0018710)
Brad King (manager)
2009-12-07 11:58
edited on: 2009-12-07 11:59

Now that I'm getting back into this and remembering more about my investigation for issue 0008647, I retract my claim that CTest's output is correct. The reporter of that issue was happy with the LC_ALL=C workaround and we never had time to finish addressing the issue.

The XML spec does allow the &#xNN; syntax to support characters whose UTF-8 byte sequences cannot be represented in the "input device":

  http://www.w3.org/TR/REC-xml/#NT-Char [^]

However, it means that the ENTIRE character should be encoded in one token, not just the individual bytes.

Issue 0008647 was actually reported against a development version of CMake (after 2.6, before 2.8). In that version we had already replaced the angle-bracket syntax with the &#xNN; encoding of each byte. That issue was about INVALID values of NN in such encodings (the link above specifies valid values). After that fix, the XML files were at least well-formed, and the LC_ALL=C workaround was used to deal with the fact that the characters were wrong.

Now my memory gets a little fuzzy. I think the reason we didn't remove the &#xNN; encoding from CTest altogether was that the PHP/MySQL encoding troubles caused CDash to treat raw UTF-8 input bytes incorrectly, but I don't remember for sure (this needs a separate investigation on the CDash side). As I state in issue 0008647, the full fix for CTest is to use the iconv library to deal with tool output and produce UTF-8 encoded XML files for submission to CDash. Doing this right cross-platform is non-trivial.

We at least need to stop mangling the output by encoding each byte of a UTF-8 character separately. Perhaps we can avoid iconv for now by assuming UTF-8 output from tools (which users can enforce with LC_* environment variables), and analyzing the byte sequences to convert whole UTF-8 characters to valid &x#NN; escape sequences.

(0018711)
Brad King (manager)
2009-12-07 12:21

The code in CTest that currently deals with this has one simple goal: ensure that the document transmitted to CDash consists only of *valid* unicode characters. Currently it does this by pretending each byte is a unicode character index and encoding it accordingly (except for invalid indices which get the angle-bracket syntax). This ensures that the document has *valid* characters, but not that it has the *correct* characters.

There is a TODO comment in the code that talks about dealing with encodings properly. This means that it needs to understand the encoding of the characters produced in tool output. CTest still needs to recognize and replace *invalid* characters and also send the correct valid characters with either UTF-8 encoding or with the xml-specific &#xNN; syntax.
(0018725)
Brad King (manager)
2009-12-08 15:45

The following commits to CMake fix this on the CTest side, and test it.

CTest: Do not munge UTF-8 output in XML files
/cvsroot/CMake/CMake/Source/CMakeLists.txt,v <-- Source/CMakeLists.txt
new revision: 1.435; previous revision: 1.434
/cvsroot/CMake/CMake/Source/cmXMLSafe.cxx,v <-- Source/cmXMLSafe.cxx
new revision: 1.7; previous revision: 1.6
/cvsroot/CMake/CMake/Source/cm_utf8.c,v <-- Source/cm_utf8.c
initial revision: 1.1
/cvsroot/CMake/CMake/Source/cm_utf8.h,v <-- Source/cm_utf8.h
initial revision: 1.1

Test UTF-8 decoding
/cvsroot/CMake/CMake/Tests/CMakeLib/CMakeLists.txt,v <-- Tests/CMakeLib/CMakeLists.txt
new revision: 1.2; previous revision: 1.1
/cvsroot/CMake/CMake/Tests/CMakeLib/testUTF8.cxx,v <-- Tests/CMakeLib/testUTF8.cxx
initial revision: 1.1

Test XML encoding with UTF-8 character validation
/cvsroot/CMake/CMake/Tests/CMakeLib/CMakeLists.txt,v <-- Tests/CMakeLib/CMakeLists.txt
new revision: 1.3; previous revision: 1.2
/cvsroot/CMake/CMake/Tests/CMakeLib/testXMLSafe.cxx,v <-- Tests/CMakeLib/testXMLSafe.cxx
initial revision: 1.1
(0019567)
Julien Jomier (manager)
2010-02-21 12:34

Brad King and Dave Cole put a fix in CDash 1.6.

 Issue History
Date Modified Username Field Change
2009-12-07 07:11 Michal Cihar New Issue
2009-12-07 07:11 Michal Cihar File Added: cdash.xml
2009-12-07 08:49 Julien Jomier Status new => assigned
2009-12-07 08:49 Julien Jomier Assigned To => Julien Jomier
2009-12-07 08:49 Julien Jomier Product Version 1.6 => 1.5
2009-12-07 08:49 Julien Jomier Description Updated
2009-12-07 09:04 Michal Cihar Note Added: 0018702
2009-12-07 09:24 Michal Cihar File Added: 20091207-1415.tar.bz2
2009-12-07 09:26 Michal Cihar Note Added: 0018703
2009-12-07 09:38 Julien Jomier Note Added: 0018704
2009-12-07 10:01 Julien Jomier Note Added: 0018705
2009-12-07 10:08 Michal Cihar Note Added: 0018706
2009-12-07 10:08 Michal Cihar File Added: backup.c.gcov
2009-12-07 10:58 Brad King Relationship added related to 0008647
2009-12-07 11:08 Michal Cihar Note Added: 0018707
2009-12-07 11:11 Brad King Note Added: 0018708
2009-12-07 11:29 Michal Cihar Note Added: 0018709
2009-12-07 11:58 Brad King Note Added: 0018710
2009-12-07 11:59 Brad King Note Edited: 0018710
2009-12-07 12:21 Brad King Note Added: 0018711
2009-12-08 15:45 Brad King Note Added: 0018725
2010-02-21 12:34 Julien Jomier Note Added: 0019567
2010-02-21 12:34 Julien Jomier Status assigned => resolved
2010-02-21 12:34 Julien Jomier Fixed in Version => 1.6
2010-02-21 12:34 Julien Jomier Resolution open => fixed
2010-04-16 13:39 Julien Jomier Status resolved => closed


Copyright © 2000 - 2018 MantisBT Team