Notes |
|
(0018702)
|
Michal Cihar
|
2009-12-07 09:04
|
|
I need to rerun the tests, so it will take some time.
Meanwhile I noticed that cmake 2.6-patch 4 just messes up the text, but cdash is able to display that:
https://cdash.cihar.com/viewCoverageFile.php?buildid=5125&fileid=113158 [^]
(Copyright <-62><-87> 2003 - 2009 Michal <-60><-116>iha<-59><-103>)
While using cmake 2.8, cdash is not able to display it at all as mentioned before. |
|
|
(0018703)
|
Michal Cihar
|
2009-12-07 09:26
|
|
XML files attached, for example check entry for ./python/gammu/src/convertors/backup.c in CoverageLog-0.xml, it is encoded as:
Copyright © 2003 - 2009 Michal Čihař |
|
|
(0018704)
|
Julien Jomier
|
2009-12-07 09:38
|
|
Thanks a lot. I can replicate the error, I'll see what I can do. |
|
|
(0018705)
|
Julien Jomier
|
2009-12-07 10:01
|
|
I have added a small fix to force the encoding of the file. It seems that the output from the coverage tool (gcov?) is non-UTF8. CDash should be able to parse and display the file now, but the ctest xml still uses non-UTF8 encoding therefore the characters are not displayed correctly. Is the machine running the coverage tool using a non-UTF8 encoding by default? |
|
|
(0018706)
|
Michal Cihar
|
2009-12-07 10:08
|
|
No the machine is using UTF-8 locales. However what I can see in XML was originally UTF-8, only somebody encoded two bytes of multibyte character separately. If I run gcov manually, I still get correct file. |
|
|
(0018707)
|
Michal Cihar
|
2009-12-07 11:08
|
|
About related bug: And this is not about dropping results, just cdash is not able to display coverage for some files. |
|
|
(0018708)
|
Brad King
|
2009-12-07 11:11
|
|
There is some history here. At one time CTest only supported the "C" locale, but some tools would produce bad characters in their output (especially in failed tests that print garbage data). To prevent the bad text from messing up the XML files sent to CDash (actually Dart at the time), we converted non-basic characters to the angle-bracket syntax that CTest 2.6 produces.
CMake 2.8 adds support for arbitrary locales, assuming UTF-8 encoding. See issue 0008647. CTest now converts valid UTF-8 bytes into "&#xNN;", and still uses the angle brackets for truly bad characters. However, the result is still a valid UTF-8 XML file which we send to CDash. The XML spec allows the &#xNN; syntax to encode bytes, but the syntax is evaluated by XML parsers so no tools loading the files ever see the actual '&' '#' ... ';' characters...they see the UTF-8 encoded strings.
The problem in this issue lies somewhere between when CDash receives the XML file and when the user's browser displays the HTML that CDash generates. I participated in an investigation of this earlier this year, but we never finished. IIRC the problem has to do with PHP and/or MySQL locale and string handling settings on the server. Somewhere a string gets interpreted with the wrong encoding on input, and then encoded to UTF-8 on output, leading to the double-encoding. Check how these data are stored in the database on the server.
FYI, a work-around to this is to set "LC_ALL=C" in your test environment. |
|
|
(0018709)
|
Michal Cihar
|
2009-12-07 11:29
|
|
Well it looks like it is double encoded in the database:
Copyright \xc3\x82\xc2\xa9 2003 - 2009 Michal \xc3\x84\xc2\x8ciha\xc3\x85\xc2\x99 |
|
|
(0018710)
|
Brad King
|
2009-12-07 11:58
(edited on: 2009-12-07 11:59) |
|
Now that I'm getting back into this and remembering more about my investigation for issue 0008647, I retract my claim that CTest's output is correct. The reporter of that issue was happy with the LC_ALL=C workaround and we never had time to finish addressing the issue.
The XML spec does allow the &#xNN; syntax to support characters whose UTF-8 byte sequences cannot be represented in the "input device":
http://www.w3.org/TR/REC-xml/#NT-Char [^]
However, it means that the ENTIRE character should be encoded in one token, not just the individual bytes.
Issue 0008647 was actually reported against a development version of CMake (after 2.6, before 2.8). In that version we had already replaced the angle-bracket syntax with the &#xNN; encoding of each byte. That issue was about INVALID values of NN in such encodings (the link above specifies valid values). After that fix, the XML files were at least well-formed, and the LC_ALL=C workaround was used to deal with the fact that the characters were wrong.
Now my memory gets a little fuzzy. I think the reason we didn't remove the &#xNN; encoding from CTest altogether was that the PHP/MySQL encoding troubles caused CDash to treat raw UTF-8 input bytes incorrectly, but I don't remember for sure (this needs a separate investigation on the CDash side). As I state in issue 0008647, the full fix for CTest is to use the iconv library to deal with tool output and produce UTF-8 encoded XML files for submission to CDash. Doing this right cross-platform is non-trivial.
We at least need to stop mangling the output by encoding each byte of a UTF-8 character separately. Perhaps we can avoid iconv for now by assuming UTF-8 output from tools (which users can enforce with LC_* environment variables), and analyzing the byte sequences to convert whole UTF-8 characters to valid &x#NN; escape sequences.
|
|
|
(0018711)
|
Brad King
|
2009-12-07 12:21
|
|
The code in CTest that currently deals with this has one simple goal: ensure that the document transmitted to CDash consists only of *valid* unicode characters. Currently it does this by pretending each byte is a unicode character index and encoding it accordingly (except for invalid indices which get the angle-bracket syntax). This ensures that the document has *valid* characters, but not that it has the *correct* characters.
There is a TODO comment in the code that talks about dealing with encodings properly. This means that it needs to understand the encoding of the characters produced in tool output. CTest still needs to recognize and replace *invalid* characters and also send the correct valid characters with either UTF-8 encoding or with the xml-specific &#xNN; syntax. |
|
|
(0018725)
|
Brad King
|
2009-12-08 15:45
|
|
The following commits to CMake fix this on the CTest side, and test it.
CTest: Do not munge UTF-8 output in XML files
/cvsroot/CMake/CMake/Source/CMakeLists.txt,v <-- Source/CMakeLists.txt
new revision: 1.435; previous revision: 1.434
/cvsroot/CMake/CMake/Source/cmXMLSafe.cxx,v <-- Source/cmXMLSafe.cxx
new revision: 1.7; previous revision: 1.6
/cvsroot/CMake/CMake/Source/cm_utf8.c,v <-- Source/cm_utf8.c
initial revision: 1.1
/cvsroot/CMake/CMake/Source/cm_utf8.h,v <-- Source/cm_utf8.h
initial revision: 1.1
Test UTF-8 decoding
/cvsroot/CMake/CMake/Tests/CMakeLib/CMakeLists.txt,v <-- Tests/CMakeLib/CMakeLists.txt
new revision: 1.2; previous revision: 1.1
/cvsroot/CMake/CMake/Tests/CMakeLib/testUTF8.cxx,v <-- Tests/CMakeLib/testUTF8.cxx
initial revision: 1.1
Test XML encoding with UTF-8 character validation
/cvsroot/CMake/CMake/Tests/CMakeLib/CMakeLists.txt,v <-- Tests/CMakeLib/CMakeLists.txt
new revision: 1.3; previous revision: 1.2
/cvsroot/CMake/CMake/Tests/CMakeLib/testXMLSafe.cxx,v <-- Tests/CMakeLib/testXMLSafe.cxx
initial revision: 1.1 |
|
|
(0019567)
|
Julien Jomier
|
2010-02-21 12:34
|
|
Brad King and Dave Cole put a fix in CDash 1.6. |
|