MantisBT - CMake
View Issue Details
0015326CMakeCMakepublic2014-12-24 12:482015-06-01 08:38
Kwabena W. Agyeman 
Clinton Stimpson 
normalfeaturealways
closedfixed 
CMake 3.1 
 
0015326: CMake can't read UTF-16 source files
I'm trying to do dependency parsing in cmake for parallax propeller spin files. This involves opening up *.spin code files and parsing them using the file(STRINGS ... REGEX ...) command. Most spin files are saved as ASCII/UTF-8 files and there are no problems parsing these files. However, some spin files are saved in UTF-16 format which allows the use of special schematic characters in the files. Using the file(STRINGS ... REGEX ...) command does not work on UTF-16 files as cmake is unable to parse them.
The below code should find the "DAT" string in the attached file...

file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$")
Saving the problem UTF-16 files as UTF-8 files fixes the issue. But, I can't just convert the source files. There's a lot of legacy spin code I need to support that uses the UTF-16 format for the special schematic characters.
No tags attached.
? Quadrature Encoder.spin (32,690) 2014-12-24 12:48
https://public.kitware.com/Bug/file/5332/Quadrature%20Encoder.spin
patch 0001-add-initial-file-STRINGS-utf16-support.patch (2,882) 2015-01-03 14:48
https://public.kitware.com/Bug/file/5337/0001-add-initial-file-STRINGS-utf16-support.patch
patch 0001-file-STRINGS-UTF-16-and-UTF-32-convertible-to-ASCII.patch (3,649) 2015-01-11 15:49
https://public.kitware.com/Bug/file/5341/0001-file-STRINGS-UTF-16-and-UTF-32-convertible-to-ASCII.patch
patch 0001-Add-test-for-file-STRINGS-with-UTF-16-and-UTF-32-enc.patch (3,461) 2015-01-14 00:36
https://public.kitware.com/Bug/file/5343/0001-Add-test-for-file-STRINGS-with-UTF-16-and-UTF-32-enc.patch
patch 0001-Compare-file-STRINGS-encodings.patch (4,045) 2015-01-16 19:54
https://public.kitware.com/Bug/file/5348/0001-Compare-file-STRINGS-encodings.patch
patch 0002-try2-file-STRINGS-to-strings-comparison.patch (1,493) 2015-01-16 20:05
https://public.kitware.com/Bug/file/5349/0002-try2-file-STRINGS-to-strings-comparison.patch
patch 0002-try3-file-STRINGS-to-strings-comparison.patch (1,643) 2015-01-16 20:13
https://public.kitware.com/Bug/file/5350/0002-try3-file-STRINGS-to-strings-comparison.patch
patch 0001-StringFileTest-read-files-using-same-encoding.-Add-t.patch (2,508) 2015-01-23 21:00
https://public.kitware.com/Bug/file/5361/0001-StringFileTest-read-files-using-same-encoding.-Add-t.patch
Issue History
2014-12-24 12:48Kwabena W. AgyemanNew Issue
2014-12-24 12:48Kwabena W. AgyemanFile Added: Quadrature Encoder.spin
2014-12-24 13:21Clinton StimpsonNote Added: 0037527
2014-12-24 15:07Kwabena W. AgyemanNote Added: 0037528
2014-12-24 15:08Kwabena W. AgyemanNote Edited: 0037528bug_revision_view_page.php?bugnote_id=37528#r1664
2015-01-03 14:48Justin BorodinskyFile Added: 0001-add-initial-file-STRINGS-utf16-support.patch
2015-01-03 14:52Justin BorodinskyNote Added: 0037547
2015-01-03 23:35Clinton StimpsonNote Added: 0037548
2015-01-04 10:44Kwabena W. AgyemanNote Added: 0037551
2015-01-08 12:18Kwabena W. AgyemanNote Added: 0037634
2015-01-09 23:38Justin BorodinskyNote Added: 0037658
2015-01-10 16:00Clinton StimpsonNote Added: 0037660
2015-01-11 15:48Justin BorodinskyNote Added: 0037667
2015-01-11 15:49Justin BorodinskyFile Added: 0001-file-STRINGS-UTF-16-and-UTF-32-convertible-to-ASCII.patch
2015-01-14 00:36Clinton StimpsonFile Added: 0001-Add-test-for-file-STRINGS-with-UTF-16-and-UTF-32-enc.patch
2015-01-14 00:43Clinton StimpsonNote Added: 0037681
2015-01-16 19:54Justin BorodinskyNote Added: 0037727
2015-01-16 19:54Justin BorodinskyFile Added: 0001-Compare-file-STRINGS-encodings.patch
2015-01-16 20:05Justin BorodinskyNote Added: 0037728
2015-01-16 20:05Justin BorodinskyFile Added: 0002-try2-file-STRINGS-to-strings-comparison.patch
2015-01-16 20:13Justin BorodinskyFile Added: 0002-try3-file-STRINGS-to-strings-comparison.patch
2015-01-21 21:45Clinton StimpsonNote Added: 0037766
2015-01-23 20:59Justin BorodinskyNote Added: 0037801
2015-01-23 21:00Justin BorodinskyFile Added: 0001-StringFileTest-read-files-using-same-encoding.-Add-t.patch
2015-01-26 10:40Clinton StimpsonNote Added: 0037812
2015-01-26 10:40Clinton StimpsonStatusnew => resolved
2015-01-26 10:40Clinton StimpsonResolutionopen => fixed
2015-01-26 10:40Clinton StimpsonAssigned To => Clinton Stimpson
2015-01-26 20:46Justin BorodinskyNote Added: 0037828
2015-01-27 11:37Brad KingNote Added: 0037840
2015-06-01 08:38Robert MaynardNote Added: 0038862
2015-06-01 08:38Robert MaynardStatusresolved => closed

Notes
(0037527)
Clinton Stimpson   
2014-12-24 13:21   
UTF-8 support was added recently, and UTF-16 could be added as well with a patch to cmFileCommand.cxx.

As a workaround, you can probably add a cmake command to copy the UTF-16 files to your build directory and convert them to UTF-8 to run file(STRINGS...), but still use the original UTF-16 elsewhere.
(0037528)
Kwabena W. Agyeman   
2014-12-24 15:07   
(edited on: 2014-12-24 15:08)
I will just wait until UTF-16 support is added. I hope this bug gets pulled into the next release since it shouldn't be too much trouble.

I can't depend on any external commands being available for a workaround.

(0037547)
Justin Borodinsky   
2015-01-03 14:52   
I uploaded a patch to master, tested with your file, with the commands below. Currently only tested on Windows.

cmake_minimum_required(VERSION 3.0)
include(CMakePrintHelpers)
project(cmake_test)

file(STRINGS "Quadrature Encoder_BE.spin" OUT_BE REGEX "^[\t ]*[Dd][Aa][Tt].*$" ENCODING UTF-16)
file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$" ENCODING UTF-16)

file(STRINGS "CMakeLists.txt" OUT2 REGEX "project")

cmake_print_variables(OUT_BE OUT OUT2)
(0037548)
Clinton Stimpson   
2015-01-03 23:35   
Thanks for providing a proposed patch.

However, I see some issues with the patch:

1. it doesn't work on non-Windows platforms where wchar_t is 32 bits and is commonly used for UTF-32. Encoding::ToNarrow() does a UTF-16 to narrow conversion on Windows and a UTF-32 to narrow conversion on other platforms. Encoding::ToNarrow() cannot be used unconditionally on all platforms to handle UTF-16.

2. I see no handling for files without a BOM. It should handle both binary and text files, with and without BOM. For example, on Linux, I can compile a binary, or make a binary file, which contains any combination of UTF-8, UTF-16LE, UTF-16BE, UTF-32BE, UTF-32LE, and run the command "strings -e b file" to extract only the UTF-16BE strings.
For example, see https://sourceware.org/binutils/docs/binutils/strings.html [^] for how encoding options are specified. I think it would be good to have CMake behave the same as GNU's string tool.

3. There are no checks for invalid UTF-16 sequences. This is especially important to distinguish UTF-16 embedded in a binary file.

4. It fails on UTF-16 surrogate pairs. Instead of reading 2 bytes, one needs to read 4 bytes from the file before converting to UTF-8.
(0037551)
Kwabena W. Agyeman   
2015-01-04 10:44   
Thanks for working on this! :)
(0037634)
Kwabena W. Agyeman   
2015-01-08 12:18   
@Justin Borodinsky

It would preferable if CMake could autodetect UTF-16 files. They all have a BOM of 0xFE followed by 0xFF. So, it's easy to detect them out from regular text files. I assume CMake can already handle the UTF-8 BOM. http://en.wikipedia.org/wiki/Byte_order_mark [^]
(0037658)
Justin Borodinsky   
2015-01-09 23:38   
@Clinton

Thanks for the feedback, a few questions:
1 - I can use char16_t and facets to handle the UTF-16: http://en.cppreference.com/w/cpp/locale/codecvt [^] Let me know if you have any thoughts on using this approach.
2 - strings extracts printable characters using the ASCII code page only (encoded in various forms as you mentioned). It appears from the UTF-8 extraction currently in place that the intent is not to limit files to these codes only. See the STRING_ISGRAPHIC macro in strings.c.
3 - strings distinguishes printable characters by looking for a sequence of printable characters. Invalid UTF-16 is simply skipped. Should CMake do the same?

@Kwabena
Using the BOM to identify the whole file as a certain encoding would eliminate the need to identify strings from a binary file
(0037660)
Clinton Stimpson   
2015-01-10 16:00   
Justin,

char16_t is only available with C++11. CMake still needs to be compatible with older compilers, so I don't think that is an option.

Its interesting that GNU strings will extract only UTF-16 that can be converted to ASCII (even some valid UTF-16 such as "Ūnĭcōde̽" is ignored). I guess the same for CMake is fine. I previously thought it did more than that.

It seems those limited requirements will simplify the implementation.
(0037667)
Justin Borodinsky   
2015-01-11 15:48   
Hi Clinton,

Yes, the implementation is simple now. Attached is a new proposed patch, used as follows:

cmake_minimum_required(VERSION 3.0)
include(CMakePrintHelpers)
project(cmake_test)

file(STRINGS "Quadrature Encoder_BE.spin" OUT_BE REGEX "^[\t ]*[Dd][Aa][Tt].*$")
file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$")
file(STRINGS "CMakeLists.txt" OUT2 REGEX "project")

#Output should match strings | grep error
file(STRINGS "/usr/lib64/libm.so" OUT_BIN REGEX "error")

#Output should match strings -e L, default strings length is 4
file(STRINGS "/usr/lib64/libm.so" OUT_BIN_UTF32LE REGEX "...." ENCODING UTF-32LE)

cmake_print_variables(OUT_BE OUT OUT2 OUT_BIN OUT_BIN_UTF32LE)

Let me know what you think. Thanks.
(0037681)
Clinton Stimpson   
2015-01-14 00:43   
Hi Justin,

I've attached a patch which adds a test to CMake to cover the new functionality in your patch.
The test currently fails and I think your patch needs some fixing.

For example, I'm getting strings back out like this:
"Hello World;; ;;"
or
"H;e;l;l;o;
  ;W;o;r;l;d;;; "

The test files have 4 lines to them (including a BOM):
1. Hello World
2. Hello World in chinese
3. Hello World in Russian
4. Hello World with a corruption

I think there is a second problem where "strings -e {l,L,b,B}" all ignore the 4th line, but I think your patch is sometimes picking up the 4th line.
(0037727)
Justin Borodinsky   
2015-01-16 19:54   
Hi Clinton,

Thanks for the test case, I should of put one together with the patch.

To compare the output to strings you would need to add REGEX "...." since strings by default only outputs a sequence of at least 4 chars. Attached is a slightly modified version of your test, to compare the output to strings. For the 16 calls made the outputs match between StringFileTest and the runstrings.sh script.
(0037728)
Justin Borodinsky   
2015-01-16 20:05   
Apologies, didn't mean to step on your patch with the minor mods, attached is a patch to your patch in reference to the note above
(0037766)
Clinton Stimpson   
2015-01-21 21:45   
Justin,

I hope I have the right set of patches together, and when I run the test, I get a failure.
Are you seeing this?

...
28: UTF-32LE.txt as UTF-32LE: Hello World
28: UTF-32LE.txt as UTF-32BE:
28: CMake Error at CMakeLists.txt:78 (message):
28: file(STRINGS) incorrectly read from UTF-32LE.txt file []
28:
28:
28: UTF-32BE.txt as UTF-16LE:
28: UTF-32BE.txt as UTF-16BE:

I have my set of patches applied here:
https://github.com/clintonstimpson/CMake/tree/file-strings-utf-16 [^]
(0037801)
Justin Borodinsky   
2015-01-23 20:59   
Hi Clinton,

Thanks, yes those were the right set of patches. I am seeing the same failure, but I suppose my question is why does each file need to be read using every other encoding? For example, the failure is when it's reading the UTF-32LE file as UTF-32BE, and it's expecting to match "Hello World", but receives "". The output of 'strings' when reading the UTF-32LE file as UTF-32BE also produces no output.

Attached is another patch to your file-strings-utf16 branch which just simplifies the tests a bit (and also removes the temporary shell script). Is this set of tests sufficient, or was there a reason for the two inner loops I'm missing?
(0037812)
Clinton Stimpson   
2015-01-26 10:40   
I was just trying to be thorough in testing, but if you don't think its really necessary, that is fine.

Thanks for your work on this. I've merged it for you.
http://www.cmake.org/gitweb?p=cmake.git;a=commit;h=d4506d17 [^]
(0037828)
Justin Borodinsky   
2015-01-26 20:46   
Great, thank you for your help.
(0037840)
Brad King   
2015-01-27 11:37   
The revised commit implementing this has now been merged to 'master':

 file: Teach STRINGS to support UTF-16 and UTF-32 encodings
 http://cmake.org/gitweb?p=cmake.git;a=commitdiff;h=1f77a700 [^]

Thanks!
(0038862)
Robert Maynard   
2015-06-01 08:38   
Closing resolved issues that have not been updated in more than 4 months.