MantisBT - CMake |
View Issue Details |
|
ID | Project | Category | View Status | Date Submitted | Last Update |
0015326 | CMake | CMake | public | 2014-12-24 12:48 | 2015-06-01 08:38 |
|
Reporter | Kwabena W. Agyeman | |
Assigned To | Clinton Stimpson | |
Priority | normal | Severity | feature | Reproducibility | always |
Status | closed | Resolution | fixed | |
Platform | | OS | | OS Version | |
Product Version | CMake 3.1 | |
Target Version | | Fixed in Version | | |
|
Summary | 0015326: CMake can't read UTF-16 source files |
Description | I'm trying to do dependency parsing in cmake for parallax propeller spin files. This involves opening up *.spin code files and parsing them using the file(STRINGS ... REGEX ...) command. Most spin files are saved as ASCII/UTF-8 files and there are no problems parsing these files. However, some spin files are saved in UTF-16 format which allows the use of special schematic characters in the files. Using the file(STRINGS ... REGEX ...) command does not work on UTF-16 files as cmake is unable to parse them.
|
Steps To Reproduce | The below code should find the "DAT" string in the attached file...
file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$")
|
Additional Information | Saving the problem UTF-16 files as UTF-8 files fixes the issue. But, I can't just convert the source files. There's a lot of legacy spin code I need to support that uses the UTF-16 format for the special schematic characters. |
Tags | No tags attached. |
Relationships | |
Attached Files | Quadrature Encoder.spin (32,690) 2014-12-24 12:48 https://public.kitware.com/Bug/file/5332/Quadrature%20Encoder.spin
0001-add-initial-file-STRINGS-utf16-support.patch (2,882) 2015-01-03 14:48 https://public.kitware.com/Bug/file/5337/0001-add-initial-file-STRINGS-utf16-support.patch
0001-file-STRINGS-UTF-16-and-UTF-32-convertible-to-ASCII.patch (3,649) 2015-01-11 15:49 https://public.kitware.com/Bug/file/5341/0001-file-STRINGS-UTF-16-and-UTF-32-convertible-to-ASCII.patch
0001-Add-test-for-file-STRINGS-with-UTF-16-and-UTF-32-enc.patch (3,461) 2015-01-14 00:36 https://public.kitware.com/Bug/file/5343/0001-Add-test-for-file-STRINGS-with-UTF-16-and-UTF-32-enc.patch
0001-Compare-file-STRINGS-encodings.patch (4,045) 2015-01-16 19:54 https://public.kitware.com/Bug/file/5348/0001-Compare-file-STRINGS-encodings.patch
0002-try2-file-STRINGS-to-strings-comparison.patch (1,493) 2015-01-16 20:05 https://public.kitware.com/Bug/file/5349/0002-try2-file-STRINGS-to-strings-comparison.patch
0002-try3-file-STRINGS-to-strings-comparison.patch (1,643) 2015-01-16 20:13 https://public.kitware.com/Bug/file/5350/0002-try3-file-STRINGS-to-strings-comparison.patch
0001-StringFileTest-read-files-using-same-encoding.-Add-t.patch (2,508) 2015-01-23 21:00 https://public.kitware.com/Bug/file/5361/0001-StringFileTest-read-files-using-same-encoding.-Add-t.patch |
|
Issue History |
Date Modified | Username | Field | Change |
2014-12-24 12:48 | Kwabena W. Agyeman | New Issue | |
2014-12-24 12:48 | Kwabena W. Agyeman | File Added: Quadrature Encoder.spin | |
2014-12-24 13:21 | Clinton Stimpson | Note Added: 0037527 | |
2014-12-24 15:07 | Kwabena W. Agyeman | Note Added: 0037528 | |
2014-12-24 15:08 | Kwabena W. Agyeman | Note Edited: 0037528 | bug_revision_view_page.php?bugnote_id=37528#r1664 |
2015-01-03 14:48 | Justin Borodinsky | File Added: 0001-add-initial-file-STRINGS-utf16-support.patch | |
2015-01-03 14:52 | Justin Borodinsky | Note Added: 0037547 | |
2015-01-03 23:35 | Clinton Stimpson | Note Added: 0037548 | |
2015-01-04 10:44 | Kwabena W. Agyeman | Note Added: 0037551 | |
2015-01-08 12:18 | Kwabena W. Agyeman | Note Added: 0037634 | |
2015-01-09 23:38 | Justin Borodinsky | Note Added: 0037658 | |
2015-01-10 16:00 | Clinton Stimpson | Note Added: 0037660 | |
2015-01-11 15:48 | Justin Borodinsky | Note Added: 0037667 | |
2015-01-11 15:49 | Justin Borodinsky | File Added: 0001-file-STRINGS-UTF-16-and-UTF-32-convertible-to-ASCII.patch | |
2015-01-14 00:36 | Clinton Stimpson | File Added: 0001-Add-test-for-file-STRINGS-with-UTF-16-and-UTF-32-enc.patch | |
2015-01-14 00:43 | Clinton Stimpson | Note Added: 0037681 | |
2015-01-16 19:54 | Justin Borodinsky | Note Added: 0037727 | |
2015-01-16 19:54 | Justin Borodinsky | File Added: 0001-Compare-file-STRINGS-encodings.patch | |
2015-01-16 20:05 | Justin Borodinsky | Note Added: 0037728 | |
2015-01-16 20:05 | Justin Borodinsky | File Added: 0002-try2-file-STRINGS-to-strings-comparison.patch | |
2015-01-16 20:13 | Justin Borodinsky | File Added: 0002-try3-file-STRINGS-to-strings-comparison.patch | |
2015-01-21 21:45 | Clinton Stimpson | Note Added: 0037766 | |
2015-01-23 20:59 | Justin Borodinsky | Note Added: 0037801 | |
2015-01-23 21:00 | Justin Borodinsky | File Added: 0001-StringFileTest-read-files-using-same-encoding.-Add-t.patch | |
2015-01-26 10:40 | Clinton Stimpson | Note Added: 0037812 | |
2015-01-26 10:40 | Clinton Stimpson | Status | new => resolved |
2015-01-26 10:40 | Clinton Stimpson | Resolution | open => fixed |
2015-01-26 10:40 | Clinton Stimpson | Assigned To | => Clinton Stimpson |
2015-01-26 20:46 | Justin Borodinsky | Note Added: 0037828 | |
2015-01-27 11:37 | Brad King | Note Added: 0037840 | |
2015-06-01 08:38 | Robert Maynard | Note Added: 0038862 | |
2015-06-01 08:38 | Robert Maynard | Status | resolved => closed |
Notes |
|
(0037527)
|
Clinton Stimpson
|
2014-12-24 13:21
|
|
UTF-8 support was added recently, and UTF-16 could be added as well with a patch to cmFileCommand.cxx.
As a workaround, you can probably add a cmake command to copy the UTF-16 files to your build directory and convert them to UTF-8 to run file(STRINGS...), but still use the original UTF-16 elsewhere. |
|
|
(0037528)
|
Kwabena W. Agyeman
|
2014-12-24 15:07
(edited on: 2014-12-24 15:08) |
|
I will just wait until UTF-16 support is added. I hope this bug gets pulled into the next release since it shouldn't be too much trouble.
I can't depend on any external commands being available for a workaround.
|
|
|
(0037547)
|
Justin Borodinsky
|
2015-01-03 14:52
|
|
I uploaded a patch to master, tested with your file, with the commands below. Currently only tested on Windows.
cmake_minimum_required(VERSION 3.0)
include(CMakePrintHelpers)
project(cmake_test)
file(STRINGS "Quadrature Encoder_BE.spin" OUT_BE REGEX "^[\t ]*[Dd][Aa][Tt].*$" ENCODING UTF-16)
file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$" ENCODING UTF-16)
file(STRINGS "CMakeLists.txt" OUT2 REGEX "project")
cmake_print_variables(OUT_BE OUT OUT2) |
|
|
(0037548)
|
Clinton Stimpson
|
2015-01-03 23:35
|
|
Thanks for providing a proposed patch.
However, I see some issues with the patch:
1. it doesn't work on non-Windows platforms where wchar_t is 32 bits and is commonly used for UTF-32. Encoding::ToNarrow() does a UTF-16 to narrow conversion on Windows and a UTF-32 to narrow conversion on other platforms. Encoding::ToNarrow() cannot be used unconditionally on all platforms to handle UTF-16.
2. I see no handling for files without a BOM. It should handle both binary and text files, with and without BOM. For example, on Linux, I can compile a binary, or make a binary file, which contains any combination of UTF-8, UTF-16LE, UTF-16BE, UTF-32BE, UTF-32LE, and run the command "strings -e b file" to extract only the UTF-16BE strings.
For example, see https://sourceware.org/binutils/docs/binutils/strings.html [^] for how encoding options are specified. I think it would be good to have CMake behave the same as GNU's string tool.
3. There are no checks for invalid UTF-16 sequences. This is especially important to distinguish UTF-16 embedded in a binary file.
4. It fails on UTF-16 surrogate pairs. Instead of reading 2 bytes, one needs to read 4 bytes from the file before converting to UTF-8. |
|
|
(0037551)
|
Kwabena W. Agyeman
|
2015-01-04 10:44
|
|
Thanks for working on this! :) |
|
|
(0037634)
|
Kwabena W. Agyeman
|
2015-01-08 12:18
|
|
@Justin Borodinsky
It would preferable if CMake could autodetect UTF-16 files. They all have a BOM of 0xFE followed by 0xFF. So, it's easy to detect them out from regular text files. I assume CMake can already handle the UTF-8 BOM. http://en.wikipedia.org/wiki/Byte_order_mark [^] |
|
|
(0037658)
|
Justin Borodinsky
|
2015-01-09 23:38
|
|
@Clinton
Thanks for the feedback, a few questions:
1 - I can use char16_t and facets to handle the UTF-16: http://en.cppreference.com/w/cpp/locale/codecvt [^] Let me know if you have any thoughts on using this approach.
2 - strings extracts printable characters using the ASCII code page only (encoded in various forms as you mentioned). It appears from the UTF-8 extraction currently in place that the intent is not to limit files to these codes only. See the STRING_ISGRAPHIC macro in strings.c.
3 - strings distinguishes printable characters by looking for a sequence of printable characters. Invalid UTF-16 is simply skipped. Should CMake do the same?
@Kwabena
Using the BOM to identify the whole file as a certain encoding would eliminate the need to identify strings from a binary file |
|
|
(0037660)
|
Clinton Stimpson
|
2015-01-10 16:00
|
|
Justin,
char16_t is only available with C++11. CMake still needs to be compatible with older compilers, so I don't think that is an option.
Its interesting that GNU strings will extract only UTF-16 that can be converted to ASCII (even some valid UTF-16 such as "Ūnĭcōde̽" is ignored). I guess the same for CMake is fine. I previously thought it did more than that.
It seems those limited requirements will simplify the implementation. |
|
|
(0037667)
|
Justin Borodinsky
|
2015-01-11 15:48
|
|
Hi Clinton,
Yes, the implementation is simple now. Attached is a new proposed patch, used as follows:
cmake_minimum_required(VERSION 3.0)
include(CMakePrintHelpers)
project(cmake_test)
file(STRINGS "Quadrature Encoder_BE.spin" OUT_BE REGEX "^[\t ]*[Dd][Aa][Tt].*$")
file(STRINGS "Quadrature Encoder.spin" OUT REGEX "^[\t ]*[Dd][Aa][Tt].*$")
file(STRINGS "CMakeLists.txt" OUT2 REGEX "project")
#Output should match strings | grep error
file(STRINGS "/usr/lib64/libm.so" OUT_BIN REGEX "error")
#Output should match strings -e L, default strings length is 4
file(STRINGS "/usr/lib64/libm.so" OUT_BIN_UTF32LE REGEX "...." ENCODING UTF-32LE)
cmake_print_variables(OUT_BE OUT OUT2 OUT_BIN OUT_BIN_UTF32LE)
Let me know what you think. Thanks. |
|
|
(0037681)
|
Clinton Stimpson
|
2015-01-14 00:43
|
|
Hi Justin,
I've attached a patch which adds a test to CMake to cover the new functionality in your patch.
The test currently fails and I think your patch needs some fixing.
For example, I'm getting strings back out like this:
"Hello World;; ;;"
or
"H;e;l;l;o;
;W;o;r;l;d;;; "
The test files have 4 lines to them (including a BOM):
1. Hello World
2. Hello World in chinese
3. Hello World in Russian
4. Hello World with a corruption
I think there is a second problem where "strings -e {l,L,b,B}" all ignore the 4th line, but I think your patch is sometimes picking up the 4th line. |
|
|
(0037727)
|
Justin Borodinsky
|
2015-01-16 19:54
|
|
Hi Clinton,
Thanks for the test case, I should of put one together with the patch.
To compare the output to strings you would need to add REGEX "...." since strings by default only outputs a sequence of at least 4 chars. Attached is a slightly modified version of your test, to compare the output to strings. For the 16 calls made the outputs match between StringFileTest and the runstrings.sh script. |
|
|
(0037728)
|
Justin Borodinsky
|
2015-01-16 20:05
|
|
Apologies, didn't mean to step on your patch with the minor mods, attached is a patch to your patch in reference to the note above |
|
|
(0037766)
|
Clinton Stimpson
|
2015-01-21 21:45
|
|
Justin,
I hope I have the right set of patches together, and when I run the test, I get a failure.
Are you seeing this?
...
28: UTF-32LE.txt as UTF-32LE: Hello World
28: UTF-32LE.txt as UTF-32BE:
28: CMake Error at CMakeLists.txt:78 (message):
28: file(STRINGS) incorrectly read from UTF-32LE.txt file []
28:
28:
28: UTF-32BE.txt as UTF-16LE:
28: UTF-32BE.txt as UTF-16BE:
I have my set of patches applied here:
https://github.com/clintonstimpson/CMake/tree/file-strings-utf-16 [^] |
|
|
(0037801)
|
Justin Borodinsky
|
2015-01-23 20:59
|
|
Hi Clinton,
Thanks, yes those were the right set of patches. I am seeing the same failure, but I suppose my question is why does each file need to be read using every other encoding? For example, the failure is when it's reading the UTF-32LE file as UTF-32BE, and it's expecting to match "Hello World", but receives "". The output of 'strings' when reading the UTF-32LE file as UTF-32BE also produces no output.
Attached is another patch to your file-strings-utf16 branch which just simplifies the tests a bit (and also removes the temporary shell script). Is this set of tests sufficient, or was there a reason for the two inner loops I'm missing? |
|
|
(0037812)
|
Clinton Stimpson
|
2015-01-26 10:40
|
|
|
|
(0037828)
|
Justin Borodinsky
|
2015-01-26 20:46
|
|
Great, thank you for your help. |
|
|
(0037840)
|
Brad King
|
2015-01-27 11:37
|
|
|
|
(0038862)
|
Robert Maynard
|
2015-06-01 08:38
|
|
Closing resolved issues that have not been updated in more than 4 months. |
|