[cmake-developers] [PATCH] Improve encoding handling on Windows

Sat Jul 2 19:48:33 EDT 2016

2016-07-01 17:41 GMT+03:00 Mike Gelfand <mikedld at mikedld.com>:

>
> Since you already have "bool SystemTools::GetEnv(const char* key,
> std::string& result)", another option would be to use it everywhere and
> maybe introduce something like "bool SystemTools::HasEnv(const char*
> key)" for those several cases where you only need to check the existence.
>
>
I implemented this, it's actually really nice API to work with.

2016-07-02 1:54 GMT+03:00 <clinton at elemtech.com>:

>
> ...
>
> > +#if defined(_WIN32)
> > +  // Kinda hack, with MSVC (and MinGW) for some reason std::wcout
> > +  // (and all other std::w*) fails once it encounters non-ASCII
> > +  // string unless locale is set.
> > +  // Note that even with this, seems it can't output characters
> > +  // which aren't present in ANSI codepage no matter what encoding
> > +  // is used for console.
> > +  // Also once any character outside of ANSI codepage is tried to
> > +  // be outputed then after there anymore won't be output from
> > +  // any of std::w* functions.
> > +  _wsetlocale(LC_ALL, L"");
> > +#endif
>
> The _wsetlocale() may be a problem.
>
> See:
> https://gitlab.kitware.com/cmake/cmake/commit/87be2e142
> https://cmake.org/Bug/view.php?id=16099
>
>
>
Indeed, good catch, thanks, I didn't thought about this. But even now
most these locale aware functions like tolower/toupper and others are
wrong because internally we use UTF-8 and there 1 character can take
more than 1 byte so these functions won't work correctly for some strings
even if we don't set any locale.
Now here we actually set it only on Windows because there just isn't any
other way. Without setting locale we get only ASCII support and can't
output even ANSI characters. With locale we can atleast output ANSI
characters.
Currently Microsoft C++ library doesn't support UTF-8/UTF-16 locales.
Only way to output Unicode would be implement our own std::wcout which
would use wide WinAPI to write to console.

Anyway quick fix is to always use English locale then number parsing
will be expected and set user's codepage.

  std::wstring locale = L"English_United States.";
  locale += std::to_wstring(GetACP());
  _wsetlocale(LC_ALL, locale.c_str());

Of course proper Unicode support will be needed some day, but for now
this is still an improvement.

> @@ -30,8 +30,23 @@ namespace @KWSYS_NAMESPACE@
> >       typedef std::basic_filebuf<CharType,Traits> my_base_type;
> >       basic_filebuf *open(char const *s,std::ios_base::openmode mode)
> >       {
> > +        std::wstring wstr = Encoding::ToWide(s);
> > +        const wchar_t *ws = wstr.c_str();
> > +#if defined(_MSC_VER) && _MSC_VER >= 1400
> > +        const wchar_t *ss = ws;
> > +#else
> > +        const char *ss = 0;
> > +        size_t length = std::wcstombs(0, ws, 0);
> > +        if (length != size_t(-1)) {
> > +          std::vector<char> str(length + 1);
> > +          ss = str.data();
> > +          std::wcstombs((char *)ss, ws, str.size());
> > +        } else {
> > +          ss = s;
> > +        }
> > +#endif
> >         return static_cast<basic_filebuf*>(
> > -          my_base_type::open(Encoding::ToWide(s).c_str(), mode)
> > +          my_base_type::open(ss, mode)
> >           );
>
> Yes.  It makes sense to convert from utf-8 to code page when we are not
> using the wide apis.
> This would at least give you support for the current code page, beyond
> ascii.   Beyond that, the wide functions should be used.
> Which compiler are you trying to support here, which doesn't give a wide
> open()?
>
>
Only MSVC have ofstream::open(const wchar_t *) so for MinGW need to use
ofstream::open(const char *) or use wofstream::open(const wchar_t *) which
would
require quite big changes.

> >       }
> >   };
> > diff --git a/Source/kwsys/ProcessWin32.c b/Source/kwsys/ProcessWin32.c
> > index 2b93e69..208e725 100644
> > --- a/Source/kwsys/ProcessWin32.c
> > +++ b/Source/kwsys/ProcessWin32.c
> > @@ -181,7 +181,7 @@ struct kwsysProcessPipeData_s
> >   /* ------------- Data managed per call to Execute ------------- */
> >
> >   /* Buffer for data read in this pipe's thread.  */
> > -  char DataBuffer[KWSYSPE_PIPE_BUFFER_SIZE];
> > +  char DataBuffer[KWSYSPE_PIPE_BUFFER_SIZE*2];
>
> This "*2" assumes all characters cmake sees will fit in the Basic
> Multilingual Plane.
> Are we OK assuming that?
> If the conversion from a code page to utf-16 results in surrogate pairs,
> there may not be enough space.
>
>
Not really, it assumes character string from ANSI code page encoded
to internal encoding (UTF-8) will not take more than 2x space. Which I
think is a pretty good guess, because here we're processing output
from a process and so output will be paths and texts usually in
ASCII with not many characters which would require more than 2 bytes.
And it will still work if say 1/4th uses 4 bytes and 1 byte for rest 3/4th.

1024 * 1/4 * 4 bytes + 1024 * 3/4 * 1 byte (1792) < 1024 * 2 (2048)

But of course can increase it to 4x, only I think it's very unlikely
that it will be used.

>
> >
> >   /* The length of the data stored in the buffer.  */
> >   DWORD DataLength;
> > @@ -1626,6 +1626,25 @@ void kwsysProcessPipeThreadReadPipe(kwsysProcess*
> cp,
> > kwsysProcessPipeData* td)
> >       KWSYSPE_DEBUG((stderr, "read closed %d\n", td->Index));
> >       }
> >
> > +    if (td->DataLength > 0) {
> > +        UINT codepage = GetConsoleCP();
> > +        if (!codepage) {
> > +            codepage = GetACP();
>
>
> Can this be stored in kwsysProcessCreateInformation, and not computed ever
> time we read data?
>
>
>
Not kwsysProcessCreateInformation, because we can't access it from this
function, but
we can use kwsysProcess and I'll change to that.

>
> > +        }
> > +        if (codepage != KWSYS_ENCODING_DEFAULT_CODEPAGE) {
> > +            int wlength = MultiByteToWideChar(codepage, 0,
> td->DataBuffer,
> > td->DataLength, NULL, 0);
> > +            wchar_t* wdata = malloc(wlength * sizeof(wchar_t));
> > +            int r = MultiByteToWideChar(codepage, 0, td->DataBuffer,
> > td->DataLength, wdata, wlength);
> > +            if (r > 0) {
> > +                r =
> WideCharToMultiByte(KWSYS_ENCODING_DEFAULT_CODEPAGE, 0,
> > wdata, wlength, td->DataBuffer, KWSYSPE_PIPE_BUFFER_SIZE * 2, NULL,
> NULL);
> > +                if (r > 0) {
> > +                    td->DataLength = r;
> > +                }
> > +            }
> > +            free(wdata);
> > +        }
> > +    }
>
> How about avoiding a malloc()/free() each time we read an array?
>
>
I don't really understand what you mean, I used malloc there because we
can't know beforehand how many bytes we actually will need before DataBuffer
is filled. Also I don't see there any array reading. All process output is
read with
single ReadFile call so I don't see how malloc could cause any overhead. Do
you
want that this buffer is added to kwsysProcessPipeData as static array?

> ProcessWin32.c already uses Encoding.h.  How about using the Encoding
> instead of
> WideCharToMultiByte(...)?  I'm fine either way.
>
>
There we get length from td->DataLength and so I assume we might get
non-null
terminated data in which case we can't use those Encoding functions because
they
expect source to be null-terminated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/cmake-developers/attachments/20160703/b4efa4cd/attachment.html>