[cmake-developers] CMake daemon for user tools

Mon Jan 25 11:05:55 EST 2016

> -----Original Message-----
> From: Milian Wolff [mailto:mail at milianw.de]
> Sent: Saturday, January 23, 2016 15:41
> To: cmake-developers at cmake.org
> Cc: James Johnston
> Subject: Re: [cmake-developers] CMake daemon for user tools
> 
> You are aware that modern std::string is SSO'ed? I'm running on such a
> system.
> Another reason why you should not reinvent the wheel and keep relying on
> the STL wherever possible.
>
<snip>
> Qt has such a class, it's called QVarLengthArray, and I've also been able
to
> apply it in multiple occasions to good effect. That said, when you look at

Yeah, but std::string is platform dependent, and the size of the buffer is
also platform dependent.  Maybe it tends to be optimal for CMake.  Then
again, maybe a larger buffer is needed.  I don't know.  The flexible option
would be something that does exactly like QVarLengthArray.  Different
variables might have different optimal sizes.

Some sample small strings for gcc/clang/VC++:
http://stackoverflow.com/a/28003328/562766
Note that none of them are large enough to store an absolute path, which are
maybe common (???) in CMake.  Also there's a fair bit of variation; if CMake
wants consistent performance in a section of code across compilers, it would
need a way to explicitly specify the small string size.  For example, some
are large enough to store typical target sizes - and some maybe are not.

There is also boost::container::small_vector in addition to QVarLengthArray:
http://www.boost.org/doc/libs/1_60_0/doc/html/boost/container/small_vector.h
tml

> Just run cmake (or the daemon) through a profiler and check the results.
> Doing so for the daemon (built with RelWithDebInfo) on the LLVM build dir
> and recording it with `perf --call-graph lbr` I get these hotspots when
looking
> at the results with `perf report -g graph --no-children`:
> 
> +    8.67%  cmake      cmake                [.]
> cmGlobalGenerator::FindGeneratorTargetImpl
> +    4.21%  cmake      libc-2.22.so         [.] _int_malloc
> +    2.67%  cmake      cmake                [.] cmCommandArgument_yylex
> +    2.09%  cmake      libc-2.22.so         [.] _int_free
> +    2.06%  cmake      libc-2.22.so         [.] __memcmp_sse4_1
> +    1.84%  cmake      libc-2.22.so         [.] malloc
> 
> This already shows you that you can gain a lot by reducing the number of
> allocations done. Heaptrack is a good tool for that.

Next question would be: who is calling malloc?  Or rather, what % of callers
are std::string, std::vector, other STL classes vs custom data structures?
Next question would be: what is the size of those mallocs, for each caller?

(Sorry I don't currently have an environment set up with a profiler to test
this myself at the moment.) 

> Similarly, someone should
> investigate cmGlobalGenerator::FindGeneratorTargetImpl. That does a lot of
> string comparisons to find targets from my quick glance, so indeed could
be
> sped up with a smarter string class.
> 
> But potentially you could also get a much quicker lookup by storing a hash
> map of target name to cmGeneratorTarget.

Indeed; there has got to be a way to reduce the complexity of that function
in number of targets compared, if not the low-level string comparison itself
as well.  For example, if target names are short-ish, the string class has
large enough SSO, and the underlying string class made use of vector CPU
instructions for comparison, there is probably very little to be gained
without such a hash map.  (On the other hand, if some of the previous
assumptions are not true on some common CMake platforms....) 

> Seems like there's more than enough areas one could (and should) optimize
> CMake.

Indeed.  Another idea - probably unrelated to the string allocations issue,
but still - that came to mind: what if link-time code
generation/optimization is turned on?  IIRC this is not default in CMake.
Maybe CMake is sufficiently well-organized (e.g. small functions
implementations moved to header files) such that what needs to be inlined
across units, is already being inlined.  Then again, maybe it's not.  I've
seen other projects rely on this feature to keep clean organization by
keeping implementations in .CPP files without sacrificing performance, and
when you turn off LTCG performance takes a major hit... 

Also IIRC there are still a few optimizations that are turned off when CMake
is built with RelWithDebInfo instead of Release.  I forget the exact
specifics at the moment but e.g. on Visual C++ when you ask it to turn on
debug symbols, it will change the default values of some optimization flags.
So a cursory examination of the flags wouldn't reveal all cases.

However, one of my bigger performance gripes, being a primarily Windows
developer right now, is the process creation overhead, especially during
configuration.  I think that is completely dominating over any CMake code
being run internally.  It would be nice if that could be parallelized on my
6-core hyper-threaded CPU, but doing so properly probably isn't so easy...

Best regards,

James Johnston