Size reduction of "hello world" bundle compiled by javascript-backend
Long-running issue (more like epic than issue)
Motivation
Hello World bundle size is:
6224 -rw-r--r-- 1 gulinserge staff 5398132 Mar 23 20:39 all.js
Reduction ~5.4mb
would help to use ghc in wider application of web development.
Proposal
Size reduction strategy could be considered in following broad directions:
- JavaScript-centric. Use widely known existing tools to reduce js bundle size, such as (at least): Google Closure Compiler and UglifyJS. Here we will gather information on how existing bundle compress solutions work with hello-world. And if they do not work for some reasonable cases we will apply fixes and improvements to our javascript-backend building process.
-
Environment-related. JavaScript code is running in a rich environment where many basic things are defined and implemented already. For example, browsers already provide some Unicode and Locale support via
Intl
, which can probably be reused. So, our javascript hello-world bundle could reduce its size by removing already implemented stuff in browsers via ECMAScript standard. -
Translation-leaned. GHCJS had built-in tooling for better dedupe called Compactor. Such an approach could be ported to GHC Javascript-Backend and enabled in the same fashion via compilation options like
-dedupe
.
Currently this issue is about N1 as other options still require deep dive from my side.
JavaScript-centric
Usage of existing JavaScript tools to reduce the size of bundle output is a straightforward task. Honestly that is the first thing that comes to the mind.
See the difference how it is much between UglifyJS and Closure Compiler for hello-world:
$ uglifyjs --compress --mangle --output ./HelloJS.jsexe/all.min.uglifyjs.js -- ./HelloJS.jsexe/all.js
$ ls -als ./HelloJS.jsexe
...
4968 -rw-r--r-- 1 gulinserge staff 5084786 Mar 24 19:11 all.min.uglifyjs.js
To run Closure Compiler need to do some (See the origin issue) changes (will be prepared as MR after applying the MR before) in GHC code because Closure Compiler will raise errors.
What if we run without these changes?
At first we hit the following:
./HelloJS.jsexe/all.js:12611:9: ERROR - [JSC_VAR_MULTIPLY_DECLARED_ERROR] Variable h$rts_isProfiled declared more than once. First occurrence: ./HelloJS.jsexe/all.js:8723:9
12611| function h$rts_isProfiled() {
The origin issue has some investigation into how it happened.
Commit !11447 (diffs) added a dupe of
h$rts_isProfiled
which was already defined by 08d8e9ef
Then if we try to fix it locally (how MR suggests) we will hit the following on latest master (+ applied MR):
./HelloJS.jsexe/all.js:7795:8: ERROR - [JSC_UNDEFINED_VARIABLE] variable java is undeclared
7795| java.lang.System.out.print(s);
^^^^
./HelloJS.jsexe/all.js:8432:27: ERROR - [JSC_UNDEFINED_VARIABLE] variable Buffer is undeclared
8432| h$fs.read(real_fd, Buffer.alloc(n), 0, n, pos, function(err, bytesRead, nbuf) {
^^^^^^
./HelloJS.jsexe/all.js:8495:8: ERROR - [JSC_UNDEFINED_VARIABLE] variable putstr is undeclared
8495| putstr(h$decodeUtf8(buf, n, buf_offset));
^^^^^^
./HelloJS.jsexe/all.js:8499:8: ERROR - [JSC_UNDEFINED_VARIABLE] variable printErr is undeclared
8499| printErr(h$decodeUtf8(buf, n, buf_offset));
^^^^^^^^
./HelloJS.jsexe/all.js:8507:37: ERROR - [JSC_UNDEFINED_VARIABLE] variable debug is undeclared
8507| var h$base_stderrLeftover = { f: debug, val: null };
^^^^^
./HelloJS.jsexe/all.js:9747:68: ERROR - [JSC_UNDEFINED_VARIABLE] variable null_ is undeclared
9747| var p = (((ptr_d).arr && (ptr_d).arr[off]) ? (ptr_d).arr[off] : null_); var o = (ptr_d).dv.getInt32(off,true);;
^^^^^
./HelloJS.jsexe/all.js:10273:26: ERROR - [JSC_UNDEFINED_VARIABLE] variable j is undeclared
10273| for(j=x.length-1;j>=0;j--) {
^
./HelloJS.jsexe/all.js:12247:9: ERROR - [JSC_UNDEFINED_VARIABLE] variable ptr is undeclared
12247| return ptr = h$initHeapBufferLen(str_d, str_o, str_d.len);
^^^
./HelloJS.jsexe/all.js:12498:25: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$setCcs_e is undeclared
12498| if(h$stack[h$sp] !== h$setCcs_e) {
^^^^^^^^^^
./HelloJS.jsexe/all.js:12808:11: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$vt_int is undeclared
12808| case h$vt_int:
^^^^^^^^
./HelloJS.jsexe/all.js:12941:22: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$checkInvariants_e is undeclared
12941| h$stack[h$sp] = h$checkInvariants_e;
^^^^^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:13834:33: ERROR - [JSC_UNDEFINED_VARIABLE] variable arr is undeclared
13834| return h$charCodeArrayToString(arr);
^^^
./HelloJS.jsexe/all.js:13925:12: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$EINVAL is undeclared
13925| h$errno = h$EINVAL;
^^^^^^^^
./HelloJS.jsexe/all.js:14112:2: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$fds is undeclared
14112| h$fds[fd].waitRead.push(h$currentThread);
^^^^^
./HelloJS.jsexe/all.js:24517:13: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$stg_cloneMyStackzh is undeclared
24517| if(c) {var g=h$stg_cloneMyStackzh();
^^^^^^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:24518:6: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$stg_decodeStackzh is undeclared
24518| var i=h$stg_decodeStackzh(g);
^^^^^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:25814:6: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$libdwLookupLocation is undeclared
25814| var h=h$libdwLookupLocation(d,e,a,c,f,g);
^^^^^^^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:26282:21: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$libdwPoolRelease is undeclared
26282| h$r3=h$mkFunctionPtr(h$libdwPoolRelease);
^^^^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:26290:2: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$libdwPoolTake is undeclared
26290| a=h$libdwPoolTake();
^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:26307:2: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$libdwGetBacktrace is undeclared
26307| c=h$libdwGetBacktrace(a,b);
^^^^^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:26335:21: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$backtraceFree is undeclared
26335| h$r3=h$mkFunctionPtr(h$backtraceFree);
^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:33154:5: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$fdReady is undeclared
33154| h$r1=h$fdReady(a,1,f,(c>>>0),0);
^^^^^^^^^
./HelloJS.jsexe/all.js:37760:5: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$lookupIPE is undeclared
37760| h$r1=h$lookupIPE(a,c,d,0);
^^^^^^^^^^^
./HelloJS.jsexe/all.js:47403:0: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$reportStackOverflow is undeclared
47403| h$reportStackOverflow(c);
^^^^^^^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:47407:8: ERROR - [JSC_UNDEFINED_VARIABLE] variable h$reportHeapOverflow is undeclared
47407| case(2):h$reportHeapOverflow();
^^^^^^^^^^^^^^^^^^^^
./HelloJS.jsexe/all.js:57067:4: ERROR - [JSC_VAR_MULTIPLY_DECLARED_ERROR] Variable fs declared more than once. First occurrence: ./HelloJS.jsexe/all.js:7767:8
57067| var fs, nodePath;
^^
26 error(s), 87 warning(s)
make: *** [Makefile:8: closure] Error 26
When all changes will be applied to master branch of GHC it will be possible to run Google Closure Compiler:
$ google-closure-compiler --isolation_mode IIFE --assume_function_wrapper --emit_use_strict --externs ./HelloJS.jsexe/all.js.externs --compilation_level ADVANCED_OPTIMIZATIONS ./HelloJS.jsexe/all.js ./hello.externs.js --js_output_file ./HelloJS.jsexe/all.min.js
$ ls -als ./HelloJS.jsexe
...
4160 -rw-r--r-- 1 gulinserge staff 3775580 Mar 23 16:50 all.min.js
For reference ./hello.externs.js is following:
/**
* @externs
*/
/** @type {*} */
Module.HEAP8;
if (typeof process !== 'undefined') {
/** @type {*} */
process.stdin.on;
}
/** @type {*} */
var __dirname = typeof __dirname != 'undefined' ? __dirname : undefined;
Why we do not test SWC?
SWC is a good thing, but it supports tree shaking only for CommonJS-enabled bundle output which is not currently available with ghc javascript-backend. When (and ever) we will support it, we could try this compiler as well and very probably it will do the job better than others.
Origin | UglifyJS (+ mangle) | Google Closure Compiler (Advanced) | |
---|---|---|---|
Size | 5398132 | 5084786 | 3775580 |
It is not a surprise that Closure Compiler wins the challenge with UglifyJS. Google Closure Compiler does deep static code analysis (that's why it was possible CC to point us on errors) to eliminate dead code and throw it away.
Little summary
Using existing tools to compress generated bundles is useful. To make it happen we still need to change GHC code to make its output work with Google Closure Compiler. To prevent us from facing such issues further it would be nice to add Google Closure Compiler to the test suite and run it at least over the hello-world bundle just to make sure that we do not add mistakes to javascript code which can be noted automatically via static checks.
Environment-related
Middle way of optimizations is to make a good deal with the environment where we are supposed to run our javascript bundles. I reviewed our code of supported environments, and, it seems, we are going to
support as much as possible.
function h$isNode() {
return h$isNode_;
}
function h$isJvm() {
return h$isJvm_;
}
function h$isJsShell() {
return h$isJsShell_;
}
function h$isJsCore() {
return h$isJsCore_;
}
function h$isBrowser() {
return h$isBrowser_;
}
function h$isGHCJSi() {
return h$isGHCJSi_;
}
Besides the wide list of JVM ones (Rhino, Nashorn, GraalVM) , browser support requires even more effort due its very large amount of differences between them.
So, the first step here is to define what exactly we are going to support. And leave details to tools which are specialized on graceful degradation in the modern web. My point of view is simple. We could rely on the edge of ECMA and leave things down to tools like babel
. That's how the javascript world works with all its huge amount of little differences.
If we are going to take such a way the problem becomes simpler.
Okay, suppose we've limited our list of support to only edge ECMA but why do I ever tell about it? Even if we will have applied for N1 we are still dealing with ~3.8MB
. It is still quite a large volume for an average browser app. Let's take a look deeper into our bundle.
I did the following:
- Processed our bundle via Google Closure Compiler (with all patches enabled) but with
--debug
option. It produces output without dead code but also without names mangling. - Counted chars on every line. Most interesting is
out.js
content. Its structure is very simple: every line contains a translated expression from haskell. For example:var h$ghczminternalZCGHCziInternalziUnicodezizdtrModule4_1=h$rawStringData([103,104,99,45,105,110,116,101,114,110,97,108]);
. - Calculated sum of all chars (without newlines), divided every line all chars over sum of all chars, selected some threshold to limit noise.
- Draw the graph of accumulated amount for weighted lines length.
It has a very interesting structure.
At the image above you will note that size grows at a lower speed most of the time. But there is an area where its growing speed is very large. I took a look into it and found the following:
h$ghczminternalZCGHCziInternalziUnicodeziCharziUnicodeDataziGeneralCategoryzilvl_1=h$rawStringData([25,25,25,25,25,25,25,25,25,25,25,25,25,25,25,25,25......
GHC.Internal.Unicode.Char.UnicodeData.GeneralCategory.lvl_1
. It is related to this. And one of these files take up to 3.5MB
in sources.
Here are my assumptions. I assume that we could avoid bundling Unicode into javascript builds due to having Intl
in modern browsers and environments. I cut off this line from my minified bundle and its size became: 0.5MB
.
Little summary
The Javascript Environment is rich but very granular. But in best crossed chances we could drastically reduce our >5.5MB
bundle to a notable little ~0.5MB
. It is TEN times better. To make it happen we should select a strategy which Javascript Environments we support, how and at which limitations. After that we could make steps into the more risky part: swap GHC internals on Environment implementations. For an average web app where size is critical the total win is 10x.
Translation-leaned
It is the last way which I can see currently but not the least. Our Haskell->Javascript translation process is not perfect. It leaves out very important things in modern JavaScript development such as: CommonJS module wrapping, Source Code Maps generation, origin code (at Haskell) duplication removal, types information provided to JavaScript tools. I think there are other things but these came to my mind first.
Every point of above could have a positive impact on javascript code compression. For example, having CommonJS module wrapping opens a door into modern javascript compilers: SWC. See the list for details why. Having Source Code Maps opens a door into modern JavaScript tooling to analyze what takes space in the bundle (besides a nice-to-have feature to show sources right at the browser with breakpoints).
But for first attempt I would move your look into existing -dedupe
option in GHCJS which is covered by Compactor. Its results are described at Reddit. As for me it is good to have things which could help too with size reduction.
It would be nice to port it into GHC JavaScript-Backend and see how it will go. This issue is not about it (because it still requires deep investigation from me) but looks pretty promising.
Little summary
This way of optimizations scratched very little at the current time but I bet here we will hit 2 things in one shot: earn some amount of size reduction with improvement overall development experience of ghc javascript-backend usage.
Summary
Suggested changes already have potential improvements in 10 times for hello-world. I am open for any questions and as you suppose most sentences in this issue could be covered by a large description further. I hope that with blessing from the GHC community we could improve the current situation with bundle size. And as you see also code size reduction is not only about size. It points us to far away things which could be a game changer for the role of GHC in modern client-side web development.