Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • GHC GHC
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 5,241
    • Issues 5,241
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
  • Merge requests 567
    • Merge requests 567
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Releases
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Glasgow Haskell CompilerGlasgow Haskell Compiler
  • GHCGHC
  • Issues
  • #16014
Closed
Open
Issue created Dec 08, 2018 by Andreas Klebinger@AndreasKDeveloper

Avoid unnecessary code duplication from String Literals.

When we compile a function like this:

foo :: Int -> String
foo 1 = "1_String"
foo 2 = "2_String"
foo 3 = "3_String"
foo 4 = "4_String"
foo 5 = "5_String"
foo _ = "unknown_string"

We get a few things.

Obviously there are the actual strings:

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
Foo.foo10 :: GHC.Prim.Addr#
Foo.foo10 = "1_String"#

There is no way around these so that's fine.

Then we end up with a function to unpack the actual string.

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
Foo.foo9 :: [Char]
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=True,
         WorkFree=False, Expandable=True, Guidance=IF_ARGS [] 20 0}]
Foo.foo9 = GHC.CString.unpackCString# Foo.foo10

We want to have a closure for each of these. Otherwise we end up constructing a new string each time, which would be a waste. So on the surface this is fine. However we end up with one of these for each string.

Looking at the Cmm/Asm the functions are not THAT small.

[Foo.foo9_entry() //  [R1]
         { info_tbls: [(c2hj,
                        label: Foo.foo9_info
                        rep: HeapRep static { Thunk }
                        srt: Nothing)]
           stack_info: arg_space: 8 updfr_space: Just 8
         }
     {offset
       c2hj: // global
           if ((Sp + -16) < SpLim) (likely: False) goto c2hk; else goto c2hl;
       c2hk: // global
           R1 = R1;
           call (stg_gc_enter_1)(R1) args: 8, res: 0, upd: 8;
       c2hl: // global
           (_c2hg::I64) = call "ccall" arg hints:  [PtrHint,
                                                    PtrHint]  result hints:  [PtrHint] newCAF(BaseReg, R1);
           if (_c2hg::I64 == 0) goto c2hi; else goto c2hh;
       c2hi: // global
           call (I64[R1])() args: 8, res: 0, upd: 8;
       c2hh: // global
           I64[Sp - 16] = stg_bh_upd_frame_info;
           I64[Sp - 8] = _c2hg::I64;
           R2 = Foo.foo10_bytes;
           Sp = Sp - 16;
           call GHC.CString.unpackCString#_info(R2) args: 24, res: 0, upd: 24;
     }
 }

In actual code size (HEAD/-O2) the code + info table is about 100Byte.

To give an idea in my current ghc-stage2 I have 12,409 calls to unpackCString, and I assume 99% of them are from actual strings. So that comes out to about one megabyte. Out of a binary size of 84MB, so surprisingly significant actually!

Now what can we do about this!

I think the right way to go here is to define a unpackString function which does the actual work. Then we jump into that from a small per-string wrapper.

So we would end up with code like:

[Foo.foo9_entry() //  [R1]
    c2hj: 
        // R1 = Closure pointer
        R2 = Foo.foo10_bytes;
        call (unpackString)(R1,R2) args: 8, res: 0, upd: 8;
 }

Which should be well under 50 bytes. It does cause an extra indirection when unpacking the String, but if the performance of that matters using String is likely the wrong choice to begin with.

How to actually get GHC to generate code like that I'm not sure yet.

Trac metadata
Trac field Value
Version 8.6.2
Type Task
TypeOfFailure OtherFailure
Priority normal
Resolution Unresolved
Component Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking