GHC issues
https://gitlab.haskell.org/ghc/ghc/-/issues
2021-02-10T08:21:58Z
https://gitlab.haskell.org/ghc/ghc/-/issues/13624
loadObj() does not respect alignment
2021-02-10T08:21:58Z
Trevor L. McDonell
loadObj() does not respect alignment
This is perhaps known, but I'll write it down here in case somebody else runs into this problem as well.
Since `loadObj()` just `mmap()`s the entire object file and decodes it *in place*, it does not respect the alignment requirements s...
This is perhaps known, but I'll write it down here in case somebody else runs into this problem as well.
Since `loadObj()` just `mmap()`s the entire object file and decodes it *in place*, it does not respect the alignment requirements specified in the section headers. This is problematic for instructions which require alignment, e.g. SSE, AVX.
The attached `map.ll` program is `map (+1)` over an array of floating point numbers. In particular, the core loop is 8-way SIMD vectorised x 4-way unrolled, for 32-elements per loop iteration. A tail loop handles any remainder one-at-a-time.
You can compile it using `llc -filetype=obj -mcpu=native map.ll`. For a CPU with AVX instructions (sandy bridge or later) you should get the following:
```
$ objdump -d map.o
Disassembly of section .text:
0000000000000000 <map>:
0: 49 89 f3 mov %rsi,%r11
3: 49 29 fb sub %rdi,%r11
6: 0f 8e f9 00 00 00 jle 105 <map+0x105>
c: 49 83 fb 20 cmp $0x20,%r11
10: 0f 82 bd 00 00 00 jb d3 <map+0xd3>
16: 4d 89 da mov %r11,%r10
19: 49 83 e2 e0 and $0xffffffffffffffe0,%r10
1d: 4d 89 d9 mov %r11,%r9
20: 49 83 e1 e0 and $0xffffffffffffffe0,%r9
24: 0f 84 a9 00 00 00 je d3 <map+0xd3>
2a: 49 01 fa add %rdi,%r10
2d: 48 8d 44 ba 60 lea 0x60(%rdx,%rdi,4),%rax
32: 49 8d 7c b8 60 lea 0x60(%r8,%rdi,4),%rdi
37: c5 fc 28 05 00 00 00 vmovaps 0x0(%rip),%ymm0 # 3f <map+0x3f>
3e: 00
3f: 4c 89 c9 mov %r9,%rcx
42: 66 66 66 66 66 2e 0f data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
49: 1f 84 00 00 00 00 00
50: c5 f8 10 4f a0 vmovups -0x60(%rdi),%xmm1
55: c5 f8 10 57 c0 vmovups -0x40(%rdi),%xmm2
5a: c5 f8 10 5f e0 vmovups -0x20(%rdi),%xmm3
5f: c5 f8 10 27 vmovups (%rdi),%xmm4
63: c4 e3 75 18 4f b0 01 vinsertf128 $0x1,-0x50(%rdi),%ymm1,%ymm1
6a: c4 e3 6d 18 57 d0 01 vinsertf128 $0x1,-0x30(%rdi),%ymm2,%ymm2
71: c4 e3 65 18 5f f0 01 vinsertf128 $0x1,-0x10(%rdi),%ymm3,%ymm3
78: c4 e3 5d 18 67 10 01 vinsertf128 $0x1,0x10(%rdi),%ymm4,%ymm4
7f: c5 f4 58 c8 vaddps %ymm0,%ymm1,%ymm1
83: c5 ec 58 d0 vaddps %ymm0,%ymm2,%ymm2
87: c5 e4 58 d8 vaddps %ymm0,%ymm3,%ymm3
8b: c5 dc 58 e0 vaddps %ymm0,%ymm4,%ymm4
8f: c4 e3 7d 19 48 b0 01 vextractf128 $0x1,%ymm1,-0x50(%rax)
96: c5 f8 11 48 a0 vmovups %xmm1,-0x60(%rax)
9b: c4 e3 7d 19 50 d0 01 vextractf128 $0x1,%ymm2,-0x30(%rax)
a2: c5 f8 11 50 c0 vmovups %xmm2,-0x40(%rax)
a7: c4 e3 7d 19 58 f0 01 vextractf128 $0x1,%ymm3,-0x10(%rax)
ae: c5 f8 11 58 e0 vmovups %xmm3,-0x20(%rax)
b3: c4 e3 7d 19 60 10 01 vextractf128 $0x1,%ymm4,0x10(%rax)
ba: c5 f8 11 20 vmovups %xmm4,(%rax)
be: 48 83 e8 80 sub $0xffffffffffffff80,%rax
c2: 48 83 ef 80 sub $0xffffffffffffff80,%rdi
c6: 48 83 c1 e0 add $0xffffffffffffffe0,%rcx
ca: 75 84 jne 50 <map+0x50>
cc: 4d 39 cb cmp %r9,%r11
cf: 75 05 jne d6 <map+0xd6>
d1: eb 32 jmp 105 <map+0x105>
d3: 49 89 fa mov %rdi,%r10
d6: 4c 29 d6 sub %r10,%rsi
d9: 4a 8d 04 92 lea (%rdx,%r10,4),%rax
dd: 4b 8d 0c 90 lea (%r8,%r10,4),%rcx
e1: c5 fa 10 05 00 00 00 vmovss 0x0(%rip),%xmm0 # e9 <map+0xe9>
e8: 00
e9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
f0: c5 fa 58 09 vaddss (%rcx),%xmm0,%xmm1
f4: c5 fa 11 08 vmovss %xmm1,(%rax)
f8: 48 83 c0 04 add $0x4,%rax
fc: 48 83 c1 04 add $0x4,%rcx
100: 48 ff ce dec %rsi
103: 75 eb jne f0 <map+0xf0>
105: c5 f8 77 vzeroupper
108: c3 req
```
The attached `test.c` will load the object file and try to execute it. The `#define N` on line 7 will change the size of the array. For fewer than 32 elements this works as expected (where the input array is \[0..N-1\]):
```
$ ./build.sh
+ llc-4.0 -filetype=obj -mcpu=native map.ll
+ ghc --make -no-hs-main test.c
$ ./a.out
array size is 31
calling function...
ok
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 31.0
```
For 32 elements or larger (i.e. entering the core loop) the program will (almost certainly) segfault.
```
$ lldb a.out
(lldb) target create "a.out"
Current executable set to 'a.out' (x86_64).
(lldb) run
Process 7294 launched: '<snip>/a.out' (x86_64)
array size is 32
calling function...
Process 7294 stopped
* thread #1: tid = 0xc41676, 0x000000010019f207, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x000000010019f207
-> 0x10019f207: vmovaps 0xe1(%rip), %ymm0
0x10019f20f: movq %r9, %rcx
0x10019f212: nopw %cs:(%rax,%rax)
0x10019f220: vmovups -0x60(%rdi), %xmm1
```
The `VMOVAPS` instruction requires the source address to be 32-byte aligned. It is attempting to load 8 floats from one of the const sections (the ones for the +1), but since the section was not loaded at the required alignment, fails.
I've tested this on x86_64 macOS (Mach-O) and ubuntu (ELF). I don't have any other systems to test on.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ----------------------- |
| Version | 8.0.1 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Runtime System (Linker) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"loadObj() does not respect alignment","status":"New","operating_system":"","component":"Runtime System (Linker)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.0.1","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"This is perhaps known, but I'll write it down here in case somebody else runs into this problem as well.\r\n\r\nSince `loadObj()` just `mmap()`s the entire object file and decodes it ''in place'', it does not respect the alignment requirements specified in the section headers. This is problematic for instructions which require alignment, e.g. SSE, AVX.\r\n\r\nThe attached `map.ll` program is `map (+1)` over an array of floating point numbers. In particular, the core loop is 8-way SIMD vectorised x 4-way unrolled, for 32-elements per loop iteration. A tail loop handles any remainder one-at-a-time.\r\n\r\nYou can compile it using `llc -filetype=obj -mcpu=native map.ll`. For a CPU with AVX instructions (sandy bridge or later) you should get the following:\r\n\r\n{{{\r\n$ objdump -d map.o\r\nDisassembly of section .text:\r\n\r\n0000000000000000 <map>:\r\n 0:\t49 89 f3 \tmov %rsi,%r11\r\n 3:\t49 29 fb \tsub %rdi,%r11\r\n 6:\t0f 8e f9 00 00 00 \tjle 105 <map+0x105>\r\n c:\t49 83 fb 20 \tcmp $0x20,%r11\r\n 10:\t0f 82 bd 00 00 00 \tjb d3 <map+0xd3>\r\n 16:\t4d 89 da \tmov %r11,%r10\r\n 19:\t49 83 e2 e0 \tand $0xffffffffffffffe0,%r10\r\n 1d:\t4d 89 d9 \tmov %r11,%r9\r\n 20:\t49 83 e1 e0 \tand $0xffffffffffffffe0,%r9\r\n 24:\t0f 84 a9 00 00 00 \tje d3 <map+0xd3>\r\n 2a:\t49 01 fa \tadd %rdi,%r10\r\n 2d:\t48 8d 44 ba 60 \tlea 0x60(%rdx,%rdi,4),%rax\r\n 32:\t49 8d 7c b8 60 \tlea 0x60(%r8,%rdi,4),%rdi\r\n 37:\tc5 fc 28 05 00 00 00 \tvmovaps 0x0(%rip),%ymm0 # 3f <map+0x3f>\r\n 3e:\t00\r\n 3f:\t4c 89 c9 \tmov %r9,%rcx\r\n 42:\t66 66 66 66 66 2e 0f \tdata16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)\r\n 49:\t1f 84 00 00 00 00 00\r\n 50:\tc5 f8 10 4f a0 \tvmovups -0x60(%rdi),%xmm1\r\n 55:\tc5 f8 10 57 c0 \tvmovups -0x40(%rdi),%xmm2\r\n 5a:\tc5 f8 10 5f e0 \tvmovups -0x20(%rdi),%xmm3\r\n 5f:\tc5 f8 10 27 \tvmovups (%rdi),%xmm4\r\n 63:\tc4 e3 75 18 4f b0 01 \tvinsertf128 $0x1,-0x50(%rdi),%ymm1,%ymm1\r\n 6a:\tc4 e3 6d 18 57 d0 01 \tvinsertf128 $0x1,-0x30(%rdi),%ymm2,%ymm2\r\n 71:\tc4 e3 65 18 5f f0 01 \tvinsertf128 $0x1,-0x10(%rdi),%ymm3,%ymm3\r\n 78:\tc4 e3 5d 18 67 10 01 \tvinsertf128 $0x1,0x10(%rdi),%ymm4,%ymm4\r\n 7f:\tc5 f4 58 c8 \tvaddps %ymm0,%ymm1,%ymm1\r\n 83:\tc5 ec 58 d0 \tvaddps %ymm0,%ymm2,%ymm2\r\n 87:\tc5 e4 58 d8 \tvaddps %ymm0,%ymm3,%ymm3\r\n 8b:\tc5 dc 58 e0 \tvaddps %ymm0,%ymm4,%ymm4\r\n 8f:\tc4 e3 7d 19 48 b0 01 \tvextractf128 $0x1,%ymm1,-0x50(%rax)\r\n 96:\tc5 f8 11 48 a0 \tvmovups %xmm1,-0x60(%rax)\r\n 9b:\tc4 e3 7d 19 50 d0 01 \tvextractf128 $0x1,%ymm2,-0x30(%rax)\r\n a2:\tc5 f8 11 50 c0 \tvmovups %xmm2,-0x40(%rax)\r\n a7:\tc4 e3 7d 19 58 f0 01 \tvextractf128 $0x1,%ymm3,-0x10(%rax)\r\n ae:\tc5 f8 11 58 e0 \tvmovups %xmm3,-0x20(%rax)\r\n b3:\tc4 e3 7d 19 60 10 01 \tvextractf128 $0x1,%ymm4,0x10(%rax)\r\n ba:\tc5 f8 11 20 \tvmovups %xmm4,(%rax)\r\n be:\t48 83 e8 80 \tsub $0xffffffffffffff80,%rax\r\n c2:\t48 83 ef 80 \tsub $0xffffffffffffff80,%rdi\r\n c6:\t48 83 c1 e0 \tadd $0xffffffffffffffe0,%rcx\r\n ca:\t75 84 \tjne 50 <map+0x50>\r\n cc:\t4d 39 cb \tcmp %r9,%r11\r\n cf:\t75 05 \tjne d6 <map+0xd6>\r\n d1:\teb 32 \tjmp 105 <map+0x105>\r\n d3:\t49 89 fa \tmov %rdi,%r10\r\n d6:\t4c 29 d6 \tsub %r10,%rsi\r\n d9:\t4a 8d 04 92 \tlea (%rdx,%r10,4),%rax\r\n dd:\t4b 8d 0c 90 \tlea (%r8,%r10,4),%rcx\r\n e1:\tc5 fa 10 05 00 00 00 \tvmovss 0x0(%rip),%xmm0 # e9 <map+0xe9>\r\n e8:\t00\r\n e9:\t0f 1f 80 00 00 00 00 \tnopl 0x0(%rax)\r\n f0:\tc5 fa 58 09 \tvaddss (%rcx),%xmm0,%xmm1\r\n f4:\tc5 fa 11 08 \tvmovss %xmm1,(%rax)\r\n f8:\t48 83 c0 04 \tadd $0x4,%rax\r\n fc:\t48 83 c1 04 \tadd $0x4,%rcx\r\n 100:\t48 ff ce \tdec %rsi\r\n 103:\t75 eb \tjne f0 <map+0xf0>\r\n 105:\tc5 f8 77 \tvzeroupper\r\n 108:\tc3 \treq\r\n}}}\r\n\r\nThe attached `test.c` will load the object file and try to execute it. The `#define N` on line 7 will change the size of the array. For fewer than 32 elements this works as expected (where the input array is [0..N-1]):\r\n\r\n{{{\r\n$ ./build.sh\r\n+ llc-4.0 -filetype=obj -mcpu=native map.ll\r\n+ ghc --make -no-hs-main test.c\r\n\r\n$ ./a.out\r\narray size is 31\r\ncalling function...\r\nok\r\n1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 31.0\r\n}}}\r\n\r\nFor 32 elements or larger (i.e. entering the core loop) the program will (almost certainly) segfault.\r\n\r\n{{{\r\n$ lldb a.out\r\n(lldb) target create \"a.out\"\r\nCurrent executable set to 'a.out' (x86_64).\r\n(lldb) run\r\nProcess 7294 launched: '<snip>/a.out' (x86_64)\r\narray size is 32\r\ncalling function...\r\nProcess 7294 stopped\r\n* thread #1: tid = 0xc41676, 0x000000010019f207, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)\r\n frame #0: 0x000000010019f207\r\n-> 0x10019f207: vmovaps 0xe1(%rip), %ymm0\r\n 0x10019f20f: movq %r9, %rcx\r\n 0x10019f212: nopw %cs:(%rax,%rax)\r\n 0x10019f220: vmovups -0x60(%rdi), %xmm1\r\n}}}\r\n\r\nThe `VMOVAPS` instruction requires the source address to be 32-byte aligned. It is attempting to load 8 floats from one of the const sections (the ones for the +1), but since the section was not loaded at the required alignment, fails.\r\n\r\nI've tested this on x86_64 macOS (Mach-O) and ubuntu (ELF). I don't have any other systems to test on.","type_of_failure":"OtherFailure","blocking":[]} -->
8.10.2
Artem Pyanykh
Artem Pyanykh