GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2021-10-04T14:19:35Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/16810Use explicit lazy binding around undefined in inlinable functions2021-10-04T14:19:35Zkazu-yamamotoUse explicit lazy binding around undefined in inlinable functions# Summary
`undefined` in inlinable functions are unintentionally evaluated if called strictly. An example:
```
{-# INLINE alloca #-}
alloca :: forall a b . Storable a => (Ptr a -> IO b) -> IO b
alloca =
allocaBytesAligned (sizeOf (u...# Summary
`undefined` in inlinable functions are unintentionally evaluated if called strictly. An example:
```
{-# INLINE alloca #-}
alloca :: forall a b . Storable a => (Ptr a -> IO b) -> IO b
alloca =
allocaBytesAligned (sizeOf (undefined :: a)) (alignment (undefined :: a))
```
# Steps to reproduce
Use `Foreign.Marshal.Alloc.alloca` with `Strict` language extension.
# Expected behavior
It should allocate a memory, not evaluating `undefined`. If `undefined`s are used in inlinable functions, lazy bindings must be used explicitly for GHC 8.0 or later.
# Environment
* GHC version used: 8.0 or later
Optional:8.9https://gitlab.haskell.org/ghc/ghc/-/issues/15972Test nofib tests for correctness in CI2021-03-05T16:28:26ZBen GamariTest nofib tests for correctness in CIApparently the result of the `spectral/fft2` nofib test changed at some point in the last few months. The difference is likely acceptable given the nature of the test but it's nevertheless worrying that we are only finding this out now. ...Apparently the result of the `spectral/fft2` nofib test changed at some point in the last few months. The difference is likely acceptable given the nature of the test but it's nevertheless worrying that we are only finding this out now. We really should test nofib tests for correctness as part of CI.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ---------------------- |
| Version | 8.6.2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | high |
| Resolution | Unresolved |
| Component | Continuous Integration |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Test nofib tests for correctness in CI","status":"New","operating_system":"","component":"Continuous Integration","related":[],"milestone":"8.9","resolution":"Unresolved","owner":{"tag":"OwnedBy","contents":"bgamari"},"version":"8.6.2","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"Apparently the result of the `spectral/fft2` nofib test changed at some point in the last few months. The difference is likely acceptable given the nature of the test but it's nevertheless worrying that we are only finding this out now. We really should test nofib tests for correctness as part of CI.","type_of_failure":"OtherFailure","blocking":[]} -->8.9Ben GamariBen Gamarihttps://gitlab.haskell.org/ghc/ghc/-/issues/16052Core optimizations for memset on a small range2020-03-25T10:12:22ZAndrew MartinCore optimizations for memset on a small rangeI've been doing some API bindings lately that require zeroing out memory before poking values into the appropriate places. Sometimes, these are small data structures. For instance, on linux, the internet socket struct `sockaddr_in` is 16...I've been doing some API bindings lately that require zeroing out memory before poking values into the appropriate places. Sometimes, these are small data structures. For instance, on linux, the internet socket struct `sockaddr_in` is 16 bytes. Here's an example (not involving `sockaddr_in`) of the kind of situation that arises:
```haskell
{-# language MagicHash #-}
{-# language UnboxedTuples #-}
module FillArray
( fill
) where
import GHC.Exts
import GHC.IO
data ByteArray = ByteArray ByteArray#
fill :: IO ByteArray
fill = IO $ \s0 -> case newByteArray# 24# s0 of
(# s1, m #) -> case setByteArray# m 0# 24# 0# s1 of
s2 -> case writeWord8Array# m 4# 14## s2 of
s3 -> case writeWord8Array# m 5# 15## s3 of
s4 -> case unsafeFreezeByteArray# m s4 of
(# s5, r #) -> (# s5, ByteArray r #)
```
This `fill` function allocates a 24-byte array, sets everything to zero, and then writes the numbers 14 and 15 to elements 4 and 5 respectively. With `-O2`, here's the relevant part of the core we get:
```haskell
fill1
fill1
= \ s0_a140 ->
case newByteArray# 24# s0_a140 of { (# ipv_s16i, ipv1_s16j #) ->
case setByteArray# ipv1_s16j 0# 24# 0# ipv_s16i of s2_a143
{ __DEFAULT ->
case writeWord8Array# ipv1_s16j 4# 14## s2_a143 of s3_a144
{ __DEFAULT ->
case writeWord8Array# ipv1_s16j 5# 15## s3_a144 of s4_a145
{ __DEFAULT ->
case unsafeFreezeByteArray# ipv1_s16j s4_a145 of
{ (# ipv2_s16p, ipv3_s16q #) ->
(# ipv2_s16p, ByteArray ipv3_s16q #)
}
}
}
}
}
```
And, here's the relevant assembly:
```asm
fill1_info:
_c1kL:
addq $56,%r12
cmpq 856(%r13),%r12
ja _c1kP
_c1kO:
movq $stg_ARR_WORDS_info,-48(%r12)
movq $24,-40(%r12)
leaq -48(%r12),%rax
subq $8,%rsp
leaq 16(%rax),%rdi
xorl %esi,%esi
movl $24,%edx
movq %rax,%rbx
xorl %eax,%eax
call memset
addq $8,%rsp
movb $14,20(%rbx)
movb $15,21(%rbx)
movq $ByteArray_con_info,-8(%r12)
movq %rbx,(%r12)
leaq -7(%r12),%rbx
jmp *(%rbp)
_c1kP:
movq $56,904(%r13)
movl $fill1_closure,%ebx
jmp *-8(%r13)
.size fill1_info, .-fill1_info
```
What a bummer that using `memset` for something as small setting three machine words (on a 64 bit platform) results in a `call` instruction getting generated. Why not simply generate three `movb` instructions for the zero initialization instead?
Currently, users can work around this by translating their `setByteArray#` call to several `writeWordArray#` calls. This optimization obscures the meaning of written code and is not portable across architectures (so you have to use `CPP` to make it work on 32 bit and 64 bit). I'd like to add a cmm-to-assembly optimization to GHC that does unrolling instead so that the user can write more natural code.
Specifically, here's what I'm thinking:
- This only happens when the offset into the `ByteArray#` and the length of the range are constant that are multiples of the machine word size. So, `setByteArray# arr 8# 16# x` is eligible on 32-bit and 64-bit platforms. And `setByteArray# arr 4# 8# x` is eligible only on a 32-bit platform. And `setByteArray# arr 16# y x` is not eligible on any platform.
- This only happens when the `call memset` instruction has a range of 32 bytes or less.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 8.6.3 |
| Type | Task |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Core optimizations for memset on a small range","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"8.6.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Task","description":"I've been doing some API bindings lately that require zeroing out memory before poking values into the appropriate places. Sometimes, these are small data structures. For instance, on linux, the internet socket struct `sockaddr_in` is 16 bytes. Here's an example (not involving `sockaddr_in`) of the kind of situation that arises:\r\n\r\n{{{\r\n{-# language MagicHash #-}\r\n{-# language UnboxedTuples #-}\r\n\r\nmodule FillArray\r\n ( fill\r\n ) where\r\n\r\nimport GHC.Exts\r\nimport GHC.IO\r\n\r\ndata ByteArray = ByteArray ByteArray#\r\n\r\nfill :: IO ByteArray\r\nfill = IO $ \\s0 -> case newByteArray# 24# s0 of\r\n (# s1, m #) -> case setByteArray# m 0# 24# 0# s1 of\r\n s2 -> case writeWord8Array# m 4# 14## s2 of\r\n s3 -> case writeWord8Array# m 5# 15## s3 of\r\n s4 -> case unsafeFreezeByteArray# m s4 of\r\n (# s5, r #) -> (# s5, ByteArray r #)\r\n}}}\r\n\r\nThis `fill` function allocates a 24-byte array, sets everything to zero, and then writes the numbers 14 and 15 to elements 4 and 5 respectively. With `-O2`, here's the relevant part of the core we get:\r\n\r\n{{{\r\nfill1\r\nfill1\r\n = \\ s0_a140 ->\r\n case newByteArray# 24# s0_a140 of { (# ipv_s16i, ipv1_s16j #) ->\r\n case setByteArray# ipv1_s16j 0# 24# 0# ipv_s16i of s2_a143\r\n { __DEFAULT ->\r\n case writeWord8Array# ipv1_s16j 4# 14## s2_a143 of s3_a144\r\n { __DEFAULT ->\r\n case writeWord8Array# ipv1_s16j 5# 15## s3_a144 of s4_a145\r\n { __DEFAULT ->\r\n case unsafeFreezeByteArray# ipv1_s16j s4_a145 of\r\n { (# ipv2_s16p, ipv3_s16q #) ->\r\n (# ipv2_s16p, ByteArray ipv3_s16q #)\r\n }\r\n }\r\n }\r\n }\r\n }\r\n}}}\r\n\r\nAnd, here's the relevant assembly:\r\n\r\n{{{\r\nfill1_info:\r\n_c1kL:\r\n addq $56,%r12\r\n cmpq 856(%r13),%r12\r\n ja _c1kP\r\n_c1kO:\r\n movq $stg_ARR_WORDS_info,-48(%r12)\r\n movq $24,-40(%r12)\r\n leaq -48(%r12),%rax\r\n subq $8,%rsp\r\n leaq 16(%rax),%rdi\r\n xorl %esi,%esi\r\n movl $24,%edx\r\n movq %rax,%rbx\r\n xorl %eax,%eax\r\n call memset\r\n addq $8,%rsp\r\n movb $14,20(%rbx)\r\n movb $15,21(%rbx)\r\n movq $ByteArray_con_info,-8(%r12)\r\n movq %rbx,(%r12)\r\n leaq -7(%r12),%rbx\r\n jmp *(%rbp)\r\n_c1kP:\r\n movq $56,904(%r13)\r\n movl $fill1_closure,%ebx\r\n jmp *-8(%r13)\r\n .size fill1_info, .-fill1_info\r\n}}}\r\n\r\nWhat a bummer that using `memset` for something as small setting three machine words (on a 64 bit platform) results in a `call` instruction getting generated. Why not simply generate three `movb` instructions for the zero initialization instead?\r\n\r\nCurrently, users can work around this by translating their `setByteArray#` call to several `writeWordArray#` calls. This optimization obscures the meaning of written code and is not portable across architectures (so you have to use `CPP` to make it work on 32 bit and 64 bit). I'd like to add a cmm-to-assembly optimization to GHC that does unrolling instead so that the user can write more natural code.\r\n\r\nSpecifically, here's what I'm thinking:\r\n\r\n* This only happens when the offset into the `ByteArray#` and the length of the range are constant that are multiples of the machine word size. So, `setByteArray# arr 8# 16# x` is eligible on 32-bit and 64-bit platforms. And `setByteArray# arr 4# 8# x` is eligible only on a 32-bit platform. And `setByteArray# arr 16# y x` is not eligible on any platform.\r\n* This only happens when the `call memset` instruction has a range of 32 bytes or less.","type_of_failure":"OtherFailure","blocking":[]} -->8.9Artem PyanykhArtem Pyanykh