Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
GHC
GHC
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 4,252
    • Issues 4,252
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
    • Iterations
  • Merge Requests 398
    • Merge Requests 398
  • Requirements
    • Requirements
    • List
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Security & Compliance
    • Security & Compliance
    • Dependency List
    • License Compliance
  • Operations
    • Operations
    • Incidents
    • Environments
  • Analytics
    • Analytics
    • CI / CD
    • Code Review
    • Insights
    • Issue
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • Glasgow Haskell Compiler
  • GHCGHC
  • Issues
  • #16052

Closed
Open
Opened Dec 15, 2018 by Andrew Martin@andrewthadDeveloper

Core optimizations for memset on a small range

I've been doing some API bindings lately that require zeroing out memory before poking values into the appropriate places. Sometimes, these are small data structures. For instance, on linux, the internet socket struct sockaddr_in is 16 bytes. Here's an example (not involving sockaddr_in) of the kind of situation that arises:

{-# language MagicHash #-}
{-# language UnboxedTuples #-}

module FillArray
  ( fill
  ) where

import GHC.Exts
import GHC.IO

data ByteArray = ByteArray ByteArray#

fill :: IO ByteArray
fill = IO $ \s0 -> case newByteArray# 24# s0 of
  (# s1, m #) -> case setByteArray# m 0# 24# 0# s1 of
    s2 -> case writeWord8Array# m 4# 14## s2 of
      s3 -> case writeWord8Array# m 5# 15## s3 of
        s4 -> case unsafeFreezeByteArray# m s4 of
          (# s5, r #) -> (# s5, ByteArray r #)

This fill function allocates a 24-byte array, sets everything to zero, and then writes the numbers 14 and 15 to elements 4 and 5 respectively. With -O2, here's the relevant part of the core we get:

fill1
fill1
  = \ s0_a140 ->
      case newByteArray# 24# s0_a140 of { (# ipv_s16i, ipv1_s16j #) ->
      case setByteArray# ipv1_s16j 0# 24# 0# ipv_s16i of s2_a143
      { __DEFAULT ->
      case writeWord8Array# ipv1_s16j 4# 14## s2_a143 of s3_a144
      { __DEFAULT ->
      case writeWord8Array# ipv1_s16j 5# 15## s3_a144 of s4_a145
      { __DEFAULT ->
      case unsafeFreezeByteArray# ipv1_s16j s4_a145 of
      { (# ipv2_s16p, ipv3_s16q #) ->
      (# ipv2_s16p, ByteArray ipv3_s16q #)
      }
      }
      }
      }
      }

And, here's the relevant assembly:

fill1_info:
_c1kL:
        addq $56,%r12
        cmpq 856(%r13),%r12
        ja _c1kP
_c1kO:
        movq $stg_ARR_WORDS_info,-48(%r12)
        movq $24,-40(%r12)
        leaq -48(%r12),%rax
        subq $8,%rsp
        leaq 16(%rax),%rdi
        xorl %esi,%esi
        movl $24,%edx
        movq %rax,%rbx
        xorl %eax,%eax
        call memset
        addq $8,%rsp
        movb $14,20(%rbx)
        movb $15,21(%rbx)
        movq $ByteArray_con_info,-8(%r12)
        movq %rbx,(%r12)
        leaq -7(%r12),%rbx
        jmp *(%rbp)
_c1kP:
        movq $56,904(%r13)
        movl $fill1_closure,%ebx
        jmp *-8(%r13)
        .size fill1_info, .-fill1_info

What a bummer that using memset for something as small setting three machine words (on a 64 bit platform) results in a call instruction getting generated. Why not simply generate three movb instructions for the zero initialization instead?

Currently, users can work around this by translating their setByteArray# call to several writeWordArray# calls. This optimization obscures the meaning of written code and is not portable across architectures (so you have to use CPP to make it work on 32 bit and 64 bit). I'd like to add a cmm-to-assembly optimization to GHC that does unrolling instead so that the user can write more natural code.

Specifically, here's what I'm thinking:

  • This only happens when the offset into the ByteArray# and the length of the range are constant that are multiples of the machine word size. So, setByteArray# arr 8# 16# x is eligible on 32-bit and 64-bit platforms. And setByteArray# arr 4# 8# x is eligible only on a 32-bit platform. And setByteArray# arr 16# y x is not eligible on any platform.
  • This only happens when the call memset instruction has a range of 32 bytes or less.
Trac metadata
Trac field Value
Version 8.6.3
Type Task
TypeOfFailure OtherFailure
Priority normal
Resolution Unresolved
Component Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
Edited Apr 06, 2019 by Ben Gamari
Assignee
Assign to
8.9
Milestone
8.9
Assign milestone
Time tracking
None
Due date
None
Reference: ghc/ghc#16052