chak · 8998af35
--- a/data-parallel/replicate.md
+++ b/data-parallel/replicate.md
@@ -11,7 +11,7 @@ smvm m v = [: sumP [: x * (v !: i) | (i,x) <- row :] | row <- m :]
 ```


-Here the variable 'v' is constant in the array comprehensions and will be replicated while lifting the expression `v !: i`.   In other words, for every single element in a `row`, lifting implies the allocation of a separate copy of of the entire array `v` — and this only to perform a single indexing operation on that copy of `v`.  More precisely, in the lifted code, lifted indexing (which we usually denote by `(!^)` is applied to a nested array consisting of multiple copies of `v`; i.e., it is applied to the result of `replicateP (length row) v`.  
+Here the variable 'v' is constant in the array comprehensions and will be replicated while lifting the expression `v !: i`.   In other words, for every single element in a `row`, lifting implies the allocation of a separate copy of of the entire array `v` — and this only to perform a single indexing operation on that copy of `v`.  More precisely, in the lifted code, lifted indexing (which we usually denote by `(!:^)` is applied to a nested array consisting of multiple copies of `v`; i.e., it is applied to the result of `replicatePA (lengthPA row) v`.  


 This is clearly wasted effort and space.  However, the situation is even worse in Ben's pathological example:
@@ -26,17 +26,161 @@ treeLookup table xx
        half    = len `div` 2
        s1      = sliceP 0    half xx
        s2      = sliceP half half  xx           
-      in concatP (mapP (treeLookup table) [: s1, s2 :])
+      in concatP (mapP (treeLookup table) [:s1, s2:])
+      -- or better: concatP (mapP (treeLookup table) (segmentP [:half, len - half:] xx))
 ```


-Here `table` is constant in `mapP (treeLookup table) [: s1, s2 :]`; hence, the entire `table` gets duplicated on each level of the recursion, leading to space consumption that is exponential in the depth of the recursion.
+Here `table` is constant in `mapP (treeLookup table) [:s1, s2:]`; hence, the entire `table` gets duplicated on each level of the recursion, leading to space consumption that is exponential in the depth of the recursion.

 ## What's happening here?


 Replication of scalars and arrays is always a waste of time and space.  However, it is particularly problematic if the replicated structure is consumed by an indexing operation as it can change the asymptotic work complexity of the vectorised program.  This holds not only for indexing, but for any operation that consumes only a small part of its input array(s).  In other words, if a replicated structure is consumed in its entirety (for example by a fold), the asymptotic work complexity of replication matches that of consuming the structure.  For operations that only consume a small part of their input, that is not the case.  Hence, lifting, which introduces the replication, does increase asymptotic work.

+## Fixing the problem: Plan A (Repeated Segments)
+
+
+Generally, we don't want to copy replicated data — it's a waste of space!  We definitely don't want to do it for large data structures, and in particular, we don't want to do it for arrays.  After all, that can change the asymptotic work complexity of a program.  To keep it simple, we are for the moment focusing on avoiding replicating arrays, as this is were our practical problems are coming from at the moment.  (Some cases of replicated scalars are also eliminated by fusion.)
+
+
+NB: We will have to revisit replication of scalar structures as such scalar structures may be large trees.
+
+### Repeated segments
+
+
+A replicated array results is always represented by a segmented array; more precisely, by a segmented array where a contiguous sequence of segments contains the same data.  For example, we have
+
+```wiki
+replicatePA 3 [:1, 2, 3:] = [:[:1, 2, 3:], [:1, 2, 3:], [:1, 2, 3:]:]
+  where
+    [:[:1, 2, 3:], [:1, 2, 3:], [:1, 2, 3:]:] = ([:3, 3, 3:], [:1, 2, 3, 1, 2, 3, 1, 2, 3:])
+```
+
+
+and
+
+```wiki
+expandPA [:2, 3:] [:[:1, 2:], [:3:]:] = [:[:1, 2:], [:1, 2:], [:3:], [:3:], [:3:]:]
+  where
+    [:[:1, 2:], [:1, 2:], [:3:], [:3:], [:3:]:] = ([:2, 2, 1, 1, 1:], [:1, 2, 1, 2, 3, 3, 3:])
+```
+
+
+NB: `expandPA` is lifted replication; `expandPA` is the name we used in HtM.
+
+### Collapse repeated segments
+
+
+In practice, segments descriptors store more information than just the segment length.  They at least additionally store the start position of each segment in the data array.  In the conventional representation, an invariant is that the start positions are such that the segments don't overlap.  To represent arrays with repeated segments more efficiently, we propose to relax that invariant.  Specifically, all start positions of a contiguous sequence of repeated segments are the same; i.e., the segment data is stored only once per sequence of repeated segments.
+
+
+Then, we have for `[:[:1, 2, 3:], [:1, 2, 3:], [:1, 2, 3:]:]`,
+
+```wiki
+start: [:0, 0, 0:]
+len:   [:3, 3, 3:]
+data:  [:1, 2, 3:])
+```
+
+
+and for `[:[:1, 2:], [:1, 2:], [:3:], [:3:], [:3:]:]`,
+
+```wiki
+start: [:0, 0, 2, 2, 2:]
+len:   [:2, 2, 1, 1, 1:]
+data:  [:1, 2, 3:])
+```
+
+
+This is merely a change in the array representation that does not affect vectorisation.
+
+## Operations on arrays with repeated segments
+
+
+As multiple segments overlap in arrays with repeated segments, array consumers need to be adapted to work correctly in this situation.
+
+### Lifted indexing
+
+
+In the `smvm` example, a replicated array is consumed by lifted indexing to extract matching elements of the vector for all non-zero elements of the matrix.  Using just an length array as a segment descriptor without overlapping segments, lifted indexing might be implemented as follows:
+
+```wiki
+(as_len, as_data) !:^ is = bpermutePA as_data ((prescanPA (+) 0 as_len) +^ is)
+```
+
+
+With overlapping segments, we have
+
+```wiki
+(as_start, as_len, as_data) !:^ is = bpermutePA as_data (as_start +^ is)
+```
+
+
+In the case of `smvm`, where the first argument is produced by `replicatePA (lengthPA row) v`, we have `as_start = replicatePA (lengthPA row) 0` and `as-data = v`.  In other words, lifted indexing draws from a single copy of `v`, which is what we wanted.
+
+### Reduction
+
+> TODO rl's example
+>
+> ```wiki
+> f xs is = mapP f is
+>  where
+>    f i = sumP xs + i
+> ```
+
+
+IDEA: Work on the data array of the segmented array with repeated segments (but without repeated data), then copy the segment results according to the repetition information.  This avoids reducing the same data multiple times.  We do something similar for scans, but don't dopy the results, but keep them in a segmented array with the same **repeated** segment information.
+
+### Splitting and combining (for lifted conditions)
+
+
+Due to lifted conditionals (or, generally, case constructs), arrays with repeated segments may be split (packed) and combined.  Arrays with repeated segments can be split (or packed) by including a repeated segment in the result exactly if one of its repetitions is included.  This can be determined by disjunctively combining all flags for one sequence of repeated segments.
+
+
+Arrays with repeated segments can not always be combined without duplicating the data corresponding to repeated segments (after all, a disjoint segment may be inserted into a sequence of repeated segments).  For simplicity, we may want to expand all repeated segments in `combinePA`.  (It seems that this should not lead to unexpected blow-ups as the repeated data now is part of a functions result and should be accounted for in its complexity estimate.)
+
+### Splitting and joining (for distribution across threads
+
+
+Our idea is to continue to split the **data** array evenly across threads.  That may spread out the segments descriptor (length and starts arrays) very unevenly as repeated segments have an entry for each repetition in the segmentation information, but not in the data.
+
+TODO Roman believes splitting arrays with repeated segments is a problem.  To me it doesn't seem to be that much of a problem.  (Keep in mind that we only need a restricted number of operations on arrays with repeated segments — all their consumers are homomorphisms as discussed in Plan B.
+
+### Multiple levels of nesting (unconcat and concat)
+
+TODO What if we have segment descriptors on top of one with repeated segments?
+
+## Fixing the problem: Plan B (Lifted Application)
+
+
+The applications of `replicatePA` and `expandPA` that introduce segmented arrays with repeated segments in Plan A are in the definition of `mapP`, specifically `replicatePA` is used in `mapP_S` and `expandPA` in `mapP_L` (see Page 403 of HtM):
+
+```wiki
+mapP_S :: (a :-> b) -> PA a :-> PA b
+mapP_S (Clo env _ fl) xss
+  = fl (replicatePA (lengthPA xss) env) xss
+
+mapP_L :: PA (a :-> b) -> PA (PA a) -> PA (PA b)
+mapP_L (AClo env _ fl) xss
+  = unconcatPA xss (fl (expandPA xss env) (concatPA xss))
+```
+
+
+In both cases, we replicate the environment of a closure before we apply the lifted version of the function represented by the closure.  This is important as it guarantees that the consumer of these occurrences of `replicatePA` and `expandPA` process (topmost) segment structure in a homomorphic manner (after all, we are implementing a `map` function here)!
+
+
+The idea behind Plan A is that `replicatePA` and `expandPA` produce a segmented array that encodes the replication without unnecessarily copying data and that the consumer —the lifted function `fl`— processes segmented arrays with encoded replication in a special manner.
+
+## Related work
+
+- The work-efficient vectorisation paper by Prins et al.  Their approach only works in a first-order language.
+- Blelloch's work on the space-behaviour of data-parallel programs.
+
+---
+
+# OLD MATERIAL
+
 ## A plan to fix the problem


@@ -68,7 +212,7 @@ Unfortunately, not only indices blow out, the length of replicated arrays may al
 ## Concrete implementation of replicated arrays


-The DPH library is built on the [ http://hackage.haskell.org/package/vector\|vector](http://hackage.haskell.org/package/vector|vector) package (that provides high-performance sequential arrays).  This package heavily relies on a cheap representation of sliced arrays — i.e., arrays of which a subarray is extracted.  Such array slices are not copied, but represented by a reference to the original array together with markers for the start and end of the slice.
+The DPH library is built on the [ vector](http://hackage.haskell.org/package/vector) package (that provides high-performance sequential arrays).  This package heavily relies on a cheap representation of sliced arrays — i.e., arrays of which a subarray is extracted.  Such array slices are not copied, but represented by a reference to the original array together with markers for the start and end of the slice.

 ### Replicating and slicing

@@ -76,9 +220,4 @@ The DPH library is built on the [ http://hackage.haskell.org/package/vector\|vec
 When implementing replicated arrays, we need to take into account that (1) a replicated may be a sliced vector and (b) that the partitioning of a parallel array across multiple threads requires to slice that parallel array.

 ****\******* Is this really an issue?  After all, we don't replicated thread-local slices of parallel arrays (which in turn may be sliced vectors), but replicate parallel arrays (which are already distributed over multiple threads).
-**
-
-## Related work
-
- The work-efficient vectorisation paper by Prins et al.  Their approach only works in a first-order language.
- Blelloch's work on the space-behaviour of data-parallel programs.
+**
\ No newline at end of file