Parallel Programming as in Ct, BSGP and FP

26. August 2008 06:36 by CarlosLoria in General  //  Tags: ,   //   Comments (0)

Parallel programming has always been an interesting and recognized as a challenging area; it is again quite full of life due to the proliferation nowadays of parallel architectures available at the "regular" PC level such as multi-core processors, SIMD architectures as well as co-processors like stream processors as we have in (Graphical Processors Units (GPUs).

A key question derived from such a trend is naturally how parallel processors development would be influencing programming paradigms, languages and patterns in the near future. How programming languages should allow regular programmers to take real advantage of parallelism without adding another huge dimension of complexity to the already intricate software development process, without penalizing performance of software engineering tasks or application performance, portability and scalability. 

Highly motivated by these questions, we want to present a modest overview in this post using our understanding of some interesting references and personal points of view.

General Notions
Some months ago someone working in Intel Costa Rica asked me about multi-core and programming languages during an informal interview, actually as a kind of test, which I probably did not pass, I am afraid to say. However, the question remains actually very interesting (and is surprisingly related to some of our previous posts on ADTs and program understanding via functional programming, FP). So, I want to try to give myself another chance and try to cover some part of the question in a humble fashion. More exactly, I just want to remind some nice well-known programming notions that are behind the way the matter has been attacked by industry initiatives.

For such a purpose, I have found a couple of interesting sources to build upon as well as recent events that exactly address the mentioned issue and serve us as a motivation. First, precisely Intel had announced last year the Ct programming model (Ghuloum et al) and we also have a report on the BSGP programming language developed by people of Microsoft Research (Hou et al)  We refer to the sources for gathering more specific details about both developments. We do not claim these are the only ones, of course; we just use them as interesting examples.

Ct is actually an API over C/C++ (its name stands for C/C++ throughput) for portability and backward compatibility; it is intended for being used in multi-core and Tera-scale Intel architectures but practicable on more specific ones like GPUs, at least as a principle. On the other side BSGP (based on the BSP model of Valiant) addresses GPU programming mainly. The language has its own primitives (spawn, barrier, among the most important) but again looks like C at the usual language level. Thus, the surface does not reveal any pointer to FP, in appearance.

In the general case, the main concern with respect to parallel programming is clear: how can we program using a parallel language without compromising/constraining/deforming the algorithmic expression due to a particular architecture, at least not too much. What kind of uniform programming high-level concepts can be used that behave scalable, portable and even predictable in different parallel architectures? In a way that program structure can be derived, understood and maintained without enormous efforts and costs. In a way the compiler is still able to efficiently map them to the specific target, as much as possible.

Naturally, we already have general programming notions like threads, so parallel/concurrent programming usually get expressed using such concepts strongly supported by the multi-core architectures. Likewise, SIMD offers data-parallelism without demanding particular mind from the programmer. But data "locality" is in this case necessary in order to become effective. Such a requirement turns out not realistic in (so-called irregular) algorithms using dynamic data structures where data references (aliases and indirections) are quite normal (sparse matrices and trees for instance). Thus memory low-latency might result computationally useless in such circumstances, as well-known.

On the other hand, multithreading requires efficient (intra/inter-core) synchronization and a coherent inter-core memory communication. Process/task decomposition should minimize latency due to synchronization. Beside that data-driven decomposition is in general easier to grasp than task-driven. Thus, such notions of parallelism are still involved with specific low-level programming considerations to get a good sense of balance between multi-core and SPMD models and even MPMD.

Plenty of literature back to the eighties and the nineties already shows several interesting approaches in achieving data parallelism (DP) that take these mentioned issues into consideration, uniformly; thus, it is not surprising that Ct is driven by the notion of nested data parallelism (NDP).  We notice that BSPG addresses GPU architectures (stream based) in order to build simpler programming model than, for instance, in CUDA. NDP is not explicitly supported but similar principles can be recognized, as we explain later on.

Interestingly, that NDP seminal works on these subject directly points to the paradigm of functional-programming (FP), specifically prototyped in the Nesl language of Blelloch and corresponding derivations based on Haskell (Nepal, by Chakravarty, Keller et al).

FP brings us, as you might realize, back to our ADTs, so my detour is not too far away from my usual biased stuff as I promised above. As Peyton Jones had predicted, FP will be more popular due to its intrinsic parallel nature. And even though developments like Ct and BSGP are not explicitly expressed as FP models (I am sorry to say), my basic understanding of the corresponding primitives and semantics was actually easier only when I was able to relocate it in its FP-ADT origin. Thus, to get an idea of what an addReduce or a pack of Ct mean, or similarly a reduce(op, x) of BSGP, was simpler by thinking in terms of FP. But this is naturally just me.

Parallelism Requirements and FP combinators
Parallel programs can directly indicate, using primitives or the like, where parallel task take place, where task synchronize. But it would be better that the intentional program structure gets not hidden by low-level parallel constraints. In fact, we have to assure that program control structure remains intentionally sequential (deterministic) because otherwise reasoning and predicting behavior can become hard. For instance, to reason about whether we add more processors (cores) we can guarantee that programs (essentially) performance scale. In addition, debugging is usually easier if programmer thinks sequentially during program development and testing.

Hence, if we want to preserve natural programming structures during coding and not be using special (essentially compiler targeting) statements or similar to specify parallel control, we have to write programs in a way that the compiler can derive parallelism opportunities from program structure. For such a purpose we would need to make use of some sort of regular programming structures, in other words, kind of patterns to formulate algorithms. That is exactly the goal the Ct development pursues.

As FP theory decade ago has shown that a family of so-called combinators does exist for expressing regular and frequent algorithmic solution patterns. Combinators are a kind of building-blocks for more general algorithms; they base on homomorphisms on (abstract) data types. Main members of such family are the map and reduce (aka. fold) combinators. They are part of a theory of lists or the famous Bird-Meertens formalism (BMF), as you might know. Symbolically, map can be defined as follows:

map(f, [x1,...,xn]) = [f(x1),..., f(xn)]  (for n>= 0)

Where as usual brackets denote lists and f is an operation being applied to each element in a list collecting the results in a new list preserving the original order (map of an empty list is the empty list). As easy to see, many algorithms iterating over collections by this way can be an instance of map. My favorite example for my former FP students was: "to raise the final notes of every student in class in a 5%". But more serious algorithms of linear algebra are also quite related to the map-pattern, so they are ubiquitous in graphic or simulation applications (pointing to GPUs).

Map is naturally parallel; the n-applications of operation "f" could take place simultaneously in pure FP (no side-effects). Hence, expressing tasks using map allows parallelism. Moreover, map can be assimilated as a for-each iteration block able to run in parallel. List comprehensions in FP are another very natural form to realize map.

Nesl original approach which is followed by Nepal (a Haskell version for supporting NDP) and Ct is to provide a special data type (parallel list or array) that implicitly implements a map. In Nepal such lists are denoted by bracket-colon notation, [: :] and called parallel-arrays. In Ct such a type is provided by the API and is called TVEC. Simply using a data type for communicating parallelism to the compiler is quite simple to follow and does not obscure the proper algorithm structure. In Nepal notation, just as simple example, we would write using parallel array comprehensions something like this:

[: x*x | x <- [: 1, 2, 3 :] :]

This denotes the map of the square function over the parallel list [:1, 2, 3:]. Notice that this is very similar to the corresponding expression using regular lists. However, it is interesting to notice that in contrast to normal lists parallel-arrays are not inductively definable, that would suggest trying to process them sequentially, and that would make no sense. Using flat parallel arrays allows DP. However, and that is the key issue, parallel-arrays (TVECs also) can be nested, in such  a case a flattening process takes places internally in order to apply in parallel a parallel function over a nested structure (a parallel-array of parallel-arrays) which is not possible in DP. Standard and illustrating examples of NDP are divide-and-conquer algorithms, like the quicksort (qsort). Using NDP lets the flattening handle recursive calls as parallel calls, too. For instance, as key part of a qsort algorithm we calculate:

[: qsort(x) | x <-[: lessThan, MoreThan :]:]

Assume the lessThan and moreThan are parallel arrays result of the previous splitting phase. In this case the flattening stage will "distribute" (replicate) the recursive calls between data structure. The final result after flattening will be a DP requiring a number of threads that depends on the number of array partitions (O(log(n))) not on the number of qsort-recursive calls (O(n)); which can help to balance the number of required threads at the end independently of the order of the original array. This example shows that NDP can avoid potential degeneration of control (task) based parallelism.

Evidently, not every algorithm is a map-instance; many of them are sequential reductions from a data type into a "simpler" one. For instance, we have the length of a list which reduces (collapses) a list into a number or similarly the sum of all elements of a list. In such a case the threads computing independent results must be synchronized into a combined result. This corresponds to the reduce class of operations. Reduce (associative fold) is symbolically defined as follows:

reduce (e, @, [x1, ..., xn]) = e @x1@...@xn
where "@" denotes here an associative infix operator (function of two arguments).

Reduce can be harder to realize in parallel in comparison to map due its own nature, in general; scans/tree-contraction techniques offer one option providing reduce as a primitive.  Bottom-line is that they can be cleanly separated of big previous map-steps. In other words, many algorithms are naturally composed of "map"-phases followed by reduce-phases so particular parallelization and optimization techniques can be employed based on this separation at the compilation level.

We notice that map-steps can be matched with the superstep-notion which is very proper of the BSP model (which lies beneath the BSPG language). On the other hand reduce-phases are comparable with after-barrier code in this language or points where threads must combine their outputs.  However, BSGP does not have flattening, as we could observe in the report.

If the implemented algorithm naturally splits into well-defines map (superstep)-reduce (barrier) phases then NDP-thinking promotes better threading performance and more readability of the code intentionality. In addition, many reduce operations can be very efficiently implemented as operations on arrays according to previously calculated indexing parallel arrays (as masks) which are directly supported in Ct (for instance, partition, orReduce, etc.) according to the source.  In addition standard scan algorithms are supported in the library, thus scan-based combinations can be used.

In a forthcoming (in a hopefully shorter) second part of this post, we will be briefly discussing some applications of NDP to some pattern-matching schemas.

ADTs, OOP, rules and Pattern-matching (1)

6. June 2008 12:57 by CarlosLoria in General  //  Tags: , , ,   //   Comments (0)

 Abstract (or algebraic) Data Types (ADTs) are well-known modeling techniques for specification of data structures in an algebraic style.
Such specifications result useful because they allow formal reasoning on Software at the specification level. That opens the possibility of automated verification and code generation for instance as we may find within the field of program transformation (refactoring, modernization, etc.).  However, ADTs are more naturally associated with a declarative programming languages (functional (FP) and logic programming (LP)) style than an object-oriented programming one (OOP).  A typical ADT for a tree data structure could be as follows using some hypothetical formalism (we simplify the model to keep things simple):

    adt TreeADT<Data>{

    import Integer, Boolean, String;
        sort Tree;
        oper Empty : -> Tree;
        oper Node : Data, Tree, Tree|-> Tree;
        oper isEmpty, isNode : Tree-> Boolean;
        oper mirror : Tree -> Tree
        oper toString : -> String
        isEmpty(Empty()) = true;
        isEmpty(Node(_,_,_)) = false;

        mirror(Empty()) = Empty()
        mirror(Node(i, l, r)) = Node(i, mirror(r), mirror(l))

As we see in this example, ADTs strongly exploit the notion of rules and pattern-matching based programming as a concise way to specify requirements for functionality.  The signature part is the interface of the type (also called sort in this jargon) by means of operators; the semantics part indicates which equalities are invariant over operations for every implementation of this specification. Semantics allows recognizing some special kinds of operations. For instance, Empty and Node are free, in the sense that no equation reduces them. They are called constructors.
If we read equations from left to right, we can consider them declarations of functions where the formals are allowed to be complex structures. We also use wildcards (‘_’) to discard those parts of the parameters which are irrelevant in context. It is evident that a normal declaration without using patterns would probably require more code and consequently obscure the original intention expressed by the ADT. In general, we would need destructors (getters), that is, functions to access ADT parts. Hence, operation mirror might look like as follows:
    Tree mirror(Tree t){
        if(t.isEmpty()) return t;
        return new Node(  t.getInfo(),
                                    t.getLeft().mirror() );
Thus, some clarity usually could get lost when generating code OOP from ADTs: among other things, because navigation (term destruction, decomposition) needs to be explicit in the implementation.

Rules and pattern-matching has not yet been standard in main stream OOP languages with respect to processing general data objects.  It has been quite common the use of regular expressions over strings as well as pattern-matching over semi structured data especially querying and transforming XML data. Such processing tends to occur via APIs and external tools outside of the proper language. Hence, the need for rules does exist in very practical applications, thus, during the last years several proposals have been developed to extend OOP languages with rules and pattern-matching. Some interesting attempts to integrate ADTs with OOP which are worthy to mention are Tom ( ) and Scala ( ). An interesting place to look at with quite illustrating comparative examples is

We will start in this blog a sequence of posts studying these two models of integration of rules, pattern-matching and OOP in some detail. We first start with a possible version of the ADT above using Scala. In this case the code generation has to be done manually by the implementer. But the language for modeling and implementing is the same which can be advantageous. Scala explicit support (generalized) ADTs, so we practically obtain a direct translation.
    object TreeADT{
        trait Tree{
           def isEmpty:boolean = false
           override def toString() = ""
           def mirror : Tree = this
           def height : int = 0
        case class Node(info:Any, left:Tree, right:Tree)
               extends Tree{
           override def toString() =
              "("+info+left.toString() + right.toString()+")"


        object Empty extends Tree{
           override def isEmpty = true

        def isEmpty(t : Tree) : boolean =
              t match {
            case Empty => true
            case _        => false

        def mirror(t : Tree):Tree =
               t match{
               case Empty       => t
               case Node(i, l, r) => Node(i,mirror(r), mirror(l))