Dynamic branch speculation in a speculative parallelization architecture for computer clusters

This article describes a technique for path unfolding for conditional branches in parallel programs executed on clusters. Unfolding paths following control structures makes it possible to break the control dependencies existing in the code and consequently to obtain a high degree of parallelism through the use of idle CPUs. The main challenge of this technique is to deal with sequences of control statements. When a control statement appears in a path after a branch, a new conditional block needs to be opened, creating a new code split before the previous one is resolved. Such subsequent code splits increase the cost of speculation management, resulting in reduced profits. Several decision techniques have been developed for improving code splitting and speculation efficiency in single machine architecture. The main contribution of this paper is to apply such techniques to a cluster of single processor systems and evaluate them in such an environment. Our results demonstrate that code splitting in conjunction with branch speculation and the use of statistical information improves the performance measured by the number of processes executed in a time unit. This improvement is particularly significant when the parallelized programs contain iterative structures in which conditions are repeatedly executed. Copyright © 2012 John Wiley & Sons, Ltd.


INTRODUCTION
The main obstacle in improving the performance of speculative multithreaded architectures is the limited degree of parallelization imposed by the intrinsic dependencies that exist among parallel threads. These dependencies can be: architecture-aware compiler to parallelize sequential applications without being constrained by the data and control dependencies present in the program.
An SMT processor is able to issue instructions from multiple threads in the same cycle, thus, allowing multiple hardware contexts, or threads, to dynamically share the resources of a superscalar architecture. In [8], the SMT available resources are applied using the so-called threaded multipath execution (TME) technique to achieve a high instruction level parallelism when there are only one or a few processes running. By following both possible paths of a conditional branch, in the best case, the misspeculation penalty can be completely eliminated if there is a significant progress down the correct path when the branch is resolved. TME uses unused (spare) contexts to execute threads in alternative paths of conditional branches. As a result, the SMT's resources can be better utilized, and the probability of executing the correct path is increased. TME can also provide significantly higher processor utilization than conventional superscalar processors.
There are also combined compiler and architecture techniques to control the multithreaded execution of branches and loop iterations [24]. These techniques can be applied by a compiler to replace branches using speculative execution of both branch paths and speculative execution of loop iterations. The resulting code needs to be tuned to a specific architecture. In [25] a compiler technique called simultaneous speculation scheduling was proposed in combination with a 'minimal' multithreaded execution model to enable speculative execution of alternative program paths. Some hardware implementations apply the different forms of eager execution. The 'nanothreaded' DanSoft processor [26] implements a multipath execution model using confidence information from a static branch speculation mechanism. The PolyPath architecture [7] enhances superscalar processor architecture through a limited multipath execution feature that employs eager execution. In [25], eager execution is used in a simultaneous multithreaded processor model. Finally, Ref. [27] proposed the Disjoint Eager Execution, which assigns resources to branch paths whose results are most likely to be used, that is, branches with the highest cumulative execution probability.

SPECULATION VERSUS MULTIPATH EXECUTION OF BRANCH INSTRUCTIONS
How processors handle conditional statements is very important for their performance considering that one of every seven executed instructions is a branch. Low-level techniques are used to optimize the time needed to evaluate the condition and the address of the jump, but this is not quite enough. Waiting for the evaluation of a condition blocks the system and reduces its performance. Decreasing this cost can be accomplished with one of two methods: condition speculation and path unfolding.
The use of a speculative method to evaluate a condition often results in having to manage new conditions before the previous ones are resolved. The system must also be able to remove the path for which the branch has not been correctly predicted. However, it should also be taken into account that in most cases, it is highly probable that an instruction is executed more than once. Moreover, most of the methods used in predicting jump conditions have a very high rate of success. In studies like Refs. [28,29], the application of speculation methods, regardless of whether implemented in hardware or software, have shown an accuracy of over 85%. These results motivated us to introduce speculative execution that relies on speculation about what will happen when a conditional statement is executed. This includes the case when the condition is guarding a loop body.
Speculations can be divided into two groups: static and dynamic. In the static case, the branch speculation is fixed, meaning it is always skipped or it does not depend on the information that the compiler puts into the code. The disadvantage is that static speculation is not adapting to the instruction behavior. In the dynamic speculation, the decision depends on the instruction's behavior during execution and therefore requires the historical evolution information of the address of the instruction to which the jump is directed. The BTB (branch target buffer) method is an example of dynamic speculation in which the jump instruction target address and historical information (the latter is typically limited to just 2 bits) are stored in a buffer [30].
Different variants of this method were introduced [31], each proposing different ways of making the speculation decision (one-bit predictor, two-bit predictor, etc.). The more bits a predictor uses, the higher the cost of necessary hardware is. Hennessy and Patterson [32] conducted a study using 935 two bits of history. They showed that for programs in SPEC89 the speculation errors ranged from 1% (nasa7, tomcat), to 9% (spice), to 12% (gcc), and even to18% (eqntott) when the BTB table had up to 4096 entries. Other methods rely on correlation-based predictors [34] that speculate about a branch outcome taking into account the behavior of other branches. These methods are motivated by the observation that the outcome of a branch is often affected by the outcomes of recently executed branches. Other types of predictors, like two-level adaptive predictors [35] and hybrid predictors [29], also work with information collected from other jumps ( Table I).
One of the most important factors in deciding how to use a speculation is the speculation's confidence level [10,11]. Considering that only a fixed percentage of accuracy can be achieved for all conditions, the rate of success in opening new paths at level n can be expressed as Percentage of success Level The confidence levels obtained in different studies [10,11] show that the confidence goes down considerably when speculations are made at increasing number of levels, as shown in Figure 1 and Table II. It is also worth mentioning that different branches may have different rates of speculation success. Table I. Comparison of different methods of speculation using the SPECint95 benchmark [33]; the labels in misspeculation rate column denote: SAg, a variant of the second-level branch history method; gshare, a two-level adaptive predictor with globally shared history buffer and pattern history table; combined, a two-bit predictor combined with the gshare predictor.  The unfolding paths or eager execution [5][6][7][8] proceeds down both paths of a branch so no speculation is made. When a branch is resolved, all operations on the nontaken path are discarded. This method allows the system to take advantage of parallel hardware architecture. If there are idle processors, two paths of a branch can be executed without waiting for the result of the control condition. However, as we demonstrate later, parallelizing splits in the condition structures by unfolding paths is not always beneficial. Each additional split must duplicate the split condition data structure, so it increases the cost of management control but decreases the gain of splitting. For example, consider an extreme case with a scheme in which all nodes are branches. If all paths are open, as shown in Figure 2, the number of processors needed to run all levels would be number of processors D 2 actual level+1 1 If only the first branch splits, as shown in Figure 3, and the other branches use speculation about which path to take to execute just one path, the number of needed processors can be expressed as number of processors D 2 actual level 1 This is because one-level unfolding executes the branch and splits the first level. Afterwards, a processor of each path executes it in parallel with other paths.
If we generalize this formula, assuming that we split n levels (see Figure 4) and speculate about the paths taken at the remaining levels, the number of needed processors would be If level unfolding > level actual W 2 level actual C1 1 Else: level actual level unfolding C 2 2 level unfolding 1  The number of needed processors shown in Figure 5 and Table III, demonstrates that from the third level on, the number of processors required to split all possible branches is large (15 or more). However, if only the first path is split and the other branches use speculation to execute one path, the number of needed processors is small (at most 7).  Consequently, the majority of techniques relay on mixed methods. Many of those use confidence estimation [36] to control speculation about the branch. For example, after a branch, if the confidence level in the speculation is low, both of its paths will be executed; otherwise, the speculation is used to execute the path predicted to be taken [37,38].

A MIXED METHOD WITH CONFIDENCE ESTIMATION AND SINGLE UNFOLDING
We have developed an execution environment that allows two execution threads to be unfolded when a branch is found. A replication of the control structures is required to schedule the two branches as shown in Figure 6. This replication is carried out automatically when a branch is reached and later, when it is solved, these structures are merged back.
The unfolding of threads of branch execution allows us to take advantage of the parallel architecture of the execution system. Processors that would otherwise be idle execute processes of the two paths of a branch (without waiting for the value of the condition). As discussed above, the generalization of this approach to nested conditional processes may not always be beneficial. Hence, we have chosen to study the optimal number of unfolding paths open simultaneously in the speculative parallelization architecture for computer clusters.
Let us consider a parallel architecture with 11 processors executing 11 processes with the following control diagram (without data dependencies) in which the loops at the bottom are executed three times ( Figure 7).
Because the sequential execution of branches would again be (2), the gain is higher in (3) than in (1) if a limited number of processors are available.
In the above case, the high gain in (3) is achieved because the speculation predicted the conditions correctly. If the speculation predictions of the conditions are incorrect, for example, when the condition B is predicted to follow 'no' path, the result of unfolding of the first condition and speculating about outcomes of the second and third conditions is: fA,B,I,D,J,F,L,F,L,F,Lg fC,E,E,Eg Comparing (3) to (4), a reduction of single unfolding performance is observed. Taking into account this loss and the gain obtained by successful speculation, and adding the extra cost imposed by opening all branches, it is clear that in this case, it is more efficient not to open more than one branch simultaneously. Clearly, when several processors are available, all processes belonging to the same depth will be executed in parallel, without increasing the depth (with 19 processors available unfolding of all levels will result in single step execution). Yet, while many branches are executed, only one of them is correct. Hence, the more branches are opened, the smaller the gain and the higher the management cost.
If there are as many processors as potential processes, for nested branches without loops, a binary execution tree will emerge. An example of such a tree with branches nested to level 3 is shown in Figure 8.
The percentages in the execution tree define the probabilities of following each path. It is assumed that unlimited number of processors can be executed in parallel. The unfolding of one versus three branches is compared using the following notation: Cpi sc D the execution time obtained with no path unfolding but with speculation on the longest path. The best case is if we speculate about executing paths A, B, C, D that can be executed in parallel on four processors in one step, and this is the end of execution for 1/8 of all cases. In the remaining 7/8 of all cases, the correct 'right' path is executed (one additional step) because all the conditions are known and all the partial results have been obtained.
Cpi sc D 0.125 1 cycle C 0.875 2 cycle Cpi sc D 1.875 cycle Cpi e1 D the execution time obtained with single unfolding and speculation on the longest path. In one step, paths A, B are executed by unfolding while paths C, D are executed by speculation. This gives the final results in half of the cases when branch 1 resolves to 'no'. In additional 1/8 of all cases, the result is also obtained in one step when all three branches resolve to 'yes'. However, in the remaining 3/8 of all cases, it is only necessary to execute either path F or path G, and we know which one because the results of conditions B and C are known at this point, so the result is computed in two steps.
Cpi e1 D 0.625 1 cycle C 0.375 2 cycle C Cges 1 Cpi e1 D 1.375 cycle C Cges 1 Cpi e3 D the execution time obtained with triple unfolding. Here, all seven processes are executed in 1 step and the right combination is selected to get the result because values of all branches are known at that point of execution.
Cges 3 D the time of management with triple unfolding Cges 1 D the time of management with single unfolding (which is smaller thanCges 3 / The unit of execution time is the time to execute a process. Hence, the larger the processes are, the more beneficial the unfolding is. Clearly, when Cges 1 < 0.5, single unfolding is more beneficial than pure speculation. Additionally, it could be argued that Cges 3 > 3 Cges 1 because triple unfolding needs to duplicate at least three times more data than single unfolding. Under this assumption, triple unfolding is better than single unfolding (and, of course, in this case it is also better than speculative execution) when Cges 1 < 0.1875.
This result confirms that in a scheme without loop structures, single unfolding is not always better than speculation nor is triple or multiple unfolding always worse than single unfolding. The ratio between the time of management of unfolding and the execution time of the processes determines if single unfolding is better than the other methods.

941
This analysis is even more precise when more is known about the conditions of branches. For example, if 'no' paths have probabilities of only 10% and 'yes' path 90%, then we get In this case, the single unfolding is best only if 0.0855 < Cges 1 < 0.1, in a narrow interval, whereas triple unfolding is better for a much wider interval of Cges 1 < 0.0855.
These results confirm that when more levels unfold, the time needed for managing unfolding must be low to achieve a gain over the other methods.
Finally, Figure 9 compares the unfolding performance of all levels versus unfolding performance of one level as a function of the number of processors used. As can be seen in this figure, when all levels are being unfolded, increasingly more processors are needed to descend to the subsequent level because it is necessary to run all processes of that level. number of processors D 2 level+1 1 For example, seven processors are needed to unfold the second level.
If only the first level splits and the others are speculated about, the number of processors needed to execute is smaller.

number of processors D 2 level 1
In this case, obtaining the second level results requires five processors, as shown in Table IV.
Comparison of the two models run with 15 processors shows that unfolding all levels stops at third level when all available processes are used. In contrast, with single level unfolding, there are enough processors to execute up to 7 levels in parallel. However, the use of speculation in single unfolding limits the guarantee of execution correctness to only the first level path. Therefore, in the worst case (where all speculations happen to be wrong) in single level splitting, two levels are lost compared with unfolding all levels. Conversely, if all the speculations are correct, the single unfolding gains four levels over unfolding of all levels. Considering that speculation success rates up to the third level are quite high, it is very likely that single unfolding will not lose any levels to misspeculation.
For these reasons we choose to use a mixed method approach that does not allow more than one level of splitting paths. Unfolding at all levels 3 7 15 61 33 127 255 511 1023 2047 Unfolding at one level 3 5 7 9 11 13 15 17 19 21 In this paper, we compare four methods (two of them unfolding branches and the remaining two speculating on branches without unfolding them) to see which gives the best results. We also evaluate the method without splitting to measure the benefits obtained by the investigated four methods.

SPECULATIVE PARALLELIZATION ARCHITECTURE FOR COMPUTER CLUSTERS
Speculative parallelization architecture for computer clusters [13][14][15][39][40][41] achieves parallelism by using speculation in distributed environments, allowing the parallel execution of a sequential program in a computer cluster. It simulates the behavior of a superscalar system by implementing instruction level parallelism that attempts to break true data and control dependencies by speculating on future data values and future branch results, respectively. Speculation is based on the fact that the program behavior is usually repetitive and consequently predictable, as demonstrated in studies of branches [4], memory dependencies, and data values [42]. Software speculation has recently shown promising results in parallelizing such programs [33,43,44]. The relevant techniques can be classified into two types: Software speculation: Compilers carry out the necessary coding. The resulting speculation cannot be applied dynamically [45][46][47]. Hardware speculation: It requires duplicated hardware elements, for example, adding extra registers to store provisional values until they are resolved [42,48,49].
The above techniques allow the processor to divide program execution into several parallel threads, and therefore increase the program's degree of parallelism. Moore's Law (processing power doubles each 18 months) and Gilder's Law (bandwidth triples each 12 months) show that the speed of information transmission and synchronization between workstations decrease faster than processing speed increases. These premises make the idea of transporting speculation techniques to a distributed environment composed of cheap workstations attractive. The complete design of the speculative parallelization architecture for computer clusters system [13][14][15]39] consists of three subsystems: The parallelizing subsystem (see Figure 10) transforms the original sequential program into the parallel format needed by the execution environment. The program is divided into blocks that can be executed in parallel. Either two or three programs (depending on the type of the original program) are generated as a result of the translation process: a farmer, a worker, and optionally a farmer/worker. A prototype implementation of this subsystem automatically transforms C code into MSSPACC (Master/Slave Speculative Parallelization Architecture for Computer Clusters) format C code by splitting loops and conditions in the corresponding blocks with their input and output variables. The description of its implementation is omitted here for the sake of brevity. ‡ When dividing a sequential program into blocks, it is very important to choose the correct block size, because it can affect system performance significantly. We are currently working on enhancing this aspect of the parallel subsystem following three options for optimizing block size: (i) user annotations of the block boundaries (the easiest but the least automatic choice), (ii) statistical information collected prior to parallelization, and (iii) a dynamic subsystem that can join blocks to improve system performance. ‡ For details, see University of Girona Technical Report IIiA 12-02-RR titled 'The parallelizing subsystem implementation,' by J. Puiggalí Figure 10. The parallelizing subsystem.
The farmer manages the parallelism and the speculation of the system. The worker runs at each of the processors; it contains the code of one of the blocks into which the sequential program has been divided. The farmer/worker program can reduce the farmer bottleneck by distributing the tasks to some of the other processors, each of which works then as a subfarmer.
The execution subsystem (see Figure 11) applies speculation to run the parallelized applications in a computer cluster composed of single processor machines running PVM (parallel virtual machine). The execution environment behaves like a superscalar processor, where the blocks are like the instructions into which the sequential program has been divided, and the processors on which the worker program runs are like the functional units. The following data speculation mechanisms are used: data value speculation [50], last value predictor [50], stride predictor [51], and context-based value predictor [52]. Control dependencies are managed with branch speculation techniques based on a BTB with 2-bit history [53]. Blocks executed because of incorrectly predicted values or wrong branch speculations are discarded and their execution is restarted from the last stable point. The simulation subsystem (see Figure 11) evaluates the impact of technological evolution or the effects of using computer clusters larger than currently available. The simulation can run on a single workstation, using the information obtained from the single processor execution (the trace of the program) and the cluster execution model (the execution cost of different blocks).
The study and development of both subsystems has been initiated simultaneously. The parallelization subsystem is currently being designed. The execution subsystem has been already developed in C on PVM. It runs in computer clusters of up to 20 PC's (personal computer units). The design of the execution subsystem is based on both theoretical analysis and a new simulation subsystem that has been extensively used [13][14][15]. This allows the extrapolation of the results to the PVM subsystem configurations of ideal clusters, that is, those that are not actually available. The simulation uses the runtime, transmission and control values obtained from the actual executions in the cluster [13,14]. The sequential execution times were obtained from the execution subsystem and from the simulator output. To analyze and validate the performance, synthetic programs were used. However, pending access to the actual parallelization subsystem, two real applications were manually adopted (the travelling salesman problem [13] and a program to generate virtual scenes illuminated by radiosity [41]).
In our recent work [39], the execution subsystem has been enhanced allowing out-of-order executions (OoOE) [54,55]. The introduction of OoOE in the processor design implies that the execution of instructions can start any time and the final result will not be affected even when there is a blockage caused by data dependencies. This takes advantage of instruction cycles that would otherwise be wasted, and so yields an improvement in system performance. In current computer architectures, OoOE is a paradigm already used in many microprocessors.

DUAL PATH EXECUTION OF A SYNTHETIC PROGRAM
In this section, we describe how a synthetic program is used to measure the efficiency of unfolding two paths. The program has a loop before the branch and two loops inside each branch (see Figure 12). We use a simulation tool [14] that takes into account the overhead of each technique to obtain the results.
During the simulation, the result of the first branch was delayed to allow the other branches to be executed before knowing their results. The first version of the synthetic program has no data dependencies, while the second version has exactly two such dependencies (function 2 and function 4).

945
In both cases, the dependencies can be addressed through speculation in two iterations. The control dependency created by the loop is solved by speculation, so there is no delay of the execution because of such dependencies. The resulting algorithm is shown in Figure 12.
The first part of the experiment was carried out on both synthetic program versions assuming that the condition is true and using two approaches -one with unfolding paths and the other without. In the second case, the condition is solved through speculation (the speculation predicts that the condition is true). The execution times obtained are shown in Table V. Table V demonstrates that the synthetic program without data dependencies executed with a small number of processors performs better with the speculation without unfolding the branch. Yet, as the number of processors increases, the difference decreases until the execution times are equal. This is because in speculation without unfolding, if the path chosen is the correct one, all executing processes contribute towards the progress of the computation. In contrast, with unfolding, both paths are opened after the branch, so some processes will be assigned to the path that have been started but do not need to be executed. This is reflected in the results shown in Table V. The difference in the execution times of these two methods decreases when the number of available processors increases. This is because the unused processors can execute the processes corresponding to the erroneous branch without delaying the execution of the correct path. With 24 processors available, the execution times are identical in both methods but the number of processes started without unfolding is 33, while with unfolding it is 58.
In the synthetic program with data dependencies, the method that speculates on branches, even in the case when the speculation prediction is correct, the executed path of the branch still has a dependency that requires a second iteration to speculate on the data value. On the other hand, the method Table V. Execution times of a synthetic program when the speculation correctly predicts the condition; columns contain the absolute difference while % columns contain the percentage of the difference between execution times with and without unfolding paths. that unfolds new execution paths for branches assigns the paths for which the speculation incorrectly predicted data to idle processors. Thus, as shown in Table V, there is almost no difference in performance of the method speculating on branches and the method unfolding new execution paths on branches. The same experiment was carried out assuming the contrary outcome of the condition -that is that the condition is false -and the results obtained are shown in Table VI.
In this case, unfolding the paths gives good results regardless if there are data dependencies in the synthetic program or not. The results for the synthetic program without data dependencies are better than the results for the synthetic program with data dependencies. This is because in the latter, the incorrect path has data dependencies and the process executing this path is blocked until the dependency is resolved.
The improvement achieved by unfolding paths versus speculating on branches reaches 17.45% using 14 processors for the synthetic program without data dependencies and 15.52% for the synthetic program with data dependencies. This is because in speculation without unfolding, the system starts the execution of processes in the incorrect path and proceeds until the value of the condition is obtained. Once the misspeculation is detected, the processes on the incorrect path are erased. However, in speculation with unfolding, the system starts process execution of both paths and later keeps the one that was started with the correct value of the condition.
According to Table VII, with speculation predicting the incorrect value and the synthetic program with data dependencies, the number of processes executed in a system with 15 processors is smaller with path unfolding than without. When the number of processors is larger than 15, the number of executed processes stays the same, regardless of the number of processors. This is because in Table VI. Execution times of a synthetic program when the speculation incorrectly predicts the condition; columns contain the absolute difference while % columns contain the percentage of the difference between execution times with and without unfolding paths.  the system using path unfolding and with unused processors, all processes executing both paths of a branch start execution before having the value of the condition. The difference in time between the two methods of dealing with branches arises because without path unfolding, the system must wait for the condition evaluation to start the processes of the correct path, thereby wasting CPU time. Presence of data dependencies in the synthetic program improves the performance of path unfolding because the execution of an incorrect path with such data dependencies is blocked until the dependency is resolved. Figure 13. Comparison of the absolute differences between execution times when speculating correctly and incorrectly on the outcome of a condition. Figure 14. Comparison of the relative differences between execution times with correct and incorrect speculation.

Number of process
CPUs speculation mistaking the condition speculation achieving the condition unfolding mistaking the condition unfolding achieving the condition Figure 15. Number of processes created without data dependencies.

Number of process
CPUs speculation mistaking the condition unfolding mistaking the condition speculation achieving the condition unfolding achieving the condition Figure 16. Number of processes created with data dependencies. Comparing the performance of the two possibilities, unfolding paths of branches or not (see Figures 13 and 14), we realized that the loss that can take place using unfolding in the worst cases is smaller than the benefit that unfolding can obtain in the best cases. Moreover, this loss decreases with the number of available processes, becoming zero for the large number of such processors, while the benefit is sustained. This is due to the use of processors which otherwise would stay inactive (see Figures 15 and 16). This demonstrates that the use of path unfolding is beneficial in our approach.

ADAPTING THE UNFOLDING PATHS TO REPETITIVE STRUCTURES
The behavior of a condition in a repetitive structure is very different from the previously considered case because in speculation without path unfolding, the BTB method can be used to decrease the probability of selecting the incorrect path. As an example, consider a synthetic program from Figure 17 that contains a repetitive structure. There is a condition present in function 3 that is called four times during execution. We assume that the correct path corresponds to the condition yielding false. We also use a 2-bit BTB that defaults initially to the condition yielding true. We have the following cases when comparing the behavior of each speculative method chosen: (1) Speculation using BTB in the election (see Figure 18). In this method the first time the iteration (process 3) is reached, it assumes that the condition will yield true and process 4 is executed.
Other processes are also started and execute until process 3 finishes its execution. At this point it can be seen that a wrong execution was carried out. Subsequently, the scheduler eliminates all the processes executed erroneously from the point where the error has taken place. In the second iteration the same thing happens again because the BTB continues indicating that the condition will yield true. In the third iteration, the BTB predicts the condition correctly and from this point forward, it continues predicting all speculations are correctly.
(2) Speculation without the use of the BTB in the election (see Figure 19). If the BTB is not used, speculation is always conducted by choosing the branch set by default. In our case, the election in all four iterations will be incorrect. Therefore, the processes will execute incorrectly after the evaluation of the condition (block 3).
(3) Speculation with unfolding paths (see Figure 20). As we have explained previously, in this method every time a branch is found and no unfolding is currently active, the branch will be unfolded and its two paths will be executed in parallel. In this example, we would be executing in parallel four iterations of the two paths. When the branch is evaluated, the incorrect path would be automatically eliminated.
A comparison of the three methods reveals that speculation without BTB obtains the worst results. This is because it must eliminate the blocks and redo the states of all iterations. In the other two methods, the speculation unfolding paths are good if the number of iterations is small. However, when the number of iterations increases, the advantage is reduced because the speculation with BTB is increasingly successful. This implies that to maintain a good performance, speculation with path unfolding requires a larger number of processors than the speculation with BTB does (Table VIII).
To overcome this disadvantage of the iterative structures for speculation with path unfolding, we introduced a modification of this method: the introduction of the historical statistic of process behavior.
A number of passes through the condition and threshold of the percentage of the same values will dictate what types of method the predictor will use. A branch whose chosen answer is statistically significant is likely to take the same path repeatedly. For such branches, we will not unfold their paths because we expect unfolding to be unnecessary.    It is important to decide what value to use as a threshold, or in other words what percentage of the same values we will consider to be statistically significant. The decision tree for making this decision is shown in Figure 21.
If we apply this modification to the path unfolding method in the previous example, assuming that the threshold was three passes and over 75% of the same values, the result would correspond Figure 21. Scheme of the mix system for speculation with unfolding paths.    Figure 22. As can be observed, it preserves the benefits that we obtained with the path unfolding method versus the speculation with BTB method. There is also an added benefit of using this modification in the speculation with the BTB method to improve the speculation success rate. In conclusion, the modification improves both of the above methods discussed. We executed the synthetic program in the simulator obtaining the results for the four methods shown in Figure 23.

Execution time
The plot in this figure starts with the execution times for six processors because the very high execution times with a smaller number of processors would distort the graph. The graph demonstrates that speculation without BTB obtains the worst result and its execution times are very high because of the large number of paths executed that ultimately are erased. The method of path unfolding performs very well. Although this method executes faster than the speculation method with BTB, when there are many processors available, the difference is reduced. This is because starting from the second iteration, the speculation with BTB finds the correct path. Finally, the mixed method of path unfolding always yields the fastest execution. Figure 24 and Table IX show the number of processes started by each method. Clearly, the speculation without BTB starts the largest number of processes but it also makes the most mistakes. Both unfolding paths and speculation with BTB produce values quite similar to each other. The speculation for mixed unfolding paths is the best because of combining advantages of the previous two methods.

The number of started processes
As shown in Figure 25 and Table X, the behavior of the methods in terms of the number of erased processes is very similar to behavior observed in terms of the number of the started processes, so the same conclusions apply.

Matrix vector multiplication
In this example, a real algorithm is used (instead of a synthetic one shown in the previous section). We selected the following algorithm for the dense matrix (of size n n/ multiplication ( Figure 26).
It contains three nested loops. To parallelize this algorithm, the parallelizing subsystem uses the two internal loops as code for the worker and the external loop to define the number of execution   1  136  136  136  136  2  154  154  146  144  3  171  171  155  151  4  195  195  167  160  5  210  206  180  170  6  255  206  180  170  7  237  228  184  174  8  246  271  189  179  9  275  268  183  173   times (controlled by the farmer process). § There is a dependence caused by variable 'i'. This data dependency would cause blocking until the current iteration is ended. Some parallel implementations would resolve this by applying the 'loop unrolling' technique (i.e., unfolding all the iterations). Instead, MSSPACC uses speculation in a dynamic way. Therefore, the results obtained by MSSPACC would be equivalent to those obtained by a parallel execution when implementing 'loop unrolling'. The original version of the problem is a sequential one, written in 'C' and compiled without loop unrolling or heavy optimization. In comparisons, we report only execution times. Different matrix sizes have been used to observe the system performance: N D 500, 1000, 2000. A cluster of 22 Intel (Corporation, Santa Clara, CA, US) Core 2 Duo E4700 2.60 GHz with 1 GB RAM was used. The comparison does not include supercomputers or multiprocessor system because the proposed system is not intended to compete with explicitly parallel programs executed on multiprocessors or scalar processors. Instead, MSCPACC aims at extracting and exploiting parallelism from sequential programs executed on computer clusters.   In Table XI the actual values of the executions are shown. The results (execution time) of the three different experiments are normalized with respect the sequential execution time, which is shown in the second column.
The results show that time management is small relative to run time of blocks. Thus, after applying loop unrolling to the parallelization of matrix multiplication on a cluster, the execution times of the resulting code under PVM or MPI would be similar to MSSCPACC execution times.
In conclusion, the MSSPACC performs well being at least as good as a system using 'loop unrolling' in the parallelization of matrix multiplication on a cluster. The main difference is that MSSPACC executes dynamically using speculation. Copyright Figure 29. Selected 'travelling salesman problem' algorithm.

Travelling salesman problem
The first example showed that using the MSSPACC we can automatically parallelize the sequential algorithm for execution on a cluster getting the same results as the explicit parallelization via loop unfolding would achieve. The second example shows the case in which parallel algorithm restricted by dependencies benefits from speculation introduced by MSSPACC system using the farmer/worker model. We use the well-known 'travelling salesman problem' (also studied in [56]) that calculates the shortest Hamiltonian circuit in a graph [57][58][59]. The problem is NP-hard. We selected the following optimized algorithm [13], designed for parallel execution with or without speculation on a cluster of 20 Pentium III, 1.7 GHz computers with 512 MB RAM ( Figure 29): Figure 30 shows the execution times as a function of the number of start cities (varying from 3 to 10) and execution methods: parallel execution without speculation and parallel execution with speculation. In the latter, two different size implementations are used: with a farmer with three workers and three subfarmers, each of which supervises also three workers, denoted as (1/3)*4 system with a total of 16 processors, and the similar system with four subfarmers, denoted as (1/4)*3 system with 20 processors. As can be observed in Figure 30, the speculative execution method is able to reduce the execution time drastically. Data and control dependences limit the maximum parallelism that the algorithm can efficiently use. In this example, the speculation is able to predict the values of the induction variables easily, which significantly increases the parallelism degree of the program. On the other hand, the use of different numbers of farmer-worker groups does not offer a significant enhancement of the execution time.

CONCLUSIONS
Unfolding paths in the branch structures allows us to break the control dependencies existing in the code and obtain a high degree of parallelism through the use of currently inactive CPUs. The main 958 J. PUIGGALI ET AL.
challenge of implementing this technique is to efficiently deal with multiple branches. Within longer branches, opening two new paths by unfolding a conditional branch before the previous unfolding is resolved would increase the cost of management, thereby reducing the benefits of such unfolding. To avoid this drawback, we propose to suppress unfolding additional branches until the current branch is resolved and apply speculation to the subsequent branches instead.
In this paper, we have compared four possible implementations of dealing with a branch. Two of them use speculation without splitting paths (one with historical information about the behavior of the condition and one without). Two others split paths when a new branch is encountered (one with historical information about the behavior of the condition and one without).
The results demonstrate that the use of unfolding combined with speculation using statistical information (the BTB technique) achieves the best time performance and the highest number of processes executed correctly. These gains are especially high for iterative structures in which the conditions are repeatedly executed. This is due to not splitting the very high percentage of branches that are predicted correctly. In contrast, splitting branches without BTB executes more paths that must be later discarded and therefore gives worse results. When the number of CPUs is large, the results of splitting without BTB improve because the discarded processes use processors that would be otherwise idle which helps this method to match the results of the technique without splitting.
In future research, we will study how performance evolves when process sizes vary at runtime. The environment will be also modified to enable higher degrees of parallelism and evaluation of the impact of system enhancements on performance. Finally, the technology that we developed should essentially be easily expendable from clusters to multicore processors, providing an interesting and easy way of parallelizing sequential codes even with data dependencies. Our future work will include exploring this direction of our system evolution.