The art of multistreaming

One of the most efficient methods for improving long-running batch performance on the IBM i platform is multistreaming, involving the submission of multiple instances of the processing program, each handling its own subset of data. Typically, this is achieved by creating a multistreaming wrapper around the original process. While there are various approaches to implement multistreaming, regardless of the chosen method, it requires addressing three key questions and selecting from the associated sets of options.

Question 1. What is the runtime model employed?

Two options are available in this context: submitting processing streams as batch jobs or threads. The latter method is seldom used and, in reality, provides few advantages. While it is true that threads typically consume fewer resources, the difference in the case of, for instance, 10 threads versus 10 jobs for a 30-minute batch is negligible. Additionally, the thread model implies thread-safety of the processes, a criterion met by very few legacy batch programs.

Question 2. What mechanisms are utilised to logically partition the processed data?

This option presents a more interesting choice. The widely adopted approach involves the creation of temporary data subsets, one for each stream. While this method is applicable excluisively to read-only data and requires additional time and resources for subset creation, it continues to be the preferred option for many IBM i users.

A dynamic modification of this approach involves the creation of a conveyor mechanism, similar to the mapping procedure in the MapReduce programming model. The conveyor retrieves data from primary files (tables) and subsequently conducts filtering and sorting into queues associated with the active stream processes. To implement this method, the original batch programs typically need modification to receive data from queues instead of directly from files (tables).

The second method relies on the OPNQRYF command. In this approach, each stream executes the command with a unique filter and subsequently overrides the program’s primary file file with the the query file. This method is effective for both physical and logical primary files, but it is suited for OPM and ILE native data access only.

If the access to the primary datasets of the process relies on the use of SQL, a more suitable approach involves constructing temporary views with appropriate selections, one for each stream, and subsequently overriding the primary tables/views in the batch process with these views. However, similar to the previous method, this approach has its drawbacks—it is not compatible with OPM and ILE native access programs utilising keyed logical files, making it less universal in its applicability.

The most comprehensive approach is to generate temporary views, one for each stream, and then execute OPNQRYF commands for retrieving or updating the data in those views. This method stands out as the optimal way to manage heterogeneous batch processes that employ both native and SQL data access.

Question 3. What types of data breakdown are most suitable for the given process?

Relative record number (RRN)

The most straightforward file or table breakdown involves splitting data by the relative record number. Each stream is assigned an equal range of record numbers for processing, making it effective even with logical files and views. However, the challenge lies in achieving a truly balanced breakdown. In cases with a substantial number of deleted records in a file, some streams may be allocated significantly larger volumes than others.

Another drawback of the RRN breakdown is its potential incompatibility with the algorithm employed by the batch process. For instance, if the process is configured to calculate totals by branch or region, all records corresponding to a particular branch or region must be allocated to the same stream. This is a requirement that the RRN breakdown clearly violates.

Field (column) value ranges

A good method in this scenario involves breaking down the data by values of a specific database file (table) field (column). For instance, cost centers with IDs from 0001 to 1000 may be allocated to the first stream, 1001 to 2000 to the second, and so forth. While this approach enhances compatibility with the batch process algorithm compared to the RRN breakdown, it does come with its own set of drawbacks.

Firstly, defining groups of key value ranges for multistreaming may present a challenge. In many cases, this activity is manual and, moreover, needs to be periodically repeated to account for the dynamic nature of the values of the key field (column).

Secondly, achieving volume balancing in this scenario can be even more challenging. It is possible that the ‘head office cost centre’ of the company accumulates more business volumes than all other cost centers combined. In such cases, breaking down by cost centre ranges may not significantly contribute to effective multistreaming, as a majority of the batch processing could end up being allocated to a single processing stream.

Virtual field (column) value ranges

An interesting alternative for defining the breakdown is by the values of a virtual field or column. The concept of a virtual field-based breakdown is straightforward in theory, although the actual implementation may introduce complexities.

Elaborating on this approach, consider a scenario where a chosen physical or logical file selected for splitting includes an additional decimal field. The values of this field for various records are randomly assigned within the 1-999 range but remain consistently identical for records sharing a specific key, such as, for example, the cost centre number. In this setup, streams can be allocated based on the values of this virtual field. The random nature of the values in the virtual field might assist in auto-balancing to a signigficant extent.

One of the possible ways to implement the virtual field multistraming is as follows.

1. Create a new table, VFMS, with two columns: the key from the primary table, chosen for grouping the records into streams, and a decimal (3,0) field designated for the virtual field implementation.

CREATE TABLE VFMS(KEY1 CHARACTER (10 ) NOT NULL WITH DEFAULT,
VFIELD DECIMAL (3 , 0) NOT NULL WITH DEFAULT)

where KEY1 is the key selected.

2. Before initiating each batch run, ensure the worktable content is up-to-date, i.e. in sync with the data in the primary table, by executing the following SQL statements.

DELETE FROM VFMS B WHERE NOT EXISTS(SELECT * FROM PRIMARY A WHERE a.KEY1=b.KEY1)

INSERT INTO VFMS SELECT DISTINCT KEY,0 FROM PRIMARY A WHERE NOT EXISTS(SELECT * FROM VFMSWORK B WHERE a.KEY1=b.KEY1)

UPDATE VFMS SET VFIELD=INT(ABS(RAND()-0.000001)*999+1) WHERE VFIELD=0

The VFMS table can now be seamlessly joined with the PRIMARY table using the values of the KEY1 field, establishing the foundation for the virtual field implementation.

At runtime, the subsequent statement would be executed for each stream:

CREATE VIEW QTEMP/PRIMARY AS SELECT * FROM PRIMARY a WHERE ((EXISTS(SELECT * FROM VFMS B WHERE ( A.KEY1=b.KEY1)
AND (B.VFIELD >= lowerstreamlimit ) AND (B.VFIELD <= higherstreamlimit ))) OR ( NOT EXISTS(SELECT * FROM PRIMARY B WHERE ( A.KEY1=b.KEY1)))

This temporary view will then serve as the source for the data to be processed by the given stream.

3. ROWNUMBER() function

A similar result can be achieved using the SQL OLAP ROW_NUMBER() function. This approach provides a notable benefit by obviating the necessity for an extra VFMS table. The randomisation of values for allocating records to a stream is executed by the SQL engine during the creation of the temporary view in QTEMP. However, the resulting SQL statements become somewhat intricate, rendering the generated code more challenging to comprehend and maintain.

4. Hiererchical data breakdown

At the pinnacle of sophistication lies the hierarchical breakdown approach, where the data in the primary table(s) is initially segmented by value ranges of a certain field (column), e.g. cost centre, and each range containing a single value undergoes further breakdown using either another field (e.g., account number), or a virtual field. This method exhibits a high probability of delivering a robust breakdown while still preserving compatibility with the batch algorithm.

Multistreaming, of course, presents other challenges, such as establishing the correct environment, including file overrides, QTEMP objects, and open files for each of the submitted process streams. Additionally, when the batch generates spool reports, careful consideration must be given to the optimal method of merging the spool files produced by the individual stream jobs.

All the previously mentioned methods and more are integral features of the iSTREAM LP Generic Multistreaming Option. This instrumental application is generally capable of multistreaming many existing batch processes without requiring any modifications to the original programs.

The art of multistreaming

Who are we?

How to contact us