2. Introductory Example

import numpy as np
import theano
import theano.tensor as T
import synkhronos as synk

synk.fork()
x = T.matrix('x')
y = T.vector('y')
z = T.mean(x.dot(y), axis=0)
f_th = theano.function(inputs=[x, y], outputs=z)
f = synk.function(inputs=[x], bcast_inputs=[y], outputs=z)
synk.distribute()

x_dat = np.random.randn(100, 10).astype('float32')
y_dat = np.random.randn(10).astype('float32')
x_synk = synk.data(x_dat)
y_synk = synk.data(y_dat)
r_th = f_th(x_dat, y_dat)
r = f(x_synk, y_synk)

assert np.allclose(r, r_th)
print("All assertions passed.")

Which returns, on a dual-GPU machine:

Synkhronos attempting to use 2 of 2 detected GPUs...
Using cuDNN version 6020 on context None
Preallocating 5677/8110 Mb (0.700000) on cuda1
Using cuDNN version 6020 on context None
Preallocating 5679/8113 Mb (0.700000) on cuda0
Mapped name None to device cuda1: GeForce GTX 1080 (0000:01:00.0)
Mapped name None to device cuda0: GeForce GTX 1080 (0000:02:00.0)
Synkhronos: 2 GPUs initialized, master rank: 0
Synkhronos distributing functions...
...distribution complete (0 s).
All assertions passed.

The program flow is:

  1. Call synkhronos.fork().
  2. Build Theano variables and graphs.
  3. Build functions through Synkhronos instead of Theano.
  4. Call synkhronos.distribute().
  5. Manage input data / run program with functions.

2.1. In More Detail

Import Theano in CPU-mode, and fork() will initialize the master GPU in the main process and additional GPUs in other processes. All Theano variables thereafter are built in the master, as in single-GPU programs. distribute() replicates all functions, and their variables, in the additional processes and their GPUs.

A function’s inputs will be scattered by splitting evenly along the 0-th dimension. In this example, data parallelism applies across the 0-th dimensions of the variable x. A function’s bcast_inputs are broadcast and used wholly in all workers, as the variable y in the example.

All explicit inputs to functions must be of type synkhronos.Data, rather than numpy arrays. The underlying memory of these objects is in OS shared memory, so all processes have access to it. They present an interface similar to numpy arrays, see demos/demo_2.py.

The Synkhronos function is computed simultaneously on all GPUs, including the master. By default, outputs are reduced and averaged, so the comparison to the single-GPU Theano function result passes. Other operations are possible: sum, prod, max, min, or None for no reduction.

2.2. Distribute

After all functions are constructed, calling distribute() pickles all functions (and their shared variable data) in the master and unpickles them in all workers. This may take a few moments. Pickling all functions together preserves correspondences among variables used in multiple functions in each worker.

Currently, distribute() can only be called once. In the future it could be automated or made possible to call multiple times. Synkhronos data objects can be made before or after distributing, but only after forking.