3. Theano Shared Variable Example

This example demonstrates management of Theano shared variables in the master and workers. First, the setup, again dual-GPU:

synk.fork(2)
s_init = np.ones(2, dtype='float32')
x = T.matrix('x')
s = theano.shared(s_init, name='s')
f = synk.function([x], updates=[(s, T.sum(x * s, axis=0))])
synk.distribute()
x_dat = synk.data(np.array([[1, 1],
                            [2, 2],
                            [3, 3],
                            [4, 4]]).astype('float32'))
print("\ns initial:\n", s.get_value())

f.as_theano(x_dat.data)
print("\ns after Theano call:\n", s.get_value())

The resulting output is the correct answer:

s initial:
 [ 1.  1.]

s after Theano call:
 [ 10.  10.]

Continuing with a reset and a call to the Synkhronos function, we investigate results using gather(), one way to collect all shared values into the master:

s.set_value(s_init)
f(x_dat)
print("\nlocal s after reset and Synkhronos call:\n", s.get_value())

gathered_s = synk.gather(s, nd_up=1)
print("\ngathered s:\n", gathered_s)

synk.reduce(s, op="sum")
print("\nlocal s after in-place reduce:\n", s.get_value())

gathered_s = synk.gather(s, nd_up=1)
print("\ngathered s after reduce:\n", gathered_s)
local s after reset and Synkhronos call:
 [ 3.  3.]

gathered s:
 [[ 3.  3.]
 [ 7.  7.]]

local s after in-place reduce:
 [ 10.  10.]

gathered s after reduce:
 [[ 10.  10.]
 [  7.   7.]]

Lastly, to propagate the result to all workers and observe this effect, call the following:

synk.broadcast(s, s_init)
f(x_dat)
synk.all_reduce(s, op="sum")
gathered_s = synk.gather(s, nd_up=1)
print("\ngathered s after reset broadcast, Synkhronos call, "
    "and all-reduce:\n", gathered_s)
gathered s after local reset, broadcast, Synkhronos call, and all-reduce:
 [[ 10.  10.]
 [ 10.  10.]]

Notice the use of broadcast() to set the same values in all GPUs.

3.1. Notes on Collectives

Collectives can be called on any Theano shared variable used in a Synkhronos function. CPU- and GPU-based collectives are available through the same interface. Results of a GPU collective communication may be returned as a new GPU array in the master, but no collective can create a new array (not associated with a Theano shared variable) in a worker.

Synkhronos provides the averaging reduction operation. The reduce operation avg is not present in NCCL; Synkhronos uses sum and then multiplies by the reciprocal number of GPUs.

3.1.1. Theano Shared Variable Sizes

Beware that the nccl collectives assume the same shape variable on each GPU, but it is possible to have different shapes in Synkhronos. In particular, gather and all_gather may leave off data or add extra data without raising an exception–in this case use CPU-based gather operations. See demos/demo_3.py for more about manipulating GPU-variables in workers.