Decentralized Parallel Stochastic Gradient Descent (D-PSGD)

Decentralized Parallel Stochastic Gradient Descent (D-PSGD) is the baseline algorithm that enables learning in decentralized setups by combining SGD with gossip averaging. Note that we sometimes refer to this algorithm as DSGD.

Notations:

n = total number of agents.
$N_i$ = set of neighbors of $i$ including itself.
$x_i^k$ = model parameters of agent $i$ at iteration $k$.
$W$ = weight matrix or mixing matrix.
$\eta$ = learning rate.
$K$ = total number of iterations.
Stochastic local loss function at agent $i$: $F_i(x,d)$.
Local loss function: $f_i(x)$.
Global loss function: $f(x) = \frac{1}{n} \sum \limits_{i=1}^n f_i(x)$.

D-PSGD Algorithm:

At the beginning of the training, all the agents are initialized to the same model parameters and hyperparameters are synchronized. At every iteration of D-PSGD, each agent executes three main steps: a. Compute the gradients on a mini-batch of local data and update the local model parameters, b. Communicate the model parameters with the neighbors, and c. Update the model parameters via the gossip averaging step. The pseudo-code for D-PSGD is given below.

for $k = 0,1,…,K-1$ parallelly on each agent $i$ do:

Randomly sample $d_i^k$ from local data of the $i^{th}$ agent

Compute local stochastic gradients: $g_i^k = \nabla F_i(x_i^k, d_i^k)$

Update the model parameters using SGD step: $x_i^{k+\frac{1}{2}} = x_i^k - \eta g_i^k$

Communicate $x_i^{k+\frac{1}{2}}$ to neighbors $N_i$

Update the model parameters using the gossip averaging step: $x_i^{k+1} = \sum_{j \in N_i} w_{ij} x_j^{k+\frac{1}{2}}$

end for

Note that we could also perform steps 4 and 5 before step 3 i.e., communicate and do the gossip averaging before performing the local SGD step. It can be shown that both versions of the algorithm have the same convergence rate.

Convergence:

D-PSGD algorithm has a convergence rate of $O(\frac{1}{\sqrt{nK}})$ under the following constraints:

Underlying graph topology G is strongly connected with self-loops. Mixing matrix $W$ is doubly stochastic and symmetric.
The communication is synchronous i.e., each agent sends and receives information at the same time.
The loss function $f_i$ is assumed to be L-smooth for all $i$ i.e., $| | \nabla f_i(y)-f_i(x) | | \leq L | | y-x | |$.
Unbiased stochastic gradient: $f_i(x) = \mathbb{E}_{d_i \in D_i}[F_i(x; d_i)]$.
Bounded stochastic noise: $\mathbb{E}_{d_i \in D_i} | | \nabla F_i(x,d_i) - \nabla f_i(x) | | \leq \sigma^2$.
Bounded outer gradient variance: $\frac{1}{n} \sum \limits_{i=1}^n | | \nabla f_i(x) - \nabla f(x) | | \leq \zeta^2$.

Constraint 6 indicates that the data across the agents is Independent and Identically Distributed (IID). It can also be interpreted as the bound on heterogeneity in the data distribution across the agents.

Bottlenecks:

Communication overhead: Every agent needs to communicate to all its neighbors at every iteration ($| N_i | \times$ communication). But in a centralized setup, each agent has to only communicate to the server and back ($2 \times$ communication).
Undirected and static graphs: D-PSGD does not work for directed and time-varying graphs.
IID data: D-PSGD either diverges or has performance degradation for non-IID data.
Synchronous Communication.
D-PSGD does not guarantee differential privacy.

A PyTorch implementation of D-PSGD can be found here.

Additional info:

The gossip averaging step can be rewritten as $x_i^{k+1} = x_j^{k+\frac{1}{2}} + \gamma \sum_{j \in N_i} w_{ij} (x_j^{k+\frac{1}{2}}-x_i^{k+\frac{1}{2}})$. Here, $\gamma$ is referred to as the averaging rate and is usually set to $1$ when there is no communication compression.

There exists another variant of D-PSGD known as Decentralized Federated averaging (DFedAvg). The main difference is that the DFedAvg algorithm has multiple local SGD updates before each communication round similar to FedAvg. In contrast, the D-PSGD algorithm communicates after every local update.

Gossip Averaging

Stochastic Gradient Push (SGP)