Stochastic Gradient Push (SGP)
In the previous post, we discussed a baseline decentralized learning algorithm known as D-PSGD. One of the bottlenecks of D-PSGD is that it requires the connectivity across the agents to be static and undirected. However, in practical scenarios, the connectivity across the agents could change with time. This demands a decentralized algorithm that can converge for time-varying and directed graph structures. Stochastic Gradient Push achieves this by combining PushSum with stochastic gradient updates. The main difference between D-PSGD and SGP is that SGP communicates an additional parameter called ps-weight (PushSum weight) along with model parameters among the neighboring agents. ps-weight keeps track of the total weightage of the neighborhood in the current iteration.
Let’s first understand SGP with a simple example. Consider an agent-$i$ with neighbors $j,k$.
- In D-PSGD, agent-$i$ assumes that it always receives the information from $j,k$, and the mixing is a properly weighted average (row-stochastic nature of mixing matrix). In particular, agent-$i$ simply multiplies its information with self-weight ($w_{ii}x_i$) and combines it with the information stored in receiving buffer ($w_{ij}x_j+w_{ik}x_k$). Here, agent-$i$ updates $x_i = w_{ii}x_i+w_{ij}x_j+w_{ik}x_k$ assuming that receiving buffer always has information from $j,k$ and $w_{ii}+w_{ij}+w_{ik}=1$. This assumption breaks when we have time-varying and/or directed graphs. To circumvent this, SGP sends the weight values along with the required information.
- In SGP, agent-$i$ has two receiving buffers: buffer1 for the required information ($w_{ij}x_j+w_{ik}x_k$) and buffer2 for weights ($w_{ij}+w_{ik}$). To update, agent-$i$ adds the weighted self-information to the received information in buffer1 and self-weight to buffer2. Finally, the update for agent-$i$ is $x_i = (w_{ii}x_i+w_{ij}x_j+w_{ik}x_k)/(w_{ii}+w_{ij}+w_{ik})$. Let’s say agent-$j$’s connection to $i$ is broken for some reason. In this case, D-PSGD’s update will be $x_i = w_{ii}x_i+w_{ik}x_k$ whereas for SGP $x_i = (w_{ii}x_i+w_{ik}x_k)/(w_{ii}+w_{ik})$. Here, we can clearly see that it’s a proper weighted average in the case of SGP update but not for D-PSGD. From this, we can conclude that SGP converges for time-varying and directed graphs.
Notations:
- $n$ = total number of agents.
- $N_i^k$ = set of neighbors of $i$ including itself at iteration k.
- $x_i^k$ = model parameters of agent $i$ at iteration $k$.
- $z_i^k$ = de-biased model parameters of agent $i$ at iteration $k$.
- $u_i^k$ = PushSum weight of agent $i$ at iteration $k$.
- $W^k$ = weight matrix or mixing matrix at iteration $k$.
- $\eta$ = learning rate.
- $K$ = total number of iterations.
- Stochastic local loss function at agent $i$: $F_i(x,d)$.
- Local loss function: $f_i(x)$.
- Global loss function: $f(x) = \frac{1}{n} \sum \limits_{i=1}^n f_i(x)$.
SGP Algorithm:
At the beginning of the training, all the agents are initialized to the same model parameters, $u_i^0=1$ for all $i$, and hyperparameters are synchronized. The pseudo-code for SGP is given below.
for $k = 0,1,…,K-1$ parallelly on each agent $i$ do:
- Randomly sample $d_i^k$ from local data of the $i^{th}$ agent
- Compute local stochastic gradients: $g_i^k = \nabla F_i(z_i^k, d_i^k)$
- Update the model parameters using SGD step: $x_i^{k+\frac{1}{2}} = x_i^k - \eta g_i^k$
- Communicate $x_i^{k+\frac{1}{2}}, u_i^k$ to neighbors $N_i^k$
- Update the model parameters using the gossip averaging step: $x_i^{k+1} = \sum_{j \in N_i^k} w_{ij} x_j^{k+\frac{1}{2}}$
- Update the PushSum weight: $u_i^{k+1} = \sum_{j \in N_i^k} w_{ij}^k u_j^{k}$
- Debias model weights: $z_i^{k+1} = x_i^{k+1} / u_i^{k+1}$
end for
Note that the gradients ($g_i$) are evaluated on the de-biased parameters ($z_i$) but the updates are performed on the original model parameters ($x_i$). The non-zero entries in the mixing matrix $W^k$ define the communication topology at each iteration $k$. SGP can leverage various communication topologies including sparse, asymmetric, or time-varying networks. At every iteration, each agent $i$ only knows/chooses its outgoing mixing weights i.e., the $i^{th}$ column of $W^k$ and the agents maintain column stochasticity. A PyTorch implementation of SGP can be found here.
Convergence:
SGP algorithm has a convergence rate of $O(\frac{1}{\sqrt{nK}})$ under the following constraints:
- The loss function $f_i$ is assumed to be L-smooth for all $i$ i.e., $| | \nabla f_i(y)-f_i(x) | | \leq L | | y-x | |$.
- Unbiased stochastic gradient: $f_i(x) = \mathbb{E}_{d_i \in D_i}[F_i(x; d_i)]$.
- Bounded stochastic noise: $\mathbb{E}_{d_i \in D_i} | | \nabla F_i(x,d_i) - \nabla f_i(x) | | \leq \sigma^2$.
- Bounded outer gradient variance: $\frac{1}{n} \sum \limits_{i=1}^n | | \nabla f_i(x) - \nabla f(x) | | \leq \zeta^2$.
- Let $E^k$ be the set of edges at iteration $k$. we assume that there exists finite positive integers B and $\Delta$ such that graph with edge set $\cup_{k=lB}^{(l+1)B-1} E^k$ is strongly connected and has diameter at most $\Delta$ for every $l \geq 0$.
Note that it can also be shown that SGP converges with a bounded delay between the updates. In particular, each node might perform $\tau$ gradient updates while it receives the information from neighbors. This hides the communication overhead by overlapping gradient computation with communication. However, it may result in the gossip updates in lines 5 and 6 incorporating outdated messages i.e., agent $i$ receives $x_j^{k’+\frac{1}{2}}$ and $u_j^{k’}$, where $k-k’ \leq \tau$. However, as long as the delay $\tau$ remains bounded, SGP is still guaranteed to converge.