Need very different learning rate for manual updates vs. using model

Refresh

March 2019

Views

0 time

1

I am currently just trying to write some pedagogical material, in which I borrow from some common examples that have been reworked numerous times on the web.

I have a simple bit of code where I manually create tensors for layers, and update them within a loop. E.g.:

w1 = torch.randn(D_in, H, dtype=torch.float, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=torch.float, requires_grad=True)

learning_rate = 1e-6
for t in range(501):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()
    loss.backward()
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad
    w1.grad.zero_()
    w2.grad.zero_()

This works great. Then I construct similar code using actual modules:

model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        )
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(501):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    model.zero_grad()
    loss.backward()
    for param in model.parameters():
        param.data -= learning_rate * param.grad

This also works great.

BUT there is a difference here. If I use a 1e-4 LR in the manual case, the loss explodes, become large, then inf, then nan. So that's no good. If I use a 1e-6 LR in the model case, the loss decreases far too slowly.

Basically I'm just trying to understand why learning rate means something very different in these two snippets which are otherwise equivalent.

1 answers

1

The crucial difference is the initialization of the weights. The weight matrix in a nn.Linear is initialized smart. I'm pretty sure that if you construct both the models and copy the weight matrices in one way or the other, you'll get consistent behavior.

Additionally, please note that the two models are not equivalent, as your handcrafted model lacks biases. Which matters.