Need very different learning rate for manual updates vs. using model

Refresh

2 weeks ago

Views

4 time

0

I am currently just trying to write some pedagogical material, in which I borrow from some common examples that have been reworked numerous times on the web.

I have a simple bit of code where I manually create tensors for layers, and update them within a loop. E.g.:

w1 = torch.randn(D_in, H, dtype=torch.float, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=torch.float, requires_grad=True)

learning_rate = 1e-6
for t in range(501):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()
    loss.backward()
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad
    w1.grad.zero_()
    w2.grad.zero_()

This works great. Then I construct similar code using actual modules:

model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        )
learning_rate = 1e-4
for t in range(501):
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    model.zero_grad()
    loss.backward()
    for param in model.parameters():
        param.data -= learning_rate * param.grad

This also works great.

BUT there is a difference here. If I use a 1e-4 LR in the manual case, the loss explodes, become large, then inf, then nan. So that's no good. If I use a 1e-6 LR in the model case, the loss decreases far too slowly.

Basically I'm just trying to understand why learning rate means something very different in these two snippets which are otherwise equivalent.

0 answers