I am currently just trying to write some pedagogical material, in which I borrow from some common examples that have been reworked numerous times on the web.
I have a simple bit of code where I manually create tensors for layers, and update them within a loop. E.g.:
w1 = torch.randn(D_in, H, dtype=torch.float, requires_grad=True) w2 = torch.randn(H, D_out, dtype=torch.float, requires_grad=True) learning_rate = 1e-6 for t in range(501): y_pred = x.mm(w1).clamp(min=0).mm(w2) loss = (y_pred - y).pow(2).sum() loss.backward() w1 -= learning_rate * w1.grad w2 -= learning_rate * w2.grad w1.grad.zero_() w2.grad.zero_()
This works great. Then I construct similar code using actual modules:
model = torch.nn.Sequential( torch.nn.Linear(D_in, H), torch.nn.ReLU(), torch.nn.Linear(H, D_out), ) learning_rate = 1e-4 for t in range(501): y_pred = model(x) loss = loss_fn(y_pred, y) model.zero_grad() loss.backward() for param in model.parameters(): param.data -= learning_rate * param.grad
This also works great.
BUT there is a difference here. If I use a 1e-4 LR in the manual case, the loss explodes, become large, then inf, then nan. So that's no good. If I use a 1e-6 LR in the model case, the loss decreases far too slowly.
Basically I'm just trying to understand why learning rate means something very different in these two snippets which are otherwise equivalent.