# Need very different learning rate for manual updates vs. using model

 Refresh March 2019 Views 0 time
1

I am currently just trying to write some pedagogical material, in which I borrow from some common examples that have been reworked numerous times on the web.

I have a simple bit of code where I manually create tensors for layers, and update them within a loop. E.g.:

``````w1 = torch.randn(D_in, H, dtype=torch.float, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=torch.float, requires_grad=True)

learning_rate = 1e-6
for t in range(501):
y_pred = x.mm(w1).clamp(min=0).mm(w2)
loss = (y_pred - y).pow(2).sum()
loss.backward()
``````

This works great. Then I construct similar code using actual modules:

``````model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(501):
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss.backward()
for param in model.parameters():
``````

This also works great.

BUT there is a difference here. If I use a 1e-4 LR in the manual case, the loss explodes, become large, then inf, then nan. So that's no good. If I use a 1e-6 LR in the model case, the loss decreases far too slowly.

Basically I'm just trying to understand why learning rate means something very different in these two snippets which are otherwise equivalent.

The crucial difference is the initialization of the weights. The weight matrix in a `nn.Linear` is initialized smart. I'm pretty sure that if you construct both the models and copy the weight matrices in one way or the other, you'll get consistent behavior.