Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This usually indicates a bug. #215

Open
macmmm81 opened this issue May 12, 2018 · 0 comments
Open

This usually indicates a bug. #215

macmmm81 opened this issue May 12, 2018 · 0 comments

Comments

@macmmm81
Copy link

I'm using Ubuntu 16.04, amd64 and CUDA (GTX 970). When I was training this happened:

th train.lua -data_dir data/xyz -rnn_size 512 -num_layers 3 -dropout 0.5 -batch_size 150
using CUDA on GPU 0...
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 190, val: 10, test: 0
vocab size: 78
creating an lstm with 3 layers
setting forget gate biases to 1 in LSTM layer 1
setting forget gate biases to 1 in LSTM layer 2
setting forget gate biases to 1 in LSTM layer 3
number of parameters in the model: 5454926
cloning rnn
cloning criterion
1/9500 (epoch 0.005), train_loss = 4.37936428, grad/param norm = 8.3750e-01, time/batch = 0.7845s
2/9500 (epoch 0.011), train_loss = 4.27539656, grad/param norm = 9.4067e-01, time/batch = 0.1877s
3/9500 (epoch 0.016), train_loss = 4.23837694, grad/param norm = 9.0943e-01, time/batch = 0.1698s
4/9500 (epoch 0.021), train_loss = 3.66106952, grad/param norm = 7.9275e-01, time/batch = 0.1681s
5/9500 (epoch 0.026), train_loss = 5.62951912, grad/param norm = 2.1572e+00, time/batch = 0.1689s
6/9500 (epoch 0.032), train_loss = 3.60939394, grad/param norm = 6.0602e-01, time/batch = 0.1679s
7/9500 (epoch 0.037), train_loss = 3.53268805, grad/param norm = 7.0176e-01, time/batch = 0.1682s
8/9500 (epoch 0.042), train_loss = 3.53769521, grad/param norm = 7.2233e-01, time/batch = 0.1686s
9/9500 (epoch 0.047), train_loss = 3.48180217, grad/param norm = 4.9382e-01, time/batch = 0.1681s
10/9500 (epoch 0.053), train_loss = 3.39289360, grad/param norm = 2.8195e-01, time/batch = 0.1682s
11/9500 (epoch 0.058), train_loss = 3.39644097, grad/param norm = 3.4993e-01, time/batch = 0.1674s
12/9500 (epoch 0.063), train_loss = 3.41690555, grad/param norm = 3.3962e-01, time/batch = 0.1668s
13/9500 (epoch 0.068), train_loss = 3.38753114, grad/param norm = 3.0437e-01, time/batch = 0.1666s
14/9500 (epoch 0.074), train_loss = 3.34901290, grad/param norm = 1.5515e-01, time/batch = 0.1672s
15/9500 (epoch 0.079), train_loss = 3.35579167, grad/param norm = 1.6236e-01, time/batch = 0.1662s
16/9500 (epoch 0.084), train_loss = 3.35414872, grad/param norm = 1.1574e-01, time/batch = 0.1667s
17/9500 (epoch 0.089), train_loss = 3.30944307, grad/param norm = 1.3277e-01, time/batch = 0.1664s
18/9500 (epoch 0.095), train_loss = 3.31451027, grad/param norm = 1.2377e-01, time/batch = 0.1674s
19/9500 (epoch 0.100), train_loss = 3.34253825, grad/param norm = 1.6845e-01, time/batch = 0.1665s
20/9500 (epoch 0.105), train_loss = 3.37940360, grad/param norm = 2.1698e-01, time/batch = 0.1676s
21/9500 (epoch 0.111), train_loss = 3.34893832, grad/param norm = 2.4687e-01, time/batch = 0.1665s
22/9500 (epoch 0.116), train_loss = 3.31333926, grad/param norm = 1.6044e-01, time/batch = 0.1668s
23/9500 (epoch 0.121), train_loss = 3.31715436, grad/param norm = 1.2095e-01, time/batch = 0.1665s
24/9500 (epoch 0.126), train_loss = 3.29283764, grad/param norm = 8.9565e-02, time/batch = 0.1667s
25/9500 (epoch 0.132), train_loss = 3.30552235, grad/param norm = 9.6753e-02, time/batch = 0.1664s
26/9500 (epoch 0.137), train_loss = 3.31862877, grad/param norm = 9.3730e-02, time/batch = 0.1666s
27/9500 (epoch 0.142), train_loss = 3.31212713, grad/param norm = 1.0088e-01, time/batch = 0.1670s
28/9500 (epoch 0.147), train_loss = 3.30230550, grad/param norm = 1.0366e-01, time/batch = 0.1665s
29/9500 (epoch 0.153), train_loss = 3.34291882, grad/param norm = 1.6979e-01, time/batch = 0.1667s
30/9500 (epoch 0.158), train_loss = 3.31770495, grad/param norm = 1.1363e-01, time/batch = 0.1672s
31/9500 (epoch 0.163), train_loss = 3.32960788, grad/param norm = 1.1513e-01, time/batch = 0.1655s
32/9500 (epoch 0.168), train_loss = 3.29852395, grad/param norm = 9.4206e-02, time/batch = 0.1668s
33/9500 (epoch 0.174), train_loss = 3.30228260, grad/param norm = 8.0267e-02, time/batch = 0.1672s
34/9500 (epoch 0.179), train_loss = 3.29601358, grad/param norm = 8.9820e-02, time/batch = 0.1663s
35/9500 (epoch 0.184), train_loss = 3.30065951, grad/param norm = 9.2121e-02, time/batch = 0.1671s
36/9500 (epoch 0.189), train_loss = 3.30256522, grad/param norm = 1.0573e-01, time/batch = 0.1721s
37/9500 (epoch 0.195), train_loss = 3.28089412, grad/param norm = 1.0760e-01, time/batch = 0.1702s
38/9500 (epoch 0.200), train_loss = 3.29966182, grad/param norm = 1.1570e-01, time/batch = 0.1712s
39/9500 (epoch 0.205), train_loss = 3.30055597, grad/param norm = 1.0634e-01, time/batch = 0.1701s
40/9500 (epoch 0.211), train_loss = 3.29188272, grad/param norm = 1.2511e-01, time/batch = 0.1702s
41/9500 (epoch 0.216), train_loss = 3.34403879, grad/param norm = 1.4596e-01, time/batch = 0.1683s
42/9500 (epoch 0.221), train_loss = 3.32186155, grad/param norm = 1.2661e-01, time/batch = 0.1716s
43/9500 (epoch 0.226), train_loss = 3.29106616, grad/param norm = 9.1877e-02, time/batch = 0.1667s
44/9500 (epoch 0.232), train_loss = 3.28822525, grad/param norm = 8.0090e-02, time/batch = 0.1685s
45/9500 (epoch 0.237), train_loss = 3.27224633, grad/param norm = 5.8864e-02, time/batch = 0.1687s
46/9500 (epoch 0.242), train_loss = 3.30062181, grad/param norm = 5.4615e-02, time/batch = 0.1668s
47/9500 (epoch 0.247), train_loss = 3.27088080, grad/param norm = 5.3451e-02, time/batch = 0.1679s
48/9500 (epoch 0.253), train_loss = 3.29304520, grad/param norm = 6.2469e-02, time/batch = 0.1697s
49/9500 (epoch 0.258), train_loss = 3.26731319, grad/param norm = 7.5385e-02, time/batch = 0.1705s
50/9500 (epoch 0.263), train_loss = 3.29232175, grad/param norm = 7.4970e-02, time/batch = 0.1700s
51/9500 (epoch 0.268), train_loss = 3.27075258, grad/param norm = 7.3532e-02, time/batch = 0.1687s
52/9500 (epoch 0.274), train_loss = 3.28982578, grad/param norm = 6.6603e-02, time/batch = 0.1668s
53/9500 (epoch 0.279), train_loss = 3.27747978, grad/param norm = 5.7288e-02, time/batch = 0.1744s
54/9500 (epoch 0.284), train_loss = 3.27050254, grad/param norm = 4.7779e-02, time/batch = 0.1712s
55/9500 (epoch 0.289), train_loss = 3.24442264, grad/param norm = 6.0415e-02, time/batch = 0.1694s
56/9500 (epoch 0.295), train_loss = 3.27334135, grad/param norm = 7.2149e-02, time/batch = 0.1675s
57/9500 (epoch 0.300), train_loss = 3.30901288, grad/param norm = 5.3891e-02, time/batch = 0.1669s
58/9500 (epoch 0.305), train_loss = 3.28041874, grad/param norm = 4.6167e-02, time/batch = 0.1669s
59/9500 (epoch 0.311), train_loss = 3.29131866, grad/param norm = 6.2633e-02, time/batch = 0.1673s
60/9500 (epoch 0.316), train_loss = 3.27999372, grad/param norm = 5.9514e-02, time/batch = 0.1674s
61/9500 (epoch 0.321), train_loss = 3.26930972, grad/param norm = 5.6056e-02, time/batch = 0.1660s
62/9500 (epoch 0.326), train_loss = 3.29838292, grad/param norm = 5.3868e-02, time/batch = 0.1911s
63/9500 (epoch 0.332), train_loss = 3.32926925, grad/param norm = 4.7501e-02, time/batch = 0.1680s
64/9500 (epoch 0.337), train_loss = 3.33108291, grad/param norm = 6.0609e-02, time/batch = 0.1671s
65/9500 (epoch 0.342), train_loss = 3.30681935, grad/param norm = 9.0101e-02, time/batch = 0.1675s
66/9500 (epoch 0.347), train_loss = 3.27353297, grad/param norm = 9.1639e-02, time/batch = 0.1671s
67/9500 (epoch 0.353), train_loss = 3.27974243, grad/param norm = 8.6865e-02, time/batch = 0.1669s
68/9500 (epoch 0.358), train_loss = 3.27348555, grad/param norm = 8.7908e-02, time/batch = 0.1670s
69/9500 (epoch 0.363), train_loss = 3.27757451, grad/param norm = 1.0084e-01, time/batch = 0.1671s
70/9500 (epoch 0.368), train_loss = 3.24274239, grad/param norm = 1.0522e-01, time/batch = 0.1669s
71/9500 (epoch 0.374), train_loss = 3.29016263, grad/param norm = 1.1343e-01, time/batch = 0.1657s
72/9500 (epoch 0.379), train_loss = 3.28759299, grad/param norm = 7.7749e-02, time/batch = 0.1671s
73/9500 (epoch 0.384), train_loss = 3.27209283, grad/param norm = 8.4547e-02, time/batch = 0.1673s
74/9500 (epoch 0.389), train_loss = 3.29661244, grad/param norm = 8.3727e-02, time/batch = 0.1675s
75/9500 (epoch 0.395), train_loss = 3.30226967, grad/param norm = 7.6405e-02, time/batch = 0.1681s
76/9500 (epoch 0.400), train_loss = 3.27048019, grad/param norm = 6.0030e-02, time/batch = 0.1671s
77/9500 (epoch 0.405), train_loss = 3.27307415, grad/param norm = 6.1161e-02, time/batch = 0.1670s
78/9500 (epoch 0.411), train_loss = 3.30788460, grad/param norm = 4.7150e-02, time/batch = 0.1678s
79/9500 (epoch 0.416), train_loss = 3.23867290, grad/param norm = 5.6643e-02, time/batch = 0.1670s
80/9500 (epoch 0.421), train_loss = 3.24106041, grad/param norm = 6.3663e-02, time/batch = 0.1672s
81/9500 (epoch 0.426), train_loss = 3.29023742, grad/param norm = 6.2061e-02, time/batch = 0.1661s
82/9500 (epoch 0.432), train_loss = 3.28714083, grad/param norm = 8.6508e-02, time/batch = 0.1670s
83/9500 (epoch 0.437), train_loss = 3.24114305, grad/param norm = 8.3357e-02, time/batch = 0.1673s
84/9500 (epoch 0.442), train_loss = 3.23749672, grad/param norm = 6.5806e-02, time/batch = 0.1680s
85/9500 (epoch 0.447), train_loss = 3.26320014, grad/param norm = 5.1585e-02, time/batch = 0.1671s
86/9500 (epoch 0.453), train_loss = 3.25714559, grad/param norm = 5.0439e-02, time/batch = 0.1672s
87/9500 (epoch 0.458), train_loss = 3.28365665, grad/param norm = 5.9689e-02, time/batch = 0.1678s
88/9500 (epoch 0.463), train_loss = 3.30418399, grad/param norm = 5.8412e-02, time/batch = 0.1672s
89/9500 (epoch 0.468), train_loss = 3.27882853, grad/param norm = 7.6653e-02, time/batch = 0.1673s
90/9500 (epoch 0.474), train_loss = 3.30339777, grad/param norm = 7.4193e-02, time/batch = 0.1672s
91/9500 (epoch 0.479), train_loss = 3.26227548, grad/param norm = 8.8452e-02, time/batch = 0.1657s
92/9500 (epoch 0.484), train_loss = 3.27803219, grad/param norm = 6.9058e-02, time/batch = 0.1673s
93/9500 (epoch 0.489), train_loss = 3.29130895, grad/param norm = 5.6408e-02, time/batch = 0.1672s
94/9500 (epoch 0.495), train_loss = 3.27140417, grad/param norm = 6.2443e-02, time/batch = 0.1662s
95/9500 (epoch 0.500), train_loss = 3.29444590, grad/param norm = 7.6994e-02, time/batch = 0.1671s
96/9500 (epoch 0.505), train_loss = 3.27935846, grad/param norm = 7.8517e-02, time/batch = 0.1674s
97/9500 (epoch 0.511), train_loss = 3.30168230, grad/param norm = 6.7420e-02, time/batch = 0.1675s
98/9500 (epoch 0.516), train_loss = 3.28696365, grad/param norm = 6.3729e-02, time/batch = 0.1672s
99/9500 (epoch 0.521), train_loss = 3.27213309, grad/param norm = 5.0801e-02, time/batch = 0.1671s
100/9500 (epoch 0.526), train_loss = 3.31575204, grad/param norm = 5.7709e-02, time/batch = 0.1679s
101/9500 (epoch 0.532), train_loss = 3.33272940, grad/param norm = 6.2874e-02, time/batch = 0.1655s
102/9500 (epoch 0.537), train_loss = 3.28259263, grad/param norm = 4.7357e-02, time/batch = 0.1673s
103/9500 (epoch 0.542), train_loss = 3.21872404, grad/param norm = 5.3481e-02, time/batch = 0.1675s
104/9500 (epoch 0.547), train_loss = 3.28609322, grad/param norm = 5.2550e-02, time/batch = 0.1674s
105/9500 (epoch 0.553), train_loss = 3.25417182, grad/param norm = 6.0065e-02, time/batch = 0.1670s
106/9500 (epoch 0.558), train_loss = 3.25648448, grad/param norm = 6.4893e-02, time/batch = 0.1677s
107/9500 (epoch 0.563), train_loss = 3.27050845, grad/param norm = 9.6045e-02, time/batch = 0.1673s
108/9500 (epoch 0.568), train_loss = 3.28357748, grad/param norm = 7.5897e-02, time/batch = 0.1672s
109/9500 (epoch 0.574), train_loss = 3.25293633, grad/param norm = 6.2269e-02, time/batch = 0.1678s
110/9500 (epoch 0.579), train_loss = 3.24648492, grad/param norm = 6.4532e-02, time/batch = 0.1671s
111/9500 (epoch 0.584), train_loss = 3.24045322, grad/param norm = 7.7416e-02, time/batch = 0.1657s
112/9500 (epoch 0.589), train_loss = 3.44701844, grad/param norm = 3.8647e-01, time/batch = 0.1673s
113/9500 (epoch 0.595), train_loss = 3.37163268, grad/param norm = 1.4331e-01, time/batch = 0.1672s
114/9500 (epoch 0.600), train_loss = 3.30353757, grad/param norm = 8.6149e-02, time/batch = 0.1675s
115/9500 (epoch 0.605), train_loss = 3.26892823, grad/param norm = 6.9791e-02, time/batch = 0.1670s
116/9500 (epoch 0.611), train_loss = 3.23298790, grad/param norm = 5.0945e-02, time/batch = 0.1680s
117/9500 (epoch 0.616), train_loss = 3.29405540, grad/param norm = 7.9216e-02, time/batch = 0.1675s
118/9500 (epoch 0.621), train_loss = 3.33130064, grad/param norm = 6.9648e-02, time/batch = 0.1675s
119/9500 (epoch 0.626), train_loss = 3.26412619, grad/param norm = 8.3202e-02, time/batch = 0.1678s
120/9500 (epoch 0.632), train_loss = 3.23351569, grad/param norm = 7.6535e-02, time/batch = 0.1672s
121/9500 (epoch 0.637), train_loss = 3.24036087, grad/param norm = 1.0496e-01, time/batch = 0.1660s
122/9500 (epoch 0.642), train_loss = 3.26624198, grad/param norm = 1.0500e-01, time/batch = 0.1680s
123/9500 (epoch 0.647), train_loss = 3.24303390, grad/param norm = 7.3783e-02, time/batch = 0.1674s
124/9500 (epoch 0.653), train_loss = 3.26324116, grad/param norm = 8.0746e-02, time/batch = 0.1673s
125/9500 (epoch 0.658), train_loss = 3.22223159, grad/param norm = 6.1081e-02, time/batch = 0.1678s
126/9500 (epoch 0.663), train_loss = 3.21170182, grad/param norm = 6.6848e-02, time/batch = 0.1672s
127/9500 (epoch 0.668), train_loss = 3.23330923, grad/param norm = 7.1474e-02, time/batch = 0.1674s
128/9500 (epoch 0.674), train_loss = 3.23585022, grad/param norm = 9.4587e-02, time/batch = 0.1671s
129/9500 (epoch 0.679), train_loss = 3.23146753, grad/param norm = 9.9437e-02, time/batch = 0.1677s
130/9500 (epoch 0.684), train_loss = 3.20319606, grad/param norm = 6.6245e-02, time/batch = 0.1671s
131/9500 (epoch 0.689), train_loss = 3.19942729, grad/param norm = 7.1896e-02, time/batch = 0.1660s
132/9500 (epoch 0.695), train_loss = 3.20248689, grad/param norm = 8.4354e-02, time/batch = 0.1675s
133/9500 (epoch 0.700), train_loss = 3.17376431, grad/param norm = 7.8533e-02, time/batch = 0.1672s
134/9500 (epoch 0.705), train_loss = 3.17502874, grad/param norm = 8.7593e-02, time/batch = 0.1672s
135/9500 (epoch 0.711), train_loss = 3.17025857, grad/param norm = 9.5012e-02, time/batch = 0.1679s
136/9500 (epoch 0.716), train_loss = 3.14070362, grad/param norm = 1.2444e-01, time/batch = 0.1671s
137/9500 (epoch 0.721), train_loss = 3.15086621, grad/param norm = 1.3916e-01, time/batch = 0.1675s
138/9500 (epoch 0.726), train_loss = 3.13574181, grad/param norm = 1.3700e-01, time/batch = 0.1680s
139/9500 (epoch 0.732), train_loss = 3.12685347, grad/param norm = 1.7680e-01, time/batch = 0.1677s
140/9500 (epoch 0.737), train_loss = 3.20465042, grad/param norm = 1.9024e-01, time/batch = 0.1674s
141/9500 (epoch 0.742), train_loss = 3.23466569, grad/param norm = 1.6682e-01, time/batch = 0.1664s
142/9500 (epoch 0.747), train_loss = 3.16408956, grad/param norm = 1.1285e-01, time/batch = 0.1675s
143/9500 (epoch 0.753), train_loss = 3.23678133, grad/param norm = 1.6472e-01, time/batch = 0.1675s
144/9500 (epoch 0.758), train_loss = 3.15593347, grad/param norm = 1.0136e-01, time/batch = 0.1679s
145/9500 (epoch 0.763), train_loss = 3.07189437, grad/param norm = 6.9025e-02, time/batch = 0.1674s
146/9500 (epoch 0.768), train_loss = 3.04447907, grad/param norm = 8.9196e-02, time/batch = 0.1676s
147/9500 (epoch 0.774), train_loss = 3.05330240, grad/param norm = 1.2684e-01, time/batch = 0.1680s
148/9500 (epoch 0.779), train_loss = 3.05118311, grad/param norm = 1.6778e-01, time/batch = 0.1675s
149/9500 (epoch 0.784), train_loss = 3.05733690, grad/param norm = 2.0024e-01, time/batch = 0.1672s
150/9500 (epoch 0.789), train_loss = 3.02148660, grad/param norm = 1.7343e-01, time/batch = 0.1671s
151/9500 (epoch 0.795), train_loss = 3.00491152, grad/param norm = 1.1277e-01, time/batch = 0.1660s
152/9500 (epoch 0.800), train_loss = 3.21407479, grad/param norm = 5.5699e-01, time/batch = 0.1674s
153/9500 (epoch 0.805), train_loss = 3.32414680, grad/param norm = 3.4037e-01, time/batch = 0.1676s
154/9500 (epoch 0.811), train_loss = 3.27463062, grad/param norm = 2.1281e-01, time/batch = 0.1675s
155/9500 (epoch 0.816), train_loss = 3.06618207, grad/param norm = 9.3634e-02, time/batch = 0.1672s
156/9500 (epoch 0.821), train_loss = 3.02144634, grad/param norm = 6.2442e-02, time/batch = 0.1673s
157/9500 (epoch 0.826), train_loss = 2.95256076, grad/param norm = 9.0091e-02, time/batch = 0.1683s
158/9500 (epoch 0.832), train_loss = 2.98143281, grad/param norm = 1.1424e-01, time/batch = 0.1672s
159/9500 (epoch 0.837), train_loss = 2.94519732, grad/param norm = 8.5923e-02, time/batch = 0.1678s
160/9500 (epoch 0.842), train_loss = 2.94131310, grad/param norm = 1.0222e-01, time/batch = 0.1677s
161/9500 (epoch 0.847), train_loss = 2.93474691, grad/param norm = 9.8161e-02, time/batch = 0.1662s
162/9500 (epoch 0.853), train_loss = 2.90784111, grad/param norm = 1.2394e-01, time/batch = 0.1675s
163/9500 (epoch 0.858), train_loss = 3.07049302, grad/param norm = 4.6730e-01, time/batch = 0.1682s
164/9500 (epoch 0.863), train_loss = 3.03834999, grad/param norm = 2.4975e-01, time/batch = 0.1677s
165/9500 (epoch 0.868), train_loss = 2.93912546, grad/param norm = 1.3055e-01, time/batch = 0.1672s
166/9500 (epoch 0.874), train_loss = 2.90044387, grad/param norm = 1.3705e-01, time/batch = 0.1681s
167/9500 (epoch 0.879), train_loss = 2.86155170, grad/param norm = 9.9814e-02, time/batch = 0.1675s
168/9500 (epoch 0.884), train_loss = 2.85820678, grad/param norm = 1.4036e-01, time/batch = 0.1678s
169/9500 (epoch 0.889), train_loss = 2.85024148, grad/param norm = 1.2539e-01, time/batch = 0.1673s
170/9500 (epoch 0.895), train_loss = 2.84438551, grad/param norm = 1.3375e-01, time/batch = 0.1679s
171/9500 (epoch 0.900), train_loss = 2.82365211, grad/param norm = 1.1715e-01, time/batch = 0.1658s
172/9500 (epoch 0.905), train_loss = 2.82259930, grad/param norm = 1.1469e-01, time/batch = 0.1677s
173/9500 (epoch 0.911), train_loss = 2.79364060, grad/param norm = 1.2573e-01, time/batch = 0.1679s
174/9500 (epoch 0.916), train_loss = 2.81964291, grad/param norm = 2.2931e-01, time/batch = 0.1676s
175/9500 (epoch 0.921), train_loss = 2.89575767, grad/param norm = 3.9404e-01, time/batch = 0.1674s
176/9500 (epoch 0.926), train_loss = 2.85975401, grad/param norm = 2.3007e-01, time/batch = 0.1679s
177/9500 (epoch 0.932), train_loss = 2.84219930, grad/param norm = 3.1699e-01, time/batch = 0.1675s
178/9500 (epoch 0.937), train_loss = 2.90646558, grad/param norm = 1.7903e-01, time/batch = 0.1676s
179/9500 (epoch 0.942), train_loss = 2.79684171, grad/param norm = 1.0125e-01, time/batch = 0.1683s
180/9500 (epoch 0.947), train_loss = 2.79794559, grad/param norm = 1.1338e-01, time/batch = 0.1673s
181/9500 (epoch 0.953), train_loss = 2.77543377, grad/param norm = 9.7725e-02, time/batch = 0.1661s
182/9500 (epoch 0.958), train_loss = 2.73258759, grad/param norm = 8.6316e-02, time/batch = 0.1684s
183/9500 (epoch 0.963), train_loss = 2.72224693, grad/param norm = 7.8525e-02, time/batch = 0.1676s
184/9500 (epoch 0.968), train_loss = 2.69406896, grad/param norm = 1.0477e-01, time/batch = 0.1677s
185/9500 (epoch 0.974), train_loss = 2.72115661, grad/param norm = 2.1522e-01, time/batch = 0.1676s
186/9500 (epoch 0.979), train_loss = 2.83466770, grad/param norm = 4.1144e-01, time/batch = 0.1677s
187/9500 (epoch 0.984), train_loss = 2.80450796, grad/param norm = 2.2144e-01, time/batch = 0.1674s
188/9500 (epoch 0.989), train_loss = 2.73207733, grad/param norm = 1.3759e-01, time/batch = 0.1677s
189/9500 (epoch 0.995), train_loss = 2.70650164, grad/param norm = 1.0955e-01, time/batch = 0.1675s
190/9500 (epoch 1.000), train_loss = 2.71835902, grad/param norm = 1.9360e-01, time/batch = 0.1674s
191/9500 (epoch 1.005), train_loss = 2.80218973, grad/param norm = 1.9382e-01, time/batch = 0.1660s
192/9500 (epoch 1.011), train_loss = 2.76163413, grad/param norm = 1.5825e-01, time/batch = 0.1676s
193/9500 (epoch 1.016), train_loss = 2.72953795, grad/param norm = 1.4570e-01, time/batch = 0.1676s
194/9500 (epoch 1.021), train_loss = 2.65865903, grad/param norm = 1.3027e-01, time/batch = 0.1673s
195/9500 (epoch 1.026), train_loss = 2.65947048, grad/param norm = 1.2876e-01, time/batch = 0.1681s
196/9500 (epoch 1.032), train_loss = 2.63516525, grad/param norm = 1.5726e-01, time/batch = 0.1679s
197/9500 (epoch 1.037), train_loss = 2.65844296, grad/param norm = 1.9885e-01, time/batch = 0.1677s
198/9500 (epoch 1.042), train_loss = 2.66414303, grad/param norm = 2.0874e-01, time/batch = 0.1679s
199/9500 (epoch 1.047), train_loss = 2.66406861, grad/param norm = 1.5419e-01, time/batch = 0.1673s
200/9500 (epoch 1.053), train_loss = 2.61099810, grad/param norm = 1.0986e-01, time/batch = 0.1676s
201/9500 (epoch 1.058), train_loss = 2.60735827, grad/param norm = 1.3129e-01, time/batch = 0.1670s
202/9500 (epoch 1.063), train_loss = 2.64200086, grad/param norm = 1.9102e-01, time/batch = 0.1673s
203/9500 (epoch 1.068), train_loss = 2.71690869, grad/param norm = 2.1875e-01, time/batch = 0.1674s
204/9500 (epoch 1.074), train_loss = 2.72020114, grad/param norm = 2.4086e-01, time/batch = 0.1679s
205/9500 (epoch 1.079), train_loss = 2.70175859, grad/param norm = 2.4647e-01, time/batch = 0.1674s
206/9500 (epoch 1.084), train_loss = 2.70258088, grad/param norm = 2.2883e-01, time/batch = 0.1673s
207/9500 (epoch 1.089), train_loss = 2.64367289, grad/param norm = 1.6997e-01, time/batch = 0.1673s
208/9500 (epoch 1.095), train_loss = 2.58395955, grad/param norm = 1.1052e-01, time/batch = 0.1676s
209/9500 (epoch 1.100), train_loss = 2.57970488, grad/param norm = 8.6612e-02, time/batch = 0.1674s
210/9500 (epoch 1.105), train_loss = 2.57031104, grad/param norm = 7.7425e-02, time/batch = 0.1673s
211/9500 (epoch 1.111), train_loss = 2.51584864, grad/param norm = 8.2444e-02, time/batch = 0.1660s
212/9500 (epoch 1.116), train_loss = 2.54540197, grad/param norm = 1.5164e-01, time/batch = 0.1679s
213/9500 (epoch 1.121), train_loss = 2.65656773, grad/param norm = 1.8741e-01, time/batch = 0.1673s
214/9500 (epoch 1.126), train_loss = 2.64668360, grad/param norm = 1.6577e-01, time/batch = 0.1673s
215/9500 (epoch 1.132), train_loss = 2.60635977, grad/param norm = 1.3135e-01, time/batch = 0.1672s
216/9500 (epoch 1.137), train_loss = 2.55764280, grad/param norm = 1.0156e-01, time/batch = 0.1671s
217/9500 (epoch 1.142), train_loss = 2.52958158, grad/param norm = 1.7944e-01, time/batch = 0.1677s
218/9500 (epoch 1.147), train_loss = 2.66934768, grad/param norm = 4.2714e-01, time/batch = 0.1675s
219/9500 (epoch 1.153), train_loss = 2.78780397, grad/param norm = 3.2570e-01, time/batch = 0.1675s
220/9500 (epoch 1.158), train_loss = 2.61872714, grad/param norm = 1.4087e-01, time/batch = 0.1680s
221/9500 (epoch 1.163), train_loss = 2.57246022, grad/param norm = 1.0808e-01, time/batch = 0.1659s
222/9500 (epoch 1.168), train_loss = 2.55501173, grad/param norm = 1.2637e-01, time/batch = 0.1679s
223/9500 (epoch 1.174), train_loss = 2.56719114, grad/param norm = 1.2544e-01, time/batch = 0.1679s
224/9500 (epoch 1.179), train_loss = 2.57267858, grad/param norm = 1.1674e-01, time/batch = 0.1672s
225/9500 (epoch 1.184), train_loss = 2.52594102, grad/param norm = 1.0010e-01, time/batch = 0.1676s
226/9500 (epoch 1.189), train_loss = 2.50786604, grad/param norm = 9.6824e-02, time/batch = 0.1676s
227/9500 (epoch 1.195), train_loss = 2.47613595, grad/param norm = 8.4033e-02, time/batch = 0.1683s
228/9500 (epoch 1.200), train_loss = 2.47743491, grad/param norm = 7.6428e-02, time/batch = 0.1680s
229/9500 (epoch 1.205), train_loss = 2.44575243, grad/param norm = 8.4128e-02, time/batch = 0.1691s
230/9500 (epoch 1.211), train_loss = 2.46043536, grad/param norm = 1.3657e-01, time/batch = 0.1698s
231/9500 (epoch 1.216), train_loss = 2.54583766, grad/param norm = 2.1191e-01, time/batch = 0.1670s
232/9500 (epoch 1.221), train_loss = 2.55123938, grad/param norm = 2.4174e-01, time/batch = 0.1693s
233/9500 (epoch 1.226), train_loss = 2.55736139, grad/param norm = 1.5332e-01, time/batch = 0.1695s
234/9500 (epoch 1.232), train_loss = 2.51756784, grad/param norm = 8.4411e-02, time/batch = 0.1693s
235/9500 (epoch 1.237), train_loss = 2.49024255, grad/param norm = 8.1373e-02, time/batch = 0.1688s
236/9500 (epoch 1.242), train_loss = 2.51164486, grad/param norm = 9.1590e-02, time/batch = 0.1699s
237/9500 (epoch 1.247), train_loss = 2.43962918, grad/param norm = 9.3577e-02, time/batch = 0.1692s
238/9500 (epoch 1.253), train_loss = 2.47471191, grad/param norm = 1.4283e-01, time/batch = 0.1693s
239/9500 (epoch 1.258), train_loss = 2.52366169, grad/param norm = 2.1839e-01, time/batch = 0.1695s
240/9500 (epoch 1.263), train_loss = 2.49762992, grad/param norm = 2.0596e-01, time/batch = 0.1692s
241/9500 (epoch 1.268), train_loss = 2.48226185, grad/param norm = 1.4365e-01, time/batch = 0.1684s
242/9500 (epoch 1.274), train_loss = 2.46790069, grad/param norm = 1.2924e-01, time/batch = 0.1698s
243/9500 (epoch 1.279), train_loss = 2.46328343, grad/param norm = 1.1304e-01, time/batch = 0.1687s
244/9500 (epoch 1.284), train_loss = 2.46148424, grad/param norm = 1.0504e-01, time/batch = 0.1690s
245/9500 (epoch 1.289), train_loss = 2.46717345, grad/param norm = 1.2091e-01, time/batch = 0.1692s
246/9500 (epoch 1.295), train_loss = 2.45412949, grad/param norm = 1.8271e-01, time/batch = 0.1687s
247/9500 (epoch 1.300), train_loss = 2.53487761, grad/param norm = 1.8479e-01, time/batch = 0.1693s
248/9500 (epoch 1.305), train_loss = 2.50844855, grad/param norm = 1.9040e-01, time/batch = 0.1689s
249/9500 (epoch 1.311), train_loss = 2.54239161, grad/param norm = 1.9795e-01, time/batch = 0.1693s
250/9500 (epoch 1.316), train_loss = 2.47290387, grad/param norm = 1.1617e-01, time/batch = 0.1695s
251/9500 (epoch 1.321), train_loss = 2.43684184, grad/param norm = 1.2325e-01, time/batch = 0.1673s
252/9500 (epoch 1.326), train_loss = 2.48042869, grad/param norm = 1.1871e-01, time/batch = 0.1688s
253/9500 (epoch 1.332), train_loss = 2.42779499, grad/param norm = 1.1851e-01, time/batch = 0.1692s
254/9500 (epoch 1.337), train_loss = 2.41981034, grad/param norm = 1.1964e-01, time/batch = 0.1689s
255/9500 (epoch 1.342), train_loss = 2.40988381, grad/param norm = 1.1759e-01, time/batch = 0.1697s
256/9500 (epoch 1.347), train_loss = 2.42414392, grad/param norm = 1.1517e-01, time/batch = 0.1691s
257/9500 (epoch 1.353), train_loss = 2.41504769, grad/param norm = 9.3772e-02, time/batch = 0.1691s
258/9500 (epoch 1.358), train_loss = 2.38073062, grad/param norm = 9.8476e-02, time/batch = 0.1696s
259/9500 (epoch 1.363), train_loss = 2.41499375, grad/param norm = 1.1082e-01, time/batch = 0.1692s
260/9500 (epoch 1.368), train_loss = 2.41186593, grad/param norm = 1.3049e-01, time/batch = 0.1690s
261/9500 (epoch 1.374), train_loss = 2.42577524, grad/param norm = 1.3320e-01, time/batch = 0.1681s
262/9500 (epoch 1.379), train_loss = 2.37677479, grad/param norm = 1.4066e-01, time/batch = 0.1690s
263/9500 (epoch 1.384), train_loss = 2.46353652, grad/param norm = 2.6515e-01, time/batch = 0.1691s
264/9500 (epoch 1.389), train_loss = 2.50882018, grad/param norm = 1.9886e-01, time/batch = 0.1698s
265/9500 (epoch 1.395), train_loss = 2.44381573, grad/param norm = 1.0059e-01, time/batch = 0.1685s
266/9500 (epoch 1.400), train_loss = 2.42095001, grad/param norm = 8.4894e-02, time/batch = 0.1688s
267/9500 (epoch 1.405), train_loss = 2.40441904, grad/param norm = 9.0272e-02, time/batch = 0.1697s
268/9500 (epoch 1.411), train_loss = 2.41696162, grad/param norm = 1.3976e-01, time/batch = 0.1690s
269/9500 (epoch 1.416), train_loss = 2.46135211, grad/param norm = 1.8336e-01, time/batch = 0.1693s
270/9500 (epoch 1.421), train_loss = 2.49511308, grad/param norm = 1.4453e-01, time/batch = 0.1688s
271/9500 (epoch 1.426), train_loss = 2.38353456, grad/param norm = 8.4670e-02, time/batch = 0.1677s
272/9500 (epoch 1.432), train_loss = 2.38605347, grad/param norm = 8.6230e-02, time/batch = 0.1693s
273/9500 (epoch 1.437), train_loss = 2.38462159, grad/param norm = 9.5548e-02, time/batch = 0.1688s
274/9500 (epoch 1.442), train_loss = 2.34584232, grad/param norm = 1.1090e-01, time/batch = 0.1689s
275/9500 (epoch 1.447), train_loss = 2.38295779, grad/param norm = 1.1715e-01, time/batch = 0.1691s
276/9500 (epoch 1.453), train_loss = 2.38875564, grad/param norm = 1.2981e-01, time/batch = 0.1691s
277/9500 (epoch 1.458), train_loss = 2.35181808, grad/param norm = 1.2308e-01, time/batch = 0.1699s
278/9500 (epoch 1.463), train_loss = 2.38032283, grad/param norm = 9.9003e-02, time/batch = 0.1687s
279/9500 (epoch 1.468), train_loss = 2.35522679, grad/param norm = 8.7687e-02, time/batch = 0.1691s
280/9500 (epoch 1.474), train_loss = 2.37795367, grad/param norm = 1.0974e-01, time/batch = 0.1700s
281/9500 (epoch 1.479), train_loss = 2.38125779, grad/param norm = 1.5710e-01, time/batch = 0.1675s
282/9500 (epoch 1.484), train_loss = 2.40863575, grad/param norm = 1.6572e-01, time/batch = 0.1688s
283/9500 (epoch 1.489), train_loss = 2.38205218, grad/param norm = 1.3572e-01, time/batch = 0.1687s
284/9500 (epoch 1.495), train_loss = 2.36070232, grad/param norm = 1.0946e-01, time/batch = 0.1694s
285/9500 (epoch 1.500), train_loss = 2.35871841, grad/param norm = 1.0983e-01, time/batch = 0.1688s
286/9500 (epoch 1.505), train_loss = 2.41650473, grad/param norm = 1.3359e-01, time/batch = 0.1690s
287/9500 (epoch 1.511), train_loss = 2.38341896, grad/param norm = 1.2712e-01, time/batch = 0.1703s
288/9500 (epoch 1.516), train_loss = 2.36407164, grad/param norm = 1.1777e-01, time/batch = 0.1689s
289/9500 (epoch 1.521), train_loss = 2.35431231, grad/param norm = 1.2503e-01, time/batch = 0.1694s
290/9500 (epoch 1.526), train_loss = 2.37336952, grad/param norm = 1.5647e-01, time/batch = 0.1696s
291/9500 (epoch 1.532), train_loss = 2.39668250, grad/param norm = 1.3588e-01, time/batch = 0.1675s
292/9500 (epoch 1.537), train_loss = 2.34678617, grad/param norm = 1.0435e-01, time/batch = 0.1692s
293/9500 (epoch 1.542), train_loss = 2.31226676, grad/param norm = 9.0857e-02, time/batch = 0.1694s
294/9500 (epoch 1.547), train_loss = 2.31894064, grad/param norm = 9.6763e-02, time/batch = 0.1686s
295/9500 (epoch 1.553), train_loss = 2.33343775, grad/param norm = 9.9626e-02, time/batch = 0.1693s
296/9500 (epoch 1.558), train_loss = 2.29599771, grad/param norm = 1.0849e-01, time/batch = 0.1691s
297/9500 (epoch 1.563), train_loss = 2.30961491, grad/param norm = 1.3189e-01, time/batch = 0.1690s
298/9500 (epoch 1.568), train_loss = 2.33088882, grad/param norm = 1.3767e-01, time/batch = 0.1689s
299/9500 (epoch 1.574), train_loss = 2.32800305, grad/param norm = 1.3919e-01, time/batch = 0.1691s
300/9500 (epoch 1.579), train_loss = 2.35925177, grad/param norm = 1.2985e-01, time/batch = 0.1691s
301/9500 (epoch 1.584), train_loss = 2.33194792, grad/param norm = 1.0772e-01, time/batch = 0.1674s
302/9500 (epoch 1.589), train_loss = 2.30412334, grad/param norm = 1.1887e-01, time/batch = 0.1696s
303/9500 (epoch 1.595), train_loss = 2.35141436, grad/param norm = 1.2923e-01, time/batch = 0.1690s
304/9500 (epoch 1.600), train_loss = 2.34429995, grad/param norm = 1.1307e-01, time/batch = 0.1691s
305/9500 (epoch 1.605), train_loss = 2.28654737, grad/param norm = 9.4583e-02, time/batch = 0.1692s
306/9500 (epoch 1.611), train_loss = 2.28441040, grad/param norm = 8.3021e-02, time/batch = 0.1688s
307/9500 (epoch 1.616), train_loss = 2.26679949, grad/param norm = 8.9095e-02, time/batch = 0.1687s
308/9500 (epoch 1.621), train_loss = 2.29492144, grad/param norm = 1.2933e-01, time/batch = 0.1687s
309/9500 (epoch 1.626), train_loss = 2.30607644, grad/param norm = 1.4136e-01, time/batch = 0.1697s
310/9500 (epoch 1.632), train_loss = 2.29290429, grad/param norm = 1.1710e-01, time/batch = 0.1691s
311/9500 (epoch 1.637), train_loss = 2.26664954, grad/param norm = 8.5261e-02, time/batch = 0.1674s
312/9500 (epoch 1.642), train_loss = 2.25645036, grad/param norm = 6.4060e-02, time/batch = 0.1691s
313/9500 (epoch 1.647), train_loss = 2.24618306, grad/param norm = 6.6768e-02, time/batch = 0.1691s
314/9500 (epoch 1.653), train_loss = 2.27145154, grad/param norm = 8.2980e-02, time/batch = 0.1692s
315/9500 (epoch 1.658), train_loss = 2.29957117, grad/param norm = 1.1043e-01, time/batch = 0.1697s
316/9500 (epoch 1.663), train_loss = 2.35174397, grad/param norm = 1.3420e-01, time/batch = 0.1689s
317/9500 (epoch 1.668), train_loss = 2.34038885, grad/param norm = 1.0814e-01, time/batch = 0.1693s
318/9500 (epoch 1.674), train_loss = 2.30957868, grad/param norm = 7.4915e-02, time/batch = 0.1697s
319/9500 (epoch 1.679), train_loss = 2.30160265, grad/param norm = 6.8551e-02, time/batch = 0.1687s
320/9500 (epoch 1.684), train_loss = 2.26268365, grad/param norm = 7.9227e-02, time/batch = 0.1692s
321/9500 (epoch 1.689), train_loss = 2.28016263, grad/param norm = 1.2030e-01, time/batch = 0.1678s
322/9500 (epoch 1.695), train_loss = 2.31945728, grad/param norm = 1.6381e-01, time/batch = 0.1692s
323/9500 (epoch 1.700), train_loss = 2.32628205, grad/param norm = 1.4136e-01, time/batch = 0.1691s
324/9500 (epoch 1.705), train_loss = 2.29548016, grad/param norm = 9.2443e-02, time/batch = 0.1700s
325/9500 (epoch 1.711), train_loss = 2.25172894, grad/param norm = 8.4460e-02, time/batch = 0.1691s
326/9500 (epoch 1.716), train_loss = 2.26378751, grad/param norm = 9.2974e-02, time/batch = 0.1693s
327/9500 (epoch 1.721), train_loss = 2.22517570, grad/param norm = 9.7820e-02, time/batch = 0.1694s
328/9500 (epoch 1.726), train_loss = 2.22391799, grad/param norm = 8.2037e-02, time/batch = 0.1687s
329/9500 (epoch 1.732), train_loss = 2.20956461, grad/param norm = 7.0078e-02, time/batch = 0.1691s
330/9500 (epoch 1.737), train_loss = 2.23581047, grad/param norm = 6.8547e-02, time/batch = 0.1694s
331/9500 (epoch 1.742), train_loss = 2.22350775, grad/param norm = 8.3162e-02, time/batch = 0.1676s
332/9500 (epoch 1.747), train_loss = 2.25299297, grad/param norm = 1.1441e-01, time/batch = 0.1698s
333/9500 (epoch 1.753), train_loss = 2.26323633, grad/param norm = 1.2287e-01, time/batch = 0.1701s
334/9500 (epoch 1.758), train_loss = 2.26779199, grad/param norm = 9.6329e-02, time/batch = 0.1691s
335/9500 (epoch 1.763), train_loss = 2.19086006, grad/param norm = 8.1474e-02, time/batch = 0.1689s
336/9500 (epoch 1.768), train_loss = 2.21199599, grad/param norm = 9.8913e-02, time/batch = 0.1691s
337/9500 (epoch 1.774), train_loss = 2.23367218, grad/param norm = 1.1997e-01, time/batch = 0.1683s
338/9500 (epoch 1.779), train_loss = 2.31054295, grad/param norm = 1.6884e-01, time/batch = 0.1692s
339/9500 (epoch 1.784), train_loss = 2.34693111, grad/param norm = 1.7690e-01, time/batch = 0.1693s
340/9500 (epoch 1.789), train_loss = 2.27534557, grad/param norm = 1.1911e-01, time/batch = 0.1701s
341/9500 (epoch 1.795), train_loss = 2.25239140, grad/param norm = 1.3636e-01, time/batch = 0.1673s
342/9500 (epoch 1.800), train_loss = 2.24585765, grad/param norm = 1.3846e-01, time/batch = 0.1692s
343/9500 (epoch 1.805), train_loss = 2.17387921, grad/param norm = 9.6608e-02, time/batch = 0.1698s
344/9500 (epoch 1.811), train_loss = 2.20758753, grad/param norm = 8.4712e-02, time/batch = 0.1690s
345/9500 (epoch 1.816), train_loss = 2.16322903, grad/param norm = 8.7759e-02, time/batch = 0.1690s
346/9500 (epoch 1.821), train_loss = 2.19312935, grad/param norm = 8.1304e-02, time/batch = 0.1699s
347/9500 (epoch 1.826), train_loss = 2.19382331, grad/param norm = 9.0339e-02, time/batch = 0.1690s
348/9500 (epoch 1.832), train_loss = 2.19797687, grad/param norm = 9.7263e-02, time/batch = 0.1688s
349/9500 (epoch 1.837), train_loss = 2.19412732, grad/param norm = 7.8655e-02, time/batch = 0.1697s
350/9500 (epoch 1.842), train_loss = 2.18391168, grad/param norm = 6.5144e-02, time/batch = 0.1688s
351/9500 (epoch 1.847), train_loss = 2.17512759, grad/param norm = 6.2824e-02, time/batch = 0.1678s
352/9500 (epoch 1.853), train_loss = 2.16151414, grad/param norm = 6.5225e-02, time/batch = 0.1693s
353/9500 (epoch 1.858), train_loss = 2.18454594, grad/param norm = 7.4765e-02, time/batch = 0.1693s
354/9500 (epoch 1.863), train_loss = 2.16541689, grad/param norm = 8.1228e-02, time/batch = 0.1690s
355/9500 (epoch 1.868), train_loss = 2.16169660, grad/param norm = 1.0157e-01, time/batch = 0.1697s
356/9500 (epoch 1.874), train_loss = 2.22497355, grad/param norm = 1.3158e-01, time/batch = 0.1687s
357/9500 (epoch 1.879), train_loss = 2.20618470, grad/param norm = 1.4374e-01, time/batch = 0.1692s
358/9500 (epoch 1.884), train_loss = 2.21218154, grad/param norm = 1.1912e-01, time/batch = 0.1689s
359/9500 (epoch 1.889), train_loss = 2.17410570, grad/param norm = 7.4946e-02, time/batch = 0.1693s
360/9500 (epoch 1.895), train_loss = 2.13761853, grad/param norm = 6.6347e-02, time/batch = 0.1691s
361/9500 (epoch 1.900), train_loss = 2.12603896, grad/param norm = 6.3467e-02, time/batch = 0.1672s
362/9500 (epoch 1.905), train_loss = 2.15620481, grad/param norm = 6.5260e-02, time/batch = 0.1692s
363/9500 (epoch 1.911), train_loss = 2.11674849, grad/param norm = 6.8263e-02, time/batch = 0.1695s
364/9500 (epoch 1.916), train_loss = 2.13061324, grad/param norm = 8.1959e-02, time/batch = 0.1693s
365/9500 (epoch 1.921), train_loss = 2.12896523, grad/param norm = 8.2872e-02, time/batch = 0.1697s
366/9500 (epoch 1.926), train_loss = 2.14974713, grad/param norm = 8.2680e-02, time/batch = 0.1692s
367/9500 (epoch 1.932), train_loss = 2.14153547, grad/param norm = 7.8719e-02, time/batch = 0.1696s
368/9500 (epoch 1.937), train_loss = 2.15340056, grad/param norm = 7.9807e-02, time/batch = 0.1698s
369/9500 (epoch 1.942), train_loss = 2.15053487, grad/param norm = 7.6839e-02, time/batch = 0.1688s
370/9500 (epoch 1.947), train_loss = 2.15090379, grad/param norm = 8.6387e-02, time/batch = 0.1694s
371/9500 (epoch 1.953), train_loss = 2.17413546, grad/param norm = 9.2852e-02, time/batch = 0.1683s
372/9500 (epoch 1.958), train_loss = 2.13467910, grad/param norm = 9.3191e-02, time/batch = 0.1687s
373/9500 (epoch 1.963), train_loss = 2.15774345, grad/param norm = 1.0283e-01, time/batch = 0.1695s
374/9500 (epoch 1.968), train_loss = 2.15074333, grad/param norm = 1.1219e-01, time/batch = 0.1698s
375/9500 (epoch 1.974), train_loss = 2.13477335, grad/param norm = 9.7862e-02, time/batch = 0.1689s
376/9500 (epoch 1.979), train_loss = 2.12953307, grad/param norm = 6.9019e-02, time/batch = 0.1688s
377/9500 (epoch 1.984), train_loss = 2.08940562, grad/param norm = 6.9708e-02, time/batch = 0.1696s
378/9500 (epoch 1.989), train_loss = 2.14820351, grad/param norm = 7.9948e-02, time/batch = 0.1696s
379/9500 (epoch 1.995), train_loss = 2.13690062, grad/param norm = 7.4366e-02, time/batch = 0.1691s
380/9500 (epoch 2.000), train_loss = 2.12444094, grad/param norm = 6.7229e-02, time/batch = 0.1697s
381/9500 (epoch 2.005), train_loss = 2.19733582, grad/param norm = 6.6125e-02, time/batch = 0.1675s
382/9500 (epoch 2.011), train_loss = 2.11979829, grad/param norm = 7.9547e-02, time/batch = 0.1689s
383/9500 (epoch 2.016), train_loss = 2.14908387, grad/param norm = 1.0015e-01, time/batch = 0.1704s
384/9500 (epoch 2.021), train_loss = 2.16144784, grad/param norm = 1.1224e-01, time/batch = 0.1695s
385/9500 (epoch 2.026), train_loss = 2.20153716, grad/param norm = 1.1283e-01, time/batch = 0.1693s
386/9500 (epoch 2.032), train_loss = 2.16543418, grad/param norm = 1.0081e-01, time/batch = 0.1694s
387/9500 (epoch 2.037), train_loss = 2.14024927, grad/param norm = 8.7125e-02, time/batch = 0.1693s
388/9500 (epoch 2.042), train_loss = 2.09341615, grad/param norm = 7.3131e-02, time/batch = 0.1690s
389/9500 (epoch 2.047), train_loss = 2.11119495, grad/param norm = 7.1604e-02, time/batch = 0.1693s
390/9500 (epoch 2.053), train_loss = 2.10058323, grad/param norm = 6.9514e-02, time/batch = 0.1690s
391/9500 (epoch 2.058), train_loss = 2.10787559, grad/param norm = 5.5379e-02, time/batch = 0.1672s
392/9500 (epoch 2.063), train_loss = 2.03389045, grad/param norm = 5.0330e-02, time/batch = 0.1696s
393/9500 (epoch 2.068), train_loss = 2.05779152, grad/param norm = 5.5753e-02, time/batch = 0.1689s
394/9500 (epoch 2.074), train_loss = 2.09085763, grad/param norm = 5.7103e-02, time/batch = 0.1692s
395/9500 (epoch 2.079), train_loss = 2.07912022, grad/param norm = 6.3371e-02, time/batch = 0.1688s
396/9500 (epoch 2.084), train_loss = 2.12627938, grad/param norm = 6.0436e-02, time/batch = 0.1694s
397/9500 (epoch 2.089), train_loss = 2.10913948, grad/param norm = 7.4728e-02, time/batch = 0.1698s
398/9500 (epoch 2.095), train_loss = 2.09446039, grad/param norm = 1.2840e-01, time/batch = 0.1693s
399/9500 (epoch 2.100), train_loss = 2.15564540, grad/param norm = 1.4231e-01, time/batch = 0.1689s
400/9500 (epoch 2.105), train_loss = 2.13715231, grad/param norm = 1.0472e-01, time/batch = 0.1698s
401/9500 (epoch 2.111), train_loss = 2.04814420, grad/param norm = 7.7260e-02, time/batch = 0.1678s
402/9500 (epoch 2.116), train_loss = 2.06672402, grad/param norm = 6.3045e-02, time/batch = 0.1695s
403/9500 (epoch 2.121), train_loss = 2.05114076, grad/param norm = 5.7376e-02, time/batch = 0.1701s
404/9500 (epoch 2.126), train_loss = 2.08165313, grad/param norm = 5.4455e-02, time/batch = 0.1691s
405/9500 (epoch 2.132), train_loss = 2.07677452, grad/param norm = 5.7086e-02, time/batch = 0.1692s
406/9500 (epoch 2.137), train_loss = 2.10919469, grad/param norm = 6.0713e-02, time/batch = 0.1696s
407/9500 (epoch 2.142), train_loss = 2.06248607, grad/param norm = 6.4338e-02, time/batch = 0.1692s
408/9500 (epoch 2.147), train_loss = 2.06789037, grad/param norm = 7.2860e-02, time/batch = 0.1691s
409/9500 (epoch 2.153), train_loss = 2.07315750, grad/param norm = 7.3565e-02, time/batch = 0.1702s
410/9500 (epoch 2.158), train_loss = 2.06759863, grad/param norm = 6.7623e-02, time/batch = 0.1685s
411/9500 (epoch 2.163), train_loss = 2.07092345, grad/param norm = 8.2426e-02, time/batch = 0.1680s
412/9500 (epoch 2.168), train_loss = 2.05095377, grad/param norm = 7.8403e-02, time/batch = 0.1699s
413/9500 (epoch 2.174), train_loss = 2.02770036, grad/param norm = 8.7164e-02, time/batch = 0.1692s
414/9500 (epoch 2.179), train_loss = 2.11526294, grad/param norm = 1.0210e-01, time/batch = 0.1692s
415/9500 (epoch 2.184), train_loss = 2.06162541, grad/param norm = 1.0046e-01, time/batch = 0.1696s
416/9500 (epoch 2.189), train_loss = 2.07784310, grad/param norm = 8.9228e-02, time/batch = 0.1689s
417/9500 (epoch 2.195), train_loss = 2.03505677, grad/param norm = 6.7889e-02, time/batch = 0.1692s
418/9500 (epoch 2.200), train_loss = 2.07345061, grad/param norm = 7.4336e-02, time/batch = 0.1698s
419/9500 (epoch 2.205), train_loss = 2.03232973, grad/param norm = 8.4112e-02, time/batch = 0.1679s
420/9500 (epoch 2.211), train_loss = 2.05735177, grad/param norm = 8.2435e-02, time/batch = 0.1694s
421/9500 (epoch 2.216), train_loss = 2.03737761, grad/param norm = 7.5686e-02, time/batch = 0.1674s
422/9500 (epoch 2.221), train_loss = 2.03024358, grad/param norm = 6.1652e-02, time/batch = 0.1685s
423/9500 (epoch 2.226), train_loss = 2.03189111, grad/param norm = 5.7557e-02, time/batch = 0.1695s
424/9500 (epoch 2.232), train_loss = 2.06050623, grad/param norm = 6.2893e-02, time/batch = 0.1693s
425/9500 (epoch 2.237), train_loss = 2.02888651, grad/param norm = 7.4358e-02, time/batch = 0.1702s
426/9500 (epoch 2.242), train_loss = 2.06969070, grad/param norm = 8.0966e-02, time/batch = 0.1690s
427/9500 (epoch 2.247), train_loss = 2.06824634, grad/param norm = 9.4836e-02, time/batch = 0.1699s
428/9500 (epoch 2.253), train_loss = 2.06225039, grad/param norm = 9.7613e-02, time/batch = 0.1701s
429/9500 (epoch 2.258), train_loss = 2.07608420, grad/param norm = 7.9984e-02, time/batch = 0.1687s
430/9500 (epoch 2.263), train_loss = 2.03506608, grad/param norm = 7.6970e-02, time/batch = 0.1693s
431/9500 (epoch 2.268), train_loss = 2.00406190, grad/param norm = 6.7939e-02, time/batch = 0.1682s
432/9500 (epoch 2.274), train_loss = 2.00990194, grad/param norm = 5.4428e-02, time/batch = 0.1694s
433/9500 (epoch 2.279), train_loss = 1.99752350, grad/param norm = 4.7198e-02, time/batch = 0.1694s
434/9500 (epoch 2.284), train_loss = 2.04182310, grad/param norm = 4.6068e-02, time/batch = 0.1696s
435/9500 (epoch 2.289), train_loss = 1.99085840, grad/param norm = 5.8305e-02, time/batch = 0.1690s
436/9500 (epoch 2.295), train_loss = 2.01940316, grad/param norm = 7.5548e-02, time/batch = 0.1694s
437/9500 (epoch 2.300), train_loss = 2.03830709, grad/param norm = 7.2778e-02, time/batch = 0.1698s
438/9500 (epoch 2.305), train_loss = 2.04007906, grad/param norm = 6.0774e-02, time/batch = 0.1691s
439/9500 (epoch 2.311), train_loss = 1.99201864, grad/param norm = 6.5663e-02, time/batch = 0.1689s
440/9500 (epoch 2.316), train_loss = 2.03984353, grad/param norm = 7.8163e-02, time/batch = 0.1696s
441/9500 (epoch 2.321), train_loss = 2.03469516, grad/param norm = 8.6206e-02, time/batch = 0.1678s
442/9500 (epoch 2.326), train_loss = 2.05842210, grad/param norm = 8.1727e-02, time/batch = 0.1689s
443/9500 (epoch 2.332), train_loss = 2.03501492, grad/param norm = 7.5565e-02, time/batch = 0.1700s
444/9500 (epoch 2.337), train_loss = 1.99769835, grad/param norm = 6.3255e-02, time/batch = 0.1691s
445/9500 (epoch 2.342), train_loss = 1.96149824, grad/param norm = 5.1880e-02, time/batch = 0.1691s
446/9500 (epoch 2.347), train_loss = 1.95617614, grad/param norm = 4.6940e-02, time/batch = 0.1693s
447/9500 (epoch 2.353), train_loss = 1.98595711, grad/param norm = 5.0623e-02, time/batch = 0.1694s
448/9500 (epoch 2.358), train_loss = 1.96133847, grad/param norm = 5.9436e-02, time/batch = 0.1695s
449/9500 (epoch 2.363), train_loss = 1.97892025, grad/param norm = 7.3154e-02, time/batch = 0.1686s
450/9500 (epoch 2.368), train_loss = 1.99704723, grad/param norm = 8.0311e-02, time/batch = 0.1698s
451/9500 (epoch 2.374), train_loss = 1.99042685, grad/param norm = 7.1340e-02, time/batch = 0.1676s
452/9500 (epoch 2.379), train_loss = 1.97674540, grad/param norm = 5.9390e-02, time/batch = 0.1695s
453/9500 (epoch 2.384), train_loss = 1.99500454, grad/param norm = 6.1047e-02, time/batch = 0.1693s
454/9500 (epoch 2.389), train_loss = 1.97899568, grad/param norm = 6.4092e-02, time/batch = 0.1693s
455/9500 (epoch 2.395), train_loss = 2.00439911, grad/param norm = 6.6185e-02, time/batch = 0.1701s
456/9500 (epoch 2.400), train_loss = 2.00213442, grad/param norm = 6.4996e-02, time/batch = 0.1690s
457/9500 (epoch 2.405), train_loss = 2.00432563, grad/param norm = 5.5966e-02, time/batch = 0.1698s
458/9500 (epoch 2.411), train_loss = 1.99580312, grad/param norm = 5.5648e-02, time/batch = 0.1690s
459/9500 (epoch 2.416), train_loss = 1.98826416, grad/param norm = 5.5029e-02, time/batch = 0.1697s
460/9500 (epoch 2.421), train_loss = 1.99678118, grad/param norm = 5.8621e-02, time/batch = 0.1701s
461/9500 (epoch 2.426), train_loss = 1.97675613, grad/param norm = 6.0451e-02, time/batch = 0.1679s
462/9500 (epoch 2.432), train_loss = 1.99692520, grad/param norm = 7.1479e-02, time/batch = 0.1695s
463/9500 (epoch 2.437), train_loss = 2.00612945, grad/param norm = 6.7842e-02, time/batch = 0.1698s
464/9500 (epoch 2.442), train_loss = 1.97794791, grad/param norm = 5.8713e-02, time/batch = 0.1687s
465/9500 (epoch 2.447), train_loss = 1.96926267, grad/param norm = 5.9240e-02, time/batch = 0.1695s
466/9500 (epoch 2.453), train_loss = 1.98670712, grad/param norm = 6.5272e-02, time/batch = 0.1697s
467/9500 (epoch 2.458), train_loss = 1.96926095, grad/param norm = 7.8357e-02, time/batch = 0.1694s
468/9500 (epoch 2.463), train_loss = 2.01704074, grad/param norm = 9.2227e-02, time/batch = 0.1695s
469/9500 (epoch 2.468), train_loss = 1.98767720, grad/param norm = 9.1166e-02, time/batch = 0.1698s
470/9500 (epoch 2.474), train_loss = 1.99060822, grad/param norm = 7.4611e-02, time/batch = 0.1695s
471/9500 (epoch 2.479), train_loss = 1.96081599, grad/param norm = 6.8998e-02, time/batch = 0.1675s
472/9500 (epoch 2.484), train_loss = 1.96887892, grad/param norm = 5.5681e-02, time/batch = 0.1699s
473/9500 (epoch 2.489), train_loss = 1.97682938, grad/param norm = 4.6408e-02, time/batch = 0.1693s
474/9500 (epoch 2.495), train_loss = 1.95999201, grad/param norm = 4.6819e-02, time/batch = 0.1693s
475/9500 (epoch 2.500), train_loss = 1.94560495, grad/param norm = 5.1971e-02, time/batch = 0.1704s
476/9500 (epoch 2.505), train_loss = 1.91802483, grad/param norm = 5.7015e-02, time/batch = 0.1697s
477/9500 (epoch 2.511), train_loss = 1.97389381, grad/param norm = 6.5461e-02, time/batch = 0.1692s
478/9500 (epoch 2.516), train_loss = 1.97972420, grad/param norm = 6.6176e-02, time/batch = 0.1692s
479/9500 (epoch 2.521), train_loss = 1.98788434, grad/param norm = 7.0251e-02, time/batch = 0.1702s
480/9500 (epoch 2.526), train_loss = 2.01638247, grad/param norm = 8.6020e-02, time/batch = 0.1691s
481/9500 (epoch 2.532), train_loss = 2.04740840, grad/param norm = 7.4106e-02, time/batch = 0.1676s
482/9500 (epoch 2.537), train_loss = 1.97773888, grad/param norm = 5.6706e-02, time/batch = 0.1684s
483/9500 (epoch 2.542), train_loss = 1.94007581, grad/param norm = 6.0240e-02, time/batch = 0.1697s
484/9500 (epoch 2.547), train_loss = 1.95144682, grad/param norm = 5.7884e-02, time/batch = 0.1696s
485/9500 (epoch 2.553), train_loss = 1.93781355, grad/param norm = 5.0974e-02, time/batch = 0.1701s
486/9500 (epoch 2.558), train_loss = 1.89508384, grad/param norm = 4.6867e-02, time/batch = 0.1693s
487/9500 (epoch 2.563), train_loss = 1.91421286, grad/param norm = 5.1458e-02, time/batch = 0.1690s
488/9500 (epoch 2.568), train_loss = 1.94030935, grad/param norm = 6.5399e-02, time/batch = 0.1699s
489/9500 (epoch 2.574), train_loss = 1.95183559, grad/param norm = 7.4217e-02, time/batch = 0.1688s
490/9500 (epoch 2.579), train_loss = 1.92077214, grad/param norm = 6.7566e-02, time/batch = 0.1692s
491/9500 (epoch 2.584), train_loss = 1.93496325, grad/param norm = 5.8619e-02, time/batch = 0.1685s
492/9500 (epoch 2.589), train_loss = 1.89806318, grad/param norm = 5.0039e-02, time/batch = 0.1693s
493/9500 (epoch 2.595), train_loss = 1.92080993, grad/param norm = 5.6614e-02, time/batch = 0.1691s
494/9500 (epoch 2.600), train_loss = 1.94387876, grad/param norm = 5.8456e-02, time/batch = 0.1699s
495/9500 (epoch 2.605), train_loss = 1.93259562, grad/param norm = 6.3409e-02, time/batch = 0.1695s
496/9500 (epoch 2.611), train_loss = 1.95020174, grad/param norm = 6.7488e-02, time/batch = 0.1700s
497/9500 (epoch 2.616), train_loss = 1.94207162, grad/param norm = 6.4943e-02, time/batch = 0.1703s
498/9500 (epoch 2.621), train_loss = 1.93701231, grad/param norm = 5.8499e-02, time/batch = 0.1691s
499/9500 (epoch 2.626), train_loss = 1.90927963, grad/param norm = 5.0635e-02, time/batch = 0.1695s
500/9500 (epoch 2.632), train_loss = 1.90199680, grad/param norm = 4.7572e-02, time/batch = 0.1701s
501/9500 (epoch 2.637), train_loss = 1.87797126, grad/param norm = 4.5660e-02, time/batch = 0.1676s
502/9500 (epoch 2.642), train_loss = 1.92205492, grad/param norm = 5.0526e-02, time/batch = 0.1695s
503/9500 (epoch 2.647), train_loss = 1.93068647, grad/param norm = 5.3872e-02, time/batch = 0.1703s
504/9500 (epoch 2.653), train_loss = 1.92930004, grad/param norm = 5.7283e-02, time/batch = 0.1695s
505/9500 (epoch 2.658), train_loss = 1.92525824, grad/param norm = 6.1745e-02, time/batch = 0.1693s
506/9500 (epoch 2.663), train_loss = 1.93505249, grad/param norm = 6.2759e-02, time/batch = 0.1690s
507/9500 (epoch 2.668), train_loss = 1.92402983, grad/param norm = 5.9564e-02, time/batch = 0.1695s
508/9500 (epoch 2.674), train_loss = 1.96721907, grad/param norm = 5.9207e-02, time/batch = 0.1687s
509/9500 (epoch 2.679), train_loss = 1.95285371, grad/param norm = 5.9355e-02, time/batch = 0.1696s
510/9500 (epoch 2.684), train_loss = 1.91990795, grad/param norm = 5.8127e-02, time/batch = 0.1700s
511/9500 (epoch 2.689), train_loss = 1.92139935, grad/param norm = 5.5599e-02, time/batch = 0.1674s
512/9500 (epoch 2.695), train_loss = 1.90295196, grad/param norm = 4.7926e-02, time/batch = 0.1694s
513/9500 (epoch 2.700), train_loss = 1.90097223, grad/param norm = 4.4270e-02, time/batch = 0.1700s
514/9500 (epoch 2.705), train_loss = 1.91268325, grad/param norm = 4.3302e-02, time/batch = 0.1695s
515/9500 (epoch 2.711), train_loss = 1.93182104, grad/param norm = 4.6179e-02, time/batch = 0.1693s
516/9500 (epoch 2.716), train_loss = 1.91054800, grad/param norm = 5.1577e-02, time/batch = 0.1701s
517/9500 (epoch 2.721), train_loss = 1.87174448, grad/param norm = 6.8137e-02, time/batch = 0.1691s
518/9500 (epoch 2.726), train_loss = 1.90458450, grad/param norm = 7.3484e-02, time/batch = 0.1694s
519/9500 (epoch 2.732), train_loss = 1.88732736, grad/param norm = 7.2377e-02, time/batch = 0.1703s
520/9500 (epoch 2.737), train_loss = 1.94510425, grad/param norm = 6.7922e-02, time/batch = 0.1695s
521/9500 (epoch 2.742), train_loss = 1.88885537, grad/param norm = 6.1867e-02, time/batch = 0.1678s
522/9500 (epoch 2.747), train_loss = 1.90226313, grad/param norm = 6.0708e-02, time/batch = 0.1697s
523/9500 (epoch 2.753), train_loss = 1.89149776, grad/param norm = 4.9649e-02, time/batch = 0.1695s
524/9500 (epoch 2.758), train_loss = 1.91792841, grad/param norm = 4.6505e-02, time/batch = 0.1692s
525/9500 (epoch 2.763), train_loss = 1.85387295, grad/param norm = 4.6291e-02, time/batch = 0.1706s
526/9500 (epoch 2.768), train_loss = 1.87974305, grad/param norm = 4.6410e-02, time/batch = 0.1692s
527/9500 (epoch 2.774), train_loss = 1.87084890, grad/param norm = 5.7665e-02, time/batch = 0.1694s
528/9500 (epoch 2.779), train_loss = 1.90143973, grad/param norm = 5.4203e-02, time/batch = 0.1700s
529/9500 (epoch 2.784), train_loss = 1.86293674, grad/param norm = 5.1877e-02, time/batch = 0.1696s
530/9500 (epoch 2.789), train_loss = 1.87207224, grad/param norm = 5.2311e-02, time/batch = 0.1695s
531/9500 (epoch 2.795), train_loss = 1.87906315, grad/param norm = 5.2316e-02, time/batch = 0.1684s
532/9500 (epoch 2.800), train_loss = 1.87079769, grad/param norm = 4.8031e-02, time/batch = 0.1691s
533/9500 (epoch 2.805), train_loss = 1.81728657, grad/param norm = 4.0500e-02, time/batch = 0.1700s
534/9500 (epoch 2.811), train_loss = 1.85256853, grad/param norm = 4.1415e-02, time/batch = 0.1694s
535/9500 (epoch 2.816), train_loss = 1.83228573, grad/param norm = 4.2823e-02, time/batch = 0.1695s
536/9500 (epoch 2.821), train_loss = 1.84020066, grad/param norm = 4.5536e-02, time/batch = 0.1701s
537/9500 (epoch 2.826), train_loss = 1.87035671, grad/param norm = 4.5981e-02, time/batch = 0.1696s
538/9500 (epoch 2.832), train_loss = 1.85629068, grad/param norm = 4.7473e-02, time/batch = 0.1696s
539/9500 (epoch 2.837), train_loss = 1.85260880, grad/param norm = 4.7161e-02, time/batch = 0.1692s
540/9500 (epoch 2.842), train_loss = 1.87969712, grad/param norm = 5.4972e-02, time/batch = 0.1690s
541/9500 (epoch 2.847), train_loss = 1.91051828, grad/param norm = 6.8706e-02, time/batch = 0.1685s
542/9500 (epoch 2.853), train_loss = 1.90196894, grad/param norm = 7.6208e-02, time/batch = 0.1694s
543/9500 (epoch 2.858), train_loss = 1.88901695, grad/param norm = 5.8855e-02, time/batch = 0.1693s
544/9500 (epoch 2.863), train_loss = 1.87920945, grad/param norm = 4.5858e-02, time/batch = 0.1700s
545/9500 (epoch 2.868), train_loss = 1.84687009, grad/param norm = 4.0043e-02, time/batch = 0.1693s
546/9500 (epoch 2.874), train_loss = 1.83711427, grad/param norm = 3.6636e-02, time/batch = 0.1691s
547/9500 (epoch 2.879), train_loss = 1.81276314, grad/param norm = 3.8621e-02, time/batch = 0.1702s
548/9500 (epoch 2.884), train_loss = 1.85328957, grad/param norm = 4.5395e-02, time/batch = 0.1698s
549/9500 (epoch 2.889), train_loss = 1.85911679, grad/param norm = 5.6605e-02, time/batch = 0.1687s
550/9500 (epoch 2.895), train_loss = 1.84301453, grad/param norm = 5.9792e-02, time/batch = 0.1703s
551/9500 (epoch 2.900), train_loss = 1.85224115, grad/param norm = 6.5236e-02, time/batch = 0.1674s
552/9500 (epoch 2.905), train_loss = 1.90556280, grad/param norm = 6.8136e-02, time/batch = 0.1705s
553/9500 (epoch 2.911), train_loss = 1.82926530, grad/param norm = 5.9384e-02, time/batch = 0.1701s
554/9500 (epoch 2.916), train_loss = 1.83039433, grad/param norm = 4.8793e-02, time/batch = 0.1689s
555/9500 (epoch 2.921), train_loss = 1.82266969, grad/param norm = 4.2242e-02, time/batch = 0.1696s
556/9500 (epoch 2.926), train_loss = 1.84657590, grad/param norm = 4.5730e-02, time/batch = 0.1695s
557/9500 (epoch 2.932), train_loss = 1.84131256, grad/param norm = 4.8838e-02, time/batch = 0.1694s
558/9500 (epoch 2.937), train_loss = 1.86687087, grad/param norm = 5.3601e-02, time/batch = 0.1694s
559/9500 (epoch 2.942), train_loss = 1.86916329, grad/param norm = 6.2592e-02, time/batch = 0.1692s
560/9500 (epoch 2.947), train_loss = 1.87084273, grad/param norm = 5.9372e-02, time/batch = 0.1698s
561/9500 (epoch 2.953), train_loss = 1.87519591, grad/param norm = 4.9984e-02, time/batch = 0.1678s
562/9500 (epoch 2.958), train_loss = 1.85490094, grad/param norm = 4.5317e-02, time/batch = 0.1691s
563/9500 (epoch 2.963), train_loss = 1.84892019, grad/param norm = 4.5259e-02, time/batch = 0.1701s
564/9500 (epoch 2.968), train_loss = 1.84495144, grad/param norm = 4.4997e-02, time/batch = 0.1693s
565/9500 (epoch 2.974), train_loss = 1.83812212, grad/param norm = 4.5136e-02, time/batch = 0.1694s
566/9500 (epoch 2.979), train_loss = 1.85847751, grad/param norm = 5.1556e-02, time/batch = 0.1704s
567/9500 (epoch 2.984), train_loss = 1.83807545, grad/param norm = 5.1985e-02, time/batch = 0.1691s
568/9500 (epoch 2.989), train_loss = 1.86318513, grad/param norm = 4.9326e-02, time/batch = 0.1696s
569/9500 (epoch 2.995), train_loss = 1.87010762, grad/param norm = 4.3792e-02, time/batch = 0.1701s
570/9500 (epoch 3.000), train_loss = 1.86753634, grad/param norm = 4.1799e-02, time/batch = 0.1690s
571/9500 (epoch 3.005), train_loss = 1.95537313, grad/param norm = 4.1294e-02, time/batch = 0.1676s
572/9500 (epoch 3.011), train_loss = 1.85101116, grad/param norm = 4.4363e-02, time/batch = 0.1698s
573/9500 (epoch 3.016), train_loss = 1.87318309, grad/param norm = 4.4513e-02, time/batch = 0.1695s
574/9500 (epoch 3.021), train_loss = 1.83887270, grad/param norm = 4.6123e-02, time/batch = 0.1690s
575/9500 (epoch 3.026), train_loss = 1.86353029, grad/param norm = 4.7851e-02, time/batch = 0.1697s
576/9500 (epoch 3.032), train_loss = 1.86116394, grad/param norm = 4.7900e-02, time/batch = 0.1692s
577/9500 (epoch 3.037), train_loss = 1.81750501, grad/param norm = 4.5996e-02, time/batch = 0.1694s
578/9500 (epoch 3.042), train_loss = 1.81265029, grad/param norm = 4.7966e-02, time/batch = 0.1701s
579/9500 (epoch 3.047), train_loss = 1.83816731, grad/param norm = 4.4847e-02, time/batch = 0.1690s
580/9500 (epoch 3.053), train_loss = 1.83383073, grad/param norm = 4.5210e-02, time/batch = 0.1692s
581/9500 (epoch 3.058), train_loss = 1.85912119, grad/param norm = 4.4638e-02, time/batch = 0.1679s
582/9500 (epoch 3.063), train_loss = 1.80435811, grad/param norm = 5.1897e-02, time/batch = 0.1694s
583/9500 (epoch 3.068), train_loss = 1.81365465, grad/param norm = 6.0729e-02, time/batch = 0.1694s
584/9500 (epoch 3.074), train_loss = 1.85385983, grad/param norm = 7.0284e-02, time/batch = 0.1695s
585/9500 (epoch 3.079), train_loss = 1.84989352, grad/param norm = 6.9419e-02, time/batch = 0.1692s
586/9500 (epoch 3.084), train_loss = 1.90681555, grad/param norm = 6.7327e-02, time/batch = 0.1695s
587/9500 (epoch 3.089), train_loss = 1.88156702, grad/param norm = 6.6523e-02, time/batch = 0.1695s
588/9500 (epoch 3.095), train_loss = 1.82472332, grad/param norm = 5.1840e-02, time/batch = 0.1687s
589/9500 (epoch 3.100), train_loss = 1.82327357, grad/param norm = 4.4981e-02, time/batch = 0.1697s
590/9500 (epoch 3.105), train_loss = 1.86524602, grad/param norm = 4.0010e-02, time/batch = 0.1692s
591/9500 (epoch 3.111), train_loss = 1.77854895, grad/param norm = 3.9045e-02, time/batch = 0.1683s
592/9500 (epoch 3.116), train_loss = 1.79578777, grad/param norm = 4.2107e-02, time/batch = 0.1694s
593/9500 (epoch 3.121), train_loss = 1.81334460, grad/param norm = 4.2988e-02, time/batch = 0.1693s
594/9500 (epoch 3.126), train_loss = 1.83485083, grad/param norm = 4.5583e-02, time/batch = 0.1700s
595/9500 (epoch 3.132), train_loss = 1.84276733, grad/param norm = 4.8822e-02, time/batch = 0.1694s
596/9500 (epoch 3.137), train_loss = 1.87020520, grad/param norm = 4.3081e-02, time/batch = 0.1691s
597/9500 (epoch 3.142), train_loss = 1.80029679, grad/param norm = 3.6300e-02, time/batch = 0.1697s
598/9500 (epoch 3.147), train_loss = 1.78793058, grad/param norm = 4.1272e-02, time/batch = 0.1697s
599/9500 (epoch 3.153), train_loss = 1.81768034, grad/param norm = 4.2506e-02, time/batch = 0.1696s
600/9500 (epoch 3.158), train_loss = 1.80982726, grad/param norm = 3.8675e-02, time/batch = 0.1700s
601/9500 (epoch 3.163), train_loss = 1.83936877, grad/param norm = 4.3528e-02, time/batch = 0.1684s
602/9500 (epoch 3.168), train_loss = 1.80401972, grad/param norm = 4.4485e-02, time/batch = 0.1698s
603/9500 (epoch 3.174), train_loss = 1.79040774, grad/param norm = 4.3550e-02, time/batch = 0.1700s
604/9500 (epoch 3.179), train_loss = 1.86320946, grad/param norm = 4.4685e-02, time/batch = 0.1691s
605/9500 (epoch 3.184), train_loss = 1.80377782, grad/param norm = 4.6752e-02, time/batch = 0.1691s
606/9500 (epoch 3.189), train_loss = 1.83076673, grad/param norm = 4.7664e-02, time/batch = 0.1704s
607/9500 (epoch 3.195), train_loss = 1.79920444, grad/param norm = 4.1736e-02, time/batch = 0.1691s
608/9500 (epoch 3.200), train_loss = 1.80921168, grad/param norm = 4.3235e-02, time/batch = 0.1699s
609/9500 (epoch 3.205), train_loss = 1.78054574, grad/param norm = 4.7414e-02, time/batch = 0.1698s
610/9500 (epoch 3.211), train_loss = 1.81060990, grad/param norm = 4.7518e-02, time/batch = 0.1699s
611/9500 (epoch 3.216), train_loss = 1.81145240, grad/param norm = 4.3713e-02, time/batch = 0.1676s
612/9500 (epoch 3.221), train_loss = 1.80558528, grad/param norm = 3.5515e-02, time/batch = 0.1705s
613/9500 (epoch 3.226), train_loss = 1.81210315, grad/param norm = 3.8165e-02, time/batch = 0.1690s
614/9500 (epoch 3.232), train_loss = 1.82753631, grad/param norm = 4.4103e-02, time/batch = 0.1693s
615/9500 (epoch 3.237), train_loss = 1.81629027, grad/param norm = 4.7005e-02, time/batch = 0.1699s
616/9500 (epoch 3.242), train_loss = 1.83512748, grad/param norm = 4.6471e-02, time/batch = 0.1699s
617/9500 (epoch 3.247), train_loss = 1.81731710, grad/param norm = 4.5593e-02, time/batch = 0.1691s
618/9500 (epoch 3.253), train_loss = 1.80058848, grad/param norm = 4.3190e-02, time/batch = 0.1694s
619/9500 (epoch 3.258), train_loss = 1.81200947, grad/param norm = 4.3211e-02, time/batch = 0.1696s
620/9500 (epoch 3.263), train_loss = 1.77170477, grad/param norm = 4.7676e-02, time/batch = 0.1693s
621/9500 (epoch 3.268), train_loss = 1.76843055, grad/param norm = 5.0132e-02, time/batch = 0.1676s
622/9500 (epoch 3.274), train_loss = 1.80059041, grad/param norm = 5.5416e-02, time/batch = 0.1695s
623/9500 (epoch 3.279), train_loss = 1.81564609, grad/param norm = 5.9781e-02, time/batch = 0.1697s
624/9500 (epoch 3.284), train_loss = 1.83986687, grad/param norm = 6.5311e-02, time/batch = 0.1699s
625/9500 (epoch 3.289), train_loss = 1.78611576, grad/param norm = 6.4258e-02, time/batch = 0.1694s
626/9500 (epoch 3.295), train_loss = 1.78806943, grad/param norm = 4.8388e-02, time/batch = 0.1700s
627/9500 (epoch 3.300), train_loss = 1.81478295, grad/param norm = 3.9739e-02, time/batch = 0.1695s
628/9500 (epoch 3.305), train_loss = 1.80561443, grad/param norm = 3.8853e-02, time/batch = 0.1690s
629/9500 (epoch 3.311), train_loss = 1.76836010, grad/param norm = 4.2556e-02, time/batch = 0.1701s
630/9500 (epoch 3.316), train_loss = 1.80009086, grad/param norm = 4.2463e-02, time/batch = 0.1691s
631/9500 (epoch 3.321), train_loss = 1.80444720, grad/param norm = 4.2388e-02, time/batch = 0.1679s
632/9500 (epoch 3.326), train_loss = 1.82457549, grad/param norm = 4.2321e-02, time/batch = 0.1705s
633/9500 (epoch 3.332), train_loss = 1.80834429, grad/param norm = 3.9830e-02, time/batch = 0.1691s
634/9500 (epoch 3.337), train_loss = 1.78631243, grad/param norm = 3.7436e-02, time/batch = 0.1699s
635/9500 (epoch 3.342), train_loss = 1.76927459, grad/param norm = 3.6327e-02, time/batch = 0.1700s
636/9500 (epoch 3.347), train_loss = 1.76974510, grad/param norm = 3.8331e-02, time/batch = 0.1690s
637/9500 (epoch 3.353), train_loss = 1.78676890, grad/param norm = 4.2768e-02, time/batch = 0.1695s
638/9500 (epoch 3.358), train_loss = 1.76189095, grad/param norm = 4.1431e-02, time/batch = 0.1703s
639/9500 (epoch 3.363), train_loss = 1.75680068, grad/param norm = 4.4314e-02, time/batch = 0.1696s
640/9500 (epoch 3.368), train_loss = 1.77559983, grad/param norm = 5.2606e-02, time/batch = 0.1694s
641/9500 (epoch 3.374), train_loss = 1.80623517, grad/param norm = 5.6592e-02, time/batch = 0.1686s
642/9500 (epoch 3.379), train_loss = 1.78997317, grad/param norm = 4.6973e-02, time/batch = 0.1694s
643/9500 (epoch 3.384), train_loss = 1.77873374, grad/param norm = 3.8688e-02, time/batch = 0.1697s
644/9500 (epoch 3.389), train_loss = 1.77711238, grad/param norm = 3.7824e-02, time/batch = 0.1703s
645/9500 (epoch 3.395), train_loss = 1.77372936, grad/param norm = 3.6224e-02, time/batch = 0.1691s
646/9500 (epoch 3.400), train_loss = 1.80325129, grad/param norm = 3.9344e-02, time/batch = 0.1692s
647/9500 (epoch 3.405), train_loss = 1.80599959, grad/param norm = 3.8560e-02, time/batch = 0.1695s
648/9500 (epoch 3.411), train_loss = 1.79874394, grad/param norm = 3.8413e-02, time/batch = 0.1697s
649/9500 (epoch 3.416), train_loss = 1.79162630, grad/param norm = 3.9938e-02, time/batch = 0.1693s
650/9500 (epoch 3.421), train_loss = 1.80168123, grad/param norm = 3.5983e-02, time/batch = 0.1692s
651/9500 (epoch 3.426), train_loss = 1.76331738, grad/param norm = 3.6479e-02, time/batch = 0.1681s
652/9500 (epoch 3.432), train_loss = 1.77353806, grad/param norm = 3.9183e-02, time/batch = 0.1689s
653/9500 (epoch 3.437), train_loss = 1.80122052, grad/param norm = 4.2918e-02, time/batch = 0.1693s
654/9500 (epoch 3.442), train_loss = 1.80728120, grad/param norm = 4.4306e-02, time/batch = 0.1698s
655/9500 (epoch 3.447), train_loss = 1.77951903, grad/param norm = 4.7183e-02, time/batch = 0.1691s
656/9500 (epoch 3.453), train_loss = 1.78742464, grad/param norm = 4.6887e-02, time/batch = 0.1697s
657/9500 (epoch 3.458), train_loss = 1.76388273, grad/param norm = 4.8532e-02, time/batch = 0.1705s
658/9500 (epoch 3.463), train_loss = 1.80837691, grad/param norm = 4.9809e-02, time/batch = 0.1694s
659/9500 (epoch 3.468), train_loss = 1.78282047, grad/param norm = 5.2433e-02, time/batch = 0.1695s
660/9500 (epoch 3.474), train_loss = 1.79175293, grad/param norm = 5.3311e-02, time/batch = 0.1699s
661/9500 (epoch 3.479), train_loss = 1.76551908, grad/param norm = 4.3421e-02, time/batch = 0.1676s
662/9500 (epoch 3.484), train_loss = 1.78204399, grad/param norm = 3.9790e-02, time/batch = 0.1697s
663/9500 (epoch 3.489), train_loss = 1.79288179, grad/param norm = 3.9606e-02, time/batch = 0.1702s
664/9500 (epoch 3.495), train_loss = 1.79020545, grad/param norm = 4.2361e-02, time/batch = 0.1694s
665/9500 (epoch 3.500), train_loss = 1.76409814, grad/param norm = 4.0439e-02, time/batch = 0.1698s
666/9500 (epoch 3.505), train_loss = 1.75201327, grad/param norm = 4.0866e-02, time/batch = 0.1699s
667/9500 (epoch 3.511), train_loss = 1.80777675, grad/param norm = 4.2955e-02, time/batch = 0.1692s
668/9500 (epoch 3.516), train_loss = 1.78990136, grad/param norm = 3.8255e-02, time/batch = 0.1693s
669/9500 (epoch 3.521), train_loss = 1.75501078, grad/param norm = 3.4115e-02, time/batch = 0.1707s
670/9500 (epoch 3.526), train_loss = 1.77833203, grad/param norm = 3.5530e-02, time/batch = 0.1692s
671/9500 (epoch 3.532), train_loss = 1.81386884, grad/param norm = 3.5689e-02, time/batch = 0.1682s
672/9500 (epoch 3.537), train_loss = 1.78632731, grad/param norm = 3.3735e-02, time/batch = 0.1704s
673/9500 (epoch 3.542), train_loss = 1.75348978, grad/param norm = 3.8147e-02, time/batch = 0.1690s
674/9500 (epoch 3.547), train_loss = 1.77019059, grad/param norm = 3.8867e-02, time/batch = 0.1692s
675/9500 (epoch 3.553), train_loss = 1.74064991, grad/param norm = 3.8122e-02, time/batch = 0.1693s
676/9500 (epoch 3.558), train_loss = 1.73402519, grad/param norm = 3.6855e-02, time/batch = 0.1697s
677/9500 (epoch 3.563), train_loss = 1.73643238, grad/param norm = 3.8880e-02, time/batch = 0.1691s
678/9500 (epoch 3.568), train_loss = 1.76063218, grad/param norm = 3.9919e-02, time/batch = 0.1694s
679/9500 (epoch 3.574), train_loss = 1.76949089, grad/param norm = 4.3115e-02, time/batch = 0.1706s
680/9500 (epoch 3.579), train_loss = 1.74841327, grad/param norm = 4.3680e-02, time/batch = 0.1695s
681/9500 (epoch 3.584), train_loss = 1.77670381, grad/param norm = 4.4488e-02, time/batch = 0.1678s
682/9500 (epoch 3.589), train_loss = 1.72950514, grad/param norm = 4.1484e-02, time/batch = 0.1702s
683/9500 (epoch 3.595), train_loss = 1.75080619, grad/param norm = 4.4720e-02, time/batch = 0.1691s
684/9500 (epoch 3.600), train_loss = 1.76947150, grad/param norm = 4.4697e-02, time/batch = 0.1698s
685/9500 (epoch 3.605), train_loss = 1.75174712, grad/param norm = 4.0703e-02, time/batch = 0.1704s
686/9500 (epoch 3.611), train_loss = 1.75587249, grad/param norm = 3.6475e-02, time/batch = 0.1695s
687/9500 (epoch 3.616), train_loss = 1.75002475, grad/param norm = 3.3846e-02, time/batch = 0.1691s
688/9500 (epoch 3.621), train_loss = 1.75332332, grad/param norm = 3.3035e-02, time/batch = 0.1706s
689/9500 (epoch 3.626), train_loss = 1.71537692, grad/param norm = 3.4892e-02, time/batch = 0.1691s
690/9500 (epoch 3.632), train_loss = 1.76720503, grad/param norm = 3.9488e-02, time/batch = 0.1692s
691/9500 (epoch 3.637), train_loss = 1.71771265, grad/param norm = 4.6831e-02, time/batch = 0.1683s
692/9500 (epoch 3.642), train_loss = 1.77643880, grad/param norm = 5.6509e-02, time/batch = 0.1694s
693/9500 (epoch 3.647), train_loss = 1.78119175, grad/param norm = 6.1088e-02, time/batch = 0.1697s
694/9500 (epoch 3.653), train_loss = 1.78204781, grad/param norm = 5.3512e-02, time/batch = 0.1701s
695/9500 (epoch 3.658), train_loss = 1.73908107, grad/param norm = 3.6930e-02, time/batch = 0.1691s
696/9500 (epoch 3.663), train_loss = 1.74140477, grad/param norm = 3.4874e-02, time/batch = 0.1693s
697/9500 (epoch 3.668), train_loss = 1.73128996, grad/param norm = 3.5438e-02, time/batch = 0.1699s
698/9500 (epoch 3.674), train_loss = 1.77941173, grad/param norm = 3.6272e-02, time/batch = 0.1694s
699/9500 (epoch 3.679), train_loss = 1.77278441, grad/param norm = 3.7084e-02, time/batch = 0.1692s
700/9500 (epoch 3.684), train_loss = 1.76013648, grad/param norm = 4.3550e-02, time/batch = 0.1695s
701/9500 (epoch 3.689), train_loss = 1.76204141, grad/param norm = 4.2121e-02, time/batch = 0.1676s
702/9500 (epoch 3.695), train_loss = 1.73786248, grad/param norm = 3.4464e-02, time/batch = 0.1696s
703/9500 (epoch 3.700), train_loss = 1.74876335, grad/param norm = 3.4461e-02, time/batch = 0.1694s
704/9500 (epoch 3.705), train_loss = 1.76601402, grad/param norm = 3.3359e-02, time/batch = 0.1698s
705/9500 (epoch 3.711), train_loss = 1.77734517, grad/param norm = 3.4287e-02, time/batch = 0.1694s
706/9500 (epoch 3.716), train_loss = 1.75355460, grad/param norm = 3.5843e-02, time/batch = 0.1697s
707/9500 (epoch 3.721), train_loss = 1.69308728, grad/param norm = 3.7051e-02, time/batch = 0.1694s
708/9500 (epoch 3.726), train_loss = 1.72896867, grad/param norm = 3.7128e-02, time/batch = 0.1692s
709/9500 (epoch 3.732), train_loss = 1.70397047, grad/param norm = 3.7675e-02, time/batch = 0.1694s
710/9500 (epoch 3.737), train_loss = 1.77687540, grad/param norm = 3.6867e-02, time/batch = 0.1702s
711/9500 (epoch 3.742), train_loss = 1.71585302, grad/param norm = 3.4215e-02, time/batch = 0.1677s
712/9500 (epoch 3.747), train_loss = 1.69915559, grad/param norm = 3.5681e-02, time/batch = 0.1691s
713/9500 (epoch 3.753), train_loss = 1.72366100, grad/param norm = 3.4640e-02, time/batch = 0.1699s
714/9500 (epoch 3.758), train_loss = 1.74328835, grad/param norm = 3.2972e-02, time/batch = 0.1688s
715/9500 (epoch 3.763), train_loss = 1.70238537, grad/param norm = 3.2979e-02, time/batch = 0.1696s
716/9500 (epoch 3.768), train_loss = 1.72236604, grad/param norm = 3.3106e-02, time/batch = 0.1697s
717/9500 (epoch 3.774), train_loss = 1.72975173, grad/param norm = 4.6342e-02, time/batch = 0.1694s
718/9500 (epoch 3.779), train_loss = 1.77293656, grad/param norm = 5.0032e-02, time/batch = 0.1694s
719/9500 (epoch 3.784), train_loss = 1.73741289, grad/param norm = 4.6442e-02, time/batch = 0.1701s
720/9500 (epoch 3.789), train_loss = 1.74295212, grad/param norm = 4.6059e-02, time/batch = 0.1695s
721/9500 (epoch 3.795), train_loss = 1.75090654, grad/param norm = 4.7042e-02, time/batch = 0.1677s
722/9500 (epoch 3.800), train_loss = 1.73434170, grad/param norm = 4.5166e-02, time/batch = 0.1699s
723/9500 (epoch 3.805), train_loss = 1.68175491, grad/param norm = 3.8617e-02, time/batch = 0.1694s
724/9500 (epoch 3.811), train_loss = 1.68830641, grad/param norm = 3.6323e-02, time/batch = 0.1693s
725/9500 (epoch 3.816), train_loss = 1.68355467, grad/param norm = 3.2484e-02, time/batch = 0.1702s
726/9500 (epoch 3.821), train_loss = 1.69512617, grad/param norm = 3.3447e-02, time/batch = 0.1690s
727/9500 (epoch 3.826), train_loss = 1.71940731, grad/param norm = 3.4875e-02, time/batch = 0.1693s
728/9500 (epoch 3.832), train_loss = 1.71243436, grad/param norm = 3.3385e-02, time/batch = 0.1690s
729/9500 (epoch 3.837), train_loss = 1.70550854, grad/param norm = 3.1973e-02, time/batch = 0.1697s
730/9500 (epoch 3.842), train_loss = 1.71293974, grad/param norm = 3.4717e-02, time/batch = 0.1693s
731/9500 (epoch 3.847), train_loss = 1.74060954, grad/param norm = 3.6998e-02, time/batch = 0.1681s
732/9500 (epoch 3.853), train_loss = 1.72798499, grad/param norm = 3.6935e-02, time/batch = 0.1698s
733/9500 (epoch 3.858), train_loss = 1.70368571, grad/param norm = 3.6158e-02, time/batch = 0.1691s
734/9500 (epoch 3.863), train_loss = 1.73405166, grad/param norm = 3.8619e-02, time/batch = 0.1696s
735/9500 (epoch 3.868), train_loss = 1.70903681, grad/param norm = 4.1595e-02, time/batch = 0.1700s
736/9500 (epoch 3.874), train_loss = 1.71198540, grad/param norm = 4.4412e-02, time/batch = 0.1695s
737/9500 (epoch 3.879), train_loss = 1.69003823, grad/param norm = 4.5561e-02, time/batch = 0.1693s
738/9500 (epoch 3.884), train_loss = 1.71900720, grad/param norm = 4.2549e-02, time/batch = 0.1702s
739/9500 (epoch 3.889), train_loss = 1.72340558, grad/param norm = 4.1286e-02, time/batch = 0.1691s
740/9500 (epoch 3.895), train_loss = 1.67800264, grad/param norm = 4.0967e-02, time/batch = 0.1691s
741/9500 (epoch 3.900), train_loss = 1.68586576, grad/param norm = 4.1443e-02, time/batch = 0.1684s
742/9500 (epoch 3.905), train_loss = 1.73222011, grad/param norm = 4.2077e-02, time/batch = 0.1694s
743/9500 (epoch 3.911), train_loss = 1.68591813, grad/param norm = 4.2611e-02, time/batch = 0.1692s
744/9500 (epoch 3.916), train_loss = 1.68500135, grad/param norm = 4.0575e-02, time/batch = 0.1700s
745/9500 (epoch 3.921), train_loss = 1.67930648, grad/param norm = 3.0952e-02, time/batch = 0.1689s
746/9500 (epoch 3.926), train_loss = 1.70040827, grad/param norm = 3.1877e-02, time/batch = 0.1700s
747/9500 (epoch 3.932), train_loss = 1.68964041, grad/param norm = 3.2007e-02, time/batch = 0.1698s
748/9500 (epoch 3.937), train_loss = 1.72014636, grad/param norm = 3.3790e-02, time/batch = 0.1692s
749/9500 (epoch 3.942), train_loss = 1.71274555, grad/param norm = 3.9030e-02, time/batch = 0.1696s
750/9500 (epoch 3.947), train_loss = 1.73294349, grad/param norm = 3.9198e-02, time/batch = 0.1696s
751/9500 (epoch 3.953), train_loss = 1.73659447, grad/param norm = 4.3977e-02, time/batch = 0.1677s
752/9500 (epoch 3.958), train_loss = 1.73771305, grad/param norm = 4.4042e-02, time/batch = 0.1695s
753/9500 (epoch 3.963), train_loss = 1.71932483, grad/param norm = 3.7578e-02, time/batch = 0.1700s
754/9500 (epoch 3.968), train_loss = 1.69500429, grad/param norm = 3.3688e-02, time/batch = 0.1695s
755/9500 (epoch 3.974), train_loss = 1.70278014, grad/param norm = 3.2687e-02, time/batch = 0.1695s
756/9500 (epoch 3.979), train_loss = 1.71677118, grad/param norm = 3.4197e-02, time/batch = 0.1695s
757/9500 (epoch 3.984), train_loss = 1.69726287, grad/param norm = 3.3633e-02, time/batch = 0.1702s
758/9500 (epoch 3.989), train_loss = 1.72809759, grad/param norm = 3.3131e-02, time/batch = 0.1695s
759/9500 (epoch 3.995), train_loss = 1.72194815, grad/param norm = 3.0292e-02, time/batch = 0.1697s
760/9500 (epoch 4.000), train_loss = 1.73746883, grad/param norm = 3.1133e-02, time/batch = 0.1699s
761/9500 (epoch 4.005), train_loss = 1.84910239, grad/param norm = 3.6232e-02, time/batch = 0.1675s
762/9500 (epoch 4.011), train_loss = 1.72923987, grad/param norm = 4.0027e-02, time/batch = 0.1695s
763/9500 (epoch 4.016), train_loss = 1.75710629, grad/param norm = 3.7150e-02, time/batch = 0.1701s
764/9500 (epoch 4.021), train_loss = 1.70714529, grad/param norm = 3.4849e-02, time/batch = 0.1696s
765/9500 (epoch 4.026), train_loss = 1.74504089, grad/param norm = 3.6244e-02, time/batch = 0.1693s
766/9500 (epoch 4.032), train_loss = 1.73616397, grad/param norm = 3.9313e-02, time/batch = 0.1701s
767/9500 (epoch 4.037), train_loss = 1.68321520, grad/param norm = 3.5906e-02, time/batch = 0.1693s
768/9500 (epoch 4.042), train_loss = 1.68415693, grad/param norm = 3.3436e-02, time/batch = 0.1693s
769/9500 (epoch 4.047), train_loss = 1.70829377, grad/param norm = 3.4259e-02, time/batch = 0.1698s
770/9500 (epoch 4.053), train_loss = 1.70832944, grad/param norm = 3.4296e-02, time/batch = 0.1695s
771/9500 (epoch 4.058), train_loss = 1.73836421, grad/param norm = 3.2955e-02, time/batch = 0.1678s
772/9500 (epoch 4.063), train_loss = 1.67311016, grad/param norm = 3.2127e-02, time/batch = 0.1698s
773/9500 (epoch 4.068), train_loss = 1.68532416, grad/param norm = 3.3327e-02, time/batch = 0.1698s
774/9500 (epoch 4.074), train_loss = 1.71948500, grad/param norm = 3.4482e-02, time/batch = 0.1696s
775/9500 (epoch 4.079), train_loss = 1.69068276, grad/param norm = 3.1240e-02, time/batch = 0.1702s
776/9500 (epoch 4.084), train_loss = 1.75061802, grad/param norm = 3.2074e-02, time/batch = 0.1692s
777/9500 (epoch 4.089), train_loss = 1.73364853, grad/param norm = 3.4344e-02, time/batch = 0.1692s
778/9500 (epoch 4.095), train_loss = 1.69672415, grad/param norm = 3.7365e-02, time/batch = 0.1702s
779/9500 (epoch 4.100), train_loss = 1.69509566, grad/param norm = 3.6946e-02, time/batch = 0.1688s
780/9500 (epoch 4.105), train_loss = 1.74267907, grad/param norm = 3.8598e-02, time/batch = 0.1694s
781/9500 (epoch 4.111), train_loss = 1.66515404, grad/param norm = 4.1729e-02, time/batch = 0.1685s
782/9500 (epoch 4.116), train_loss = 1.68998295, grad/param norm = 4.6062e-02, time/batch = 0.1693s
783/9500 (epoch 4.121), train_loss = 1.69910591, grad/param norm = 4.6570e-02, time/batch = 0.1695s
784/9500 (epoch 4.126), train_loss = 1.71080169, grad/param norm = 4.0959e-02, time/batch = 0.1697s
785/9500 (epoch 4.132), train_loss = 1.70382817, grad/param norm = 3.7693e-02, time/batch = 0.1692s
786/9500 (epoch 4.137), train_loss = 1.73845530, grad/param norm = 3.9224e-02, time/batch = 0.1695s
787/9500 (epoch 4.142), train_loss = 1.69863048, grad/param norm = 4.0038e-02, time/batch = 0.1693s
788/9500 (epoch 4.147), train_loss = 1.69344724, grad/param norm = 3.9740e-02, time/batch = 0.1695s
789/9500 (epoch 4.153), train_loss = 1.71512477, grad/param norm = 3.7185e-02, time/batch = 0.1691s
790/9500 (epoch 4.158), train_loss = 1.69265177, grad/param norm = 3.2910e-02, time/batch = 0.1696s
791/9500 (epoch 4.163), train_loss = 1.70233213, grad/param norm = 3.3815e-02, time/batch = 0.1675s
792/9500 (epoch 4.168), train_loss = 1.66747679, grad/param norm = 3.4603e-02, time/batch = 0.1695s
793/9500 (epoch 4.174), train_loss = 1.67292951, grad/param norm = 3.3639e-02, time/batch = 0.1693s
794/9500 (epoch 4.179), train_loss = 1.73572232, grad/param norm = 3.1533e-02, time/batch = 0.1695s
795/9500 (epoch 4.184), train_loss = 1.68390242, grad/param norm = 3.2948e-02, time/batch = 0.1706s
796/9500 (epoch 4.189), train_loss = 1.70645335, grad/param norm = 3.4675e-02, time/batch = 0.1696s
797/9500 (epoch 4.195), train_loss = 1.68173735, grad/param norm = 3.3376e-02, time/batch = 0.1690s
798/9500 (epoch 4.200), train_loss = 1.70523606, grad/param norm = 3.1556e-02, time/batch = 0.1701s
799/9500 (epoch 4.205), train_loss = 1.65286048, grad/param norm = 2.9314e-02, time/batch = 0.1692s
800/9500 (epoch 4.211), train_loss = 1.70290766, grad/param norm = 3.1852e-02, time/batch = 0.1693s
801/9500 (epoch 4.216), train_loss = 1.68540107, grad/param norm = 3.1607e-02, time/batch = 0.1682s
802/9500 (epoch 4.221), train_loss = 1.69307506, grad/param norm = 3.1378e-02, time/batch = 0.1693s
803/9500 (epoch 4.226), train_loss = 1.70052163, grad/param norm = 3.3597e-02, time/batch = 0.1690s
804/9500 (epoch 4.232), train_loss = 1.70445857, grad/param norm = 3.3929e-02, time/batch = 0.1700s
805/9500 (epoch 4.237), train_loss = 1.68909266, grad/param norm = 3.2400e-02, time/batch = 0.1692s
806/9500 (epoch 4.242), train_loss = 1.69284573, grad/param norm = 3.3771e-02, time/batch = 0.1697s
807/9500 (epoch 4.247), train_loss = 1.70507795, grad/param norm = 3.5486e-02, time/batch = 0.1697s
808/9500 (epoch 4.253), train_loss = 1.68410130, grad/param norm = 3.4042e-02, time/batch = 0.1691s
809/9500 (epoch 4.258), train_loss = 1.69357463, grad/param norm = 3.6804e-02, time/batch = 0.1694s
810/9500 (epoch 4.263), train_loss = 1.65414316, grad/param norm = 3.7683e-02, time/batch = 0.1699s
811/9500 (epoch 4.268), train_loss = 1.64966578, grad/param norm = 3.5812e-02, time/batch = 0.1676s
812/9500 (epoch 4.274), train_loss = 1.66857878, grad/param norm = 3.3379e-02, time/batch = 0.1696s
813/9500 (epoch 4.279), train_loss = 1.68510762, grad/param norm = 3.2478e-02, time/batch = 0.1703s
814/9500 (epoch 4.284), train_loss = 1.70041412, grad/param norm = 3.3275e-02, time/batch = 0.1692s
815/9500 (epoch 4.289), train_loss = 1.65687522, grad/param norm = 3.7639e-02, time/batch = 0.1694s
816/9500 (epoch 4.295), train_loss = 1.67861206, grad/param norm = 3.7732e-02, time/batch = 0.1697s
817/9500 (epoch 4.300), train_loss = 1.70076107, grad/param norm = 3.1151e-02, time/batch = 0.1702s
818/9500 (epoch 4.305), train_loss = 1.69641311, grad/param norm = 3.1220e-02, time/batch = 0.1691s
819/9500 (epoch 4.311), train_loss = 1.66745172, grad/param norm = 3.5026e-02, time/batch = 0.1699s
820/9500 (epoch 4.316), train_loss = 1.69968755, grad/param norm = 3.5276e-02, time/batch = 0.1701s
821/9500 (epoch 4.321), train_loss = 1.69734132, grad/param norm = 3.4001e-02, time/batch = 0.1675s
822/9500 (epoch 4.326), train_loss = 1.69687637, grad/param norm = 3.3011e-02, time/batch = 0.1692s
823/9500 (epoch 4.332), train_loss = 1.68392624, grad/param norm = 3.2806e-02, time/batch = 0.1699s
824/9500 (epoch 4.337), train_loss = 1.68316220, grad/param norm = 3.5902e-02, time/batch = 0.1692s
825/9500 (epoch 4.342), train_loss = 1.67767254, grad/param norm = 3.7929e-02, time/batch = 0.1696s
826/9500 (epoch 4.347), train_loss = 1.66085459, grad/param norm = 3.4367e-02, time/batch = 0.1704s
827/9500 (epoch 4.353), train_loss = 1.67712373, grad/param norm = 3.7813e-02, time/batch = 0.1694s
828/9500 (epoch 4.358), train_loss = 1.64296116, grad/param norm = 3.4517e-02, time/batch = 0.1696s
829/9500 (epoch 4.363), train_loss = 1.62750533, grad/param norm = 3.0746e-02, time/batch = 0.1700s
830/9500 (epoch 4.368), train_loss = 1.66052310, grad/param norm = 3.3042e-02, time/batch = 0.1693s
831/9500 (epoch 4.374), train_loss = 1.66827621, grad/param norm = 3.4641e-02, time/batch = 0.1680s
832/9500 (epoch 4.379), train_loss = 1.66162167, grad/param norm = 3.2866e-02, time/batch = 0.1706s
833/9500 (epoch 4.384), train_loss = 1.67696622, grad/param norm = 3.2193e-02, time/batch = 0.1696s
834/9500 (epoch 4.389), train_loss = 1.66974986, grad/param norm = 3.3135e-02, time/batch = 0.1697s
835/9500 (epoch 4.395), train_loss = 1.66859882, grad/param norm = 3.1017e-02, time/batch = 0.1702s
836/9500 (epoch 4.400), train_loss = 1.69916052, grad/param norm = 3.3128e-02, time/batch = 0.1697s
837/9500 (epoch 4.405), train_loss = 1.71771247, grad/param norm = 3.6049e-02, time/batch = 0.1699s
838/9500 (epoch 4.411), train_loss = 1.68828653, grad/param norm = 3.2667e-02, time/batch = 0.1709s
839/9500 (epoch 4.416), train_loss = 1.69377450, grad/param norm = 3.1784e-02, time/batch = 0.1693s
840/9500 (epoch 4.421), train_loss = 1.71324383, grad/param norm = 3.2364e-02, time/batch = 0.1697s
841/9500 (epoch 4.426), train_loss = 1.67354369, grad/param norm = 3.5824e-02, time/batch = 0.1686s
842/9500 (epoch 4.432), train_loss = 1.66145486, grad/param norm = 3.4614e-02, time/batch = 0.1692s
843/9500 (epoch 4.437), train_loss = 1.69032756, grad/param norm = 3.4409e-02, time/batch = 0.1693s
844/9500 (epoch 4.442), train_loss = 1.69635531, grad/param norm = 3.4377e-02, time/batch = 0.1693s
845/9500 (epoch 4.447), train_loss = 1.67571386, grad/param norm = 3.5496e-02, time/batch = 0.1693s
846/9500 (epoch 4.453), train_loss = 1.68532317, grad/param norm = 3.5646e-02, time/batch = 0.1700s
847/9500 (epoch 4.458), train_loss = 1.66196614, grad/param norm = 3.3800e-02, time/batch = 0.1694s
848/9500 (epoch 4.463), train_loss = 1.70427778, grad/param norm = 3.3331e-02, time/batch = 0.1700s
849/9500 (epoch 4.468), train_loss = 1.67554153, grad/param norm = 3.5593e-02, time/batch = 0.1693s
850/9500 (epoch 4.474), train_loss = 1.67805842, grad/param norm = 3.6657e-02, time/batch = 0.1691s
851/9500 (epoch 4.479), train_loss = 1.66183797, grad/param norm = 3.7103e-02, time/batch = 0.1685s
852/9500 (epoch 4.484), train_loss = 1.68452298, grad/param norm = 3.6480e-02, time/batch = 0.1694s
853/9500 (epoch 4.489), train_loss = 1.69640654, grad/param norm = 3.4951e-02, time/batch = 0.1702s
854/9500 (epoch 4.495), train_loss = 1.70235688, grad/param norm = 3.5472e-02, time/batch = 0.1701s
855/9500 (epoch 4.500), train_loss = 1.66385414, grad/param norm = 3.3043e-02, time/batch = 0.1693s
856/9500 (epoch 4.505), train_loss = 1.63838851, grad/param norm = 3.3034e-02, time/batch = 0.1694s
857/9500 (epoch 4.511), train_loss = 1.69584212, grad/param norm = 3.1342e-02, time/batch = 0.1699s
858/9500 (epoch 4.516), train_loss = 1.65695025, grad/param norm = 2.8823e-02, time/batch = 0.1696s
859/9500 (epoch 4.521), train_loss = 1.65720232, grad/param norm = 2.8859e-02, time/batch = 0.1696s
860/9500 (epoch 4.526), train_loss = 1.67545436, grad/param norm = 2.8094e-02, time/batch = 0.1700s
861/9500 (epoch 4.532), train_loss = 1.70237213, grad/param norm = 2.9727e-02, time/batch = 0.1680s
862/9500 (epoch 4.537), train_loss = 1.68935789, grad/param norm = 3.0745e-02, time/batch = 0.1693s
863/9500 (epoch 4.542), train_loss = 1.65004142, grad/param norm = 3.0606e-02, time/batch = 0.1706s
864/9500 (epoch 4.547), train_loss = 1.67958582, grad/param norm = 3.1302e-02, time/batch = 0.1697s
865/9500 (epoch 4.553), train_loss = 1.65028798, grad/param norm = 3.1387e-02, time/batch = 0.1691s
866/9500 (epoch 4.558), train_loss = 1.64009621, grad/param norm = 3.0035e-02, time/batch = 0.1705s
867/9500 (epoch 4.563), train_loss = 1.64705458, grad/param norm = 3.0802e-02, time/batch = 0.1698s
868/9500 (epoch 4.568), train_loss = 1.65758240, grad/param norm = 3.1543e-02, time/batch = 0.1695s
869/9500 (epoch 4.574), train_loss = 1.67436466, grad/param norm = 3.3443e-02, time/batch = 0.1695s
870/9500 (epoch 4.579), train_loss = 1.65568807, grad/param norm = 3.1547e-02, time/batch = 0.1696s
871/9500 (epoch 4.584), train_loss = 1.66286589, grad/param norm = 3.3907e-02, time/batch = 0.1676s
872/9500 (epoch 4.589), train_loss = 1.63673788, grad/param norm = 3.6569e-02, time/batch = 0.1694s
873/9500 (epoch 4.595), train_loss = 1.65604325, grad/param norm = 4.0041e-02, time/batch = 0.1699s
874/9500 (epoch 4.600), train_loss = 1.66198657, grad/param norm = 4.2221e-02, time/batch = 0.1695s
875/9500 (epoch 4.605), train_loss = 1.64800436, grad/param norm = 3.7810e-02, time/batch = 0.1693s
876/9500 (epoch 4.611), train_loss = 1.66949524, grad/param norm = 3.4123e-02, time/batch = 0.1694s
877/9500 (epoch 4.616), train_loss = 1.66157489, grad/param norm = 2.9860e-02, time/batch = 0.1696s
878/9500 (epoch 4.621), train_loss = 1.65762391, grad/param norm = 2.7941e-02, time/batch = 0.1697s
879/9500 (epoch 4.626), train_loss = 1.62202005, grad/param norm = 2.9629e-02, time/batch = 0.1697s
880/9500 (epoch 4.632), train_loss = 1.67640835, grad/param norm = 3.2587e-02, time/batch = 0.1693s
881/9500 (epoch 4.637), train_loss = 1.61996679, grad/param norm = 3.0431e-02, time/batch = 0.1679s
882/9500 (epoch 4.642), train_loss = 1.66797994, grad/param norm = 3.5335e-02, time/batch = 0.1699s
883/9500 (epoch 4.647), train_loss = 1.67702560, grad/param norm = 3.4762e-02, time/batch = 0.1693s
884/9500 (epoch 4.653), train_loss = 1.67072403, grad/param norm = 3.2903e-02, time/batch = 0.1693s
885/9500 (epoch 4.658), train_loss = 1.65085541, grad/param norm = 3.3333e-02, time/batch = 0.1703s
886/9500 (epoch 4.663), train_loss = 1.66171013, grad/param norm = 3.4721e-02, time/batch = 0.1694s
887/9500 (epoch 4.668), train_loss = 1.65543686, grad/param norm = 3.1265e-02, time/batch = 0.1692s
888/9500 (epoch 4.674), train_loss = 1.67150947, grad/param norm = 2.9587e-02, time/batch = 0.1703s
889/9500 (epoch 4.679), train_loss = 1.66555226, grad/param norm = 2.8626e-02, time/batch = 0.1695s
890/9500 (epoch 4.684), train_loss = 1.65576789, grad/param norm = 3.0017e-02, time/batch = 0.1699s
891/9500 (epoch 4.689), train_loss = 1.66307118, grad/param norm = 3.2335e-02, time/batch = 0.1685s
892/9500 (epoch 4.695), train_loss = 1.64577079, grad/param norm = 3.6755e-02, time/batch = 0.1696s
893/9500 (epoch 4.700), train_loss = 1.67214643, grad/param norm = 4.1275e-02, time/batch = 0.1693s
894/9500 (epoch 4.705), train_loss = 1.68925711, grad/param norm = 4.0973e-02, time/batch = 0.1699s
895/9500 (epoch 4.711), train_loss = 1.70573308, grad/param norm = 3.8978e-02, time/batch = 0.1695s
896/9500 (epoch 4.716), train_loss = 1.66014789, grad/param norm = 3.2731e-02, time/batch = 0.1690s
897/9500 (epoch 4.721), train_loss = 1.60224290, grad/param norm = 3.0528e-02, time/batch = 0.1703s
898/9500 (epoch 4.726), train_loss = 1.64024791, grad/param norm = 2.9035e-02, time/batch = 0.1695s
899/9500 (epoch 4.732), train_loss = 1.62264446, grad/param norm = 2.8376e-02, time/batch = 0.1691s
900/9500 (epoch 4.737), train_loss = 1.68484296, grad/param norm = 2.8212e-02, time/batch = 0.1696s
901/9500 (epoch 4.742), train_loss = 1.62961482, grad/param norm = 2.8651e-02, time/batch = 0.1675s
902/9500 (epoch 4.747), train_loss = 1.60998358, grad/param norm = 3.2463e-02, time/batch = 0.1690s
903/9500 (epoch 4.753), train_loss = 1.64074933, grad/param norm = 3.3213e-02, time/batch = 0.1698s
904/9500 (epoch 4.758), train_loss = 1.66365141, grad/param norm = 3.2918e-02, time/batch = 0.1694s
905/9500 (epoch 4.763), train_loss = 1.62952644, grad/param norm = 3.2209e-02, time/batch = 0.1695s
906/9500 (epoch 4.768), train_loss = 1.64807916, grad/param norm = 2.9828e-02, time/batch = 0.1697s
907/9500 (epoch 4.774), train_loss = 1.63682451, grad/param norm = 3.3154e-02, time/batch = 0.1702s
908/9500 (epoch 4.779), train_loss = 1.65010587, grad/param norm = 3.1245e-02, time/batch = 0.1696s
909/9500 (epoch 4.784), train_loss = 1.62872632, grad/param norm = 3.1844e-02, time/batch = 0.1695s
910/9500 (epoch 4.789), train_loss = 1.63452202, grad/param norm = 3.1856e-02, time/batch = 0.1704s
911/9500 (epoch 4.795), train_loss = 1.63445446, grad/param norm = 2.9030e-02, time/batch = 0.1679s
912/9500 (epoch 4.800), train_loss = 1.63899488, grad/param norm = 2.8729e-02, time/batch = 0.1698s
913/9500 (epoch 4.805), train_loss = 1.60223852, grad/param norm = 2.6382e-02, time/batch = 0.1700s
914/9500 (epoch 4.811), train_loss = 1.60930083, grad/param norm = 2.8544e-02, time/batch = 0.1693s
915/9500 (epoch 4.816), train_loss = 1.60181721, grad/param norm = 2.7536e-02, time/batch = 0.1695s
916/9500 (epoch 4.821), train_loss = 1.60203920, grad/param norm = 2.8407e-02, time/batch = 0.1701s
917/9500 (epoch 4.826), train_loss = 1.63418029, grad/param norm = 2.8586e-02, time/batch = 0.1694s
918/9500 (epoch 4.832), train_loss = 1.62646056, grad/param norm = 2.7871e-02, time/batch = 0.1698s
919/9500 (epoch 4.837), train_loss = 1.61954283, grad/param norm = 2.9192e-02, time/batch = 0.1699s
920/9500 (epoch 4.842), train_loss = 1.64262913, grad/param norm = 3.0302e-02, time/batch = 0.1694s
921/9500 (epoch 4.847), train_loss = 1.66500191, grad/param norm = 3.1456e-02, time/batch = 0.1678s
922/9500 (epoch 4.853), train_loss = 1.64307425, grad/param norm = 3.1235e-02, time/batch = 0.1698s
923/9500 (epoch 4.858), train_loss = 1.61042687, grad/param norm = 3.0643e-02, time/batch = 0.1696s
924/9500 (epoch 4.863), train_loss = 1.63392434, grad/param norm = 3.0235e-02, time/batch = 0.1694s
925/9500 (epoch 4.868), train_loss = 1.61581607, grad/param norm = 2.8926e-02, time/batch = 0.1699s
926/9500 (epoch 4.874), train_loss = 1.60089006, grad/param norm = 2.7676e-02, time/batch = 0.1693s
927/9500 (epoch 4.879), train_loss = 1.59109243, grad/param norm = 2.8219e-02, time/batch = 0.1697s
928/9500 (epoch 4.884), train_loss = 1.62405933, grad/param norm = 2.7545e-02, time/batch = 0.1696s
929/9500 (epoch 4.889), train_loss = 1.62654609, grad/param norm = 3.0146e-02, time/batch = 0.1683s
930/9500 (epoch 4.895), train_loss = 1.60034422, grad/param norm = 2.8618e-02, time/batch = 0.1693s
931/9500 (epoch 4.900), train_loss = 1.59652648, grad/param norm = 2.8139e-02, time/batch = 0.1676s
932/9500 (epoch 4.905), train_loss = 1.64072244, grad/param norm = 2.8822e-02, time/batch = 0.1690s
933/9500 (epoch 4.911), train_loss = 1.59011396, grad/param norm = 2.8430e-02, time/batch = 0.1694s
934/9500 (epoch 4.916), train_loss = 1.60858842, grad/param norm = 2.9285e-02, time/batch = 0.1696s
935/9500 (epoch 4.921), train_loss = 1.60042659, grad/param norm = 2.9119e-02, time/batch = 0.1706s
936/9500 (epoch 4.926), train_loss = 1.63264688, grad/param norm = 3.2313e-02, time/batch = 0.1697s
937/9500 (epoch 4.932), train_loss = 1.63113268, grad/param norm = 3.3992e-02, time/batch = 0.1692s
938/9500 (epoch 4.937), train_loss = 1.65355922, grad/param norm = 3.6723e-02, time/batch = 0.1700s
939/9500 (epoch 4.942), train_loss = 1.64333221, grad/param norm = 4.1211e-02, time/batch = 0.1693s
940/9500 (epoch 4.947), train_loss = 1.66768390, grad/param norm = 3.9564e-02, time/batch = 0.1700s
941/9500 (epoch 4.953), train_loss = 1.65494865, grad/param norm = 3.2262e-02, time/batch = 0.1684s
942/9500 (epoch 4.958), train_loss = 1.66035280, grad/param norm = 3.1301e-02, time/batch = 0.1692s
943/9500 (epoch 4.963), train_loss = 1.64335900, grad/param norm = 3.1343e-02, time/batch = 0.1694s
944/9500 (epoch 4.968), train_loss = 1.61113292, grad/param norm = 2.9278e-02, time/batch = 0.1703s
945/9500 (epoch 4.974), train_loss = 1.62212906, grad/param norm = 2.7581e-02, time/batch = 0.1697s
946/9500 (epoch 4.979), train_loss = 1.63742517, grad/param norm = 2.8057e-02, time/batch = 0.1696s
947/9500 (epoch 4.984), train_loss = 1.62823877, grad/param norm = 2.8806e-02, time/batch = 0.1706s
948/9500 (epoch 4.989), train_loss = 1.64514560, grad/param norm = 2.9481e-02, time/batch = 0.1691s
949/9500 (epoch 4.995), train_loss = 1.65037564, grad/param norm = 2.7383e-02, time/batch = 0.1696s
950/9500 (epoch 5.000), train_loss = 1.66151847, grad/param norm = 2.8288e-02, time/batch = 0.1702s
951/9500 (epoch 5.005), train_loss = 1.77705599, grad/param norm = 3.1189e-02, time/batch = 0.1672s
952/9500 (epoch 5.011), train_loss = 1.64908596, grad/param norm = 3.1772e-02, time/batch = 0.1690s
953/9500 (epoch 5.016), train_loss = 1.66444999, grad/param norm = 3.1093e-02, time/batch = 0.1705s
954/9500 (epoch 5.021), train_loss = 1.62766813, grad/param norm = 2.7987e-02, time/batch = 0.1694s
955/9500 (epoch 5.026), train_loss = 1.66020779, grad/param norm = 2.9121e-02, time/batch = 0.1695s
956/9500 (epoch 5.032), train_loss = 1.65251416, grad/param norm = 3.0783e-02, time/batch = 0.1692s
957/9500 (epoch 5.037), train_loss = 1.60465912, grad/param norm = 2.8084e-02, time/batch = 0.1702s
958/9500 (epoch 5.042), train_loss = 1.61021670, grad/param norm = 2.8620e-02, time/batch = 0.1695s
959/9500 (epoch 5.047), train_loss = 1.62438451, grad/param norm = 2.7550e-02, time/batch = 0.1692s
960/9500 (epoch 5.053), train_loss = 1.63201022, grad/param norm = 2.7815e-02, time/batch = 0.1698s
961/9500 (epoch 5.058), train_loss = 1.64732937, grad/param norm = 2.7494e-02, time/batch = 0.1680s
962/9500 (epoch 5.063), train_loss = 1.59930218, grad/param norm = 2.7709e-02, time/batch = 0.1691s
963/9500 (epoch 5.068), train_loss = 1.59902252, grad/param norm = 2.9308e-02, time/batch = 0.1705s
964/9500 (epoch 5.074), train_loss = 1.63271731, grad/param norm = 2.8591e-02, time/batch = 0.1697s
965/9500 (epoch 5.079), train_loss = 1.62295301, grad/param norm = 2.9227e-02, time/batch = 0.1702s
966/9500 (epoch 5.084), train_loss = 1.67438715, grad/param norm = 2.9675e-02, time/batch = 0.1712s
967/9500 (epoch 5.089), train_loss = 1.67603414, grad/param norm = 3.2029e-02, time/batch = 0.1698s
968/9500 (epoch 5.095), train_loss = 1.61446275, grad/param norm = 3.4218e-02, time/batch = 0.1692s
969/9500 (epoch 5.100), train_loss = 1.62601659, grad/param norm = 3.2989e-02, time/batch = 0.1704s
970/9500 (epoch 5.105), train_loss = 1.67570079, grad/param norm = 3.3527e-02, time/batch = 0.1694s
971/9500 (epoch 5.111), train_loss = 1.58885965, grad/param norm = 3.1664e-02, time/batch = 0.1682s
972/9500 (epoch 5.116), train_loss = 1.62055286, grad/param norm = 3.0688e-02, time/batch = 0.1708s
973/9500 (epoch 5.121), train_loss = 1.62080072, grad/param norm = 3.0555e-02, time/batch = 0.1694s
974/9500 (epoch 5.126), train_loss = 1.63117465, grad/param norm = 2.9566e-02, time/batch = 0.1695s
975/9500 (epoch 5.132), train_loss = 1.62619622, grad/param norm = 2.9124e-02, time/batch = 0.1706s
976/9500 (epoch 5.137), train_loss = 1.65507337, grad/param norm = 2.7847e-02, time/batch = 0.1696s
977/9500 (epoch 5.142), train_loss = 1.60157510, grad/param norm = 2.7896e-02, time/batch = 0.1694s
978/9500 (epoch 5.147), train_loss = 1.60030244, grad/param norm = 3.2202e-02, time/batch = 0.1704s
979/9500 (epoch 5.153), train_loss = 1.65531791, grad/param norm = 3.5065e-02, time/batch = 0.1697s
980/9500 (epoch 5.158), train_loss = 1.63282693, grad/param norm = 3.2780e-02, time/batch = 0.1695s
981/9500 (epoch 5.163), train_loss = 1.62918163, grad/param norm = 3.1312e-02, time/batch = 0.1689s
982/9500 (epoch 5.168), train_loss = 1.59889682, grad/param norm = 2.9411e-02, time/batch = 0.1697s
983/9500 (epoch 5.174), train_loss = 1.59179234, grad/param norm = 2.8294e-02, time/batch = 0.1696s
984/9500 (epoch 5.179), train_loss = 1.67189718, grad/param norm = 2.8097e-02, time/batch = 0.1696s
985/9500 (epoch 5.184), train_loss = 1.61507277, grad/param norm = 2.8062e-02, time/batch = 0.1694s
986/9500 (epoch 5.189), train_loss = 1.63122194, grad/param norm = 2.8934e-02, time/batch = 0.1697s
987/9500 (epoch 5.195), train_loss = 1.61473113, grad/param norm = 2.7248e-02, time/batch = 0.1698s
988/9500 (epoch 5.200), train_loss = 1.62698888, grad/param norm = 2.7496e-02, time/batch = 0.1694s
989/9500 (epoch 5.205), train_loss = 1.59136685, grad/param norm = 2.8788e-02, time/batch = 0.1698s
990/9500 (epoch 5.211), train_loss = 1.62315706, grad/param norm = 2.8934e-02, time/batch = 0.1698s
991/9500 (epoch 5.216), train_loss = 1.61051164, grad/param norm = 2.8938e-02, time/batch = 0.1686s
992/9500 (epoch 5.221), train_loss = 1.62514684, grad/param norm = 2.7790e-02, time/batch = 0.1693s
993/9500 (epoch 5.226), train_loss = 1.61459715, grad/param norm = 2.8618e-02, time/batch = 0.1695s
994/9500 (epoch 5.232), train_loss = 1.63715502, grad/param norm = 2.9638e-02, time/batch = 0.1705s
995/9500 (epoch 5.237), train_loss = 1.60970640, grad/param norm = 2.9066e-02, time/batch = 0.1694s
996/9500 (epoch 5.242), train_loss = 1.62929763, grad/param norm = 3.0512e-02, time/batch = 0.1697s
997/9500 (epoch 5.247), train_loss = 1.62241317, grad/param norm = 2.8995e-02, time/batch = 0.1705s
998/9500 (epoch 5.253), train_loss = 1.61970294, grad/param norm = 2.8253e-02, time/batch = 0.1692s
999/9500 (epoch 5.258), train_loss = 1.64353000, grad/param norm = 2.8038e-02, time/batch = 0.1695s
evaluating loss over split index 2
1/10...
2/10...
3/10...
4/10...
5/10...
6/10...
7/10...
8/10...
9/10...
10/10...
saving checkpoint to cv/lm_lstm_epoch5.26_1.4981.t7
1000/9500 (epoch 5.263), train_loss = 1.58477830, grad/param norm = 2.6291e-02, time/batch = 0.1699s
1001/9500 (epoch 5.268), train_loss = 1.71356881, grad/param norm = 2.7578e-02, time/batch = 0.1680s
1002/9500 (epoch 5.274), train_loss = 1.61021095, grad/param norm = 2.6320e-02, time/batch = 0.1696s
1003/9500 (epoch 5.279), train_loss = 1.63006354, grad/param norm = 2.6899e-02, time/batch = 0.1698s
1004/9500 (epoch 5.284), train_loss = 1.63025910, grad/param norm = 2.8030e-02, time/batch = 0.1693s
1005/9500 (epoch 5.289), train_loss = 1.58297621, grad/param norm = 3.1094e-02, time/batch = 0.1704s
1006/9500 (epoch 5.295), train_loss = 1.60252831, grad/param norm = 3.1455e-02, time/batch = 0.1692s
1007/9500 (epoch 5.300), train_loss = 1.64592949, grad/param norm = 3.1501e-02, time/batch = 0.1691s
1008/9500 (epoch 5.305), train_loss = 1.63222276, grad/param norm = 2.9764e-02, time/batch = 0.1702s
1009/9500 (epoch 5.311), train_loss = 1.59896771, grad/param norm = 3.0159e-02, time/batch = 0.1694s
1010/9500 (epoch 5.316), train_loss = 1.61894051, grad/param norm = 3.1067e-02, time/batch = 0.1698s
1011/9500 (epoch 5.321), train_loss = 1.62694973, grad/param norm = 3.1807e-02, time/batch = 0.1679s
1012/9500 (epoch 5.326), train_loss = 1.62513082, grad/param norm = 3.1709e-02, time/batch = 0.1683s
1013/9500 (epoch 5.332), train_loss = 1.62816175, grad/param norm = 3.1928e-02, time/batch = 0.1693s
1014/9500 (epoch 5.337), train_loss = 1.61013783, grad/param norm = 3.0056e-02, time/batch = 0.1696s
1015/9500 (epoch 5.342), train_loss = 1.60113346, grad/param norm = 2.8828e-02, time/batch = 0.1702s
1016/9500 (epoch 5.347), train_loss = 1.58151573, grad/param norm = 2.9244e-02, time/batch = 0.1693s
1017/9500 (epoch 5.353), train_loss = 1.61923838, grad/param norm = 2.9916e-02, time/batch = 0.1691s
1018/9500 (epoch 5.358), train_loss = 1.57432884, grad/param norm = 2.8124e-02, time/batch = 0.1703s
1019/9500 (epoch 5.363), train_loss = 1.57148386, grad/param norm = 2.9131e-02, time/batch = 0.1693s
1020/9500 (epoch 5.368), train_loss = 1.59019744, grad/param norm = 3.2256e-02, time/batch = 0.1697s
1021/9500 (epoch 5.374), train_loss = 1.60132719, grad/param norm = 3.2881e-02, time/batch = 0.1683s
1022/9500 (epoch 5.379), train_loss = 1.61264314, grad/param norm = 3.2418e-02, time/batch = 0.1698s
1023/9500 (epoch 5.384), train_loss = 1.60490526, grad/param norm = 3.1095e-02, time/batch = 0.1713s
1024/9500 (epoch 5.389), train_loss = 1.61464143, grad/param norm = 2.8774e-02, time/batch = 0.1716s
1025/9500 (epoch 5.395), train_loss = 1.60085696, grad/param norm = 2.6535e-02, time/batch = 0.1706s
1026/9500 (epoch 5.400), train_loss = 1.62712472, grad/param norm = 2.7542e-02, time/batch = 0.1706s
1027/9500 (epoch 5.405), train_loss = 1.63245245, grad/param norm = 2.7683e-02, time/batch = 0.1716s
1028/9500 (epoch 5.411), train_loss = 1.62059371, grad/param norm = 2.6345e-02, time/batch = 0.1710s
1029/9500 (epoch 5.416), train_loss = 1.62355596, grad/param norm = 2.7788e-02, time/batch = 0.1710s
1030/9500 (epoch 5.421), train_loss = 1.63999403, grad/param norm = 2.6580e-02, time/batch = 0.1718s
1031/9500 (epoch 5.426), train_loss = 1.60270127, grad/param norm = 2.7163e-02, time/batch = 0.1678s
1032/9500 (epoch 5.432), train_loss = 1.58458433, grad/param norm = 2.8376e-02, time/batch = 0.1713s
1033/9500 (epoch 5.437), train_loss = 1.62882029, grad/param norm = 2.9634e-02, time/batch = 0.1719s
1034/9500 (epoch 5.442), train_loss = 1.62154590, grad/param norm = 2.7524e-02, time/batch = 0.1711s
1035/9500 (epoch 5.447), train_loss = 1.59972684, grad/param norm = 2.8254e-02, time/batch = 0.1708s
1036/9500 (epoch 5.453), train_loss = 1.63209802, grad/param norm = 2.8794e-02, time/batch = 0.1718s
1037/9500 (epoch 5.458), train_loss = 1.60023825, grad/param norm = 2.7858e-02, time/batch = 0.1709s
1038/9500 (epoch 5.463), train_loss = 1.62562241, grad/param norm = 2.6270e-02, time/batch = 0.1712s
1039/9500 (epoch 5.468), train_loss = 1.60725322, grad/param norm = 2.6803e-02, time/batch = 0.1708s
1040/9500 (epoch 5.474), train_loss = 1.61762009, grad/param norm = 2.6867e-02, time/batch = 0.1709s
1041/9500 (epoch 5.479), train_loss = 1.59191772, grad/param norm = 2.6272e-02, time/batch = 0.1680s
1042/9500 (epoch 5.484), train_loss = 1.60194592, grad/param norm = 2.8042e-02, time/batch = 0.1713s
1043/9500 (epoch 5.489), train_loss = 1.61496329, grad/param norm = 2.7836e-02, time/batch = 0.1714s
1044/9500 (epoch 5.495), train_loss = 1.62394652, grad/param norm = 2.7476e-02, time/batch = 0.1713s
1045/9500 (epoch 5.500), train_loss = 1.60691760, grad/param norm = 2.6228e-02, time/batch = 0.1712s
1046/9500 (epoch 5.505), train_loss = 1.57751009, grad/param norm = 2.8069e-02, time/batch = 0.1710s
1047/9500 (epoch 5.511), train_loss = 1.63151429, grad/param norm = 2.9989e-02, time/batch = 0.1712s
1048/9500 (epoch 5.516), train_loss = 1.61945364, grad/param norm = 2.7706e-02, time/batch = 0.1711s
1049/9500 (epoch 5.521), train_loss = 1.61136587, grad/param norm = 2.6306e-02, time/batch = 0.1717s
1050/9500 (epoch 5.526), train_loss = 1.61016889, grad/param norm = 2.6714e-02, time/batch = 0.1710s
1051/9500 (epoch 5.532), train_loss = 1.64574303, grad/param norm = 2.7422e-02, time/batch = 0.1678s
1052/9500 (epoch 5.537), train_loss = 1.62995796, grad/param norm = 2.8214e-02, time/batch = 0.1714s
1053/9500 (epoch 5.542), train_loss = 1.60001548, grad/param norm = 3.1045e-02, time/batch = 0.1716s
1054/9500 (epoch 5.547), train_loss = 1.61024321, grad/param norm = 3.1720e-02, time/batch = 0.1716s
1055/9500 (epoch 5.553), train_loss = 1.58601625, grad/param norm = 3.0616e-02, time/batch = 0.1718s
1056/9500 (epoch 5.558), train_loss = 1.59756952, grad/param norm = 2.8764e-02, time/batch = 0.1711s
1057/9500 (epoch 5.563), train_loss = 1.58861377, grad/param norm = 3.0921e-02, time/batch = 0.1709s
1058/9500 (epoch 5.568), train_loss = 1.61496736, grad/param norm = 3.4332e-02, time/batch = 0.1720s
1059/9500 (epoch 5.574), train_loss = 1.62116094, grad/param norm = 3.2935e-02, time/batch = 0.1715s
1060/9500 (epoch 5.579), train_loss = 1.58575966, grad/param norm = 3.1127e-02, time/batch = 0.1710s
1061/9500 (epoch 5.584), train_loss = 1.60926470, grad/param norm = 2.7237e-02, time/batch = 0.1685s
1062/9500 (epoch 5.589), train_loss = 1.56545387, grad/param norm = 2.4923e-02, time/batch = 0.1713s
1063/9500 (epoch 5.595), train_loss = 1.57381443, grad/param norm = 2.6798e-02, time/batch = 0.1709s
1064/9500 (epoch 5.600), train_loss = 1.58619361, grad/param norm = 2.8047e-02, time/batch = 0.1717s
1065/9500 (epoch 5.605), train_loss = 1.59352496, grad/param norm = 2.9392e-02, time/batch = 0.1710s
1066/9500 (epoch 5.611), train_loss = 1.60184913, grad/param norm = 2.9223e-02, time/batch = 0.1715s
1067/9500 (epoch 5.616), train_loss = 1.59964491, grad/param norm = 2.9870e-02, time/batch = 0.1717s
1068/9500 (epoch 5.621), train_loss = 1.60653246, grad/param norm = 2.8714e-02, time/batch = 0.1717s
1069/9500 (epoch 5.626), train_loss = 1.56526207, grad/param norm = 2.5508e-02, time/batch = 0.1706s
1070/9500 (epoch 5.632), train_loss = 1.61220764, grad/param norm = 2.7056e-02, time/batch = 0.1712s
1071/9500 (epoch 5.637), train_loss = 1.56570190, grad/param norm = 2.6016e-02, time/batch = 0.1687s
1072/9500 (epoch 5.642), train_loss = 1.60773491, grad/param norm = 2.7178e-02, time/batch = 0.1707s
1073/9500 (epoch 5.647), train_loss = 1.60162445, grad/param norm = 2.6090e-02, time/batch = 0.1710s
1074/9500 (epoch 5.653), train_loss = 1.60375334, grad/param norm = 2.6196e-02, time/batch = 0.1717s
1075/9500 (epoch 5.658), train_loss = 1.57944025, grad/param norm = 2.7137e-02, time/batch = 0.1712s
1076/9500 (epoch 5.663), train_loss = 1.58492365, grad/param norm = 2.9692e-02, time/batch = 0.1709s
1077/9500 (epoch 5.668), train_loss = 1.57500605, grad/param norm = 2.6682e-02, time/batch = 0.1714s
1078/9500 (epoch 5.674), train_loss = 1.61110756, grad/param norm = 2.5913e-02, time/batch = 0.1708s
1079/9500 (epoch 5.679), train_loss = 1.60189772, grad/param norm = 2.7624e-02, time/batch = 0.1708s
1080/9500 (epoch 5.684), train_loss = 1.60074795, grad/param norm = 2.8561e-02, time/batch = 0.1714s
1081/9500 (epoch 5.689), train_loss = 1.60558249, grad/param norm = 2.9131e-02, time/batch = 0.1679s
1082/9500 (epoch 5.695), train_loss = 1.60236510, grad/param norm = 2.9468e-02, time/batch = 0.1708s
1083/9500 (epoch 5.700), train_loss = 1.60292503, grad/param norm = 2.9190e-02, time/batch = 0.1714s
1084/9500 (epoch 5.705), train_loss = 1.63112989, grad/param norm = 2.9433e-02, time/batch = 0.1704s
1085/9500 (epoch 5.711), train_loss = 1.64094972, grad/param norm = 2.9567e-02, time/batch = 0.1711s
1086/9500 (epoch 5.716), train_loss = 1.59924020, grad/param norm = 2.5955e-02, time/batch = 0.1722s
1087/9500 (epoch 5.721), train_loss = 1.54466332, grad/param norm = 2.5541e-02, time/batch = 0.1709s
1088/9500 (epoch 5.726), train_loss = 1.58856917, grad/param norm = 2.6235e-02, time/batch = 0.1710s
1089/9500 (epoch 5.732), train_loss = 1.58076847, grad/param norm = 2.7265e-02, time/batch = 0.1717s
1090/9500 (epoch 5.737), train_loss = 1.61948328, grad/param norm = 2.7359e-02, time/batch = 0.1710s
1091/9500 (epoch 5.742), train_loss = 1.57474860, grad/param norm = 2.6192e-02, time/batch = 0.1678s
1092/9500 (epoch 5.747), train_loss = 1.56139510, grad/param norm = 2.6241e-02, time/batch = 0.1718s
1093/9500 (epoch 5.753), train_loss = 1.58093657, grad/param norm = 2.6137e-02, time/batch = 0.1707s
1094/9500 (epoch 5.758), train_loss = 1.59754069, grad/param norm = 2.7335e-02, time/batch = 0.1712s
1095/9500 (epoch 5.763), train_loss = 1.56909133, grad/param norm = 2.6058e-02, time/batch = 0.1707s
1096/9500 (epoch 5.768), train_loss = 1.58684452, grad/param norm = 2.5606e-02, time/batch = 0.1708s
1097/9500 (epoch 5.774), train_loss = 1.58023751, grad/param norm = 3.1555e-02, time/batch = 0.1714s
1098/9500 (epoch 5.779), train_loss = 1.60079907, grad/param norm = 2.9827e-02, time/batch = 0.1712s
1099/9500 (epoch 5.784), train_loss = 1.59114204, grad/param norm = 2.8857e-02, time/batch = 0.1707s
1100/9500 (epoch 5.789), train_loss = 1.58133795, grad/param norm = 2.7795e-02, time/batch = 0.1712s
1101/9500 (epoch 5.795), train_loss = 1.57104543, grad/param norm = 2.7598e-02, time/batch = 0.1680s
1102/9500 (epoch 5.800), train_loss = 1.59475090, grad/param norm = 2.7435e-02, time/batch = 0.1717s
1103/9500 (epoch 5.805), train_loss = 1.54757036, grad/param norm = 2.4524e-02, time/batch = 0.1707s
1104/9500 (epoch 5.811), train_loss = 1.55600042, grad/param norm = 2.5618e-02, time/batch = 0.1708s
1105/9500 (epoch 5.816), train_loss = 1.54878673, grad/param norm = 2.4830e-02, time/batch = 0.1710s
1106/9500 (epoch 5.821), train_loss = 1.54537115, grad/param norm = 2.5483e-02, time/batch = 0.1708s
1107/9500 (epoch 5.826), train_loss = 1.57810329, grad/param norm = 2.6457e-02, time/batch = 0.1710s
1108/9500 (epoch 5.832), train_loss = 1.57032775, grad/param norm = 2.6506e-02, time/batch = 0.1716s
1109/9500 (epoch 5.837), train_loss = 1.57145359, grad/param norm = 2.8057e-02, time/batch = 0.1711s
1110/9500 (epoch 5.842), train_loss = 1.58273715, grad/param norm = 2.8427e-02, time/batch = 0.1707s
1111/9500 (epoch 5.847), train_loss = 1.58339036, grad/param norm = 2.7252e-02, time/batch = 0.1686s
1112/9500 (epoch 5.853), train_loss = 1.59040709, grad/param norm = 2.6443e-02, time/batch = 0.1714s
1113/9500 (epoch 5.858), train_loss = 1.55637665, grad/param norm = 2.5121e-02, time/batch = 0.1712s
1114/9500 (epoch 5.863), train_loss = 1.58591290, grad/param norm = 2.5833e-02, time/batch = 0.1721s
1115/9500 (epoch 5.868), train_loss = 1.56070707, grad/param norm = 2.7682e-02, time/batch = 0.1707s
1116/9500 (epoch 5.874), train_loss = 1.55226213, grad/param norm = 3.0444e-02, time/batch = 0.1707s
1117/9500 (epoch 5.879), train_loss = 1.54949378, grad/param norm = 2.8687e-02, time/batch = 0.1717s
1118/9500 (epoch 5.884), train_loss = 1.56988122, grad/param norm = 2.6302e-02, time/batch = 0.1710s
1119/9500 (epoch 5.889), train_loss = 1.58171265, grad/param norm = 2.6720e-02, time/batch = 0.1710s
1120/9500 (epoch 5.895), train_loss = 1.53821108, grad/param norm = 2.5877e-02, time/batch = 0.1717s
1121/9500 (epoch 5.900), train_loss = 1.53973559, grad/param norm = 2.5799e-02, time/batch = 0.1679s
1122/9500 (epoch 5.905), train_loss = 1.59484873, grad/param norm = 2.6945e-02, time/batch = 0.1714s
1123/9500 (epoch 5.911), train_loss = 1.53801155, grad/param norm = 2.5474e-02, time/batch = 0.1711s
1124/9500 (epoch 5.916), train_loss = 1.55928077, grad/param norm = 2.6353e-02, time/batch = 0.1700s
1125/9500 (epoch 5.921), train_loss = 1.55370684, grad/param norm = 2.5800e-02, time/batch = 0.1710s
1126/9500 (epoch 5.926), train_loss = 1.58016362, grad/param norm = 2.6609e-02, time/batch = 0.1707s
1127/9500 (epoch 5.932), train_loss = 1.56180913, grad/param norm = 2.5708e-02, time/batch = 0.1720s
1128/9500 (epoch 5.937), train_loss = 1.59221177, grad/param norm = 2.6835e-02, time/batch = 0.1710s
1129/9500 (epoch 5.942), train_loss = 1.57146146, grad/param norm = 2.8855e-02, time/batch = 0.1706s
1130/9500 (epoch 5.947), train_loss = 1.58948232, grad/param norm = 2.7271e-02, time/batch = 0.1714s
1131/9500 (epoch 5.953), train_loss = 1.59540340, grad/param norm = 2.4874e-02, time/batch = 0.1679s
1132/9500 (epoch 5.958), train_loss = 1.59555402, grad/param norm = 2.5575e-02, time/batch = 0.1707s
1133/9500 (epoch 5.963), train_loss = 1.57336476, grad/param norm = 2.5382e-02, time/batch = 0.1714s
1134/9500 (epoch 5.968), train_loss = 1.56566529, grad/param norm = 2.5334e-02, time/batch = 0.1711s
1135/9500 (epoch 5.974), train_loss = 1.56498470, grad/param norm = 2.4828e-02, time/batch = 0.1710s
1136/9500 (epoch 5.979), train_loss = 1.58861147, grad/param norm = 2.7582e-02, time/batch = 0.1718s
1137/9500 (epoch 5.984), train_loss = 1.57496490, grad/param norm = 2.8181e-02, time/batch = 0.1711s
1138/9500 (epoch 5.989), train_loss = 1.59986017, grad/param norm = 2.8200e-02, time/batch = 0.1713s
1139/9500 (epoch 5.995), train_loss = 1.59599957, grad/param norm = 2.8425e-02, time/batch = 0.1720s
1140/9500 (epoch 6.000), train_loss = 1.60858932, grad/param norm = 2.7920e-02, time/batch = 0.1715s
1141/9500 (epoch 6.005), train_loss = 1.73984857, grad/param norm = 3.0110e-02, time/batch = 0.1679s
1142/9500 (epoch 6.011), train_loss = 1.58995754, grad/param norm = 3.1278e-02, time/batch = 0.1718s
1143/9500 (epoch 6.016), train_loss = 1.63107630, grad/param norm = 2.9110e-02, time/batch = 0.1709s
1144/9500 (epoch 6.021), train_loss = 1.56470308, grad/param norm = 2.6084e-02, time/batch = 0.1710s
1145/9500 (epoch 6.026), train_loss = 1.61440370, grad/param norm = 2.6398e-02, time/batch = 0.1717s
1146/9500 (epoch 6.032), train_loss = 1.59324164, grad/param norm = 2.7083e-02, time/batch = 0.1707s
1147/9500 (epoch 6.037), train_loss = 1.55868971, grad/param norm = 2.4698e-02, time/batch = 0.1708s
1148/9500 (epoch 6.042), train_loss = 1.56878395, grad/param norm = 2.5445e-02, time/batch = 0.1716s
1149/9500 (epoch 6.047), train_loss = 1.57619318, grad/param norm = 2.5382e-02, time/batch = 0.1706s
1150/9500 (epoch 6.053), train_loss = 1.58639514, grad/param norm = 2.6805e-02, time/batch = 0.1704s
1151/9500 (epoch 6.058), train_loss = 1.60905317, grad/param norm = 2.7297e-02, time/batch = 0.1693s
1152/9500 (epoch 6.063), train_loss = 1.54873623, grad/param norm = 2.5672e-02, time/batch = 0.1710s
1153/9500 (epoch 6.068), train_loss = 1.55265088, grad/param norm = 2.6312e-02, time/batch = 0.1710s
1154/9500 (epoch 6.074), train_loss = 1.57865137, grad/param norm = 2.8123e-02, time/batch = 0.1714s
1155/9500 (epoch 6.079), train_loss = 1.56338261, grad/param norm = 2.4877e-02, time/batch = 0.1714s
1156/9500 (epoch 6.084), train_loss = 1.62495264, grad/param norm = 2.5598e-02, time/batch = 0.1707s
1157/9500 (epoch 6.089), train_loss = 1.61622963, grad/param norm = 2.7610e-02, time/batch = 0.1712s
1158/9500 (epoch 6.095), train_loss = 1.56827593, grad/param norm = 2.8256e-02, time/batch = 0.1705s
1159/9500 (epoch 6.100), train_loss = 1.58569304, grad/param norm = 2.5937e-02, time/batch = 0.1713s
1160/9500 (epoch 6.105), train_loss = 1.61280133, grad/param norm = 2.6153e-02, time/batch = 0.1711s
1161/9500 (epoch 6.111), train_loss = 1.53331320, grad/param norm = 2.4454e-02, time/batch = 0.1687s
1162/9500 (epoch 6.116), train_loss = 1.55069329, grad/param norm = 2.5777e-02, time/batch = 0.1708s
1163/9500 (epoch 6.121), train_loss = 1.55052384, grad/param norm = 2.6012e-02, time/batch = 0.1709s
1164/9500 (epoch 6.126), train_loss = 1.57263940, grad/param norm = 2.6088e-02, time/batch = 0.1708s
1165/9500 (epoch 6.132), train_loss = 1.58153190, grad/param norm = 2.6584e-02, time/batch = 0.1708s
1166/9500 (epoch 6.137), train_loss = 1.59831739, grad/param norm = 2.5384e-02, time/batch = 0.1706s
1167/9500 (epoch 6.142), train_loss = 1.56047926, grad/param norm = 2.4924e-02, time/batch = 0.1726s
1168/9500 (epoch 6.147), train_loss = 1.54751500, grad/param norm = 2.6674e-02, time/batch = 0.1706s
1169/9500 (epoch 6.153), train_loss = 1.59067149, grad/param norm = 2.6495e-02, time/batch = 0.1713s
1170/9500 (epoch 6.158), train_loss = 1.58233874, grad/param norm = 2.4550e-02, time/batch = 0.1716s
1171/9500 (epoch 6.163), train_loss = 1.58411314, grad/param norm = 2.7291e-02, time/batch = 0.1681s
1172/9500 (epoch 6.168), train_loss = 1.55041063, grad/param norm = 2.7612e-02, time/batch = 0.1712s
1173/9500 (epoch 6.174), train_loss = 1.56140073, grad/param norm = 2.9202e-02, time/batch = 0.1714s
1174/9500 (epoch 6.179), train_loss = 1.62295443, grad/param norm = 2.9280e-02, time/batch = 0.1710s
1175/9500 (epoch 6.184), train_loss = 1.56843423, grad/param norm = 2.9110e-02, time/batch = 0.1709s
1176/9500 (epoch 6.189), train_loss = 1.60055677, grad/param norm = 2.9198e-02, time/batch = 0.1717s
1177/9500 (epoch 6.195), train_loss = 1.57765967, grad/param norm = 2.7748e-02, time/batch = 0.1709s
1178/9500 (epoch 6.200), train_loss = 1.58147513, grad/param norm = 2.7744e-02, time/batch = 0.1708s
1179/9500 (epoch 6.205), train_loss = 1.54784093, grad/param norm = 2.8430e-02, time/batch = 0.1710s
1180/9500 (epoch 6.211), train_loss = 1.58361659, grad/param norm = 2.8003e-02, time/batch = 0.1720s
1181/9500 (epoch 6.216), train_loss = 1.55975837, grad/param norm = 2.7251e-02, time/batch = 0.1677s
1182/9500 (epoch 6.221), train_loss = 1.58656626, grad/param norm = 2.5014e-02, time/batch = 0.1709s
1183/9500 (epoch 6.226), train_loss = 1.56164750, grad/param norm = 2.4552e-02, time/batch = 0.1716s
1184/9500 (epoch 6.232), train_loss = 1.57823615, grad/param norm = 2.5413e-02, time/batch = 0.1712s
1185/9500 (epoch 6.237), train_loss = 1.56140816, grad/param norm = 2.4587e-02, time/batch = 0.1711s
1186/9500 (epoch 6.242), train_loss = 1.58552214, grad/param norm = 2.5522e-02, time/batch = 0.1717s
1187/9500 (epoch 6.247), train_loss = 1.57030070, grad/param norm = 2.5697e-02, time/batch = 0.1713s
1188/9500 (epoch 6.253), train_loss = 1.56635670, grad/param norm = 2.4729e-02, time/batch = 0.1711s
1189/9500 (epoch 6.258), train_loss = 1.57043375, grad/param norm = 2.4269e-02, time/batch = 0.1717s
1190/9500 (epoch 6.263), train_loss = 1.53808373, grad/param norm = 2.4381e-02, time/batch = 0.1707s
1191/9500 (epoch 6.268), train_loss = 1.53994649, grad/param norm = 2.4872e-02, time/batch = 0.1683s
1192/9500 (epoch 6.274), train_loss = 1.55982841, grad/param norm = 2.4594e-02, time/batch = 0.1717s
1193/9500 (epoch 6.279), train_loss = 1.57214562, grad/param norm = 2.5002e-02, time/batch = 0.1709s
1194/9500 (epoch 6.284), train_loss = 1.56773368, grad/param norm = 2.5845e-02, time/batch = 0.1712s
1195/9500 (epoch 6.289), train_loss = 1.54561214, grad/param norm = 2.9292e-02, time/batch = 0.1714s
1196/9500 (epoch 6.295), train_loss = 1.55530141, grad/param norm = 2.8230e-02, time/batch = 0.1710s
1197/9500 (epoch 6.300), train_loss = 1.58661561, grad/param norm = 2.4611e-02, time/batch = 0.1712s
1198/9500 (epoch 6.305), train_loss = 1.57494878, grad/param norm = 2.4050e-02, time/batch = 0.1718s
1199/9500 (epoch 6.311), train_loss = 1.54837640, grad/param norm = 2.4839e-02, time/batch = 0.1708s
1200/9500 (epoch 6.316), train_loss = 1.57026181, grad/param norm = 2.6232e-02, time/batch = 0.1710s
1201/9500 (epoch 6.321), train_loss = 1.58378760, grad/param norm = 2.5740e-02, time/batch = 0.1687s
1202/9500 (epoch 6.326), train_loss = 1.57102396, grad/param norm = 2.5527e-02, time/batch = 0.1711s
1203/9500 (epoch 6.332), train_loss = 1.57766544, grad/param norm = 2.5115e-02, time/batch = 0.1713s
1204/9500 (epoch 6.337), train_loss = 1.56476935, grad/param norm = 2.5721e-02, time/batch = 0.1711s
1205/9500 (epoch 6.342), train_loss = 1.55041074, grad/param norm = 2.5041e-02, time/batch = 0.1706s
1206/9500 (epoch 6.347), train_loss = 1.55436800, grad/param norm = 2.5340e-02, time/batch = 0.1707s
1207/9500 (epoch 6.353), train_loss = 1.58356120, grad/param norm = 2.7574e-02, time/batch = 0.1712s
1208/9500 (epoch 6.358), train_loss = 1.54088088, grad/param norm = 2.5341e-02, time/batch = 0.1709s
1209/9500 (epoch 6.363), train_loss = 1.52250770, grad/param norm = 2.6208e-02, time/batch = 0.1708s
1210/9500 (epoch 6.368), train_loss = 1.55153673, grad/param norm = 2.9915e-02, time/batch = 0.1713s
1211/9500 (epoch 6.374), train_loss = 1.56547657, grad/param norm = 3.1736e-02, time/batch = 0.1680s
1212/9500 (epoch 6.379), train_loss = 1.58293375, grad/param norm = 3.1457e-02, time/batch = 0.1714s
1213/9500 (epoch 6.384), train_loss = 1.55846135, grad/param norm = 2.8564e-02, time/batch = 0.1710s
1214/9500 (epoch 6.389), train_loss = 1.57061131, grad/param norm = 2.6531e-02, time/batch = 0.1718s
1215/9500 (epoch 6.395), train_loss = 1.55018004, grad/param norm = 2.4736e-02, time/batch = 0.1712s
1216/9500 (epoch 6.400), train_loss = 1.59280874, grad/param norm = 2.4181e-02, time/batch = 0.1709s
1217/9500 (epoch 6.405), train_loss = 1.58309348, grad/param norm = 2.4618e-02, time/batch = 0.1712s
1218/9500 (epoch 6.411), train_loss = 1.57834624, grad/param norm = 2.3766e-02, time/batch = 0.1710s
1219/9500 (epoch 6.416), train_loss = 1.56666605, grad/param norm = 2.4147e-02, time/batch = 0.1710s
1220/9500 (epoch 6.421), train_loss = 1.59191423, grad/param norm = 2.3964e-02, time/batch = 0.1717s
1221/9500 (epoch 6.426), train_loss = 1.55453262, grad/param norm = 2.4590e-02, time/batch = 0.1679s
1222/9500 (epoch 6.432), train_loss = 1.54508754, grad/param norm = 2.6057e-02, time/batch = 0.1708s
1223/9500 (epoch 6.437), train_loss = 1.57658995, grad/param norm = 2.6704e-02, time/batch = 0.1721s
1224/9500 (epoch 6.442), train_loss = 1.59198083, grad/param norm = 2.6873e-02, time/batch = 0.1710s
1225/9500 (epoch 6.447), train_loss = 1.55912389, grad/param norm = 2.6679e-02, time/batch = 0.1711s
1226/9500 (epoch 6.453), train_loss = 1.58403724, grad/param norm = 2.6577e-02, time/batch = 0.1713s
1227/9500 (epoch 6.458), train_loss = 1.55439860, grad/param norm = 2.5479e-02, time/batch = 0.1708s
1228/9500 (epoch 6.463), train_loss = 1.58069977, grad/param norm = 2.5863e-02, time/batch = 0.1715s
1229/9500 (epoch 6.468), train_loss = 1.56863065, grad/param norm = 2.6022e-02, time/batch = 0.1710s
1230/9500 (epoch 6.474), train_loss = 1.56931305, grad/param norm = 2.6948e-02, time/batch = 0.1709s
1231/9500 (epoch 6.479), train_loss = 1.55011195, grad/param norm = 2.6195e-02, time/batch = 0.1678s
1232/9500 (epoch 6.484), train_loss = 1.57551002, grad/param norm = 2.6773e-02, time/batch = 0.1718s
1233/9500 (epoch 6.489), train_loss = 1.58969910, grad/param norm = 2.5897e-02, time/batch = 0.1710s
1234/9500 (epoch 6.495), train_loss = 1.58992211, grad/param norm = 2.6129e-02, time/batch = 0.1711s
1235/9500 (epoch 6.500), train_loss = 1.55782776, grad/param norm = 2.4416e-02, time/batch = 0.1713s
1236/9500 (epoch 6.505), train_loss = 1.54011432, grad/param norm = 2.5565e-02, time/batch = 0.1716s
1237/9500 (epoch 6.511), train_loss = 1.58537027, grad/param norm = 2.5914e-02, time/batch = 0.1713s
1238/9500 (epoch 6.516), train_loss = 1.56039585, grad/param norm = 2.4451e-02, time/batch = 0.1709s
1239/9500 (epoch 6.521), train_loss = 1.56065989, grad/param norm = 2.3829e-02, time/batch = 0.1716s
1240/9500 (epoch 6.526), train_loss = 1.55923576, grad/param norm = 2.2959e-02, time/batch = 0.1714s
1241/9500 (epoch 6.532), train_loss = 1.59549124, grad/param norm = 2.4361e-02, time/batch = 0.1679s
1242/9500 (epoch 6.537), train_loss = 1.58152433, grad/param norm = 2.5870e-02, time/batch = 0.1721s
1243/9500 (epoch 6.542), train_loss = 1.55773341, grad/param norm = 2.4856e-02, time/batch = 0.1710s
1244/9500 (epoch 6.547), train_loss = 1.57046490, grad/param norm = 2.4631e-02, time/batch = 0.1713s
1245/9500 (epoch 6.553), train_loss = 1.53541806, grad/param norm = 2.4299e-02, time/batch = 0.1721s
1246/9500 (epoch 6.558), train_loss = 1.54894270, grad/param norm = 2.3665e-02, time/batch = 0.1711s
1247/9500 (epoch 6.563), train_loss = 1.53760676, grad/param norm = 2.3848e-02, time/batch = 0.1710s
1248/9500 (epoch 6.568), train_loss = 1.55027062, grad/param norm = 2.4383e-02, time/batch = 0.1714s
1249/9500 (epoch 6.574), train_loss = nan, grad/param norm = 4.4345e+01, time/batch = 0.1712s
loss is NaN. This usually indicates a bug. Please check the issues page for existing issues, or create a new issue, if none exist. Ideally, please state: your operating system, 32-bit/64-bit, your blas version, cpu/cuda/cl?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant