Merge pull request #98 from YueZhengMeng/master

加入了练习3.4.1第2问第2小问 softmax分布方差与二阶导数的匹配的公式推导
datawhalechina · Jun 26, 2024 · 9dfc91c · 9dfc91c
2 parents a638f73 + f982aed
commit 9dfc91c
Show file tree

Hide file tree

Showing 6 changed files with 7,091 additions and 39 deletions.
diff --git a/docs/ch03/ch03.md b/docs/ch03/ch03.md
@@ -691,12 +691,35 @@ $$
 $$ 
 \begin{aligned} 
 \mathrm{Var}_{\mathrm{softmax}(\mathbf{o})} 
-&= \sum_{j=1}^q (\mathrm{softmax}(\mathbf{o})_j - E[\mathrm{softmax}(\mathbf{o})_j])^2 \\ 
-&= \sum_{j=1}^q (\mathrm{softmax}(\mathbf{o})_j - \frac{1}{q}\sum_{k=1}^q \mathrm{softmax}(\mathbf{o})_k)^2 \\ 
-&= \sum_{j=1}^q (\mathrm{softmax}(\mathbf{o})_j - \frac{1}{q})^2 \\
+&= \frac{1}{q} \sum_{j=1}^q (\mathrm{softmax}(\mathbf{o})_j - E[\mathrm{softmax}(\mathbf{o})_j])^2 \\ 
+&= \frac{1}{q} \sum_{j=1}^q (\mathrm{softmax}(\mathbf{o})_j - \frac{1}{q}\sum_{k=1}^q \mathrm{softmax}(\mathbf{o})_k)^2 \\ 
+&= \frac{1}{q} \sum_{j=1}^q (\mathrm{softmax}(\mathbf{o})_j - \frac{1}{q})^2 \\
 \end{aligned} 
 $$
 
+&emsp;&emsp;展开为:  
+$$
+\begin{aligned}
+\mathrm{Var}_{\mathrm{softmax}(\mathbf{o})} 
+&= \frac{1}{q} \sum_{j=1}^q (\mathrm{softmax}(\mathbf{o})_j - \frac{1}{q})^2 \\
+&=\frac{1}{q}\left[(\mathrm{softmax}(\boldsymbol{o})_1-\frac{1}{q})^2+(\mathrm{softmax}(\boldsymbol{o})_2-\frac{1}{q})^2+\dots+(\mathrm{softmax}(\boldsymbol{o})_q-\frac{1}{q})^2\right]\\
+&=\frac{1}{q}(\sum^q_{j=1}\mathrm{softmax}^2(\boldsymbol{o})_j-\frac{2}{q}\sum^q_{j=1}\mathrm{softmax}(\boldsymbol{o})_j+\sum_{j=1}^q \frac{1}{q^2})\\
+&=\frac{1}{q}(\sum^q_{j=1}\mathrm{softmax}^2(\boldsymbol{o})_j -\frac{2}{q} +\frac{1}{q})\\
+&=\frac{1}{q}\sum^q_{j=1}\mathrm{softmax}^2(\boldsymbol{o})_j -\frac{1}{q^2}
+\end{aligned}
+$$
+
+&emsp;&emsp;与二阶导数匹配为：  
+$$
+\begin{aligned}
+\mathrm{V\ ar}(o)&=\frac{1}{q}\sum^q_{j=1}\mathrm{softmax}^2(\boldsymbol{o})_j -\frac{1}{q^2}\\
+&=-\frac{1}{q}(1-\sum^q_{j=1}\mathrm{softmax}^2(\boldsymbol{o})_j)+\frac{1}{q} -\frac{1}{q^2}\\
+&=-\frac{1}{q}(\sum^q_{j=1}\mathrm{softmax}(\boldsymbol{o})_j-\sum^q_{j=1}\mathrm{softmax}^2(\boldsymbol{o})_j)+\frac{1}{q} -\frac{1}{q^2}\\
+&=-\frac{1}{q}\sum^q_{j=1}(\mathrm{softmax}(\boldsymbol{o})_j-\mathrm{softmax}^2(\boldsymbol{o})_j)+\frac{1}{q} -\frac{1}{q^2}\\
+&=-\frac{1}{q}\sum^q_{j=1}\partial_{o_j}^2 l(\mathbf{y}, \hat{\mathbf{y}}) +\frac{q-1}{q^2}\\
+\end{aligned}
+$$
+
 ### 练习3.4.2
 
 假设我们有三个类发生的概率相等，即概率向量是$\displaystyle (\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$。

diff --git a/docs/ch04/ch04.md b/docs/ch04/ch04.md
@@ -1605,12 +1605,12 @@ $$
 &emsp;&emsp;&emsp;&emsp;根据链式子法则，后面两个式子的结果为：
 $$
 \frac{\partial J}{\partial \mathbf{b}^{(1)}}
-= \text{prod}\left(\frac{\partial J}{\partial \mathbf{h}}, \frac{\partial \mathbf{h}}{\partial \mathbf{b}^{(2)}}\right) 
-= \frac{\partial J}{\partial \mathbf{h}}.
+= \text{prod}\left(\frac{\partial J}{\partial \mathbf{z}}, \frac{\partial \mathbf{z}}{\partial \mathbf{b}^{(1)}}\right) 
+= \frac{\partial J}{\partial \mathbf{z}}.
 $$
 
 $$
-\frac{\partial J}{\partial \mathbf{b}^{(1)}}
+\frac{\partial J}{\partial \mathbf{b}^{(2)}}
 = \text{prod}\left(\frac{\partial J}{\partial \mathbf{o}}, \frac{\partial \mathbf{o}}{\partial \mathbf{b}^{(2)}}\right) 
 = \frac{\partial J}{\partial \mathbf{o}}.
 $$

diff --git a/docs/ch08/ch08.md b/docs/ch08/ch08.md
@@ -699,7 +699,125 @@ d2l.plot([zipf_one, zipf_two, zip_three],
 
 ![svg](output_69_0.svg)
 
+**另解：**  
+TODO：由于该解答的新增图片会打乱原有的图片顺序，因此此处只给出代码  
+对于线性方程：  
+$$\log n_i = -\alpha \log i + c $$  
+我们可以采用最小二乘法来估计$\alpha$和$c$的值  
 
+```python
+from d2l import torch as d2l
+tokens = d2l.tokenize(d2l.read_time_machine())
+# 因为每个文本行不一定是一个句子或一个段落，因此我们把所有文本行拼接到一起
+corpus = [token for line in tokens for token in line]
+vocab = d2l.Vocab(corpus)
+
+freqs = [freq for token, freq in vocab.token_freqs]
+
+bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]
+bigram_vocab = d2l.Vocab(bigram_tokens)
+trigram_tokens = [triple for triple in zip(corpus[:-2], corpus[1:-1], corpus[2:])]
+trigram_vocab = d2l.Vocab(trigram_tokens)
+bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
+trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
+```
+
+```python
+import numpy as np
+
+def estimate_coefficients(x, y):
+    # 计算x和y的均值
+    mean_x = np.mean(x)
+    mean_y = np.mean(y)
+
+    # 计算分子和分母
+    numerator = np.sum((x - mean_x) * (y - mean_y))
+    denominator = np.sum((x - mean_x)**2)
+
+    # 计算斜率a
+    a = numerator / denominator
+
+    # 计算截距c
+    c = mean_y - a * mean_x
+
+    return a, c
+
+def compute_zipf(freqs):
+    freqs = np.array(freqs)
+    index = np.array(range(1, len(freqs) + 1))
+    a, c = estimate_coefficients(np.log(index), np.log(freqs))
+    return a, c
+```
+
+```python
+# 一元语法的齐普夫定律指数
+# 注意这里的alpha相当于公式里的负alpha
+zipf_one_alpha, zipf_one_const = compute_zipf(freqs)
+zipf_one_alpha, zipf_one_const
+```
+
+```python
+# 验证一元语法的齐普夫定律指数
+_freqs = np.exp(zipf_one_alpha * np.log(np.array(range(1, len(freqs) + 1))) + zipf_one_const)
+d2l.plot([freqs,_freqs], xlabel='token: x',
+         ylabel='frequency: n(x)', xscale='log', yscale='log', legend=['real', 'fit'])
+```
+
+```python
+# 二元语法的齐普夫定律指数
+zipf_two_alpha, zipf_two_const = compute_zipf(bigram_freqs)
+zipf_two_alpha, zipf_two_const
+```
+
+```python
+# 验证二元语法的齐普夫定律指数
+_bigram_freqs = np.exp(zipf_two_alpha * np.log(np.array(range(1, len(bigram_freqs) + 1))) + zipf_two_const)
+d2l.plot([bigram_freqs,_bigram_freqs], xlabel='token: x',
+            ylabel='frequency: n(x)', xscale='log', yscale='log', legend=['real', 'fit'])
+```
+
+```python
+# 三元语法的齐普夫定律指数
+zipf_three_alpha, zipf_three_const = compute_zipf(trigram_freqs)
+zipf_three_alpha, zipf_three_const
+```
+
+```python
+# 验证三元语法的齐普夫定律指数
+_trigram_freqs = np.exp(zipf_three_alpha * np.log(np.array(range(1, len(trigram_freqs) + 1))) + zipf_three_const)
+d2l.plot([trigram_freqs,_trigram_freqs], xlabel='token: x',
+            ylabel='frequency: n(x)', xscale='log', yscale='log', legend=['real', 'fit'])
+```
+在二元语法和三元语法的情况下计算出的参数拟合效果不好，是因为低频词太多造成的影响  
+在二元语法的情况下仅保留前一半的高频词，在三元语法的情况下仅保留前四分之一的高频词，重新计算齐普夫定律指数，拟合效果较好  
+
+```python
+# 去掉后一半的低频词，重新计算二元语法的齐普夫定律指数
+fit_count = len(bigram_freqs) // 2
+zipf_two_alpha, zipf_two_const = compute_zipf(bigram_freqs[:fit_count])
+zipf_two_alpha, zipf_two_const
+```
+
+```python
+# 验证重新计算的二元语法的齐普夫定律指数
+_bigram_freqs = np.exp(zipf_two_alpha * np.log(np.array(range(1, len(bigram_freqs) + 1))) + zipf_two_const)
+d2l.plot([bigram_freqs,_bigram_freqs], xlabel='token: x',
+            ylabel='frequency: n(x)', xscale='log', yscale='log', legend=['real', 'fit'])
+```
+
+```python
+# 去掉后四分之三的低频词，重新计算三元语法的齐普夫定律指数
+fit_count = len(trigram_freqs) // 4
+zipf_three_alpha, zipf_three_const = compute_zipf(trigram_freqs[:fit_count])
+zipf_three_alpha, zipf_three_const
+```
+
+```python
+# 验证重新计算的三元语法的齐普夫定律指数
+_trigram_freqs = np.exp(zipf_three_alpha * np.log(np.array(range(1, len(trigram_freqs) + 1))) + zipf_three_const)
+d2l.plot([trigram_freqs,_trigram_freqs], xlabel='token: x',
+            ylabel='frequency: n(x)', xscale='log', yscale='log', legend=['real', 'fit'])
+```
 
 ### 练习 8.3.4
 
@@ -2362,7 +2480,7 @@ M^k \cdot x
 &= \lambda_i^k \cdot \sum_{i=1}^n \alpha_i v_i \\
 &= \sum_{i=1}^n \lambda_i^k \alpha_i v_i
 \end{align}$$
-&emsp;&emsp;又因为$M$的特征值$\lambda_i$满足$|\lambda_i| \geq |\lambda_{i+1}|$，因此$lambda_1^k >> lambda_i$，即$\lambda_1^k$的权重最大。  
+&emsp;&emsp;又因为$M$的特征值$\lambda_i$满足$|\lambda_i| \geq |\lambda_{i+1}|$，因此$\lambda_1^k >> \alpha_i$，即$\lambda_1^k$的权重最大。  
 &emsp;&emsp;因此$M^k \cdot x \approx \lambda_1^k \alpha_1 v_1$，即存在较高概率与特征向量$v_1$在一条直线上。
 
 3.上述结果对于循环神经网络中的梯度意味着什么？

diff --git a/notebooks/ch03/ch03.ipynb b/notebooks/ch03/ch03.ipynb
@@ -2331,13 +2331,50 @@
     "$$ \n",
     "\\begin{aligned} \n",
     "\\mathrm{Var}_{\\mathrm{softmax}(\\mathbf{o})} \n",
-    "&= \\sum_{j=1}^q (\\mathrm{softmax}(\\mathbf{o})_j - E[\\mathrm{softmax}(\\mathbf{o})_j])^2 \\\\ \n",
-    "&= \\sum_{j=1}^q (\\mathrm{softmax}(\\mathbf{o})_j - \\frac{1}{q}\\sum_{k=1}^q \\mathrm{softmax}(\\mathbf{o})_k)^2 \\\\ \n",
-    "&= \\sum_{j=1}^q (\\mathrm{softmax}(\\mathbf{o})_j - \\frac{1}{q})^2 \\\\\n",
+    "&= \\frac{1}{q} \\sum_{j=1}^q (\\mathrm{softmax}(\\mathbf{o})_j - E[\\mathrm{softmax}(\\mathbf{o})_j])^2 \\\\ \n",
+    "&= \\frac{1}{q} \\sum_{j=1}^q (\\mathrm{softmax}(\\mathbf{o})_j - \\frac{1}{q}\\sum_{k=1}^q \\mathrm{softmax}(\\mathbf{o})_k)^2 \\\\ \n",
+    "&= \\frac{1}{q} \\sum_{j=1}^q (\\mathrm{softmax}(\\mathbf{o})_j - \\frac{1}{q})^2 \\\\\n",
     "\\end{aligned} \n",
     "$$"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "&emsp;&emsp;展开为:  \n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "\\mathrm{Var}_{\\mathrm{softmax}(\\mathbf{o})} \n",
+    "&= \\frac{1}{q} \\sum_{j=1}^q (\\mathrm{softmax}(\\mathbf{o})_j - \\frac{1}{q})^2 \\\\\n",
+    "&=\\frac{1}{q}\\left[(\\mathrm{softmax}(\\boldsymbol{o})_1-\\frac{1}{q})^2+(\\mathrm{softmax}(\\boldsymbol{o})_2-\\frac{1}{q})^2+\\dots+(\\mathrm{softmax}(\\boldsymbol{o})_q-\\frac{1}{q})^2\\right]\\\\\n",
+    "&=\\frac{1}{q}(\\sum^q_{j=1}\\mathrm{softmax}^2(\\boldsymbol{o})_j-\\frac{2}{q}\\sum^q_{j=1}\\mathrm{softmax}(\\boldsymbol{o})_j+\\sum_{j=1}^q \\frac{1}{q^2})\\\\\n",
+    "&=\\frac{1}{q}(\\sum^q_{j=1}\\mathrm{softmax}^2(\\boldsymbol{o})_j -\\frac{2}{q} +\\frac{1}{q})\\\\\n",
+    "&=\\frac{1}{q}\\sum^q_{j=1}\\mathrm{softmax}^2(\\boldsymbol{o})_j -\\frac{1}{q^2}\n",
+    "\\end{aligned}\n",
+    "$$"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "&emsp;&emsp;与二阶导数匹配为：  \n",
+    "$$\n",
+    "\\begin{aligned}\n",
+    "\\mathrm{V\\ ar}(o)&=\\frac{1}{q}\\sum^q_{j=1}\\mathrm{softmax}^2(\\boldsymbol{o})_j -\\frac{1}{q^2}\\\\\n",
+    "&=-\\frac{1}{q}(1-\\sum^q_{j=1}\\mathrm{softmax}^2(\\boldsymbol{o})_j)+\\frac{1}{q} -\\frac{1}{q^2}\\\\\n",
+    "&=-\\frac{1}{q}(\\sum^q_{j=1}\\mathrm{softmax}(\\boldsymbol{o})_j-\\sum^q_{j=1}\\mathrm{softmax}^2(\\boldsymbol{o})_j)+\\frac{1}{q} -\\frac{1}{q^2}\\\\\n",
+    "&=-\\frac{1}{q}\\sum^q_{j=1}(\\mathrm{softmax}(\\boldsymbol{o})_j-\\mathrm{softmax}^2(\\boldsymbol{o})_j)+\\frac{1}{q} -\\frac{1}{q^2}\\\\\n",
+    "&=-\\frac{1}{q}\\sum^q_{j=1}\\partial_{o_j}^2 l(\\mathbf{y}, \\hat{\\mathbf{y}}) +\\frac{q-1}{q^2}\\\\\n",
+    "\\end{aligned}\n",
+    "$$"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/notebooks/ch04/ch04.ipynb b/notebooks/ch04/ch04.ipynb
@@ -3,10 +3,7 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "collapsed": true,
-    "jupyter": {
-     "outputs_hidden": true
-    }
+    "collapsed": true
    },
    "source": [
     "# 第4章 多层感知机"
@@ -44529,12 +44526,12 @@
     "&emsp;&emsp;&emsp;&emsp;根据链式子法则，后面两个式子的结果为：\n",
     "$$\n",
     "\\frac{\\partial J}{\\partial \\mathbf{b}^{(1)}}\n",
-    "= \\text{prod}\\left(\\frac{\\partial J}{\\partial \\mathbf{h}}, \\frac{\\partial \\mathbf{h}}{\\partial \\mathbf{b}^{(2)}}\\right) \n",
-    "= \\frac{\\partial J}{\\partial \\mathbf{h}}.\n",
+    "= \\text{prod}\\left(\\frac{\\partial J}{\\partial \\mathbf{z}}, \\frac{\\partial \\mathbf{z}}{\\partial \\mathbf{b}^{(1)}}\\right) \n",
+    "= \\frac{\\partial J}{\\partial \\mathbf{z}}.\n",
     "$$\n",
     "\n",
     "$$\n",
-    "\\frac{\\partial J}{\\partial \\mathbf{b}^{(1)}}\n",
+    "\\frac{\\partial J}{\\partial \\mathbf{b}^{(2)}}\n",
     "= \\text{prod}\\left(\\frac{\\partial J}{\\partial \\mathbf{o}}, \\frac{\\partial \\mathbf{o}}{\\partial \\mathbf{b}^{(2)}}\\right) \n",
     "= \\frac{\\partial J}{\\partial \\mathbf{o}}.\n",
     "$$\n"