深度學習"聖經「筆記目錄:圖解+直覺 (前四章完結)
這是第N次拿起這本書開始讀了(已經放棄N-1次了)
這次開始做的筆記視頻的特點是,對每一段的知識點做圖解
以下是之前積累的筆記
拿起,放棄這本書好多次了,這次終於看到有大牛說這不是本入門讀物了
deep learning book 英文版鏈接
深度學習辭彙中英文對照
深度學習中的數學符號
面試題的答案藏於此書
(國產)大牛帶著讀(文字與圖片)
(海外)大牛帶著讀(視頻)
花書目錄
1 Introduction 筆記
[x] 1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8
[x] 1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 12
## I Applied Math and Machine Learning Basics
2 Linear Algebra 筆記
[v] 2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 29 筆記
[v] 2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . 32 筆記
[v] 2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . 34 筆記
[v] 2.4 Linear Dependence and Span . . . . . . . . . . . . . . . . . . . . 35 筆記
[v] 2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 筆記
簡要梳理 matrix multiplication, linear dependence, identity inverse matrix, norm
[v] 2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . 38 筆記
簡要梳理 special matrices and vectors.mp4
[v] 2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 40 筆記
目前只能做到翻譯程度的理解,離直覺的距離還很遠
[v] 2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 42 筆記
目前只能做到翻譯程度的理解,離直覺的距離還很遠
[v] 2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . 43 筆記
目前只能做到翻譯程度的理解,離直覺的距離還很遠
[v] 2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 44 筆記
目前只能做到翻譯程度的理解,離直覺的距離還很遠
[v] 2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 筆記
目前只能做到翻譯程度的理解,離直覺的距離還很遠
作為面積來理解?expand and contract of Matrix through its eigen vectors
[v] 2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 45 筆記
目前只能做到翻譯程度的理解,離直覺的距離還很遠
3 Probability and Information Theory
[v] 3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 筆記
[v] 3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 54 筆記
圖解probability, random variable, probability distribution
why probability?什麼是frequentist probability, bayesian probability?什麼是random variable, probability distribution?
[v] 3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 54 筆記
圖解 probability mass function 和 probability density function
如何理解probability mass function?如何理解probability density function?
額外內容
圖解 derivative 圖解 integral
[v] 3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 56 筆記
圖解 marginal probability 從p(x,y) 到p(x)
[v] 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 57 筆記
[v] 3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 57 筆記
[v] 3.7 Independence and Conditional Independence . . . . . . . . . . . . 58 筆記
深度學習中常見的條件概率公式
[v] 3.8 Expectation, Variance and Covariance . . . . . . . . . . . . . . . 58 筆記
圖解 Expectation 基於概率分佈的均值 圖解 variance 圖解 covariance 圖解 covariance matrix
[v] 3.9 Common Probability Distributions . . . . . . . . . . . . . . . . . 60 筆記
圖解 bernoulli, multinoulli, gaussian distributions
如何理解bernoulli distribution 均值與方差公式
圖解 Exponential and Laplace distributions 英文思考版 英文求助 Dirac distribution, Empirical distribution 簡單梳理 mixture distribution model 英文思考版
[v] 3.10 Useful Properties of Common Functions . . . . . . . . . . . . . . 65
圖解logistic sigmoid
什麼是logistic sigmoid?什麼是saturation?什麼是Bernoulli distribution?圖解softplus 函數 softplus有哪些特性?我們需要記住那些sigmoid 和softplus函數的properties(公式變形)?
[v] 3.11 Bayes』 Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
[v] 3.12 Technical Details of Continuous Variables . . . . . . . . . . . . . 69
什麼是continuous variable 可能面對的paradox?
什麼是measure zero, measure positive?什麼是distortion of space by a function?
[v] 3.13 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 71
圖解 entropy
什麼是entropy?什麼是binary cross entropy?
[v] 3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 73
簡單梳理 structured probability models 英文思考版
4 Numerical Computation
圖解 機器學習天性吞噬計算資源
[v] 4.1 Over?ow and Under?ow . . . . . . . . . . . . . . . . . . . . . . . 78
圖解underflow, overflow 計算精度極限帶來的問題
[v] 4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 80
圖解 poor conditioning
[v] 4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 80
圖解 gradient-based optimization
Jacobian matrix 5 videos (khanacademy)
Hessian matrix khanacademy
curature intuition
簡單梳理 beyond gradient
[ ] 4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 91
constrained optimization and Lagrangian khanacademy
[ ] 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 94
matrix inverse khanacademy
5 Machine Learning Basics
圖解機器學習特點
[ v ] 5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 97 筆記
圖解 分類問題與概率分佈
probabilistic modeling|learning: classification with missing data relevant course Introduction to statistical learning course site
[ ] 5.2 Capacity, Over?tting and Under?tting . . . . . . . . . . . . . . . 108
[ ] 5.3 Hyperparameters and Validation Sets . . . . . . . . . . . . . . . . 118
[ ] 5.4 Estimators, Bias and Variance . . . . . . . . . . . . . . . . . . . . 120
[ ] 5.5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 129
[ ] 5.6 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 133
[ ] 5.7 Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . 137
[ ] 5.8 Unsupervised Learning Algorithms . . . . . . . . . . . . . . . . . 142
[ ] 5.9 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 149
[ ] 5.10 Building a Machine Learning Algorithm . . . . . . . . . . . . . . 151
[ ] 5.11 Challenges Motivating Deep Learning . . . . . . . . . . . . . . . . 152
6 Deep Feedforward Networks
[ ] 6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 167
[ ] 6.2 Gradient-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 172
[ ] 6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
[ ] 6.4 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 193
[ ] 6.5 Back-Propagation and Other Di?erentiation Algorithms . 200
[ ] 6.6 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7 Regularization for Deep Learning
[ ] 7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 226
[ ] 7.2 Norm Penalties as Constrained Optimization . . . . . . . . . . . . 233
[ ] 7.3 Regularization and Under-Constrained Problems . . . . . . . . . 235
[ ] 7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 236
[ ] 7.5 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
[ ] 7.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 240
[ ] 7.7 Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 241
[ ] 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
[ ] 7.9 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 249
[ ] 7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 251
[ ] 7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 253
[ ] 7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
[ ] 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 265
[ ] 7.14 Tangent Distance, Tangent Prop and Manifold Tangent Classi?er . 267
8 Optimization for Training Deep Models
[ ] 8.1 How Learning Di?ers from Pure Optimization . . . . . . . . . . . 272
[ ] 8.2 Challenges in Neural Network Optimization . . . . . . . . . . . . 279
[ ] 8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
[ ] 8.4 Parameter Initialization Strategies . . . . . . . . . . . . . . . . . 296
[ ] 8.5 Algorithms with Adaptive Learning Rates . . . . . . . . . . . . . 302
[ ] 8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 307
[ ] 8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 313
9 Convolutional Networks 326
[ ] 9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 327
[ ] 9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
[ ] 9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
[ ] 9.4 Convolution and Pooling as an In?nitely Strong Prior . . . . . . . 339
[ ] 9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 342
[ ] 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 352
[ ] 9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
[ ] 9.8 E?cient Convolution Algorithms . . . . . . . . . . . . . . . . . . 356
[ ] 9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 356
[ ] 9.10 The Neuroscienti?c Basis for Convolutional Networks . .358
[ ] 9.11 Convolutional Networks and the History of Deep Learning . . . . 365
10 Sequence Modeling: Recurrent and Recursive Nets 367
[ ] 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . . . . 369
[ ] 10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 372
[ ] 10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 388
[ ] 10.4 Encoder-Decoder Sequence-to-Sequence Architectures . 390
[ ] 10.5 Deep Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . 392
[ ] 10.6 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . 394
[ ] 10.7 The Challenge of Long-Term Dependencies . . . . . . . . . . . . . 396
[ ] 10.8 Echo State Networks . . . . . . . . . . . . . . . . . . . . . . . . . 399
[ ] 10.9 Leaky Units and Other Strategies for Multiple Time Scales . 402
[ ] 10.10 The Long Short-Term Memory and Other Gated RNNs . . . . . . 404
[ ] 10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 408
[ ] 10.12 Explicit Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
11 Practical Methodology 416
[ ] 11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 417
[ ] 11.2 Default Baseline Models . . . . . . . . . . . . . . . . . . . . . . . 420
[ ] 11.3 Determining Whether to Gather More Data . . . . . . . . . . . . 421
[ ] 11.4 Selecting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 422
[ ] 11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 431
[ ] 11.6 Example: Multi-Digit Number Recognition . . . . . . . . . . . . . 435
12 Applications 438
[ ] 12.1 Large-Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 438
[ ] 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
[ ] 12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 453
[ ] 12.4 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 456
[ ] 12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 473
## III Deep Learning Research 482
13 Linear Factor Models 485
[ ] 13.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 486
[ ] 13.2 Independent Component Analysis (ICA) . . . . . . . . . . . . . . 487
[ ] 13.3 Slow Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . 489
[ ] 13.4 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
[ ] 13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 496
14 Autoencoders 499
[ ] 14.1 Undercomplete Autoencoders . . . . . . . . . . . . . . . . . . . . 500
[ ] 14.2 Regularized Autoencoders . . . . . . . . . . . . . . . . . . . . . . 501
[ ] 14.3 Representational Power, Layer Size and Depth . . . . . . . . . . . 505
[ ] 14.4 Stochastic Encoders and Decoders . . . . . . . . . . . . . . . . . . 506
[ ] 14.5 Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 507
[ ] 14.6 Learning Manifolds with Autoencoders . . . . . . . . . . . . . . . 513
[ ] 14.7 Contractive Autoencoders . . . . . . . . . . . . . . . . . . . . . . 518
[ ] 14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 521
[ ] 14.9 Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . 522
15 Representation Learning 524
[ ] 15.1 Greedy Layer-Wise Unsupervised Pretraining . . . . . . . . . . . 526
[ ] 15.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 534
[ ] 15.3 Semi-Supervised Disentangling of Causal Factors . . . . . . . . . 539
[ ] 15.4 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . 544
[ ] 15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . 550
[ ] 15.6 Providing Clues to Discover Underlying Causes . . . . . . . . . . 552
16 Structured Probabilistic Models for Deep Learning 555
[ ] 16.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 556
[ ] 16.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 560
[ ] 16.3 Sampling from Graphical Models . . . . . . . . . . . . . . . . . . 577
[ ] 16.4 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 579
[ ] 16.5 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . 579
[ ] 16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . 580
[ ] 16.7 The Deep Learning Approach to Structured
Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . 581
[ ] 17 Monte Carlo Methods 587
[ ] 17.1 Sampling and Monte Carlo Methods . . . . . . . . . . . . . . . . 587
[ ] 17.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 589
[ ] 17.3 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 592
[ ] 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
[ ] 17.5 The Challenge of Mixing between Separated
Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
[ ] 18 Confronting the Partition Function 603
[ ] 18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 604
[ ] 18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 605
[ ] 18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
[ ] 18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 615
[ ] 18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 617
[ ] 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 618
[ ] 18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . . 621
[ ] 19 Approximate Inference 629
[ ] 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 631
[ ] 19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 632
[ ] 19.3 MAP Inference and Sparse Coding . . . . . . . . . . . . . . . . . 633
[ ] 19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 636
[ ] 19.5 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 648
20 Deep Generative Models 651
[ ] 20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 651
[ ] 20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 653
[ ] 20.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 657
[ ] 20.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 660
[ ] 20.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 673
[ ] 20.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 679
[ ] 20.7 Boltzmann Machines for Structured or Sequential Outputs . . . . 681
[ ] 20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 683
[ ] 20.9 Back-Propagation through Random Operations . . . . . . . . . . 684
[ ] 20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . 688
[ ] 20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . . . . . 707
[ ] 20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 710
[ ] 20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . . . . . 712
[ ] 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 713
[ ] 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716
推薦閱讀: