這是第N次拿起這本書開始讀了(已經放棄N-1次了)

這次開始做的筆記視頻的特點是,對每一段的知識點做圖解

以下是之前積累的筆記


拿起,放棄這本書好多次了,這次終於看到有大牛說這不是本入門讀物了

deep learning book 英文版鏈接

深度學習辭彙中英文對照

深度學習中的數學符號

面試題的答案藏於此書

(國產)大牛帶著讀(文字與圖片)

(海外)大牛帶著讀(視頻)


花書目錄

1 Introduction 筆記

[x] 1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8

[x] 1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 12

## I Applied Math and Machine Learning Basics

2 Linear Algebra 筆記

[v] 2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 29 筆記

[v] 2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . 32 筆記

[v] 2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . 34 筆記

[v] 2.4 Linear Dependence and Span . . . . . . . . . . . . . . . . . . . . 35 筆記

[v] 2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 筆記

簡要梳理 matrix multiplication, linear dependence, identity inverse matrix, norm

[v] 2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . 38 筆記

簡要梳理 special matrices and vectors.mp4

[v] 2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 40 筆記

目前只能做到翻譯程度的理解,離直覺的距離還很遠

[v] 2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 42 筆記

目前只能做到翻譯程度的理解,離直覺的距離還很遠

[v] 2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . 43 筆記

目前只能做到翻譯程度的理解,離直覺的距離還很遠

[v] 2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 44 筆記

目前只能做到翻譯程度的理解,離直覺的距離還很遠

[v] 2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 筆記

目前只能做到翻譯程度的理解,離直覺的距離還很遠

作為面積來理解?expand and contract of Matrix through its eigen vectors

[v] 2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 45 筆記

目前只能做到翻譯程度的理解,離直覺的距離還很遠

3 Probability and Information Theory

[v] 3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 筆記

[v] 3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 54 筆記

圖解probability, random variable, probability distribution

why probability?什麼是frequentist probability, bayesian probability?什麼是random variable, probability distribution?

[v] 3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 54 筆記

圖解 probability mass function 和 probability density function

如何理解probability mass function?如何理解probability density function?

額外內容

圖解 derivative 圖解 integral

[v] 3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 56 筆記

圖解 marginal probability 從p(x,y) 到p(x)

[v] 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 57 筆記

[v] 3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 57 筆記

[v] 3.7 Independence and Conditional Independence . . . . . . . . . . . . 58 筆記

深度學習中常見的條件概率公式

[v] 3.8 Expectation, Variance and Covariance . . . . . . . . . . . . . . . 58 筆記

圖解 Expectation 基於概率分佈的均值 圖解 variance 圖解 covariance 圖解 covariance matrix

[v] 3.9 Common Probability Distributions . . . . . . . . . . . . . . . . . 60 筆記

圖解 bernoulli, multinoulli, gaussian distributions

如何理解bernoulli distribution 均值與方差公式

圖解 Exponential and Laplace distributions 英文思考版 英文求助 Dirac distribution, Empirical distribution 簡單梳理 mixture distribution model 英文思考版

[v] 3.10 Useful Properties of Common Functions . . . . . . . . . . . . . . 65

圖解logistic sigmoid

什麼是logistic sigmoid?什麼是saturation?什麼是Bernoulli distribution?圖解softplus 函數 softplus有哪些特性?我們需要記住那些sigmoid 和softplus函數的properties(公式變形)?

[v] 3.11 Bayes』 Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

[v] 3.12 Technical Details of Continuous Variables . . . . . . . . . . . . . 69

什麼是continuous variable 可能面對的paradox?

什麼是measure zero, measure positive?什麼是distortion of space by a function?

[v] 3.13 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 71

圖解 entropy

什麼是entropy?什麼是binary cross entropy?

[v] 3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 73

簡單梳理 structured probability models 英文思考版

4 Numerical Computation

圖解 機器學習天性吞噬計算資源

[v] 4.1 Over?ow and Under?ow . . . . . . . . . . . . . . . . . . . . . . . 78

圖解underflow, overflow 計算精度極限帶來的問題

[v] 4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 80

圖解 poor conditioning

[v] 4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 80

圖解 gradient-based optimization

Jacobian matrix 5 videos (khanacademy)

Hessian matrix khanacademy

curature intuition

簡單梳理 beyond gradient

[ ] 4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 91

constrained optimization and Lagrangian khanacademy

[ ] 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 94

matrix inverse khanacademy

5 Machine Learning Basics

圖解機器學習特點

[ v ] 5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 97 筆記

圖解 分類問題與概率分佈

probabilistic modeling|learning: classification with missing data relevant course Introduction to statistical learning course site

[ ] 5.2 Capacity, Over?tting and Under?tting . . . . . . . . . . . . . . . 108

[ ] 5.3 Hyperparameters and Validation Sets . . . . . . . . . . . . . . . . 118

[ ] 5.4 Estimators, Bias and Variance . . . . . . . . . . . . . . . . . . . . 120

[ ] 5.5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 129

[ ] 5.6 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 133

[ ] 5.7 Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . 137

[ ] 5.8 Unsupervised Learning Algorithms . . . . . . . . . . . . . . . . . 142

[ ] 5.9 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 149

[ ] 5.10 Building a Machine Learning Algorithm . . . . . . . . . . . . . . 151

[ ] 5.11 Challenges Motivating Deep Learning . . . . . . . . . . . . . . . . 152

6 Deep Feedforward Networks

[ ] 6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 167

[ ] 6.2 Gradient-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 172

[ ] 6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

[ ] 6.4 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 193

[ ] 6.5 Back-Propagation and Other Di?erentiation Algorithms . 200

[ ] 6.6 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

7 Regularization for Deep Learning

[ ] 7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 226

[ ] 7.2 Norm Penalties as Constrained Optimization . . . . . . . . . . . . 233

[ ] 7.3 Regularization and Under-Constrained Problems . . . . . . . . . 235

[ ] 7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 236

[ ] 7.5 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

[ ] 7.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 240

[ ] 7.7 Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 241

[ ] 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

[ ] 7.9 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 249

[ ] 7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 251

[ ] 7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 253

[ ] 7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

[ ] 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 265

[ ] 7.14 Tangent Distance, Tangent Prop and Manifold Tangent Classi?er . 267

8 Optimization for Training Deep Models

[ ] 8.1 How Learning Di?ers from Pure Optimization . . . . . . . . . . . 272

[ ] 8.2 Challenges in Neural Network Optimization . . . . . . . . . . . . 279

[ ] 8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

[ ] 8.4 Parameter Initialization Strategies . . . . . . . . . . . . . . . . . 296

[ ] 8.5 Algorithms with Adaptive Learning Rates . . . . . . . . . . . . . 302

[ ] 8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 307

[ ] 8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 313

9 Convolutional Networks 326

[ ] 9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 327

[ ] 9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

[ ] 9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

[ ] 9.4 Convolution and Pooling as an In?nitely Strong Prior . . . . . . . 339

[ ] 9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 342

[ ] 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 352

[ ] 9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

[ ] 9.8 E?cient Convolution Algorithms . . . . . . . . . . . . . . . . . . 356

[ ] 9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 356

[ ] 9.10 The Neuroscienti?c Basis for Convolutional Networks . .358

[ ] 9.11 Convolutional Networks and the History of Deep Learning . . . . 365

10 Sequence Modeling: Recurrent and Recursive Nets 367

[ ] 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . . . . 369

[ ] 10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 372

[ ] 10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 388

[ ] 10.4 Encoder-Decoder Sequence-to-Sequence Architectures . 390

[ ] 10.5 Deep Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . 392

[ ] 10.6 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . 394

[ ] 10.7 The Challenge of Long-Term Dependencies . . . . . . . . . . . . . 396

[ ] 10.8 Echo State Networks . . . . . . . . . . . . . . . . . . . . . . . . . 399

[ ] 10.9 Leaky Units and Other Strategies for Multiple Time Scales . 402

[ ] 10.10 The Long Short-Term Memory and Other Gated RNNs . . . . . . 404

[ ] 10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 408

[ ] 10.12 Explicit Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

11 Practical Methodology 416

[ ] 11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 417

[ ] 11.2 Default Baseline Models . . . . . . . . . . . . . . . . . . . . . . . 420

[ ] 11.3 Determining Whether to Gather More Data . . . . . . . . . . . . 421

[ ] 11.4 Selecting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 422

[ ] 11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 431

[ ] 11.6 Example: Multi-Digit Number Recognition . . . . . . . . . . . . . 435

12 Applications 438

[ ] 12.1 Large-Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 438

[ ] 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

[ ] 12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 453

[ ] 12.4 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 456

[ ] 12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 473

## III Deep Learning Research 482

13 Linear Factor Models 485

[ ] 13.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 486

[ ] 13.2 Independent Component Analysis (ICA) . . . . . . . . . . . . . . 487

[ ] 13.3 Slow Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . 489

[ ] 13.4 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492

[ ] 13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 496

14 Autoencoders 499

[ ] 14.1 Undercomplete Autoencoders . . . . . . . . . . . . . . . . . . . . 500

[ ] 14.2 Regularized Autoencoders . . . . . . . . . . . . . . . . . . . . . . 501

[ ] 14.3 Representational Power, Layer Size and Depth . . . . . . . . . . . 505

[ ] 14.4 Stochastic Encoders and Decoders . . . . . . . . . . . . . . . . . . 506

[ ] 14.5 Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 507

[ ] 14.6 Learning Manifolds with Autoencoders . . . . . . . . . . . . . . . 513

[ ] 14.7 Contractive Autoencoders . . . . . . . . . . . . . . . . . . . . . . 518

[ ] 14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 521

[ ] 14.9 Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . 522

15 Representation Learning 524

[ ] 15.1 Greedy Layer-Wise Unsupervised Pretraining . . . . . . . . . . . 526

[ ] 15.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 534

[ ] 15.3 Semi-Supervised Disentangling of Causal Factors . . . . . . . . . 539

[ ] 15.4 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . 544

[ ] 15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . 550

[ ] 15.6 Providing Clues to Discover Underlying Causes . . . . . . . . . . 552

16 Structured Probabilistic Models for Deep Learning 555

[ ] 16.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 556

[ ] 16.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 560

[ ] 16.3 Sampling from Graphical Models . . . . . . . . . . . . . . . . . . 577

[ ] 16.4 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 579

[ ] 16.5 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . 579

[ ] 16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . 580

[ ] 16.7 The Deep Learning Approach to Structured

Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . 581

[ ] 17 Monte Carlo Methods 587

[ ] 17.1 Sampling and Monte Carlo Methods . . . . . . . . . . . . . . . . 587

[ ] 17.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 589

[ ] 17.3 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 592

[ ] 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596

[ ] 17.5 The Challenge of Mixing between Separated

Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597

[ ] 18 Confronting the Partition Function 603

[ ] 18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 604

[ ] 18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 605

[ ] 18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 613

[ ] 18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 615

[ ] 18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 617

[ ] 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 618

[ ] 18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . . 621

[ ] 19 Approximate Inference 629

[ ] 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 631

[ ] 19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 632

[ ] 19.3 MAP Inference and Sparse Coding . . . . . . . . . . . . . . . . . 633

[ ] 19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 636

[ ] 19.5 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 648

20 Deep Generative Models 651

[ ] 20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 651

[ ] 20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 653

[ ] 20.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 657

[ ] 20.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 660

[ ] 20.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 673

[ ] 20.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 679

[ ] 20.7 Boltzmann Machines for Structured or Sequential Outputs . . . . 681

[ ] 20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 683

[ ] 20.9 Back-Propagation through Random Operations . . . . . . . . . . 684

[ ] 20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . 688

[ ] 20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . . . . . 707

[ ] 20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 710

[ ] 20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . . . . . 712

[ ] 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 713

[ ] 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716


推薦閱讀:
相关文章