為什麼Eigen里的矩陣加法要比矩陣-向量乘法慢？

為什麼在Eigen中，n*n矩陣加法要比n*n矩陣乘以n*1向量慢？（大約前者是後者時間兩倍）
理論上前者只需要n^2次加法，但是後者需要n*(n-1)次加法和n^2次乘法。
#include & #include & #include & #include & #include & #include & #include & #include &

using namespace Eigen;
using namespace std;

int main()
{
const int l=100;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
for(int j=0;j&<10000;j++) qq=m*v; auto end = chrono::steady_clock::now(); double time_duration=chrono::duration_cast&(end - start).count();
std::cout &(end1 - start1).count();
std::cout &
Test 1: Without any optimization:
compile command: g++-8 -test.cpp -o test
run command: ./test

Elapsed time in seconds : 0.323s
Elapsed time in seconds : 0.635s
Test 2: With -march=native optimization:
g++-8 test.cpp -march=native -o test
run command: ./test
Elapsed time in seconds : 0.21s
Elapsed time in seconds : 0.372s
Test 3: With -O3 optimization:
compile command: g++-8 -test.cpp -O3 -o test
run command: ./test

Elapsed time in seconds : 0.009s
Elapsed time in seconds : 0.016s
Test 4: With -march=native, -O3 optimization:
compile command: g++-8 -test.cpp -march=native -O3 -o test
run command: ./test
Elapsed time in seconds : 0.008s
Elapsed time in seconds : 0.016s
具體說明參見我在stackoverflow的提問：Why Matrix Addition is slower than Matrix-Vector Multiplication in Eigen?

我做了兩個小實驗：

把VectorXf v的type換成了MatrixXf，前者的時間明顯超過了後者，開了O3的話前者是後者的兩倍，說明matrix-matrix乘法和matrix-vector的乘法實現上不同；
我打開了src/core/products，打開了matrixmatrix和matrixvector的兩個general的header file，發現前者對於sequential的方法沒有太多優化（如果沒有OpenMP），而後者對於可以vectorizable的某些情況進行了優化。

於是我直接把 l=100 改成了 l=97，得到了

./test

Elapsed time in seconds : 0.013s

Elapsed time in seconds : 0.016s

優化基本消失，而當我再改回 l = 100 的時候，得到了

./test

Elapsed time in seconds : 0.009s

Elapsed time in seconds : 0.017s

而具體並行優化的方法，可以看到是調用了pstore/pset1/pmadd/predux/pmul等方法，而這些都是用了指令集優化（這裡面每一個並行的操作，都在SSE/AVX/AVX512/GPU/NEON/ZVector等多種架構里定義了）。

說明當matrix乘以vector並且dimension是4或者8的倍數的時候，Eigen會直接對其進行指令集上的並行優化，但具體要看方法的實現。而matrix-matrix的sequential的方法里，並沒有這些優化。