之前提到的暴力搜索雖然使用了並行和指令優化的策略來提升查詢性能,但數據一旦變大,效率還是很難達到要求,於是就有了常規搜索中經常出現的倒排索引出現

數據結構

struct IndexIVF: Index {
size_t nlist; ///< number of possible key values
size_t nprobe; ///< number of probes at query time

Index * quantizer; ///< quantizer that maps vectors to inverted lists

/**
* = 0: use the quantizer as index in a kmeans training
* = 1: just pass on the training set to the train() of the quantizer
* = 2: kmeans training on a flat index + add the centroids to the quantizer
*/
char quantizer_trains_alone;
bool own_fields; ///< whether object owns the quantizer

ClusteringParameters cp; ///< to override default clustering params
Index *clustering_index; ///< to override index used during clustering

std::vector < std::vector<long> > ids; ///< Inverted lists for indexes

size_t code_size; ///< code size per vector in bytes
std::vector < std::vector<uint8_t> > codes; // binary codes, size nlist

/// map for direct access to the elements. Enables reconstruct().
bool maintain_direct_map;
std::vector <long> direct_map;

/** The Inverted file takes a quantizer (an Index) on input,
* which implements the function mapping a vector to a list
* identifier. The pointer is borrowed: the quantizer should not
* be deleted while the IndexIVF is in use.
*/
IndexIVF (Index * quantizer, size_t d, size_t nlist,
MetricType metric = METRIC_INNER_PRODUCT);

重點關注

nlist: 倒排表的長度,對應聚類的中心點數

nprobe: 每次查詢需要查的倒排list的個數

quantizer:量化器,或者說是碼錶,存儲所有的聚類中心點

cp: 聚類配置(見文檔-Faiss聚類實現)

ids: 倒排向量ID表,由嵌套的vector實現,一個二維數組

code_size: 向量的位元組長度

codes: 倒排向量code表,向量的code就是向量本身

初始化

faiss::IndexFlatL2 quantizer(d); // the other index
faiss::IndexIVFFlat index(&quantizer, d, nlist, faiss::METRIC_L2);
IndexIVFFlat::IndexIVFFlat (Index * quantizer,
size_t d, size_t nlist, MetricType metric):
IndexIVF (quantizer, d, nlist, metric)
{
code_size = sizeof(float) * d;
}

向量的 code是向量本身 d維float, 所以code size 是sizeof(float) * d

IndexIVF::IndexIVF (Index * quantizer, size_t d, size_t nlist,
MetricType metric):
Index (d, metric),
nlist (nlist),
nprobe (1),
quantizer (quantizer),
quantizer_trains_alone (0),
own_fields (false),
clustering_index (nullptr),
ids (nlist),
maintain_direct_map (false)
{
FAISS_THROW_IF_NOT (d == quantizer->d);
is_trained = quantizer->is_trained && (quantizer->ntotal == nlist);
// Spherical by default if the metric is inner_product
if (metric_type == METRIC_INNER_PRODUCT) {
cp.spherical = true;
}
// here we set a low # iterations because this is typically used
// for large clusterings (nb this is not used for the MultiIndex,
// for which quantizer_trains_alone = true)
cp.niter = 10;
cp.verbose = verbose;
code_size = 0; // let sub-classes set this
codes.resize(nlist);
}

quantizer:量化器

d:向量維度

nlist:聚類個數

MetricType: 向量距離衡量類型:傳入的是METRIC_L2

訓練

1) 輸入

index.train(nb, xb);

nb: 待索引向量的個數

xb: 待索引向量的數組

2) 訓練初聚類

Clustering clus (d, nlist, cp);
quantizer->reset();
.....
clus.train (n, x, *quantizer);
.....

n: 待索引向量的個數

x: 待索引向量的數組

關於聚類的細節請看文檔 Faiss的聚類實現

最後將所有(nlist個)聚類中心(向量)存儲在量化器quantizer中

3) 訓練量化殘差

void IndexIVF::train_residual(idx_t /*n*/, const float* /*x*/) {
if (verbose)
printf("IndexIVF: no residual training
");
// does nothing by default
}

啥也不幹

所以訓練就是一次粗聚類

索引

1) 輸入

index.add(nb, xb);

nb: 待索引向量的個數

xb: 待索引向量的數組

2) 遍歷找到每個向量最近的聚類中心點

void Index::assign (idx_t n, const float * x, idx_t * labels, idx_t k)
{
float * distances = new float[n * k];
ScopeDeleter<float> del(distances);
search (n, x, k, distances, labels);
}

nb: 待索引向量的個數

xb: 待索引向量的數組

labels: 存放每個向量距離最近的中心點ID

k: 1 (取 top 1)

其中k=1, 即查詢最近的一個

使用ScopeDeleter<float> del(distances),是因為不需要distances, 只需要labels, 函數退出,distances回收掉,留下最近的聚類中心點id數組labels

search (n, x, k, distances, labels); 的細節見文檔 Faiss暴力搜索實現

3) 建立倒排表

3.1) 遍歷所有向量

for (size_t i = 0; i < n; i++) {

3.2) 建立倒排id list

long id = xids ? xids[i] : ntotal + i;
long list_no = idx [i];
.....
ids[list_no].push_back (id);

idx是步驟2)返回的那個距離每個向量最近的聚類中心ID 數組

取出向量對應的那個距離最近的聚類中心ID

將向量ID加入到該聚類中心對應的倒排list

3.3) 建立倒排code list

const float *xi = x + i * d;
/* store the vectors */
size_t ofs = codes[list_no].size();
codes[list_no].resize(ofs + code_size);
memcpy(codes[list_no].data() + ofs,xi, code_size);

倒排code存放的就是向量本身(d個float)

搜索

1) 輸入

index.search(nq, xq, k, D, I);

nq: 查詢向量個數

xq: 查詢向量數組

k: 每個查詢向量返回個數 top k

D:存放返回的top k個距離

I: 存放返回的top k 個索引向量ID

2) 粗查詢(查詢聚類中心)

long * idx = new long [n * nprobe];
ScopeDeleter<long> del (idx);
float * coarse_dis = new float [n * nprobe];
ScopeDeleter<float> del2 (coarse_dis);

quantizer->search (n, x, nprobe, coarse_dis, idx);

nprobe: 返回幾個最近聚類中心點

quantizer->search (n, x, nprobe, coarse_dis, idx); 實現細節見文檔 Faiss暴力搜索實現

搜出查詢向量最近的npobe個聚類中心點ID及對應的距離

3) 查詢倒排表前的距離類型判斷

void IndexIVFFlat::search_preassigned (idx_t n, const float *x, idx_t k,
const idx_t *idx,
const float * /* coarse_dis */,
float *distances, idx_t *labels,
bool store_pairs) const
{
if (metric_type == METRIC_INNER_PRODUCT) {
float_minheap_array_t res = {
size_t(n), size_t(k), labels, distances};
search_knn_inner_product (*this, n, x, idx, &res, store_pairs);

} else if (metric_type == METRIC_L2) {
float_maxheap_array_t res = {
size_t(n), size_t(k), labels, distances};
search_knn_L2sqr (*this, n, x, idx, &res, store_pairs);
}
}

和暴力搜索一樣,初始化返回的最大堆,進入實質查詢

4) 查詢倒排表

void search_knn_L2sqr (const IndexIVFFlat &ivf,
size_t nx,
const float * x,
const long * keys,
float_maxheap_array_t * res,
bool store_pairs)
{
const size_t k = res->k;
size_t nlistv = 0, ndis = 0;
size_t d = ivf.d;
#pragma omp parallel for reduction(+: nlistv, ndis)
for (size_t i = 0; i < nx; i++) {
const float * xi = x + i * d;
const long * keysi = keys + i * ivf.nprobe;
float * __restrict disi = res->get_val (i);
long * __restrict idxi = res->get_ids (i);
maxheap_heapify (k, disi, idxi);

for (size_t ik = 0; ik < ivf.nprobe; ik++) {
long key = keysi[ik]; /* select the list */
if (key < 0) {
// not enough centroids for multiprobe
continue;
}
FAISS_THROW_IF_NOT_FMT (
key < (long) ivf.nlist,
"Invalid key=%ld at ik=%ld nlist=%ld
",
key, ik, ivf.nlist);

nlistv++;
const size_t list_size = ivf.ids[key].size();
const float * list_vecs = (const float*)(ivf.codes[key].data());

for (size_t j = 0; j < list_size; j++) {
const float * yj = list_vecs + d * j;
float disij = fvec_L2sqr (xi, yj, d);
if (disij < disi[0]) {
maxheap_pop (k, disi, idxi);
long id = store_pairs ? (key << 32 | j) : ivf.ids[key][j];
maxheap_push (k, disi, idxi, disij, id);
}
}
ndis += list_size;
}
maxheap_reorder (k, disi, idxi);
}
indexIVFFlat_stats.nq += nx;
indexIVFFlat_stats.nlist += nlistv;
indexIVFFlat_stats.ndis += ndis;
}

本段代碼寫的極為精簡,所以copy過來慢慢講,和暴力搜索過程極為類似

4.1) 初始化統計變數

size_t nlistv = 0, ndis = 0;

nlistv : 統計遍歷的倒排list個數

ndis: 統計比對過的向量個數

4.2) 開啟openmp

#pragma omp parallel for reduction(+: nlistv, ndis)
for (size_t i = 0; i < nx; i++) {

累計各線程計算的nlistv和ndis, 形成全局的nlistv和ndis

開啟並行的查詢

4.3) 準備好當前查詢的變數

const float * xi = x + i * d;
const long * keysi = keys + i * ivf.nprobe;
float * __restrict disi = res->get_val (i);
long * __restrict idxi = res->get_ids (i);
maxheap_heapify (k, disi, idxi);

定位到當前的查詢向量

定位到當前查詢向量對應的粗聚類中心ID

定位到當前查詢對應的返回最大堆

4.4) 開始遍歷top k 個粗聚類中心

for (size_t ik = 0; ik < ivf.nprobe; ik++) {

4.5) 確定但前的聚類中心ID

long key = keysi[ik]; /* select the list */
if (key < 0) {
// not enough centroids for multiprobe
continue;
}
FAISS_THROW_IF_NOT_FMT (
key < (long) ivf.nlist,
"Invalid key=%ld at ik=%ld nlist=%ld
",
key, ik, ivf.nlist);

nlistv++;

獲取當前對應的聚類中心ID並校驗,然後累加nlistv (一個聚類中心點對應一個倒排list)

4.6) 核心的計算

const size_t list_size = ivf.ids[key].size();
const float * list_vecs = (const float*)(ivf.codes[key].data());

for (size_t j = 0; j < list_size; j++) {
const float * yj = list_vecs + d * j;
float disij = fvec_L2sqr (xi, yj, d);
if (disij < disi[0]) {
maxheap_pop (k, disi, idxi);
long id = store_pairs ? (key << 32 | j) : ivf.ids[key][j];
maxheap_push (k, disi, idxi, disij, id);
}
}
ndis += list_size;

獲取當前聚類中心對應的倒排list

遍歷倒排list的每個向量

和查詢向量一一計算距離fvec_L2sqr (xi, yj, d) 細節見 文檔 Faiss暴力搜索實現

將結果放入返回的最大堆

最後累加比對的向量的個數

4.7) 最大堆排序

maxheap_reorder (k, disi, idxi);

細節見文檔 Faiss暴力搜索實現

4.8) 更新統計

indexIVFFlat_stats.nq += nx;
indexIVFFlat_stats.nlist += nlistv;
indexIVFFlat_stats.ndis += ndis;

小結

在暴力搜索的基礎上,通過聚類將索引數據進行了分片,

和查詢和暴力搜索相比,通過代價很小的粗查詢(只需要比較少量的聚類中心點)迅速定位到某些分片,

使得計算向量距離的次數大大縮小,從而提升了性能


推薦閱讀:
相關文章