This article is contributed. See the original author and article here.

 


In our continued effort to equip developers and organizations with advanced search tools, we are thrilled to announce the launch of several new features in the latest Preview API for Azure AI Search. These enhancements are designed to optimize vector index size and provide more granular control and understanding of your search index to build Retrieval-Augmented Generation (RAG) applications.


 


MRL Support for Quantization


Matryoshka Representation Learning (MRL) is a new technique that introduces a different form of vector compression, which complements and works independently of existing quantization methods. MRL enables the flexibility to truncate embeddings without significant semantic loss, offering a balance between vector size and information retention.

This technique works by training embedding models so that information density increases towards the beginning of the vector. As a result, even when using only a prefix of the original vector, much of the key information is preserved, allowing for shorter vector representations without a substantial drop in performance.

OpenAI has integrated MRL into their ‘text-embedding-3-small’ and ‘text-embedding-3-large’ models, making them adaptable for use in scenarios where compressed embeddings are needed while maintaining high retrieval accuracy. You can read more about the underlying research in the official paper [1] or learn about the latest OpenAI embedding models in their blog.


 


Storage Compression Comparison


Table 1.1 below highlights the different configurations for vector compression, comparing standard uncompressed vectors, Scalar Quantization (SQ), and Binary Quantization (BQ) with and without MRL. The compression ratio demonstrates how efficiently the vector index size can be optimized, yielding significant cost savings. You can find more about our Vector Index Size Limits here: Service limits for tiers and skus – Azure AI Search | Microsoft Learn.


 


Table 1.1: Vector Index Size Compression Comparison






























 


Configuration



*Compression Ratio



Uncompressed





SQ



4x



BQ



28x



**MRL + SQ (1/2 and 1/3 truncation dimension respectively)



8x-12x



**MRL + BQ (1/2 and 1/3 truncation dimension respectively)



64x – 96x



 


Note: Compression ratios depend on embedding dimensions and truncation. For instance, using “text-embedding-3-large” with 3072 dimensions truncated to 1024 dimensions can result in 96x compression with Binary Quantization.


*All compression methods listed above, may experience slightly lower compression ratios due to overhead introduced by the index data structures. See “Memory overhead from selected algorithm” for more details.


**The compression impact when using MRL depends on the value of the truncation dimension. We recommend either using ½ or 1/3 of the original dimensions to preserve embedding quality (see below)


 


Quality Retainment Table:


Table 1.2 provides a detailed view of the quality retainment when using MRL with quantization across different models and configurations. The results indicate the impact on Mean NDCG@10 across a subset of MTEB datasets, showing that high levels of compression can still preserve up to 99% of search quality, particularly with BQ and MRL.


 


Table 1.2: Impact of MRL on Mean NDCG@10 Across MTEB Subset














































































Model Name



Original Dimension



MRL Dimension



Quantization Algorithm



No Rerank (% Δ)



Rerank 2x Oversampling (% Δ)



OpenAI text-embedding-3-small



1536



512



SQ



-2.00% (Δ = 1.155)



-0.0004% (Δ = 0.0002)



OpenAI text-embedding-3-small



1536



512



BQ



-15.00% (Δ = 7.5092)



-0.11% (Δ = 0.0554)



OpenAI text-embedding-3-small



1536



768



SQ



-2.00% (Δ = 0.8128)



-1.60% (Δ = 0.8128)



OpenAI text-embedding-3-small



1536



768



BQ



-10.00% (Δ = 5.0104)



-0.01% (Δ = 0.0044)



OpenAI text-embedding-3-large



3072



1024



SQ



-1.00% (Δ = 0.616)



-0.02% (Δ = 0.0118)



OpenAI text-embedding-3-large



3072



1024



BQ



-7.00% (Δ = 3.9478)



-0.58% (Δ = 0.3184)



OpenAI text-embedding-3-large



3072



1536



SQ



-1.00% (Δ = 0.3184)



-0.08% (Δ = 0.0426)



OpenAI text-embedding-3-large



3072



1536



BQ



-5.00% (Δ = 2.8062)



-0.06% (Δ = 0.0356)



 


Table 1.2 compares the relative point differences of Mean NDCG@10 when using different MRL dimensions (1/3 and 1/2 from the original dimensions) from an uncompressed index across OpenAI text-embedding models.


 


Key Takeaways:



  • 99% Search Quality with BQ + MRL + Oversampling: Combining Binary Quantization (BQ) with Oversampling and Matryoshka Representation Learning (MRL) retains 99% of the original search quality in the datasets and embeddings combinations we tested, even with up to 96x compression, making it ideal for reducing storage while maintaining high retrieval performance.

  • Flexible Embedding Truncation: MRL enables dynamic embedding truncation with minimal accuracy loss, providing a balance between storage efficiency and search quality.

  • No Latency Impact Observed: Our experiments also indicated that using MRL had no noticeable latency impact, supporting efficient performance even at high compression rates.


For more details on how MRL works and how to implement it, visit the MRL Support Documentation.


 


Targeted Vector Filtering


Targeted Vector Filtering allows you to apply filters specifically to the vector component of hybrid search queries. This fine-grained control ensures that your filters enhance the relevance of vector search results without inadvertently affecting keyword-based searches.


 


Sub-Scores


Sub-Scores provide granular scoring information for each recall set contributing to the final search results. In hybrid search scenarios, where multiple factors like vector similarity and text relevance play a role, Sub-Scores offer transparency into how each component influences the overall ranking.


 


Text Split Skill by Tokens


The Text Split Skill by Tokens feature enhances your ability to process and manage large text data by splitting text based on token countsThis gives you more precise control over passage (chunk) length, leading to more targeted indexing and retrieval, particularly for documents with extensive content.


For any questions or to share your feedback, feel free to reach out through our  Azure Search · Community


 


Getting started with Azure AI Search 



 


References:
[1] Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2024). Matryoshka Representation Learning. arXiv preprint arXiv:2205.13147. Retrieved from https://arxiv.org/abs/2205.13147
{2205.13147}

Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.