Mail
Inbox
Compose
Forward
From
Phillip Carter <pcarter@fastmail.com>
Phillip Carter <phillip.phillipcarter.carter@gmail.com>
To
Cc
Bcc
Subject
Markdown
Write
Split
--- **---------- Forwarded message ----------** **From:** Top Information Retrieval Papers of the Week <recsys@substack.com> **To:** pcarter@fastmail.com **Date:** Fri, Mar 20, 2026, 6:06 PM **Subject:** A Production System for Podcast Discovery, A Fully Open-Source Frontier Search Agent, and More! View this post on the web at https://recsys.substack.com/p/a-production-system-for-podcast-discovery Stay Ahead of the Curve with the Latest Advancements and Discoveries in Information Retrieval. This week’s newsletter highlights the following research: Deploying an LLM-Based Podcast Recommender with Semantic IDs [ https://substack.com/redirect/91552d27-f58e-439a-bbc0-0035b74cdb0c?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from Spotify Learning Retrieval Models with Sparse Autoencoders [ https://substack.com/redirect/de39369e-b13c-4c95-ac3e-bb081ff84f34?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from NAVER A Fully Open-Source Frontier Search Agent [ https://substack.com/redirect/0dd59366-a475-474b-a9e6-70a81a91c5f3?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from SJTU A Unified Language Model for Large Scale Search, Recommendation, and Reasoning [ https://substack.com/redirect/3aab87cb-9dda-4ddf-9fb9-b17fb3d1ff53?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from Spotify A Survey of Negative Sampling in Dense Information Retrieval [ https://substack.com/redirect/451b0b25-93ce-4645-970f-a78d3252e17a?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from the University of Innsbruck Open, Efficient, and Multilingual Text Embeddings at Scale [ https://substack.com/redirect/6db15149-8b84-44a4-bc4d-626d9df5af78?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from Ant Group Adapting Model Editing for Cold-Start Generative Recommendation [ https://substack.com/redirect/504d4d31-ca08-4c53-a878-873be4ab570a?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from Renmin University Production-Ready Multimodal Late-Interaction Retrieval Without Specialized Infrastructure [ https://substack.com/redirect/5cbabd72-69a6-4c2b-a170-5813de0293af?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from Apple Overcoming Modality Collapse in VLM Embedders for Sequential Recommendation [ https://substack.com/redirect/a62dd8cf-0582-4123-9fc4-4afed18c12c0?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from Pohang University Index-Time Reasoning for Multi-Hop Question Answering Without Graphs or Iterative Retrieval [ https://substack.com/redirect/8668049b-6a38-4e77-8f10-cd65bab7d7a6?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ], from Continuum AI [1] Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at Spotify This paper from Spotify presents GLIDE (Grounded LLM for Interest Discovery rEcommendations), a production-scale generative recommender system for podcast discovery. The core idea is framing recommendation as an instruction-following task over a discretized catalog using Semantic IDs (SIDs) (short token sequences derived from residual K-Means quantization of episode content embeddings), which lets an LLM (Llama 3.2 1B backbone) generate valid catalog items directly rather than selecting from a fixed item set. To handle personalization without ballooning prompt length, the model injects long-term collaborative-filtering user embeddings as soft prompts alongside recent listening history and natural-language task instructions. Training proceeds in two stages: first grounding the SID tokens to the LLM’s semantic space via bidirectional translation, then instruction-tuning with multi-task learning that distinguishes “familiar” from “unfamiliar” non-habitual discovery objectives. In a 21-day online A/B test across ~20M impressions, GLIDE increased non-habitual podcast streams by 5.4% and new-show discovery by 14.3%, while staying within production latency and cost constraints. 📚 https://arxiv.org/abs/2603.17540 [ https://substack.com/redirect/48868130-48eb-4b5d-8af0-9902904bcc71?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [2] Learning Retrieval Models with Sparse Autoencoders This paper from NAVER introduces SPLARE (SParse LAtent REtrieval), a new approach to Learned Sparse Retrieval (LSR) that replaces the standard vocabulary-space projection used by models like SPLADE with representations derived from pre-trained Sparse Autoencoders (SAEs). The key motivation is that SAE latent features are largely monosemantic, language-agnostic, and scalable in dimensionality. Vocabulary-constrained approaches inherently lack these properties, which the authors argue is why LSR models have been falling behind dense retrievers on multilingual benchmarks. Practically, SPLARE inserts a frozen pre-trained SAE (specifically from the Llama Scope suite built on Llama-3.1-8B) at an intermediate transformer layer (~2/3 depth), applies a SPLADE-style max-pooling over the resulting sparse token activations, and fine-tunes only the LLM backbone via LoRA. Training uses KL-divergence distillation from a cross-encoder teacher plus FLOPS regularization for sparsity, with Top-K pooling applied at inference. The resulting SPLARE-7B model achieves top-10 performance on MMTEB (Multilingual, v2) retrieval and ranks as the best LSR model on that benchmark, with particularly strong gains on multilingual and cross-lingual tasks, though it lags on code retrieval due to the general-purpose nature of available SAEs. 📚 https://arxiv.org/abs/2603.13277 [ https://substack.com/redirect/fbee8574-7757-495c-8d68-27e6509db0fc?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [3] OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data This paper from SJTU introduces OpenSeeker, a fully open-source search agent that achieves frontier-level performance on web search benchmarks while releasing its complete training data, model weights, and synthesis pipeline. The system addresses two core challenges through novel data synthesis methods: A fact-grounded QA synthesis framework reverse-engineers the web graph by expanding from seed pages, extracting entity subgraphs, and applying entity obfuscation to produce complex multi-hop reasoning questions whose difficulty can be controlled by tuning subgraph size. A denoised trajectory synthesis approach uses retrospective summarization to give the teacher model clean context during generation, but then trains the student on raw, noisy tool responses, forcing it to internalize robust information extraction. Trained via simple SFT on just 11.7k synthesized samples using Qwen3-30B-A3B, OpenSeeker scores 29.5% on BrowseComp (nearly double DeepDive’s 15.3%), 48.4% on BrowseComp-ZH (beating Tongyi DeepResearch’s 46.7% despite the latter using continual pre-training, SFT, and RL), 74.0% on xbench-DeepSearch, and 59.4% on WideSearch, all in a single training run with no hyperparameter tuning. 📚 https://arxiv.org/abs/2603.15594 [ https://substack.com/redirect/54da2d77-28ab-4db9-bac4-0c6ad4b297ff?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] 👨🏽💻 https://github.com/rui-ye/OpenSeeker [ https://substack.com/redirect/c24777f9-c3c6-4c99-94b4-520c88e7d49c?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] 🤗 https://huggingface.co/OpenSeeker/OpenSeeker-v1-30B-SFT [ https://substack.com/redirect/76124e49-9f38-48d5-bdc5-d9ab1719ec47?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] 🤗 https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data [ https://substack.com/redirect/027ae5a0-4238-435e-823d-eea26d2a9b7f?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [4] A Unified Language Model for Large Scale Search, Recommendation, and Reasoning NEO is a framework from Spotify that adapts a pre-trained decoder-only LLM into a single, tool-free, catalog-grounded generator capable of jointly handling recommendation, search, and reasoning over a heterogeneous catalog of 10M+ items (podcasts, audiobooks, artists, etc.). The key idea is to treat catalog items as a new “modality” by representing them as semantic identifiers (SIDs), which get interleaved with natural language in both input and output. Training follows a three-stage recipe inspired by multimodal alignment: Stage 1 learns the SIDs via embedding discretization, Stage 2 aligns SID tokens with the LLM’s linguistic space through bidirectional objectives (SID→text verbalization, text→SID grounded retrieval, SID→type disambiguation) while freezing backbone weights, and Stage 3 performs multi-task instruction tuning across recommendation, retrieval, interest profiling, and “recsplanation” (recommendation + explanation) tasks. Natural-language prompts steer the task, target entity type, and output format, while constrained decoding via a prefix trie guarantees that generated SIDs always map to valid catalog items. In offline experiments using Qwen3-0.6B as backbone, NEO consistently outperforms mature production baselines, including a GNN+two-tower recommender and a dense retrieval system. 📚 https://arxiv.org/abs/2603.17533 [ https://substack.com/redirect/e26bd26f-f4d5-4edf-a13f-a9025b6fded5?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [5] Negative Sampling Techniques in Information Retrieval: A Survey This survey paper from the University of Innsbruck tackles negative sampling in dense retrieval, where neural encoders learn query and document embeddings trained via contrastive learning. The effectiveness of this process hinges critically on which “negative” (irrelevant) examples the model trains against. The authors propose a taxonomy dividing techniques into “sampling-based” and “data-centric approaches”: the former spans random/in-batch negatives, static BM25-mined hard negatives, dynamic model-based mining (ANCE), cluster-based diversity sampling, and principled selection (TriSampler), while the latter covers LLM-driven data augmentation and fully synthetic dataset generation. A central theme in this paper is the false negative problem, in which hard-mined passages are often genuinely relevant but unlabeled, addressed through cross-encoder or LLM-based filtering and loss regularization techniques like label smoothing. 📚 https://arxiv.org/abs/2603.18005 [ https://substack.com/redirect/241c1ecb-aff5-4ea1-97e6-9752b8e9277c?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [6] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World This paper from Ant Group introduces F2LLM-v2, a family of 8 general-purpose multilingual text embedding models (80M–14B parameters). It is designed to address two persistent problems in the field: English-centric training bias and a lack of training transparency among top-performing models. The authors curated 60 million samples from 157 publicly available sources spanning 282 natural languages and 40+ programming languages, deliberately prioritizing real-world data breadth over benchmark optimization. Their training process follows a two-stage pipeline: the first stage builds a broad semantic foundation on 27M large-scale retrieval samples, while the second stage fine-tunes on 18M task-diverse samples with task-specific instructions. The three smallest models (80M, 160M, 330M) are obtained by pruning the 0.6B model along hidden size, MLP intermediate size, and layer depth dimensions, with knowledge distillation applied to recover performance lost during pruning. Matryoshka Representation Learning is applied throughout, enabling flexible embedding truncation without catastrophic quality loss. Evaluated across 17 MTEB benchmarks covering 430 tasks, F2LLM-v2-14B achieves state-of-the-art on 11 of them, while smaller variants consistently outperform comparably-sized Qwen3-Embedding and EmbeddingGemma models, particularly on language-specific benchmarks. All models, training code, data, and intermediate checkpoints are publicly released. 📚 https://arxiv.org/abs/2603.19223 [ https://substack.com/redirect/9589933c-f67d-404b-8720-ff4dd651fae1?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] 🤗 https://huggingface.co/collections/codefuse-ai/f2llm [ https://substack.com/redirect/b0c50943-3200-41cc-a805-b65abfe582a2?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [7] Bringing Model Editing to Generative Recommendation in Cold-Start Scenarios This paper from Renmin University identifies and addresses cold-start collapse in generative recommendation (GR), a phenomenon where recommendation accuracy for newly introduced items plummets to near zero because the autoregressive model has never observed their semantic ID (SID) patterns during training. The authors show that GR models can often predict the first SID token of cold-start items correctly but then default to familiar, seen patterns for subsequent tokens, revealing latent capability that goes unexploited. Drawing on model editing techniques from NLP (e.g., ROME, MEMIT), they propose GenRecEdit, a framework that applies training-free knowledge injection to GR. This injection process requires overcoming two domain-specific obstacles: GR sequences lack the explicit subject–object binding that NLP editing relies on, and cold-start SID tokens don’t exhibit the stable co-occurrence patterns of natural language phrases. GenRecEdit addresses these through three mechanisms: Position-wise knowledge preparation, which constructs pseudo interaction histories for cold-start items via similarity-based retrieval of warm-item neighbors and decomposes edit targets into individual token positions. A locate-then-edit framework that uses linear probing classifiers to identify the most discriminative layer for each SID position, then computes closed-form weight updates balancing knowledge injection against preservation of existing capabilities (controlled by a hyperparameter). A One-One triggering policy that activates only the position-relevant edited layer during each decoding step, preventing cross-position interference. Experiments on three Amazon datasets show GenRecEdit substantially improves cold-start recommendation while requiring just 9.5% of the wall-clock time needed for full retraining. 📚 https://arxiv.org/abs/2603.14259 [ https://substack.com/redirect/cd0010c4-92df-41a8-a935-8d46091c15e1?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [8] AMES: Approximate Multi-modal Enterprise Search via Late Interaction Retrieval This paper from Apple presents AMES (Approximate Multimodal Enterprise Search), which argues that fine-grained multimodal retrieval doesn’t actually require specialized ANN infrastructure to work well. The core idea is to represent text tokens, image patches, and video frames as child documents under parent retrievable units within Apache Solr’s native parent-child indexing scheme. Then run a two-stage pipeline where Stage 1 issues parallel per-token ANN queries, applies per-document Top-M MaxSim approximation (retaining only the M strongest token-level alignments to suppress low-signal query tokens), and fuses modality-specific scores via robust normalization (median/MAD). Stage 2 reranks the top-N candidates using full Exact MaxSim computed via batched PyTorch matrix operations on GPU or Apple Silicon. The architecture is deliberately backend-agnostic. Solr is the reference implementation, but any vector-enabled system supporting ANN search and parent-child grouping works. It also preserves enterprise necessities like metadata filtering and access control without modification. 📚 https://arxiv.org/abs/2603.13537 [ https://substack.com/redirect/a9c0c754-9a00-4559-880a-9b0bcec3e52e?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [9] VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation This paper from Pohang University presents VLM2Rec to address a fundamental problem that arises when Vision-Language Models are repurposed as embedding encoders for multimodal sequential recommendation. The standard contrastive supervised fine-tuning (SFT), while necessary to inject collaborative filtering signals into the representation space, paradoxically amplifies the inherent modality collapse of VLMs. The model takes a shortcut by optimizing primarily through the dominant text modality while the vision modality’s gradient signal degrades, ultimately making visual information act as negative transfer in the fused embeddings. The authors demonstrate this through dropout experiments, gradient dynamics tracking, and geometric analysis of alignment/uniformity/separability metrics across modalities. To fix this, they propose two objective-level interventions: Weak-modality Penalized Contrastive Learning (LWPCL), which dynamically estimates a per-user modality gap and amplifies the contrastive penalty on negatives proportionally to how much the weak modality lags behind, and Cross-modal Relational Topology Regularization (LCRTR), which uses bidirectional KL divergence to align the relative similarity ranking structures across modalities. Experiments show consistent improvements on both direct retrieval and downstream SR model initialization tasks, with strong generalization in cross-domain and cold-start settings, and robustness across VLM families from 2B to 32B parameters. 📚 https://arxiv.org/abs/2603.17450 [ https://substack.com/redirect/309f008e-f0c6-4f9d-99ec-83c5930d5f82?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] [10] IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time IndexRAG from Continuum AI tackles the challenge of multi-hop question answering in RAG systems by moving cross-document reasoning from inference time to the indexing phase. The core insight is that connections between documents are determined by content rather than queries, so they can be precomputed. During offline indexing, the system extracts atomic knowledge units (AKUs) and entities from each document, identifies “bridge entities” shared across multiple documents, and then prompts an LLM to generate “bridging facts”, i.e. new retrievable text units that explicitly encode multi-hop reasoning chains (e.g., combining “Film X was directed by Y” and “Y was born in Z” into “The director of Film X was born in Z”). These bridging facts are stored alongside AKUs in a flat vector store, so at inference time only a single retrieval pass and one LLM call are needed, with no graph traversal or iterative reasoning. A balanced context selection mechanism caps the number of bridging facts to prevent them from crowding out longer AKUs in the top-k results. 📚 https://arxiv.org/abs/2603.16415 [ https://substack.com/redirect/1c96c78a-68bd-4a23-af29-db430446c675?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] Extras: Benchmarks ⏱️ Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents TRQA (Total Recall QA) is a benchmark for evaluating deep research agents on multi-step information retrieval and reasoning tasks. It is built around a formulation where answering a query requires retrieving all relevant documents from a corpus and aggregating information across them, enabling precise and verifiable evaluation. It includes three datasets derived from Wikipedia/Wikidata and a synthetic e-commerce domain to reduce data contamination effects. 📝 https://arxiv.org/abs/2603.18516 [ https://substack.com/redirect/cf072e03-2429-4cfb-aaf0-63593e2d901d?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] 👨🏽💻 https://github.com/mahta-r/total-recall-qa [ https://substack.com/redirect/a3a00916-b75a-4f29-a5ef-8679b117015e?j=eyJ1IjoibHI2ZWwifQ.c5D8ZKYyDRfMY9QPFK3dWk-zel5v6FKdRTN7mU1D8QY ] I hope this weekly roundup of top papers has provided you with valuable insights and a glimpse into the exciting advancements taking place in the field. Remember to look deeper into the papers that pique your interest. Subscribe for free to receive new posts like this! I also blog about Machine Learning, Deep Learning, MLOps, and Software Engineering domains. I explore diverse topics, such as Natural Language Processing, Large Language Models, Recommendation Systems, etc., and conduct in-depth analyses, drawing insights from the latest research papers. Unsubscribe https://substack.com/redirect/2/eyJlIjoiaHR0cHM6Ly9yZWNzeXMuc3Vic3RhY2suY29tL2FjdGlvbi9kaXNhYmxlX2VtYWlsP3Rva2VuPWV5SjFjMlZ5WDJsa0lqb3pOalV6T1RrME9Td2ljRzl6ZEY5cFpDSTZNVGt4TlRVd01qZzVMQ0pwWVhRaU9qRTNOelF3TWprNU56RXNJbVY0Y0NJNk1UZ3dOVFUyTlRrM01Td2lhWE56SWpvaWNIVmlMVEUyTmpJNE16RWlMQ0p6ZFdJaU9pSmthWE5oWW14bFgyVnRZV2xzSW4wLnBLaEo3MWNULTVacnhPWDA1WWdPNm1MUDZHcFl1NVRGMUY5dGhZaHdCTmMiLCJwIjoxOTE1NTAyODksInMiOjE2NjI4MzEsImYiOmZhbHNlLCJ1IjozNjUzOTk0OSwiaWF0IjoxNzc0MDI5OTcxLCJleHAiOjIwODk2MDU5NzEsImlzcyI6InB1Yi0wIiwic3ViIjoibGluay1yZWRpcmVjdCJ9.k8lpYIpswYpMSwYZDBxd-AOnbznskywaTFeOi3lTwRA?
Send