Chandar Research Lab (CRL) Annual Symposium 2025
We welcome you to the sixth annual CRL symposium !
The CRL symposium is an annual event that showcases some highlights of the research work that happened in the Chandar Lab in the last year. The symposium will also have a keynote. The keynote talk for this year will be given by Prof. Subbarao Kambhampati (Arizona State University).
Date: August 19 and 20, 2025
Time: 9 am to 5 pm EST
Mode: Hybrid (both remote and in-person)
Address: 6650 Saint-Urbain, Montréal, QC H2S 3H1
Room: Mila Agora
How to register? Please register on Eventbrite (it takes 1min) if you are planning to either attend in-person or remotely.
Contact: mathieu.reymond@mila.quebec
Day 1 (August 19)
Time | Speaker | Topic | Abstract |
---|---|---|---|
9:00am - 9:15am |
Sarath Chandar | Opening remarks | A welcome message with an overview of various research activities at CRL. |
9:15am - 9:45am |
Ali Rahimi-Kalahroudi & Reza Bayat | Steering Large Language Model Activations in Sparse Spaces | A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering—modifying internal activations during inference—shows promise, but dense spaces suffer from superposition, limiting control and interpretability. We introduce Sparse Activation Steering (SAS), which operates in sparse latent spaces to isolate behavior-specific features via a contrastive prompt-pairing approach. On Gemma LLMs, SAS matches dense steering in effectiveness while offering feature compositionality and interpretability. Scaling studies show SAS vectors become increasingly sparse, trending toward ideal monosemanticity. |
9:45am - 10:15am |
Jerry Huang | Do Robot Snakes Dream like Electric Sheep? Investigating the Effects of Architectural Inductive Biases on Hallucination | The prominence of large language models (LLMs) have led to various questions regarding their reliability, such as their tendency to hallucinate false or misleading information, as well as their efficiency. Yet these two factors are rarely considered in tandem. Do changes in LLM architecture exacerbate hallucinations or influence how and where they occur? We evaluate how these architecture-based inductive biases affect hallucination, revealing that the situations in which they occur can change based on architecture. These findings highlight the need for better understanding both these problems in conjunction with each other, as well as consider how to design more universal techniques for handling hallucinations. |
10:15am - 10:45am |
Coffee break | ||
10:45am - 11:15am |
Ekaterina Lobacheva | Understanding Language Model Scaling Laws in Terms of Training Dynamics via Loss Deceleration and Zero-Sum Learning | We study how scaling affects language model training dynamics, focusing on loss deceleration—an abrupt slowdown in the rate of loss improvement early in training. We attribute this effect to zero-sum learning (ZSL), a training dynamic where per-example gradients become systematically opposed, causing destructive interference in per-example loss changes. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Scaling up the model mitigates loss deceleration by (1) decreasing the loss at which deceleration occurs and (2) improving the rate of loss improvement after deceleration by reducing ZSL. |
11:15am - 11:45am |
Darshan Patil | Experimental Design for Nonstationary Optimization | With the growing interest in nonstationary optimization and plasticity research, there is also a growing need to properly define experimental design and hyperparameter search protocols to enable principled research. In this work, we perform an extensive empirical study looking at the performance of representative plasticity methods across different settings and architectures. We examine several core experiment design choices made by the community, such as how to select hyperparameters that transfer, whether we should be focusing on training or testing accuracy when studying plasticity, and how to do a resource-efficient hyperparameter search. |
11:45am - 12:15pm |
Istabrak Abbes | Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models | Training large language models (LLMs) typically involves pretraining on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving approach would be continual pretraining, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. |
12:15pm - 1:30pm |
Lunch break | ||
1:30pm - 2:00pm |
Lola Le Breton | NeoBERT: A Next-Generation BERT | Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of auto-regressive LLMs. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. |
2:00pm - 2:30pm |
Istabrak Abbes | Small Encoders Can Rival Large Decoders in Detecting Groundedness | We address the challenge of detecting groundedness in large language model (LLM) queries—determining whether a query is answerable from a given context—prior to expensive LLM inference. Our results show that lightweight encoders like RoBERTa and NomicBERT, fine-tuned on curated datasets, match the accuracy of LLMs such as LLaMA3 8B and GPT-4o in groundedness detection, while drastically reducing inference cost and latency. |
2:30pm - 3:00pm |
Behnoush Khavari | Expressivity and Generalization Performance of Linear Recurrent Models | We review the current state of Linear Recurrent Models/SSMs in terms of their expressivity on several sets of synthetic tasks, including state tracking tasks. These synthetic tasks are known to be representative of the model performance on important real-world tasks. We present our results that add to the evidence that most of the current LRNNs/SSMs fail to effectively solve such tasks. Finally, we go through newly proposed solutions to this deficiency. |
3:00pm - 3:30pm |
Coffee break | ||
3:30pm - 4:00pm |
Prashant Govindarajan | CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning | We introduce CrystalGym, an open-source reinforcement learning (RL) environment for crystalline material discovery using direct feedback from density functional theory (DFT), which enables property optimization for challenging targets like band gap, bulk modulus, and density. Through extensive benchmarking of value- and policy-based RL algorithms, we highlight differences in sample efficiency and convergence. CrystalGym offers a practical test bed for advancing RL methods under time-consuming reward signals, promoting interdisciplinary research at the intersection of machine learning and materials science. |
4:00pm - 4:30pm |
Kamran Chitsaz | NovoMolGen: Rethinking Molecular Language Model Pretraining | We present NovoMolGen, a family of transformer-based molecular language models pretrained on 1.5B molecules. We systematically study the impact of tokenization, model size, and data scale on molecular generation. Our results reveal weak correlation between pretraining metrics and downstream performance, unlike general NLP models. NovoMolGen achieves state-of-the-art results across both unconstrained and goal-directed generation tasks. |
4:30pm - 5:00pm |
Davide Baldelli | CADmium: Fine-tuning Code Language Models for Text-Driven Sequential CAD Design | Computer-aided design (CAD) is crucial across engineering domains but remains a largely manual and time-consuming process. We introduce CADmium, a large-scale dataset of over 170k CAD models paired with high-quality GPT-4.1-generated descriptions, which enables fine-tuning of large language models to generate CAD sequences from natural language. To better assess generated outputs, we propose geometric and topological metrics, and demonstrate through detailed experiments that CADmium can significantly accelerate the CAD modeling process. |
Day 2 (August 20)
Time | Speaker | Topic | Abstract |
---|---|---|---|
9:00am - 9:30am |
Xutong Zhao | Boosting LLM Reasoning via Spontaneous Self-Correction | We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama-3.1-8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively. |
9:30am - 10:00am |
Gabriele Prato | Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models | The standard practice for training LLMs involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models’ capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing improves model performance compared to training on individual documents. To further understand the underlying mechanisms, we conduct an extensive ablation study, identifying key factors that explain the advantages of packing. |
10:00am - 10:30am |
Mathieu Reymond | GRPO-lambda | LLMs are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training, e.g., using RL. We introduce GRPO-λ, an extension of state-of-the-art GRPO post-training method, that enhances credit assignment using a critic-free approximation of temporal-difference error and token-level eligibility traces. Across models from 1.5B to 7B parameters and four math reasoning datasets, GRPO-λ shows 30–40% better RL training performance and over 3-point average improvement on benchmarks. These gains are consistent across both LLaMA-3.1 and Qwen-2.5 architectures. |
10:30am - 11:00am |
Coffee break | ||
11:00am - 12:00pm |
Subbarao Kambhampati (Arizona State University) | Keynote: (How) Do LLMs Reason? | Large Language Models, auto-regressively trained on the digital footprints of humanity, have shown impressive abilities in generating coherent text completions for a vast variety of prompts. While they excelled from the beginning in producing completions in appropriate style, factuality and reasoning/planning abilities remained their Achilles heel (premature claims notwithstanding). More recently a breed of approaches dubbed “reasoning models” (LRMs). These approaches leverage two broad and largely independent ideas: (i) test-time inference – which involves getting the base LLMs do more work than simply providing the most likely completion, including using them in generate and test approaches such as LLM-Modulo (that pair LLM generation with a bank of verifiers) and (ii) post-training methods–which go beyond simple auto-regressive training on web corpora by collecting, filtering and training on derivational traces (that are often anthropomorphically referred to as “chains of thought” and “reasoning traces”), and modifying the base LLM with it using supervised finetuning or reinforcement learning methods. Their success on benchmarks notwithstanding, there are significant questions and misunderstandings about these methods–including whether they can provide correctness guarantees, whether they do adaptive computation, whether the intermediate tokens they generate can be viewed as reasoning traces in any meaningful sense, and whether they are costly Rube Goldberg reasoning machines that incrementally compile verifier signal into the generator or truly the start of a golden era of general purpose System 1+2 AI systems. Drawing from our ongoing work in planning, I will present a broad perspective on these approaches and their promise and limitations. |
12:00pm - 1:30pm |
Lunch break | ||
1:30pm - 2:00pm |
Artem Zholus | V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning | Joint embedding-predictive architecture (JEPA) is a fundametal approach to learning world models that understand physical world. By scaling world model pretraining to over a million hours of internet videos, we build V-JEPA 2 that excels at motion understanding, human-action anticipation, and video question answering. We show how action-conditioned post training on just 62 hours of unlabeled robot videos, enables zero-shot generalization in robotic control through planning in the latent space for tasks such as pick-and-place. |
2:00pm - 2:30pm |
Antoine Clavaud | The Challenges of Learning Streaming Representations for Reinforcement Learning | Deep Reinforcement Learning methods are inherently unstable, especially when using replay buffers is not allowed. Recent works have shown that it is possible to design such a deep streaming RL agent, but have not used any self-supervised representation learning techniques, commonly added to boost sample efficiency. We investigate how to add self-predictive representations in the streaming setting of RL in the hope it will improve sample efficiency. This aims to strengthen our understanding of how to learn good representations from a stream of data. |
2:30pm - 3:00pm |
Hadi Nekoei | A Generalist Hanabi Agent | We present the first agent capable of both playing all Hanabi settings simultaneously and generalizing zero-shot to novel partners and game configurations. Achieving this through self-play, without complex MARL methods, demonstrates a task-agnostic approach to generalization that we believe is valuable to the MARL community. |
3:00pm - 3:30pm |
Coffee break | ||
3:30pm - 4:00pm |
Artem Zholus | TAPNext: Tracking Any Point (TAP) as Next Token Prediction | Tracking Any Point (TAP) is a computer vision problem where a physical point is queried in a video frame and the model needs to find the position of the same point in the rest of the video. We present TAPNext, a novel computer vision model that achieves SOTA in both TAP quality and speed despite operating fully online. Notably, TAPNext discards all previous inductive biases for the task and builds atop generic blocks such as ViT and State-Space Model (SSM), a type of recurrent network. The paper will be presented at ICCV 2025. |
4:00pm - 4:30pm |
Naga Karthik | Monitoring morphometric drift in lifelong learning segmentation of the spinal cord | We present a real-world, lifelong-learning-in-deployment scenario applied to medical image segmentation. First, we develop a spinal cord segmentation model trained on a multi-institutional cohort, gathering data from 75 clinical sites worldwide. Then, we also introduce a lifelong learning framework to automatically monitor the performance drift of various spinal cord segmentation models developed over time. This framework uses GitHub Actions workflows and gets triggered each time a new model is released. Our model is open-source and is accessible via the Spinal Cord Toolbox (v7.0 and above). |
4:30pm - |
Closing remarks and social |