06-23-2026
I recently attended PLDI 2026 in Boulder and thought I would take some time to reflect on some of the interesting things I learned while I was there. Overall I find conference travel very expensive (bost in time, and in cash), so I want to make the most of that investment by reflecting on the trip a bit for myself and for others. My goal is to shine a light on some of the interesting papers or prsentations I enjoyed. If anything sounds interesting here, PLDI is fully recorded. I am hoping to write similar reflections for subsequent conferences I attend. The highlight for this trip was definitely John Li’s and Jack Czenszak’s distinguished paper award for the symbolic sets paper. My plan is to dedicate exactly 3 hours to this post and ship it, so it is what it is!

A photo of boulder
ML compilers, some thoughts on industry engagement
Entertainingly, one of the first reflections I have is on what was missing at PLDI. Saman Amarasinghe gave an interesting opening keynote titled “Programming Language Design and Implementation for the Machine Learning Era”. I thought the most interesting part of this talk was a provocation to the programming languages community: why aren’t machine learning compilers a part of the PL community? Traditionally, the PL community has had lots of work on compilers and compiler optimization, but much of this work has shifted to other communities. He highlighted that many of the important and widely-cited papers on ML compilers, for example MLIR: Scaling compiler infrastructure for domain specific computation, were published in venues besides PLDI. He described how he asked many of the authors of these papers why the papers were not in PLDI. Strikingly, several of the authors of these highly-influential papers said they had been rejected from PLDI, and so went to other venues like OSDI. Others said they simply did not submit to PLDI because they did not perceive that there would be interest within the PLDI community.
I think there’s another possibility, which Saman did not discuss but I think is quite important generally for the PLDI community. PLDI has very little industry engagement relative to many other conferences. I counted 3 companies with booths: Amazon Web Services, Apple, and Jane Street. I noticed a few industry papers, including some from Microsoft Research, Galois, NVidia, and Draper Labs, but not nearly as many as in other conferences. I think it would be good for PLDI to increase its industry engagement. This is easier said than done: I think it requires a full-scale effort on behalf of organizers and reviewers to encourage submission of industry papers. An industry paper track could be a good idea: our current page “limit” (effectively minimum) of 20+ pages is quite onerous for time-strapped company employees, and the review expectation of a half-page of inference rules doesn’t usually align with company priorities, so company employees have a harder time justifying these theoretical developments. A separate dedicated review process for these papers would help. PLDI should also invite more industry speakers, and consider drawing more conference organization positions from industry to make sure their needs are met. I think this would be healthy for the community.
Some quantum questions
This was perhaps the PLDI of quantum. There was Aws’s keynote, a dedicated quantum computing track, and several papers in workshops and other tracks talking about quantum programming. I have lots of opinions about quantum programming: I have an old paper talking about programming, it has lots of connections with probabilistic reasoning, and many people in my life (like my wife!) have worked on physical aspects of quantum computers. So, you might think that I’m an enthusiast, but I will go on record here as saying I’m a skeptic. Explaining all my concerns and questions would blow my 3-hour budget, so I’ll briefly list some here and save a longer discussion for perhaps a future post:
- Narrowness and brittleness of quantum advantage. There are two strong applications of quantum algorithms I know of that have thoroughly stood the test of time: discrete logarithm (breaking lots of crypto) and universal quantum simulation (Seth Loyd’s paper here is a canonical reference). New applications — like quantum machine learning — come and go: as soon as a target is identified, the classical computing tortoise slowly catches up to quantum hare. I am still skeptical that interesting computing applications beyond these two domains exist: perhaps this blog post will inspire some email relieving me of my enduring ignorance.
- Physical realizability. We’ve made real strides in building quantum computers with more qubits, and from what I understand some applications in the NISQ setting are opening up. But, we remain a long way off from general quantum computers requiring many qubits. Partly this is due to error correction: error correction consumes some qubits, effectively multiplying the number of qubits required to build a working computer. Surface error correcting codes have very high overhead but are well-understood and have been physically realized; more promising codes like quantum LPDC codes have lower overhead but are less well-understood and have not been physically realized (yet). We are also still rapidly iterating on the physical design of the qubit, with neutral atoms emerging as a new candidate with its own slew of complexities. In short, the model of the computer itself feels quite unsettled, so it’s hard to ask interesting questions of what it means to design programming languages for it in my opinion.
There are undoubtedly fun programming languages problems in quantum: it has a fun exotic semantics (I like that!), there are lots of things to think about in terms of optimization and physical layout. But, I’m wary of dedicating too much effort to these downstream questions before we resolve these foundational questions I listed above. There’s more to say here, but I’ll leave it at that. I spent a nice amount of time talking with Aws and Charles, two of UW Madison’s quantum of quantum. My skepticisms endure, but it was nice to hear their opinions on these above concerns (but, I remain unconvinced)!
GPU programming (is still too hard)
There were several talks on GPU programming I thought were pretty interesting. I would really like to utilize GPUs more in my probabilistic programming work, but I find them terribly hard to program. This is because the kinds of computations I want to do are not matrix multiplications, and it’s pretty hard to get GPUs to cooperate with workloads that do not look like matrix multiplications.
There were a couple papers I thought were particularly interesting in this category:
Abhinav Jangda had an interesting paper on implementing Strassen-like multiplication using CUDA. Strassen’s algorithm is a matrix multiplication strategy that has better asymptotic complexity than the naive \(O(n^3)\) strategy (where \(n\) is the dimension of the matrix). Strassen is harder to implement than naive matrix multiplication, so implementing it in CUDA is nice! But, a key downside of Strassen’s algorithm is less numerically stable than naive matrix multiplication (look at the algorithm: there are subtractions, which can give catastrophic cancellation). I am not sure how practical this tradeoff is, especially with quantization for LLM workloads.
There was a nice distinguished paper talk from Manya Bansal on typed perspectives for GPU programming. The goal here is to make high-level GPU programming easier. GPU programming has a challenging control-plane hierarchy. Processing units are hierarchically organized: thread is a single unit of compute, warps are some threads that get run together, warps work together to implement a GPU kernel. When writing GPU code, you typically write hyper-cursed C++ programs that do things like ask for the current thread ID and branch based on what that is. This means that the same program describes all the behaviors that take place across different levels of the control hierarchy. As you can imagine, this makes it hard to disambiguate between thread-local code, warp-local code, and glue code that wires everything up. This paper creates a type system to help separate these notions and prevent silly mistakes, but one wonders: is this a bandaid over a flawed programming model to begin with? Why is it still so hard to write good GPU programs?
Floating point error (finally done right?)
Speaking of floating point error: it sucks. I work in probabilistic programming and I regularly encounter floating point error. I firmly believe that the way we think about floating point error today is similar to how we thought about undefined behavior in programming languages 5–10 years ago: as a compromise in language design that we simply do not have the performance and cognitive budget to overcome. Today we are paying the price for the compromise of undefined behavior in language design: we are slowly, painfully excorsizing undefined behavior from our programming languages at tremendous expense, and we will need to do the same for floating point error (especially in the scientific computing community, where the problem of floating point error looms like a boogeyman where only the most guru-ish programmers even know how to state the problems). We write all sorts of programs that have floating point errors in them and pretend like it’s okay, and I’m quite passionate that we need to improve on this abysmal situation.
So, I’m relieved to see the programming languages community is starting to have some genuinely good ideas in this space. I really enjoyed Laura Zielinski and Justin Hsu’s distinguished paper on compositional backward error analysis for floating point programs.
The key idea conceptual framework for floating-point error analysis is summarized in this diagram:
Exact Function f
Input (x) ────────────────────────────────────────► Exact Output (y)
│ │
│ (Backward Error: Δx) │ (Forward Error: Δy)
▼ ▼
Perturbed Input (x + Δx) ───────────────────────────► Computed Output (ŷ)
Exact Function fThere are two ways to think about the approximation error of floating point:
- The first is forward error, which asks “How close is my computed output to the true output?”. For a real-valued input
x, run an exact functionf(x)to get a real-valued outputy. Then, compare that with the floating-point computed outputŷ. The differencey - ŷis the forward error. - The second is backward error, which asks “What exact problem did my algorithm actually solve?”, which is a bit less intuitive. For a given computed output
ŷ, one finds a quantityΔxto perturb the input by so thatf(x + Δx) = ŷ. This quantityΔxis the backward error. IfΔxis on the order of machine epsilon precision, then the algorithm is called backward stable.
These two forms of error are related by the condition number, which “measures how much the output value of the function can change for a small change in the input argument”. This is closely related to a derivative: for a function in 1 variable, the condition number is the maximum absolute value of the derivative, \(\max_{x} f'(x)\). These quantities are all related by the following key equation:
$$\text{Forward Error} \le \text{Condition Number} \times \text{Backward Error}$$
This gives some hint about why backward error is central in numerical methods: if you have a small backward error, it cancels out the condition number and crunches the forward error to 0, which is the ideal situation.
I got a bit sidetracked here setting up the problem, and didn’t leave myself enough time to really discuss the paper itself. That’s OK: the paper is one of the better-written papers I’ve seen in a long time, so I encourage you to check it out. They give a new definition for backward error analysis that enables compositional reasoning about backward error. This would have been enough for a good result. Then, they wrap this new compositional definition up in a beatiful functional programming abstraction (lenses). It’s not done. Then, they give a semantic characterization of this analysis using a categorical framework and characterize some nice universal properties of this category (no, it isn’t Cartesian-closed so you model STLC in it, but it is nice). Wow, that’s a good result. Then, they implement an automated analysis using egglog that puts this framework to work by computing from an input numerical expression the set of all possible numerical computations paired with their backward errors. And that’s how you get a distinguished paper!
Probabilistic programming
I genuinely feel that the probabilistic programming papers at this year’s PLDI are some of the best I’ve seen in years, and it’s very exciting to me. I’ll comment on two that I particularly liked.
Incremental computation
Incremental Computation for Efficient Programmable Inference in Probabilistic Programs. Markov-Chain Monte Carlo inference is a local-search style inference strategy that works by perturbing some part of a probabilistic program and then rerunning it. This perturbation might only change a small part of the previous run, which can be a lot of wasted computation. This paper fixes this issue via the incremental lambda calculus. The essence of the idea an incrementalizing program transformation that augments functions \(\lambda x . e\) with an extra argument \(\lambda x . \lambda \Delta x . e\), where \(\Delta x\) is an “input difference”. Once the program has been transformed in this way, one can reuse prior function computations by varying \(\Delta x\) and keeping \(x\) fixed. This paper had to do some non-trivial generalizations to the incremental lambda calculus to make it fast enough to be useful. There are also some interesting semantics involving name generation, which is becoming an increasingly widely-used semantic component for PPLs (is that our fault ? I am not sure…) To me, this paper is a good idea that most MCMC inference implementations should probably adopt.
Gradient estimation
GradInf: Gradient Estimation as Probabilistic Inference I really enjoyed this paper and the talk by Gaurav. Gradient estimation refers generally to the problem of estimating the gradient of some function from samples. A particularly common and challenging instance of gradient estimation occurrs in machine learning and statistics, where one wishes to estimate the gradient of an expected value where direct computation of the expected value is too expensive. Concretely, for some two-dimensional function \(f : \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}\), probability distribution \(\mu\) on \(\mathbb{R}\), estimate:
$$\frac{\text{d}}{\text{d}y} \mathbb{E}_{x \sim \mu}[f(x, y)]$$
In machine learning, the distribution \(\mu\) is typically on the data, and the parameter \(y\) is a model parameter one is trying to optimize. Just to quickly ground ourselves, let’s consider \(f(x,y) = x^2 + y\), and let \(\mu\) be a uniform distribution on the interval \([0,1]\). Then, we can compute the above analytically:
$$ \begin{align*} \frac{\text{d}}{\text{d}y} \mathbb{E}_{x \sim \mu}[x^2 + y] &= \frac{\text{d}}{\text{d}y} \int_0^1 x^2 + y ~ \text{d}x \\ &= \frac{\text{d}}{\text{d}y} 1+y \\ &= 1 \end{align*} $$
(It’s been a very embarassingly long time since I’ve done an analytic calculation like above, let me know if it’s wrong…) So, the challenge of gradient estimation is to compute the above without evaluating the expecetation \(\mathbb{E}\) analytically. This is a central problem that shows up across machine learning, and there are many approaches, the most famous probably being the REINFORCE estimator.
I’ve always found gradient estimators to be a very confusing topic in machine learning. There is a zoo of different estimators with many different properties, and it seems only the experts and gurus really know what’s going on. This paper brings a tremendous amount of clarity to the many different kinds of gradient estimation techniques by dividing the problem up into 4 separate design steps: (1) choose a coupling, (2) factorize the resulting distribution, (3) perform probabilistic inference and automatic differentiation. Table is 1 a very impressive figure: it shows how a great number of gradient estimation techniques can be derived by making different choices for how to implement the 3 steps above. I love when a very complicated idea can be broken down to its essence, and I will definitely be returning to this paper if (when?) I need to understand gradient estimation better.
Bits and bobbles
- In homage to Phil Zucker, I’ll conclude this post with a barely-formatted bullet-pointed list of what I didn’t have time to polish. This goes against every instict I have as a perfectionist academic writer of ACM-format-compliant papers, but I’ve gotten value out of his bits and bobbles, maybe you’ll get value out of these.
- I blew my budget on this post. I was having a lot of fun. I spent 6 hours on it. I really enjoyed writing it though, so I think next time I will consider making the post during the conference.
- Apologies if I did not mention an interaction or talk; I ran out of time to write more.
- I really enjoyed several interactions at PLDI. It was fun to catch up with Max Bernstein; he encouraged me to just blog more and not polish it. I did the first part, but the second part is quite hard.