The Role of Theory in Trustworthy & Interpretable AI

Overview

Recent progress in generative AI—Large Language Models, Diffusion Models, and beyond—has been transformative, yet has largely proceeded without engaging with the theoretical foundations developed by the TCS community. At the same time, the deployment of these models has created pressing challenges around trustworthiness: How do we attribute model behavior to training data? How do we verify whether content is AI-generated? How do we align AI systems with the diverse preferences of society? How do we ensure safety and interpretability?

We believe theory has a key role to play in addressing these questions. Indeed, a growing body of recent work demonstrates that ideas from cryptography, social choice theory, learning theory, and statistics yield concrete, practical tools for modern AI systems. This workshop will showcase these successes and make the case that trustworthy AI is a natural and fertile area for harnessing theoretical perspective for practical impact.

Organizers

Noah Golowich

Allen Liu

Abhishek Shetty

Schedule

Monday, June 22

Tutorial by Noah Golowich, Allen Liu, and Abhishek Shetty 8:30–9:30am MDT

Tutorial on Language Models and Generative AI
Talk by Andrew Ilyas 9:30–10:30am MDT

Predictive models for AI systems

Abstract

Building AI systems entails making a number of extraordinarily high-dimensional design choices, such as what data to train on, what hyperparameters to use, and how to order training data. A natural stepping stone towards trustworthy AI systems is to be able to predict how these design choices affect model behavior. In particular, predictibility enables us to design targeted interventions that steer our systems towards desirable behavior. In this talk, we will discuss a family of approaches towards building predictive models of AI systems, with applications to data selection, data editing, and privacy auditing.

Tuesday, June 23

Talk by Paul Christiano 8:30–9:30am MDT

Theoretical approaches to AI alignment

Abstract

I’ll survey a range of theoretically grounded approaches for aligning AI systems with human intent. The first half of the talk will discuss black box methods that use checks and balances to prevent bad behavior. The second half will explore white box methods that leverage model weights to make more accurate predictions about out of distribution behavior. Within each category I’ll review a few existing theoretical and empirical results and state some important open problems.
Talk by Miranda Christ 9:30–10:30am MDT

Pseudorandom Codes and Cryptographic Watermarks for AI-Generated Content

Abstract

Watermarks involve embedding hidden patterns in AI-generated content to facilitate its detection. Intuitively, it seems inevitable that the stronger the watermark is, the more it must alter the content, leading to quality degradation. However, a recent line of work draws on cryptography to construct watermarks that are both robust to modification of the content, and provably quality-preserving: the watermarked model is computationally indistinguishable from the original model, implying that quality is maintained under any efficiently computable metric.

The key ingredient is a new cryptographic primitive called a pseudorandom error-correcting code, also an interesting theoretical object in its own right. A pseudorandom code is an error-correcting code whose codewords are computationally indistinguishable from uniformly random strings, while still being efficiently decodable from a constant rate of errors.

In this talk, I will survey common approaches to watermarking, focusing on those using pseudorandom codes. I'll end with some open questions that I find interesting, ranging from theory questions about pseudorandom codes to engineering questions that could yield more practical watermarks.

Wednesday, June 24

Talk by Cynthia Dwork 8:30–9:30am MDT

Equitable Evaluation via Elicitation

Abstract

Data are not simply given; they are created. Scoring and classification algorithms are trained on, and operate over, representations of individuals obtained via a representation mapping whose design directly affects what is given. This representation is not neutral, even when individuals decide how to represent themselves. For example, on a professional networking platform, individuals with similar qualifications and skills may vary in their outward manner: some tend toward self-promotion while others are modest to the point of omitting crucial information. Comparing the self-descriptions of equally qualified job seekers with different self-presentation styles is therefore problematic.

These differences are often correlated with gender and geography, and women and people from less individualstic cultures are typically advised to “speak more like (American) men.” In this work, we provide a method for obtaining a more equitable representation mapping via interaction. Specifically, we build a prototype interactive AI for skill elicitation that permits accurate determination of skills while simultaneously allowing individuals to speak in their own voice.

To obtain sufficient training data, we train an LLM to act as synthetic humans.

Joint work with Elbert Du, Lunjia Hu, Reid McIlroy-Young, Han Shao, and Linjun Zhang.
Talk by Sam Gunn 9:30–10:30am MDT

How to sketch a learning algorithm

Abstract

How does the choice of training data influence an AI model? This question is of central importance to interpretability, privacy, data attribution, and basic science. At its core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave if a given subset of training data had been deleted.

I will present a data deletion scheme capable of predicting model outputs to arbitrary precision in the deep learning setting. A bound on the error can be derived from a simple assumption about the stability of the AI model. In contrast to the assumptions made by prior work, ours appears to be fully compatible with deep learning.

We believe that this method opens up a range of new possibilities for theory in AI. For instance, it gives the first machine unlearning scheme with provable security in the deep learning setting.