Brandon Reagen
-
Assistant Professor

Brandon Reagen is an Assistant Professor in the Department of Electrical and Computer Engineering with affiliation appointments in the Computer Science. He earned a PhD in computer science from Harvard in 2018 and received his undergraduate degrees in computer systems engineering and applied mathematics from the University of Massachusetts, Amherst, in 2012.
A computer architect by training, Brandon has a research focus on designing specialized hardware accelerators for applications including deep learning and privacy preserving computation. He has made several contributions to ease the use accelerators as general architectural constructs including benchmarking, simulation infrastructure, and System on a Chip (SoC) design. He has led the way in highly efficient and accurate deep learning accelerator design with his studies of principled unsafe optimizations, and his work has been published in conferences ranging from computer architecture, machine learning, computer aided design, and circuits.
Prior to joining NYU, he was a research scientist with Facebook’s AI Infrastructure Research working on privacy preserving machine learning and systems for neural recommendation. During his PhD he was a Siebel Scholar (2018) and was selected as a 2018 Rising Star in Computer Architecture by Georgia Tech.
Research News
Cracking the code of private AI: The role of entropy in secure language models
Large Language Models (LLMs) have rapidly become an integral part of our digital landscape, powering everything from chatbots to code generators. However, as these AI systems increasingly rely on proprietary, cloud-hosted models, concerns over user privacy and data security have escalated. How can we harness the power of AI without exposing sensitive data?
A recent study, Entropy-Guided Attention for Private LLMs by Nandan Kumar Jha, a Ph.D. candidate at the NYU Center for Cybersecurity (CCS), and Brandon Reagen, Assistant Professor in the Department of Electrical and Computer Engineering and a member of CCS, introduces a novel approach to making AI more secure. The paper was presented at the Privacy-preserving Artificial Intelligence workshop at the AAAI Workshop on Privacy-Preserving Artificial Intelligence in early March.
The researchers delve into a fundamental, yet often overlooked, property of neural networks: entropy — the measure of information uncertainty within a system. Their work proposes that by understanding entropy’s role in AI architectures, we can improve the privacy, efficiency, and reliability of LLMs.
The Privacy Paradox in AI
When we interact with AI models — whether asking a virtual assistant for medical advice or using AI-powered legal research tools — our input data is typically processed in the cloud. This means user queries, even if encrypted in transit, are ultimately decrypted for processing by the model. This presents a fundamental privacy risk: sensitive data could be exposed, either unintentionally through leaks or maliciously via cyberattacks.
To design efficient private LLMs, researchers must rethink the architecture these models are built on. However, simply removing nonlinearities destabilizes training and disrupts the core functionality of components like the attention mechanism.
“Nonlinearities are the lifeblood of neural networks,” says Jha. “They enable models to learn rich representations and capture complex patterns.”
The field of Private Inference (PI) aims to solve this problem by allowing AI models to operate directly on encrypted data, ensuring that neither the user nor the model provider ever sees the raw input. However, PI comes with significant computational costs. Encryption methods that protect privacy also make computation more complex, leading to higher latency and energy consumption — two major roadblocks to practical deployment.
To tackle these challenges, Jha and Reagen’s research focuses on the nonlinear transformations within AI models. In deep learning, nonlinear functions like activation functions play a crucial role in shaping how models process information. The researchers explore how these nonlinearities affect entropy — specifically, the diversity of information being passed through different layers of a transformer model.
“Our work directly tackles this challenge and takes a fundamentally different approach to privacy,” says Jha. “It removes nonlinear operations while preserving as much of the model’s functionality as possible.”
Using Shannon’s entropy as a quantitative measure, they reveal two key failure modes that occur when nonlinearity is removed:
- Entropy Collapse (Deep Layers): In the absence of nonlinearity, later layers in the network fail to retain useful information, leading to unstable training.
- Entropic Overload (Early Layers): Without proper entropy control, earlier layers fail to efficiently utilize the Multi-Head Attention (MHA) mechanism, reducing the model’s ability to capture diverse representations.
This insight is new — it suggests that entropy isn’t just a mathematical abstraction but a key design principle that determines whether a model can function properly.
A New AI Blueprint
Armed with these findings, the researchers propose an entropy-guided attention mechanism that dynamically regulates information flow in transformer models. Their approach consists of Entropy Regularization — a new technique that prevents early layers from being overwhelmed by excessive information — and PI-Friendly Normalization — alternative methods to standard layer normalization that help stabilize training while preserving privacy.
By strategically regulating the entropy of attention distributions, they were able to maintain coherent, trainable behavior even in drastically simplified models, which ensures that attention weights remain meaningful, avoiding degenerate patterns that commonly arise once nonlinearity is removed, where a disproportionate number of heads exhibit extreme behavior — collapsing to near one-hot attention (low entropy) or diffusing attention uniformly (high entropy) — both of which impair the model’s ability to focus and generalize.
This work bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient privacy-preserving LLMs. It represents a crucial step toward making privacy-preserving AI more practical and efficient in real-world applications. By bridging the gap between information theory and neural architecture design, their work offers a roadmap for developing AI models that are not only more private but also computationally efficient.
The team has also open-sourced their implementation, inviting researchers and developers to experiment with their entropy-guided approach.
arXiv:2501.03489v2 [cs.LG] 8 Jan 2025
DeepReDuce: ReLU Reduction for Fast Private Inference
This research was led by Brandon Reagen, assistant professor of computer science and electrical and computer engineering, with Nandan Kumar Jha, a Ph.D. student under Reagen, and Zahra Ghodsi, who obtained her Ph.D. at NYU Tandon under Siddharth Garg, Institute associate professor of electrical and computer engineering.
Concerns surrounding data privacy are having an influence on how companies are changing the way they use and store users’ data. Additionally, lawmakers are passing legislation to improve users’ privacy rights. Deep learning is the core driver of many applications impacted by privacy concerns. It provides high utility in classifying, recommending, and interpreting user data to build user experiences and requires large amounts of private user data to do so. Private inference (PI) is a solution that simultaneously provides strong privacy guarantees while preserving the utility of neural networks to power applications.
Homomorphic data encryption, which allows inferences to be made directly on encrypted data, is a solution that addresses the rise of privacy concerns for personal, medical, military, government and other sensitive information. However, the primary challenge facing private inference is that computing on encrypted data levies an impractically high penalty on latency, stemming mostly from non-linear operators like ReLU (rectified linear activation function).
Solving this challenge requires new optimization methods that minimize network ReLU counts while preserving accuracy. One approach is minimizing the use of ReLU by eliminating uses of this function that do little to contribute to the accuracy of inferences.
“What we are to trying to do there is rethink how neural nets are designed in the first place,” said Reagen. “You can skip a lot of these time and computationally-expensive ReLU operations and still get high performing networks at 2 to 4 times faster run time.”
The team proposed DeepReDuce, a set of optimizations for the judicious removal of ReLUs to reduce private inference latency. The researchers tested this by dropping ReLUs from classic networks to significantly reduce inference latency while maintaining high accuracy.
The team found that, compared to the state-of-the-art for private inference DeepReDuce improved accuracy and reduced ReLU count by up to 3.5% (iso-ReLU count) and 3.5× (iso-accuracy), respectively.
The work extends an innovation, called CryptoNAS. Described in an earlier paper whose authors include Ghodsi and a third Ph.D. student, Akshaj Veldanda, CryptoNAS optimizes the use of ReLUs as one might rearrange how rocks are arranged in a stream to optimize the flow of water: it rebalances the distribution of ReLUS in the network and removes redundant ReLUs.
The investigators will present their work on DeepReDuce at the 2021 International Conference on Machine Learning (ICML) from July 18-24, 2021.