Neural Image Understanding

Ongoing Research

A neuroscience-driven research pipeline that decomposes image memorability into brain-region contributions, discovers that object-selective regions drive memorability while scene regions suppress it, and applies this to engineer more memorable images via LoRA fine-tuning of diffusion models.

V-JEPA2TRIBEv2PyTorchStable DiffusionLoRACLIPfMRIDRaFT-1

What if you could use a computational model of the human visual cortex to understand — and engineer — what makes images stick in memory? This research builds a complete pipeline from brain modeling to prescriptive image generation, grounded in fMRI data from real human subjects.

The Vision

Memorability is not random. Some images are almost universally remembered; others are forgotten within seconds. The question is whether that signal lives in the image itself, in the brain's response to it, or both — and whether we can use that understanding to make images deliberately more memorable.

This project builds toward a prescriptive tool that takes any image and shifts it toward the neural signature of memorability, grounded in actual measurements of the human visual cortex.

TRIBEv2: A Computational Visual Cortex

TRIBEv2 is a brain-predictive encoder trained to map visual input to predicted cortical responses. It pairs V-JEPA2 (1.2B parameters) — Meta's self-supervised video encoder — with an FmriEncoder that maps those visual features to predicted activity across 20,484 cortical vertices.

The model was trained on the Natural Scenes Dataset (NSD): 73,000 natural images viewed by 8 subjects during fMRI scanning, providing a rich ground truth for how the brain responds to real-world visual content.

Key Regions of Interest

FFA — Fusiform Face Area: face processing
LOC — Lateral Occipital Complex: object selectivity
PPA — Parahippocampal Place Area: place and scene encoding
RSC — Retrosplenial Complex: scene navigation and spatial context
V1 — Primary Visual Cortex: low-level edge and orientation detection
STS / ATL — Superior Temporal Sulcus and Anterior Temporal Lobe: social and semantic processing

Together, these regions span the full ventral and dorsal visual hierarchy, giving a rich multi-dimensional view of how any image is processed across the cortex.

Brain-Guided Image Generation

The first experimental direction: use TRIBEv2's predicted brain responses as a reward signal to steer Stable Diffusion — maximizing the predicted response of a target ROI during generation.

The Adversarial Gradient Problem

Backpropagating reward gradients through 1.2 billion parameters amplifies high-frequency artifacts. The generated images quickly degrade into adversarial noise that triggers high brain responses without resembling anything meaningful. Over 200 experiments explored solutions: gradient clipping, noise schedules, latent-space regularization, and more.

The Fix: Gradient Truncation + Blur

The working solution combines two interventions:

Gradient truncation at encoder layer 20 — stops the gradient before the deepest, most artifact-prone layers
Gaussian blur (σ=1.0) on the brain gradient — smooths the update signal to suppress high-frequency amplification

Together, these enable guidance scales up to 3× higher (50 → 150) without artifact collapse — enough to visibly shift image content toward the target brain region's preferences.

CLIP Proxy Pipeline

The cleaner alternative: train lightweight linear probes (512 → N_voxels) on CLIP image features to approximate brain responses. CLIP gradients are smooth and well-behaved — no adversarial artifacts.

Cross-evaluation validates the proxy: images generated by maximizing PPA proxy activation score highest on PPA brain activation according to TRIBEv2. The proxy generalizes.

The Memorability Discovery

This is the centerpiece finding of the project. Rather than optimizing for a single brain region, the question became: which brain regions predict whether an image will be remembered?

To answer it, we scored 58,000 imagesfrom the LaMem memorability dataset through TRIBEv2, decomposing each image's predicted cortical response by ROI and correlating those responses with human memorability scores.

A Cortical Dissociation

The result is a clean, consistent dissociation across the visual hierarchy:

Object-selective regions predict memorability (positive correlation):

FFA (faces): r = +0.23
ATL (semantic/social): r = +0.21
STS (social processing): r = +0.20
LOC (objects): r = +0.05
V1 (low-level): r = +0.06 (near zero — early visual cortex does not drive memorability)

Scene-selective regions suppress memorability (negative correlation):

RSC (scene navigation): r = −0.49
PCC (posterior cingulate): r = −0.42
PPA (places): r = −0.37

The story:the ventral social and object pathway drives what we remember. The dorsal scene-navigation pathway suppresses it. This aligns with prior behavioral work showing that objects and faces are more memorable than landscapes — but here it's read directly off the brain's predicted response.

Validation: Real fMRI Data (BOLD5000)

To rule out the possibility that these correlations are an artifact of the model, the same analysis was run on BOLD5000 — actual fMRI responses from human subjects viewing the same images. The dissociation holds on real brain data:

LOC: r = +0.27 on actual brain responses
RSC: r = −0.35 on actual brain responses

Two independent predictors — TRIBEv2 (NSD-trained) and BOLD5000-trained model — confirm the same signs across all ROIs.

Ruling Out Category Confounds

One alternative explanation: scene images are simply less memorable than object images, and ROI responses just detect category. Partial correlations controlling for image category (scene vs. object) preserve all signs and magnitudes, ruling out this confound.

The brain-response pattern within each category predicts memorability — not just which category the image belongs to.

Spatial Coherence

Vertex-level analysis of RSC shows that 99.2% of RSC vertices have a negative memorability correlation — a highly coherent spatial pattern across the region, not a noisy average masking heterogeneous effects.

Engineering Memorability

The research question becomes prescriptive: given the discovered brain signature of memorability, can we use it to make images more memorable?

The approach: fine-tune a LoRA adapter for Stable Diffusion that shifts generated images toward the memorable brain signature — maximizing object-selective responses (LOC↑, FFA↑) while suppressing scene-navigation responses (RSC↓, PPA↓).

DRaFT-1 + CLIP Proxy

Training uses DRaFT-1 (Differentiable Reward for Accelerated Fine-Tuning), a method for reward-guided diffusion model fine-tuning that avoids expensive full rollouts. The reward signal comes from the CLIP linear probes trained to approximate brain responses — clean gradients, no adversarial artifacts.

A composite reward function combines contributions from four brain regions, weighted by their memorability correlations, providing a single training signal that captures the full cortical dissociation.

This work is currently in progress — before/after generation results coming soon.

What's Next

Video memorability decomposition: Extend the analysis to the Memento10K dataset to ask whether the same cortical dissociation predicts memorability for dynamic content.
Domain transfer: Do the same brain signatures predict memorability for ads, product photos, and artwork — or is the effect specific to natural images?
Behavioral validation: An MTurk study to confirm that brain-predicted memorable images are actually remembered more by human participants — closing the loop from model to behavior.
Prescriptive tool: The ultimate goal: a system that takes any image or prompt and produces a neurally optimized version — more memorable by design.