AAAI'26 Workshop: Shaping Responsible Synthetic Data in the Era of Foundation Models

RSD @ AAAI 2026

Welcome to the AAAI'26 Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models (RSD).

Foundation models (LLMs and multimodal FMs) are increasingly supplemented with synthetic and LLM-generated data - to extend training corpora, fill coverage gaps, and navigate privacy and fairness requirements. At the same time, these models serve as powerful generators of synthetic data for downstream applications, such as training specialized ML models, augmenting datasets in privacy-sensitive domains, and generating test cases for system validation.

But synthetic datasets, whether used to train FMs or generated by them for other uses, introduce a plethora of risks: legal (copyright, consent), security & privacy related (leakage, membership inference), ethical (bias amplification), and technical (model collapse, quality degradation).

This workshop examines how synthetic data can be responsibly generated and used to fuel, test, and govern foundation models across their lifecycle (pre-training, fine-tuning, evaluation, auditing), as well as how FM-generated synthetic data impacts downstream applications and systems, and what technical, ethical, and regulatory guardrails are needed across this synthetic data ecosystem.

Topics of Interest (But not limited to)

✓Lifecycle Uses & LLM‑Driven Generation

Synthetic data for pre-training and fine-tuning (RLHF/RLAIF), continual evaluation, and self-training or bootstrapping loops—along with their limitations.

✓Explainability, Interpretability & Uncertainty

Synthetic counterfactuals and narratives for debugging; quantifying uncertainty; cross-domain benchmarks spanning tabular, time-series, text, vision, and multimodal data.

✓Safety, Robustness & Red‑Teaming

Synthetic adversarial probes, jailbreak tests, and edge-case simulations; preventing model collapse or shortcut learning through robust real/synthetic mixes.

✓Fairness, Bias & Representation

Synthetic adversarial probes, jailbreak tests, and edge-case simulations; preventing model collapse or shortcut learning through robust real/synthetic mixes..

✓Critical Perspectives on Synthetic Data

Cross-disciplinary examinations from law, policy, ethics, and social sciences; defining what is synthetic data, questioning appropriateness and societal impacts; tensions between technological possibility and human authenticity; critical assessments of synthetic data's role in shaping future AI systems.

✓Standards, Metrics & Tooling for Trustworthy Use

Metric suites for fidelity, utility, and privacy; responsible generation protocols (e.g., source curation, prompt filtering, DP noise, provenance/watermarking, disclosure “data cards”); validation pipelines and audit checklists; open-source vs. commercial generators.

✓Privacy, Security & Data Governance

Differential privacy and other PETs; leakage and membership-inference risks; consent, copyright, and provenance concerns; comparative perspectives on regulation.

Call for Papers

Important Dates

Submission Due Date: ~~October 27th, 2025 AoE~~
Notification of Acceptance: ~~November 10th, 2025, AoE~~
Workshop Dates: January 27th 2026, Singapore

Submission Instructions

Papers should be no more than 4 pages in length, though additional pages for references and appendices are permitted. Note that reviewers are not required to read appendices during their evaluation.
Submissions must be in a single PDF, formatted with the official AAAI LaTeX template, and uploaded via OpenReview.
Submissions are double-blind. Please ensure your paper is fully anonymized. Papers that are not anonymized or that exceed the page limit will be automatically rejected.
There is no rebuttal period. Final decisions are based only on the initial submission and reviewer feedback. Rejected or withdrawn papers will remain private. Reviews will not be published.

This is a non-archival workshop. While accepted papers will be accessible on the workshop website, we do not publish formal proceedings. You may submit work that is:

Previously presented conference/workshop papers: Significant updates are required.
Journal papers not previously presented at conferences/workshops: Submissions are welcome if they offer novel value for the community.
Dual submission to other workshops: Generally allowed.

Workshop Program

Coming soon

Accepted Papers

Spotlight Presentations

(Each talk including QA is 10 min)

Beyond Surface-Level Similarity: Hierarchical Contamination Detection for Synthetic Training Data in Foundation Models
Sushant Mehta
Robust Tabular Foundation Models
Matthew Peroni, Franck Le, Vadim Sheinin
Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
Dongkyu Cho, Miao Zhang, Rumi Chunara
MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data
Eyal German, Daniel Samira, Yuval Elovici, Asaf Shabtai

Accepted Paper

Beyond Surface-Level Similarity: Hierarchical Contamination Detection for Synthetic Training Data in Foundation Models
Sushant Mehta
When AI Cannot Reproduce Itself: Citation Drift as a Reproducibility Failure in Scientific LLMs
Gokul Srinath Seetha Ram
Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data
Ashish Kattamuri, Arpita Vats, Rahul Raja, Harshwardhan Fartale, Akshata Kishore Moharir, Ishita Prasad
FedEvolve: Evolutionary Tabular Data Synthesis in Vertical Federated Learning Systems
Adethya Srinivasan, Han Wu
CountSteer: Steering Attention for Object Counting in Diffusion Models
Hyemin Boo, Hyoryung Kim, MyungJin Lee, Seunghyeon Lee, Jiyoung Lee, Jang-Hwan Choi, Hyunsoo Cho
Improving Synthetic Data Generation with LLMs through Strategic Comparisons
Yao Rong, Shuo Yang, Gjergji Kasneci, Enkelejda Kasneci
Robust Tabular Foundation Models
Matthew Peroni, Franck Le, Vadim Sheinin
RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection
Zhuo Wang, Xiliang Liu, Ligang Sun
Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D.
Arsh Gupta, Ajay Narayanan Sridhar, Bonam Mingole, Amulya Yadav
Who Benefits from Alignment? Measuring Disparate Impact in RLHF with Synthetic Populations
Ibrahim Berber, Cerag Oguztuzun
Biologically-Informed Hybrid Membership Inference Attacks on Generative Genomic Models
Asia Belfiore, Jonathan Passerat-Palmbach, Dmitrii Usynin
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Max Zhang, Derek Liu, Kai Zhang, Joshua Franco, Haihao Liu, Kevin Zhu
An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text
Paula Joy B. Martinez, Jose Marie Antonio Miñoza, Sebastian C. Ibañez
Attribute-Aware Controlled Product Generation with LLMs for E-commerce
Virginia Negri, Víctor Martínez, Sergio A. Balanya, Subburam Rajaram
Backdoors in Conditional Diffusion: Threats to Responsible Synthetic Data Pipelines
Raz Lapid, Almog Dubin
"Hiding in Plain Sight": Designing Synthetic Dialog Generation for Uncovering Socially Situated Norms
Chengfei Wu, Dan Goldwasser
Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
Dongkyu Cho, Miao Zhang, Rumi Chunara
DICE: Discrete Interpretable Comparison Evaluation with Probabilistic Scoring for Retrieval-Augmented Generation
Shiyan Liu, Jian Ma
MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data
Eyal German, Daniel Samira, Yuval Elovici, Asaf Shabtai