Shaping Responsible Synthetic Data in the Era of Foundation Models

A Workshop at AAAI 2026

27th January 2026

Singapore Expo, Singapore

Register for Workshop via AAAI'26

RSD @ AAAI 2026

Welcome to the AAAI'26 Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models (RSD).

Foundation models (LLMs and multimodal FMs) are increasingly supplemented with synthetic and LLM-generated data - to extend training corpora, fill coverage gaps, and navigate privacy and fairness requirements. At the same time, these models serve as powerful generators of synthetic data for downstream applications, such as training specialized ML models, augmenting datasets in privacy-sensitive domains, and generating test cases for system validation.

But synthetic datasets, whether used to train FMs or generated by them for other uses, introduce a plethora of risks: legal (copyright, consent), security & privacy related (leakage, membership inference), ethical (bias amplification), and technical (model collapse, quality degradation).

This workshop examines how synthetic data can be responsibly generated and used to fuel, test, and govern foundation models across their lifecycle (pre-training, fine-tuning, evaluation, auditing), as well as how FM-generated synthetic data impacts downstream applications and systems, and what technical, ethical, and regulatory guardrails are needed across this synthetic data ecosystem.

Topics of Interest (But not limited to)

Lifecycle Uses & LLM‑Driven Generation

Synthetic data for pre-training and fine-tuning (RLHF/RLAIF), continual evaluation, and self-training or bootstrapping loops—along with their limitations.

Explainability, Interpretability & Uncertainty

Synthetic counterfactuals and narratives for debugging; quantifying uncertainty; cross-domain benchmarks spanning tabular, time-series, text, vision, and multimodal data.

Safety, Robustness & Red‑Teaming

Synthetic adversarial probes, jailbreak tests, and edge-case simulations; preventing model collapse or shortcut learning through robust real/synthetic mixes.

Fairness, Bias & Representation

Synthetic adversarial probes, jailbreak tests, and edge-case simulations; preventing model collapse or shortcut learning through robust real/synthetic mixes..

Critical Perspectives on Synthetic Data

Cross-disciplinary examinations from law, policy, ethics, and social sciences; defining what is synthetic data, questioning appropriateness and societal impacts; tensions between technological possibility and human authenticity; critical assessments of synthetic data's role in shaping future AI systems.

Standards, Metrics & Tooling for Trustworthy Use

Metric suites for fidelity, utility, and privacy; responsible generation protocols (e.g., source curation, prompt filtering, DP noise, provenance/watermarking, disclosure “data cards”); validation pipelines and audit checklists; open-source vs. commercial generators.

Privacy, Security & Data Governance

Differential privacy and other PETs; leakage and membership-inference risks; consent, copyright, and provenance concerns; comparative perspectives on regulation.

Call for Papers

Important Dates

  • Submission Due Date: October 27th, 2025 AoE
  • Notification of Acceptance: November 10th, 2025, AoE
  • Workshop Dates: January 27th 2026, Singapore

Submission Instructions

  • Papers should be no more than 4 pages in length, though additional pages for references and appendices are permitted. Note that reviewers are not required to read appendices during their evaluation.
  • Submissions must be in a single PDF, formatted with the official AAAI LaTeX template, and uploaded via OpenReview.
  • Submissions are double-blind. Please ensure your paper is fully anonymized. Papers that are not anonymized or that exceed the page limit will be automatically rejected.
  • There is no rebuttal period. Final decisions are based only on the initial submission and reviewer feedback. Rejected or withdrawn papers will remain private. Reviews will not be published.

This is a non-archival workshop. While accepted papers will be accessible on the workshop website, we do not publish formal proceedings. You may submit work that is:

  • Previously presented conference/workshop papers: Significant updates are required.
  • Journal papers not previously presented at conferences/workshops: Submissions are welcome if they offer novel value for the community.
  • Dual submission to other workshops: Generally allowed.

Workshop Program

In-person location: Peridot 204, Singapore EXPO - 1 Expo Drive, Singapore 486150.

Poster sessions will be held near the room | Assigned Poster Numbers: WS51-WS60

Local Time (UTC+8) Activity
09:00AM - 09:05AM Opening Remarks
09:05AM - 09:45AM Invited Talk by Xiaokui Xiao: Responsible Synthetic Data with Differential Privacy
09:45AM - 10:30AM Poster Session 1
10:30AM - 11:00AM Coffee Break
11:00AM - 11:40AM Invited Talk by Daniel Ramage: Synthetic and Federated: toward provably private inspectable data
11:40AM - 12:20PM Invited Talk by Jörg Drechsler: Synthetic Data – A Statistician’s Perspective
12:20PM - 12:30PM Spotlight Talk: Robust Tabular Foundation Models
12:30PM - 02:00PM Lunch Break
02:00PM - 02:45PM Panel Discussion with Ghim Eng Yap, Yev Meyer, Anthony Tung, and Uzair Javaid (moderated by Shlomi Hod) : Responsible Synthetic Data in the Real World
02:45PM - 03:30PM Poster Session 2
03:30PM - 04:00PM Coffee Break
04:00PM - 04:40PM Invited Talk by Nitin Kohli: Enabling Humanitarian Applications with Targeted Differential Privacy
04:40PM - 05:20PM Invited Talk by Junnan Li: Evaluating and Advancing Computer-Use Agents
05:20PM - 05:30PM Spotlight Talk: MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data
05:30PM Closing Remarks

Invited Speakers

Panelists

Accepted Papers

Spotlight Presentations

(Each talk including QA is 10 min)

  • Robust Tabular Foundation Models
    Matthew Peroni, Franck Le, Vadim Sheinin
  • MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data
    Eyal German, Daniel Samira, Yuval Elovici, Asaf Shabtai

Morning Session

  • Beyond Surface-Level Similarity: Hierarchical Contamination Detection for Synthetic Training Data in Foundation Models
    Sushant Mehta
  • When AI Cannot Reproduce Itself: Citation Drift as a Reproducibility Failure in Scientific LLMs
    Gokul Srinath Seetha Ram
  • Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data
    Ashish Kattamuri, Arpita Vats, Rahul Raja, Harshwardhan Fartale, Akshata Kishore Moharir, Ishita Prasad
  • FedEvolve: Evolutionary Tabular Data Synthesis in Vertical Federated Learning Systems
    Adethya Srinivasan, Han Wu
  • CountSteer: Steering Attention for Object Counting in Diffusion Models
    Hyemin Boo, Hyoryung Kim, MyungJin Lee, Seunghyeon Lee, Jiyoung Lee, Jang-Hwan Choi, Hyunsoo Cho
  • Improving Synthetic Data Generation with LLMs through Strategic Comparisons
    Yao Rong, Shuo Yang, Gjergji Kasneci, Enkelejda Kasneci
  • Robust Tabular Foundation Models
    Matthew Peroni, Franck Le, Vadim Sheinin
  • RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection
    Zhuo Wang, Xiliang Liu, Ligang Sun
  • Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D.
    Arsh Gupta, Ajay Narayanan Sridhar, Bonam Mingole, Amulya Yadav

Afternoon Session

  • Who Benefits from Alignment? Measuring Disparate Impact in RLHF with Synthetic Populations
    Ibrahim Berber, Cerag Oguztuzun
  • Biologically-Informed Hybrid Membership Inference Attacks on Generative Genomic Models
    Asia Belfiore, Jonathan Passerat-Palmbach, Dmitrii Usynin
  • An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text
    Paula Joy B. Martinez, Jose Marie Antonio Miñoza, Sebastian C. Ibañez
  • Attribute-Aware Controlled Product Generation with LLMs for E-commerce
    Virginia Negri, Víctor Martínez, Sergio A. Balanya, Subburam Rajaram
  • Backdoors in Conditional Diffusion: Threats to Responsible Synthetic Data Pipelines
    Raz Lapid, Almog Dubin
  • "Hiding in Plain Sight": Designing Synthetic Dialog Generation for Uncovering Socially Situated Norms
    Chengfei Wu, Dan Goldwasser
  • Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
    Dongkyu Cho, Miao Zhang, Rumi Chunara
  • MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data
    Eyal German, Daniel Samira, Yuval Elovici, Asaf Shabtai

Workshop Organizers

Contact Us

Email us at aaai26-responsiblesyntheticdata@googlegroups.com