PRIVIMAGE: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining

Kecen Li*,1, Chen Gong*,2, Zhixiang Li3, Yuzhong Zhao1, Xinwen Hou1, Tianhao Wang2
1Chinese Academy of Sciences, 2University of Virginia, 3University of Bristol
*Indicates Equal Contribution
USENIX Security 2024

Abstract

Differential Privacy (DP) image data synthesis, which leverages the DP technique to generate synthetic data to replace the sensitive data, allowing organizations to share and utilize synthetic images without privacy concerns. Previous methods incorporate the advanced techniques of generative models and pre-training on a public dataset to produce exceptional DP image data, but suffer from problems of unstable training and massive computational resource demands. This paper proposes a novel DP image synthesis method, termed PRIVIMAGE, which meticulously selects pre-training data, promoting the efficient creation of DP datasets with high fidelity and utility. PRIVIMAGE first establishes a semantic query function using a public dataset. Then, this function assists in querying the semantic distribution of the sensitive dataset, facilitating the selection of data from the public dataset with analogous semantics for pre-training. Finally, we pre-train an image generative model using the selected data and then fine-tune this model on the sensitive dataset using Differentially Private Stochastic Gradient Descent (DP-SGD). PRIVIMAGE allows us to train a lightly parameterized generative model, reducing the noise in the gradient during DP-SGD training and enhancing training stability. Extensive experiments demonstrate that PRIVIMAGE uses only 1% of the public dataset for pre-training and 7.6% of the parameters in the generative model compared to the state-of-the-art method, whereas achieves superior synthetic performance and conserves more computational resources. On average, PRIVIMAGE achieves 6.8% lower FID and 13.2% higher Classification Accuracy than the state-of-the-art method.

Figure 1: An example of using the semantic query function to retrieve the semantic distribution from the sensitive dataset. We first train the semantic query function using the public dataset. This function is used to obtain the semantic distribution of the sensitive dataset. To ensure privacy, we then incorporate the Gaussian noise into our query results.

PRIVIMAGE

DP image synthesis aims to generate synthetic images that resemble real data while ensuring the original dataset remains private. With DP image synthesis, organizations can share and utilize synthetic images, facilitating various downstream tasks without privacy concerns. Diffusion models have demonstrated potential in DP image synthesis. Dockhorn et al. advocated for training diffusion models using DPSGD, a widely adopted method for training models satisfying DP. Drawing inspiration from the success of pre-training and fine-tuning across many challenging tasks in computer vision, Sabra et al. proposed to first pre-train the diffusion models on a public dataset, and then fine-tune them on the sensitive dataset. They attained state-of-the-art (SOTA) outcomes on datasets more intricate than those used by prior methods.

We highlight that the dataset with a semantic distribution similar to the sensitive dataset is more suitable for pre-training. Building on this observation, we present PRIVIMAGE, an end-to-end solution to meticulously and privately select a small subset of the public dataset whose semantic distribution aligns with the sensitive one, and train a DP generative model that significantly outperforms SOTA solutions.

PRIVIMAGE consists of three steps. As in Figure 1, firstly, we derive a foundational semantic query function from the public dataset. This function could be an image caption method or a straightforward image classifier. In our experiments, we use a CNN classifier to implement this function, and train this classifier with Cross-Entropy loss. Secondly, PRIVIMAGE uses the semantic query function to extract the semantics of each sensitive image. The frequencies of these extracted semantics then shape a semantic distribution, which can be used to select data from the public dataset for pre-training. To make this query satisfy DP, we introduce Gaussian noise to the queried semantic distribution. Finally, we pre-train image generative models on the selected dataset and fine-tune pretrained models on the sensitive dataset with DP-SGD.

Experimental Results

We extensively validate our PRIVIMAGE on several popular DP benchmark datasets, namely, (conditional) CIFAR-10 and (unconditional) CelebA (downsampled to 32x32 and 64x64 resolutions respectively). We measure sample quality via FID. On CIFAR-10, we also assess utility of class-labeled generated data by training classifiers on synthesized samples and compute class prediction accuracy (CA) on real test data. As is standard practice, we consider logistic regression (LR), MLP, and CNN classifiers. Table 1 gives the quantitative results, and PRIVIMAGE surpasses all the baselines. Figure 2, 3 and 4 shows the synthetic images from different methods. See paper for details.

Table 1: FID and CA of PRIVIMAGE and four baselines on CIFAR-10, CelebA32 and CelebA64 with ε = 10,5,1. For space limitation, CeA32 and CeA64 refer to CelebA32 and CelebA64 respectively. The best performance in each column is highlighted using the bold font.

Figure 2: Examples of Synthetic CIFAR-10 images with ε = 10.

Figure 3: Examples of Synthetic CelebA32 and CelebA64 images with ε = 10.

Figure 4: Examples of Synthetic Camelyon17 images with ε = 10.

BibTeX

@article{li2023privimage,
  title={PRIVIMAGE: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining},
  author={Kecen Li and Chen Gong and Zhixiang Li and Yuzhong Zhao and Xinwen Hou and Tianhao Wang},
  journal={arXiv preprint arXiv:2307.09756},
  year={2023}
}