Learning a logistic model from aggregated data


Alexandre Gilotte and David Rohde

Recent moves towards a privacy aware internet have increased the need for methods to learn from aggregated data, where both the labels and features are only observed through aggregated queries.
Aggregated data can easily be made differentially private by injecting some noise, and then shared with a third party, with the guarantee that no personal information is leaked. However, it is not clear how a third party should use such data to learn a model, as classical supervised learning methods do not apply here. In this paper we explain how existing methods on Markov Random Fields may be applied to fit a model to the joint distribution of labels and features from aggregated data (exploiting the presence of sufficient statistics), and use the conditional distribution of the obtained model to predict the labels. We then show how to modify the training objective to improve the quality of the learned conditional distribution. We further show experimentally on a public online advertising dataset that our method can perform close to a logistic regression with full access to the dis-aggregated data set.