
Unsupervised Learning is one of the most fundamental problem of machine learning, and more generally, of artificial intelligence. In a broad sense, it amounts to learning some unobserved latent structure over data. This structure may be of interest per se, or may serve as an important stepping stone integrated in a complex data analysis pipe-line - since large amounts of unlabeled data are more common than costly labeled data. Arguably, one the cornerstones of unsupervised learning is clustering, where the aim is to recover a partition of the data into homogeneous groups. Beside vanilla clustering, unsupervised learning encompasses a large variety of related other problems such as hierarchical clustering, where the group structure is more complex and reveals both the backbone and fine-grain organization of the data, segmentation where the shape of the clusters is constrained by side information, or ranking or seriation problems where where no actual cluster structure exists, but where there is some implicit ordering between the data. All these problems have already found countless applications and interest in these methods is even strengthening due to the amount of available unlabelled data. We can for instance cite crowdsourcing - where individuals answer to a subset of questions, and where, depending on the context, one might want to e.g. cluster them depending on their field of expertise, rank them depending on their performances, or seriate them depending on their affinities. Such problems are extremely relevant for recommender systems - where individuals are users, and questions are items - and for social network analyses. The analysis of unsupervised learning procedures has a long history that takes its roots both in the computer science and in mathematical communities. In response to recent bridges between these two communities, groundbreaking advances have been made in the theoretical foundations of vanilla clustering. We believe that these recent advances hold the key for deep impacts on the broader field of unsupervised learning because of the pervasive nature of clustering. In this proposal, we first aim at propagating these recent ground-breaking advances in vanilla clustering to problems where the latent structure is either more complex or more constrained. We will consider problems of increasing latent structure complexity - starting from hierarchical clustering and heading toward ranking, seriation, and segmentation - and propose new algorithms that will build on each other, focusing on the interfaces between these problems. As a result, we expect to provide new methods that are valid under weaker assumptions in comparison to what is usually done - e.g. parametric assumptions - while being also able to adapt to the unknown intrinsic difficulty of the problem. Moreover, many modern unsupervised learning applications are essentially of an online nature - and sometimes decisions have to be made sequentially on top of that. For instance, consider a recommender systems that sequentially recommends items to users. In this context where sequential, active recommendations are made, it is important to leverage the latent structure underlying the individuals. While both the fields of unsupervised learning, and sequential, active learning, are thriving, research at the crossroad has been conducted mostly separately by each community - leading to procedures that can be improved. A second aim of this proposal will then be to bring together the fields of unsupervised learning and active learning, in order to propose new algorithms that are more efficient at leveraging sequentially the unknown latent structure. We will consider the same unsupervised learning problems as in the batch learning side of the proposal. We will focus on developing algorithms that fully take advantage of new advances in clustering, and of our own future work in batch learning.
Modern statistics and machine learning problems often feature data sitting in an ambient space of high dimension. Yet, methods such as random forests or deep neural networks have recently enabled remarkable performance even in such complex settings. One main reason is that the data can often be explained through a low dimensional structure, hidden to the statistician. In such settings, Bayesian methods such as spike-and-slab variable selection priors, Bayesian additive regression trees (BART), Bayesian deep neural networks or deep Gaussian processes are routinely used by statisticians as well as in physics, astronomy and genomics applications. Among the reasons for the popularity of Bayesian algorithms, one can mention: their flexibility, in that it is relatively easy to model the unknown structure underlying the data through the prior distribution; the broad range of computational methods available, including variational approximations; their ability to quantify uncertainty through so-called credible sets. While there are many empirical successes, there is an important need for understanding and validation for such methods. From the mathematical perspective, one would like to be able to understand and demonstrate under which conditions such algorithms effectively work. The BACKUP project aims at providing theoretical backup for such modern statistical algorithms, around three research avenues. First, new results will be obtained for high-dimensional models and latent variable settings using Bayesian posterior distributions, tackling important recent questions of multiple testing and variable selection. Second, foundational results will be obtained for complex methods such as random forests and Bayesian deep neural networks, both for posteriors and their variational approximations. Third, we will address the fundamental question of uncertainty quantification, by deriving optimal efficient confidence sets from well-chosen Bayesian credible regions.
This project will combine the expertise of many specialists in contact and symplectic topology. Our research objectives will concentrate around two specific themes: Lefschetz fibrations and open book decompositions on the one hand and persistent homology on the other hand. Lefschetz fibrations and open book decompositions are central tools to understand the structure of symplectic and contact manifolds. We plan to use these notions in order to derive new constraints on the topology of Lagrangian submanifolds. We also plan to gain a better understanding of their holomorphic curves invariants and of their properties, using the very recent theory of convex hypersurfaces. Exciting new results in C^0 symplectic topology were obtained using persistent homology. We plan to use it in order to extract richer information from homological invariants constructed using holomorphic curves, generating families or sheaves.
The last decades has witnessed a very fast and deep development in the field of evolution partial differential equations (and particularly the dispersive equations). These major advances allow some perspectives which appeared to be completely out of reach a few years ago and open the very exciting perspective of studying deep dynamical properties of solutions of Partial differential equations. On the other hand, specialists of dynamical systems successfully extended methods and ideas, developed for the study of finite dimensional models, to infinite dimensional ones. This led to spectacular results concerning long time behavior of solutions of some non-linear partial differential equations, especially in one space dimension. The tools developed in the PDE context to handle non-linear PDEs could also lead to major breakthroughs, when combined with some dynamical properties of the equations. Finally, the study of partial differential equations in the presence of randomness, a topic originating in ideas from statistical mechanics, has also recently seen spectacular results. The goal of this project is to blend together ideas from these three points of view, to gain some new insights on the behaviour of solutions of partial differential equations in some asymptotic regimes : -- Long time behaviour, -- Existence, scattering, -- Stabilities, instabilities of particular solutions -- Blow-up.
With the enormous amounts of data involved in training a machine learning system, as well as the trend to push artificial intelligence to mobile devices and embedded systems, taking into account computation and memory constraints is of primary interest from the get go when designing machine learning methods. A defining feature of modern approaches to artificial intelligence is that the different type of constraints coming from statistical efficiency (i.e. training data efficiency) and computational efficiency (i.e. memory and compute time efficiency), should be considered simultaneously, not separately. While thinking about efficient algorithms in the two above senses was present from the origins of the field of machine learning, joint consideration of these issues has accelerated significantly in the last decade.The overarching objective of this chair will be to combine expertise and tools from optimisation, statistics, and theoretical computer science to take into account structure hidden in the data, and design provably reliable, statistically efficient, and computationally efficient algorithms exploiting such structure. One key aspect that will be considered is that of adaptivity, i.e. automatic tuning of the statistical models or hyperparameters involved, as well as of the type and amount of computational acceleration (such as parallelization, low dimensional representation), driven by he data itself. For the teaching objectives, a central tenet is to expose masters’ students to relevant parts of the fields of mentioned above (optimisation, statistics, theoretical computer science) and to bring them in a position to understand, control and ultimately shape (and not only to use) the latest developments in the field of artificial intelligence. Interaction between the different fields will be emphasized, so that the student’s thinking and skillset is shaped by a principle of mutual cross-overs, as much as through specialized courses on technical and up-to-date topics.