Nexus Trimester - 2016 - Inference Problems Theme

Collection Nexus Trimester - 2016 - Inference Problems Theme

Organizer(s)

Date(s) 28/04/2024

00:00:00 / 00:00:00

39 41

When your big data seems too small: accurate inferences beyond the empirical distribution 2/2

By Gregory Valiant

We discuss three problems related to the general challenge of making accurate inferences about a complex distribution, in the regime in which the amount of data (i. e the sample size) is too small for the empirical distribution of the samples to be an accurate representation of the underlying distribution. The first problem we consider is the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in L1 distance (ie total variation distance). Perhaps surprisingly, it is often possible to “de-noise” the empirical distribution of the samples to return an approximation of the true distribution that is significantly more accurate than the empirical distribution, without relying on any prior assumptions on the distribution. We present an instance optimal learning algorithm which optimally performs this de-noising for every distribution for which such a de-noising is possible. One curious implication of our techniques is an algorithm for accurately estimating the number of new domain elements that would be seen given a new larger sample, of size up to n*log n. (Extrapolation beyond this sample size is provable information theoretically impossible, without additional assumptions on the distribution. ) While these results are applicable generally, we highlight an adaptation of this general approach to some problems in genomics (e. g. quantifying the number of unobserved protein coding variants). The second problem we consider is the task of accurately estimating the eigenvalues of the covariance matrix of a (high dimensional real-valued) distribution–the “population spectrum”. (These eigenvalues contain basic information about the distribution, including the presence or lack of low-dimensional structure in the distribution and the applicability of many higher-level machine learning and multivariate statistical tools. ) As we show, even in the regime where the sample size is linear or sublinear in the dimensionality of the distribution, and hence the eigenvalues and eigenvectors of the empirical covariance matrix are misleading, accurate approximations to the true population spectrum are possible. The final problem we discuss is the problem of recovering a low-rank approximation to a matrix of probabilities P, given access to an observed matrix of “counts” obtained via independent samples from the distribution defined by P. This problem can be viewed as a generalization of “community detection”, and is relevant to several recent machine learning efforts, including the work on constructing “word embeddings”. This talk is based on four papers, which are joint works with Paul Valiant, with Paul Valiant and James Zou, with Weihao Kong, and with Qingqing Huang, Sham Kakade, and Weihao Kong.

Information about the video

Date of recording 14/03/2016
Date of publication 08/04/2016
Institution IHP
Format MP4

Domain(s)

Computer Science

MSC codes

Last related questions on MathOverflow

You have to connect your Carmin.tv account with mathoverflow to add question

Ask a question on MathOverflow

All the collection videos

47:29

published on March 28, 2016

Sketching and Embeddings 1/2

By Alex Andoni

01:02:28

published on March 28, 2016

Sketching and Embeddings 2/2

By Alex Andoni

40:14

published on March 28, 2016

Dynamics of Randomized Row-Action Methods for High-Dimensional Estimation

By Yue Lu

56:03

published on March 28, 2016

Sampling from log-concave non-smooth densities or when Moreau meets Langevin

By Eric Moulines

43:00

published on March 28, 2016

Testing properties of distributions over big domains : An introduction

By Ronitt Rubinfeld

48:15

published on March 28, 2016

Testing properties of distributions over big domains : information theoretic quantities

By Ronitt Rubinfeld

47:05

published on March 28, 2016

Convex Programming in Small Space

By Sudipto Guha

50:07

published on March 28, 2016

New Algorithms for Heavy Hitters in Data Streams 1/2

By David Woodruff

49:51

published on March 28, 2016

New Algorithms for Heavy Hitters in Data Streams 2/2

By David Woodruff

33:38

published on March 28, 2016

Challenges in Streaming XML

By Christian Konrad

50:53

published on March 28, 2016

Approximating matchings in sublinear space

By Michael Kapralov

42:31

published on March 28, 2016

Communication Complexity of Learning Discrete Distributions

By Krzysztof Onak

40:56

published on March 28, 2016

Data Reduction for Clustering on Streams

By Harry Lang

49:14

published on March 28, 2016

Stream, sketching and Big Data – 1/2

By Graham Cormode

46:40

published on March 28, 2016

Streram, sketching and Big Data – 2/2

By Graham Cormode

42:02

published on March 28, 2016

Testing Cluster Structure of Graphs

By Christian Sohler

45:26

published on March 28, 2016

Nonasymptotic guarantees for sampling from a log-concave density

By Arnak Dalalyan

43:15

published on March 28, 2016

Communication Complexity with a more-than usal emphasis on Upper Bounds. Part 1 : Setting the stage

By Amit Chakrabarti

56:17

published on March 28, 2016

Communication Complexity with a more-than usal emphasis on Upper Bounds. Part 2 : The Information Complexity Paradigm

By Amit Chakrabarti

46:15

published on March 28, 2016

Streaming sums and symmetric norms

By Stephen Chestnut

01:20:39

published on April 8, 2016

Analysis of double covers of factor graphs

By Pascal Vontobel

56:46

published on April 8, 2016

Spatial coupling as a proof technique

By Nicolas Macris

38:53

published on April 8, 2016

Testing temporal causality and estimating directed information

By Ioannis Kontoyiannis

52:44

published on April 8, 2016

An Optimal Affine Invariant Smooth Minimization Algorithm

By Alexandre d'Aspremont

49:26

published on April 8, 2016

The Area Theorem and Capacity-Achieving Codes – 1/2

By Ruediger Urbanke

38:36

published on April 8, 2016

The Area Theorem and Capacity-Achieving Codes – 2/2

By Ruediger Urbanke

58:04

published on April 8, 2016

Near-optimal message-passing algorithms for crowdsourcing

By Sewoong Oh

41:57

published on April 8, 2016

Efficient Monte Carlo Methods for the Potts Model at Low Temperature

By Mehdi Molkaraie

50:24

published on April 8, 2016

(Arguably) Hard on Average Optimization Problems and the Overlap Gap Property

By David Gamarnik

54:23

published on April 8, 2016

Factor Graphs, Belief Propagation, and Density Evolution – 1/2

By Henry Pfister

53:21

published on April 8, 2016

Factor Graphs, Belief Propagation, and Density Evolution – 2/2

By Henry Pfister

46:52

published on April 8, 2016

Group-testing: Together we are one

By Sidharth Jaggi

51:24

published on April 8, 2016

Iteratively-Decoded Erasure-Correcting Coding for Distributed Storage

By Iryna Andriyanova

48:24

published on April 8, 2016

Understanding the MMSE of compressed sensing one measurement at a time

By Galen Reeves

01:02:39

published on April 8, 2016

New Algorithmic Techniques for Massive graphs – 1/2

By Andrew McGregor

39:25

published on April 8, 2016

New Algorithmic Techniques for Massive graphs – 2/2

By Andrew McGregor

47:33

published on April 8, 2016

Density estimation via piecewise polynomial approximation in sample near-linear time

By Ilias Diakonikolas

56:01

published on April 8, 2016

When your big data seems too small: accurate inferences beyond the empirical distribution 1/2

By Gregory Valiant

56:17

published on April 8, 2016

When your big data seems too small: accurate inferences beyond the empirical distribution 2/2

By Gregory Valiant

46:18

published on April 8, 2016

How to estimate the mean of a random variable? - Part 1

By Gábor Lugosi

39:48

published on April 8, 2016

How to estimate the mean of a random variable? - Part 2

By Gábor Lugosi

Copyright Carmin.tv 2024

Give feedback