Genotyping assays are a cost-effective way of measuring genotypes on hundreds of thousands of positions across the genome and across many individuals. However, there are in excess of 10 million common genetic variants in human populations and drastically more rare variants. This can be treated as a missing data problem, where there are a few hundred thousand known variables and many million unknown variables in a dataset. Genetic imputation makes probabilistic inference of the unknown variables by matching haplotypes within the sample with haplotypes from a sequenced reference dataset.
Jointly analysing genome-wide association study (GWAS) results from multiple cohorts is key to improving sample sizes to the point that small effects can be detected at stringent genome-wide p-value thresholds. However, different cohorts often use different genotyping platforms, which means that combining the data in totality is not possible. Genetic imputation has been crucial for enabling very large GWAS sample sizes because, if each cohort imputes to the same reference panel, then they then have a set of common variables across the genome. Genetic imputation is now very effective at inferring common variants, but rare variants tend to how low imputation accuracy.