## Highlights

- •Mathematical model reveals impact of mRNA-seq read depth on gene expression analysis
- •Modularity in gene expression facilitates robust transcriptional program extraction
- •Model suggests dramatic increases in sample multiplexing for many applications
- •Read depth calculator determines parameters for optimal experimental design

## Summary

## Graphical Abstract

## Introduction

^{6}reads per cell) currently requires 2 weeks of sequencing on an Illumina HiSeq 4000.

## Results

### Statistical Properties of Gene Expression Data Determine the Accuracy of Principal Component Analysis at Low Read Depth

*k*-means, spectral clustering, and Locally Linear Embedding, are naturally related to PCA or its generalization, Kernel PCA (

_{i}are the associated eigenvalues that equal the variance of the data projected onto the component (

*pc*

_{i}) and shallow $\left(\stackrel{\u02c6}{p{c}_{i}}\right)$ principal components,

where

**C**and

**Ĉ**are the covariance matrices obtained from deep and shallow mRNA-seq data, respectively. Equation 1 can be used to model the impact of shallow sequencing on any given mRNA-seq dataset. Moreover, qualitative analysis of the equation reveals the key factors that determine whether low depth profiling will accurately identify transcriptional programs. As expected, this equation indicates that the principal component error depends on generic features including read depth and sample number, as these affect the difference between the shallow and deep covariance matrices in the numerator of Equation 1 (see the Supplemental Information, section 2.1). However, Equation 1 also reveals that the principal component error depends on a system-specific property: the relative magnitude of the principal values (captured by λ

_{i}− λ

_{j}). Since the principal values correspond to the variance in the data along a principal component, this term quantifies whether the information in the gene expression data is concentrated among a few transcriptional programs. When genes covary along a small number of principal axes, the dataset has an effective low dimensionality, i.e., the data are concentrated on a low-dimensional sub-space, and transcriptional programs can be extracted even in the presence of sequencing noise.

### Mouse Tissues Can Be Distinguished at Low Depth in Bulk mRNA-Seq Samples

^{7}reads per sample) to profile gene expression of 19 different mouse tissues with a biological replicate (

_{1}is separated from other principal values by at least λ

_{1}− λ

_{2}= 5 × 10

^{−6}, more than two orders of magnitude greater than the minimum separation of λ

_{25}from other principal values (1.5 × 10

^{−8}) (Figure 2B). Therefore, the 25

^{th}principal component requires almost four million reads, 140 times more than the first principal component, to be recovered with the same 80% accuracy.

^{7}reads per sample, the first three principal components have many significant functional enrichments with the second and third principal components enriched for neural and hematopoietic processes, respectively (Figure 2C; see Figure S1C for first principal component). These functional enrichments corroborate the separation seen when the gene expression profiles from each tissue are projected onto the second and third principal components (see the Experimental Procedures). Neural tissues (cerebellum, cortex, olfactory, and embryonic day 14.5 [E14.5] brain) project along the second principal component while the hematopoietic tissues (spleen, liver, thymus, bone marrow, and E14.5 liver) project along the third principal component (Figure 2D).

^{−4}(negative predictive value and positive predictive value in Figures S1D and S1E). To put this result in perspective, using only 32,000 reads per sample (corresponding to PCA accuracies of 81%, 79%, and 75% for the first three principal components, respectively) would allow a faithful recapitulation of functional enrichments while still multiplexing thousands of samples, rather than dozens, in a single Illumina HiSeq sequencing lane. Additionally, this low number of reads was still sufficient to separate the different cell types (Figure 2D). We obtained similar results when working in FPKM units, suggesting that the broad conclusions of our analysis are insensitive to gene expression units (Figures S1F, S1G, and S1H).

### Transcriptional States in Single Cells Are Distinguishable with Less Than 1,000 Transcripts per Cell

### Gene Expression Covariance Induces Tolerance to Shallow Sequencing Noise

^{2}= 0.13; Figure S3D).

### Large-Scale Survey Reveals that Shallow mRNA-Seq Is Widely Applicable due to Gene-Gene Covariance

### The Read Depth Calculator: A Quantitative Framework for Selecting Optimal mRNA-Seq Read Depth and Number of Biological Samples

_{i+1}/λ

_{i}is small (as defined in the Supplemental Information, section 2.1), an assumption justified by our large-scale microarray survey (see Figures S5C and S5D). These assumptions enable us to provide simple guidelines for making important experimental decisions, for example, choosing read depth,

*N*:

where

*n*is the number of biological samples and κ is a constant that can be estimated from existing data (see the Supplemental Information, section 2.1 for a derivation of this equation and its limitations). This relationship can be understood intuitively. First, Equation 2 states that the principal component error decreases with read depth, a consequence of the well-known fact that the signal-to-noise ratio of a Poisson random variable is proportional to $\sqrt{N}$. The read depth also depends on λ

_{i}, which comes from the λ

_{i}− λ

_{j}term of Equation 1. Finally, the influence of the sample number

*n*on read depth follows from the definition of covariance as an average over samples. (Figure S5E shows that

*n*is approximately statistically uncorrelated with principal values across the microarray datasets.)

^{−5}(median principal value from the 226 human microarray datasets) and 100 samples where 80% PCA accuracy is tolerable requires less than 5,000 reads per experiment or less than 500,000 reads in total, occupying less than 0.125% of a single sequencing lane in the Illumina HiSeq 4000.

## Discussion

*Caenorhabditis elegans*, 40 times over, in a single 400,000,000 read lane on the Illumina HiSeq 4000. Because shallow mRNA-based screens would provide information at the level of transcriptional programs and not individual genes, complementing these experiments by careful profiling of specific genes with targeted mRNA-seq (

## Experimental Procedures

### Simulated Shallow Sequencing through Down-sampling of Reads

### Finding Genes Significantly Associated with a Principal Component

### Gene Set Enrichment Analysis

^{−4}.

## Author Contributions

## Acknowledgments

## Supplemental Information

- Document S1. Supplemental Experimental Procedures and Figures S1–S5

## References

- Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
*Proc. Natl. Acad. Sci. USA.*1999; 96: 6745-6750 - Singular value decomposition for genome-wide expression data processing and modeling.
*Proc. Natl. Acad. Sci. USA.*2000; 97: 10101-10106 - Learning eigenfunctions links spectral embedding and kernel PCA.
*Neural Comput.*2004; 16: 2197-2219 - Iterative signature algorithm for the analysis of large-scale gene expression data.
*Phys. Rev. E Stat. Nonlin. Soft Matter Phys.*2003; 67: 031902 - Learning biological networks: from modules to dynamics.
*Nat. Chem. Biol.*2008; 4: 658-664 - Stable signal recovery from incomplete and inaccurate measurements.
*Commun. Pure Appl. Math.*2006; 59: 1207-1223 Ding, C., and He, X. (2004). K-means clustering via principal component analysis. ICML Proceedings of the 21st International Conference on Machine Learning (ACM), p. 29.

- Compressed sensing.
*IEEE Trans. Inf. Theory.*2006; 52: 1289-1306 - Single-pixel imaging via compressive sampling.
*IEEE Signal Process. Mag.*2008; 25: 83-91 - Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.
*Nucleic Acids Res.*2002; 30: 207-210 - Cluster analysis and display of genome-wide expression patterns.
*Proc. Natl. Acad. Sci. USA.*1998; 95: 14863-14868 - Expression profiling. Combinatorial labeling of single cells for gene expression cytometry.
*Science.*2015; 347: 1258367 Ham, J., Lee, D.D., Mika, S., and Schölkopf, B. (2004). A kernel view of the dimensionality reduction of manifolds. ICML Proceedings of the 21st International Conference on Machine Learning (ACM), p. 47.

- Reducing the dimensionality of data with neural networks.
*Science.*2006; 313: 504-507 - Dynamic modeling of gene expression data.
*Proc. Natl. Acad. Sci. USA.*2001; 98: 1693-1698 - Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types.
*Science.*2014; 343: 776-779 - Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.
*Cell.*2015; 161: 1187-1201 - Exploring the shallow end; estimating information content in transcriptomics studies.
*Front. Plant Sci.*2012; 3: 213 - Deconstructing transcriptional heterogeneity in pluripotent stem cells.
*Nature.*2014; 516: 56-61 - Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.
*Cell.*2015; 161: 1202-1214 - On spectral clustering: analysis and an algorithm.in: Advances in Neural Information Processing Systems. MIT Press, 2001: 849-856
- Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma.
*Science.*2014; 344: 1396-1401 - Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.
*Nat. Biotechnol.*2014; 32: 1053-1058 - What is principal component analysis?.
*Nat. Biotechnol.*2008; 26: 303-304 - Nonlinear dimensionality reduction by locally linear embedding.
*Science.*2000; 290: 2323-2326 Saxe, A.M., Mcclelland, J.L., and Ganguli, S. (2013). Learning hierarchical category structure in deep neural networks. Proceedings of the 35th Annual Meeting of the Cognitive Science Society, pp. 1271–1276.

- Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data.
*Nat. Genet.*2003; 34: 166-176 - Gene expression profiling identifies molecular subtypes of gliomas.
*Oncogene.*2003; 22: 4918-4923 - Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells.
*Nature.*2013; 498 (advance online publication): 236-240 - Single-cell RNA-seq reveals dynamic paracrine control of cellular variation.
*Nature.*2014; 510: 363-369 - Principles of Quantum Mechanics.Springer Science & Business Media, 2012
- A map of the cis-regulatory sequences in the mouse genome.
*Nature.*2012; 488: 116-120 - Matrix Perturbation Theory.Academic Press, 1990
- Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.
*Proc. Natl. Acad. Sci. USA.*2005; 102: 15545-15550 - Defining cell types and states with single-cell genomics.
*Genome Res.*2015; 25: 1491-1498 - Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq.
*Nature.*2014; 509: 271-375 - Visualizing data using t-SNE.
*J. Mach. Learn. Res.*2008; 9: 85 - Analysis of human transcriptomes.
*Nat. Genet.*1999; 23: 387-388 - Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.
*Science.*2015; 347: 1138-1142

## Article Info

### Publication History

### Identification

### Copyright

### User License

Creative Commons Attribution – NonCommercial – NoDerivs (CC BY-NC-ND 4.0) |## Permitted

### For non-commercial purposes:

- Read, print & download
- Redistribute or republish the final article
- Text & data mine
- Translate the article (private use only, not for distribution)
- Reuse portions or extracts from the article in other works

## Not Permitted

- Sell or re-use for commercial purposes
- Distribute translations or adaptations of the article

Elsevier's open access license policy

### ScienceDirect

Access this article on ScienceDirect## Related Articles

## Comments

#### Cell Press commenting guidelines

To submit a comment for a journal article, please use the space above and note the following:

- We will review submitted comments within 2 business days.
- This forum is intended for constructive dialog. Comments that are commercial or promotional in nature, pertain to specific medical cases, are not relevant to the article for which they have been submitted, or are otherwise inappropriate will not be posted.
- We recommend that commenters identify themselves with full names and affiliations.
- Comments must be in compliance with our Terms & Conditions.
- Comments will not be peer-reviewed.