What is rna seq data
Kumar, S. Comparative assessment of methods for the fusion transcripts detection from RNA-Seq data. Conesa, A. A survey of best practices for RNA-seq data analysis. Peixoto, L. How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets. Nucleic Acids Res. Langmead, B. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
Fast gapped-read alignment with Bowtie 2. Methods 9 , — Li, H. Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics 25 , — Wu, T. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26 , — Thierry-Mieg, D. Wang, K. Hu, J. Bioinformatics 28 , — Grant, G. Bioinformatics 27 , — Dobin, A. Bioinformatics 29 , 15—21 Liao, Y. The Subread aligner: Fast, accurate and scalable read mapping by seed-and-vote.
Trapnell, C. Li, Y. Garber, M. Computational methods for transcriptome annotation and quantification using RNA-seq. Methods 8 , — Anders, S. HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics 31 , — Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.
Li, B. Dillies, M. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Article PubMed Google Scholar. Differential expression analysis for sequence count data. Robinson, M. A scaling normalization method for differential expression analysis of RNA-seq data.
Hardwick, S. Reference standards for next-generation sequencing. Lindner, R. A comprehensive evaluation of alignment algorithms in the context of RNA-Seq.
Systematic evaluation of spliced alignment programs for RNA-seq data. Methods 10 , — Hatem, A. Benchmarking short sequence mapping tools. Borozan, I. Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-Seq. Baruzzo, G. Simulation-based comprehensive benchmarking of RNA-seq aligners.
Methods 14 , — Bray, N. Near-optimal probabilistic RNA-seq quantification. Maza, E. Comparison of normalization methods for differential gene expression analysis in RNA-Seq experiments: A matter of relative size of studied transcriptomes. Aanes, H. Fonseca, N. RNA-seq gene profiling-a systematic empirical comparison.
Nookaew, I. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: A case study in Saccharomyces cerevisiae. Zhang, W. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Everaert, C. Schmittgen, T. Quantitative reverse transcription-polymerase chain reaction to study mRNA decay: Comparison of endpoint and real-time methods.
Hellemans, J. Bioinformatics 30 , — Bartlett, J. Reliability, repeatability and reproducibility: Analysis of measurement errors in continuous variables. Ultrasound Obstet. Shrout, P. Intraclass correlations: Uses in assessing rater reliability. Varma, S. Bias in error estimation when using cross-validation for model selection. Parry, R. Ding, C. Minimum redundancy feature selection from microarray gene expression data.
Download references. The authors want to thank all data contributing teams for providing the unique SEQC benchmark datasets, and Ms. Ying Sha at Georgia Institute of Technology for insightful feedback. You can also search for this author in PubMed Google Scholar. Correspondence to May D. Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.
If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Reprints and Permissions. Tong, L. Impact of RNA-seq data analysis algorithms on gene expression estimation and downstream prediction. Sci Rep 10, Download citation.
Received : 22 December Accepted : 27 August Published : 21 October Anyone you share the following link with will be able to read this content:. Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative. By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Advanced search. Skip to main content Thank you for visiting nature. Download PDF. Abstract To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. Introduction The first phase of the FDA-led microarray quality control project MAQC-I investigated the reliability of microarray platforms for gene expression estimation 1.
These datasets were used to investigate the joint impact of pipeline components on downstream gene expression-based prediction in a two-phase study: 1 Phase-1 we developed three metrics—accuracy, precision, and reliability—for assessing the performance of a representative set of RNA-seq pipelines Fig.
Figure 1. SC3: consensus clustering of single-cell RNA-seq data. Klein, A. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell , — Kolodziejczyk, A.
The technology and biology of single-cell RNA sequencing. Cell 58, — Kristensen, L. Circular RNAs in cancer: opportunities and challenges in the field.
Oncogene 37, — Langfelder, P. Leek, J. Tackling the widespread and critical impact of batch effects in high-throughput data. Li, B. Li, H. A survey of sequence alignment algorithms for next-generation sequencing.
Li, W. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Lin, P. Love, M. Lummertz da Rocha, E. Reconstruction of complex single-cell trajectories using CellRouter. Lun, A. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Macosko, E.
Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Marco, E. Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape. McCarthy, D. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, — McDavid, A. Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments.
McKean, D. Loss of RNA expression and allele-specific expression associated with congenital heart disease. Miao, Z. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics 34, — Nichterwitz, S. Laser capture microscopy coupled with Smart-seq2 for precise spatial transcriptomic profiling.
Pertea, M. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Picelli, S. Single-cell RNA-sequencing: the future of genome biology is now. RNA Biol. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Pierson, E. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis.
Qiu, X. Single-cell mRNA quantification and differential analysis with Census. Quinn, J. Unique features of long non-coding RNA biogenesis and function.
Ramskold, D. Risso, D. Normalization of RNA-seq data using factor analysis of control genes or samples. Ritchie, M. Robinson, M. Bioinformatics 26, — A scaling normalization method for differential expression analysis of RNA-seq data. Rosenberg, A. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Saelens, W. A comparison of single-cell trajectory inference methods: towards more accurate and robust tools. Sasagawa, Y.
Quartz-Seq2: a high-throughput single-cell RNA-sequencing method that effectively uses limited sequence reads. Quartz-Seq: a highly reproducible and sensitive single-cell RNA sequencing method, reveals non-genetic gene-expression heterogeneity. Satija, R. Spatial reconstruction of single-cell gene expression data. Setty, M. Wishbone identifies bifurcating developmental trajectories from single-cell data. Seyednasrollah, F. Sheng, K. Effective detection of variation in single-cell transcriptomes using MATQ-seq.
Shin, J. Single-cell RNA-Seq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell 17, — Soneson, C. Bias, robustness and scalability in single-cell differential expression analysis. Song, Y. Single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation. Cell 67, — Stegle, O. Computational and analytical challenges in single-cell transcriptomics. Street, K. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics.
Sveen, A. Aberrant RNA splicing in cancer; expression changes and driver mutations of splicing factor genes. Oncogene 35, — Svensson, V. Power analysis of single-cell RNA-sequencing experiments. Talwar, D. AutoImpute: autoencoder based imputation of single-cell RNA-seq data. Tang, F. Methods 6, — Trapnell, C. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.
Vallejos, C. BASiCS: bayesian analysis of single-cell sequencing data. PLoS Comput. Beyond comparisons of means: understanding changes in gene expression at the single-cell level. Normalizing single-cell RNA sequencing data: challenges and opportunities. Visualizing data using t-SNE. Google Scholar. Recovering gene interactions from single-cell data using data diffusion. Vu, T. Beta-Poisson model for single-cell RNA-seq data analyses. Wang, E. Alternative isoform regulation in human tissue transcriptomes.
Welch, J. Robust detection of alternative splicing in a population of single cells. Xu, C. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, — Zeisel, A. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Zhang, L.
Comparison of computational methods for imputing single-cell RNA-sequencing data. The functional annotation of these long non-coding RNAs is more challenging as their conservation is often less pronounced than that of protein-coding genes. These resources can be used for similarity-based annotation of short non-coding RNAs, but no standard functional annotation procedures are available yet for other RNA types such as the long non-coding RNAs. The integration of RNA-seq data with other types of genome-wide data Fig.
Integrative analyses that incorporate RNA-seq data as the primary gene expression readout that is compared with other genomic experiments are becoming increasingly prevalent. Below, we discuss some of the additional challenges posed by such analyses. These associations can unravel the genetic basis of complex traits such as height [ ], disease susceptibility [ ] or even features of genome architecture [ , ].
Large eQTL studies have shown that genetic variation affects the expression of most genes [ — ]. First, it can identify variants that affect transcript processing. Second, reads that overlap heterozygous SNPs can be mapped to maternal and paternal chromosomes, enabling quantification of allele-specific expression within an individual [ ]. Allele-specific signals provide additional information about a genetic effect on transcription, and a number of computational methods have recently become available that leverage these signals to boost power for association mapping [ — ].
One challenge of this approach is the computational burden, as billions of gene—SNP associations need to be tested; bootstrapping or permutation-based approaches [ ] are frequently used [ , ].
Many studies have focused on testing only SNPs in the cis region surrounding the gene in question, and computationally efficient approaches have been developed recently to allow extremely swift mapping of eQTLs genome-wide [ ]. Moreover, the combination of RNA-seq and re-sequencing can be used both to remove false positives when inferring fusion genes [ 88 ] and to analyze copy number alterations [ ].
Pairwise DNA-methylation and RNA-seq integration, for the most part, has consisted of the analysis of the correlation between DEGs and methylation patterns [ — ]. General linear models [ — ], logistic regression models [ ] and empirical Bayes model [ ] have been attempted among other modeling approaches. The statistically significant correlations that were observed, however, accounted for relatively small effects. An interesting shift away from focusing on individual gene—CpG methylation correlations is to use a network-interaction-based approach to analyze RNA-seq in relation to DNA methylation.
This approach identifies one or more sets of genes also called modules that have coordinated differential expression and differential methylation [ ]. The combination of RNA-seq and transcription factor TF chromatin immunoprecipitation sequencing ChIP-seq data can be used to remove false positives in ChIP-seq analysis and to suggest the activating or repressive effect of a TF on its target genes.
In addition, ChIP-seq experiments involving histone modifications have been used to understand the general role of these epigenomic changes on gene expression [ , ].
Integration of open chromatin data such as that from FAIRE-seq and DNase-seq with RNA-seq has mostly been limited to verifying the expression status of genes that overlap a region of interest [ ].
DNase-seq can be used for genome-wide footprinting of DNA-binding factors, and this in combination with the actual expression of genes can be used to infer active transcriptional networks [ ]. This analysis is challenging, however, because of the very noisy nature of miRNA target predictions, which hampers analyses based on correlations between miRNAs and their target genes. Associations might be found in databases such as mirWalk [ ] and miRBase [ ] that offer target prediction according to various algorithms.
Nevertheless, pairwise integration of proteomics and RNA-seq can be used to identify novel isoforms. Unreported peptides can be predicted from RNA-seq data and then used to complement databases normally queried in mass spectrometry as done by Low et al.
Furthermore, post-translational editing events may be identified if peptides that are present in the mass spectrometry analysis are absent from the expressed genes of the RNA-seq dataset. Integration of transcriptomics with metabolomics data has been used to identify pathways that are regulated at both the gene expression and the metabolite level, and tools are available that visualize results within the pathway context MassTRIX [ ], Paintomics [ ], VANTED v2 [ ], and SteinerNet [ ]. Integration of more than two genomic data types is still at its infancy and not yet extensively applied to functional sequencing techniques, but there are already some tools that combine several data types.
Paintomics can integrate any type of functional genomics data into pathway analysis, provided that the features can be mapped onto genes or metabolites [ ]. In all cases, integration of different datasets is rarely straightforward because each data type is analyzed separately with its own tailored algorithms that yield results in different formats. Tools that facilitate format conversions and the extraction of relevant results can help; examples of such workflow construction software packages include Anduril [ ], Galaxy [ ] and Chipster [ ].
Anduril was developed for building complex pipelines with large datasets that require automated parallelization. The strength of Galaxy and Chipster is their usability; visualization is a key component of their design. Simultaneous or integrative visualization of the data in a genome browser is extremely useful for both data exploration and interpretation of results. Browsers can display in tandem mappings from most next-generation sequencing technologies, while adding custom tracks such as gene annotation, nucleotide variation or ENCODE datasets.
For proteomics integration, the PG Nexus pipeline [ ] converts mass spectrometry data to mappings that are co-visualized with RNA-seq alignments. RNA-seq has become the standard method for transcriptome analysis, but the technology and tools are continuing to evolve.
It should be noted that the agreement between results obtained from different tools is still unsatisfactory and that results are affected by parameter settings, especially for genes that are expressed at low levels.
The two major highlights in the current application of RNA-seq are the construction of transcriptomes from small amounts of starting materials and better transcript identification from longer reads.
The state of the art in both of these areas is changing rapidly, but we will briefly outline what can be done now and what can be expected in the near future. Newer protocols such as Smart-seq [ ] and Smart-seq2 [ ] have enabled us to work from very small amounts of starting mRNA that, with proper amplification, can be obtained from just a single cell. The resulting single-cell libraries enable the identification of new, uncharacterized cell types in tissues.
They also make it possible to measure a fascinating phenomenon in molecular biology, the stochasticity of gene expression in otherwise identical cells within a defined population. In this context, single cell studies are meaningful only when a set of individual cell libraries are compared with the cell population, with the aim of identifying subgroups of multiple cells with distinct combinations of expressed genes.
Differences may be due to naturally occurring factors such as stage of the cell cycle, or may reflect rare cell types such as cancer stem cells.
Recent rapid progress in methodologies for single-cell preparation, including the availability of single-cell platforms such as the Fluidigm C1 [ 8 ], has increased the number of individual cells analyzed from a handful to 50—90 per condition up to cells at a time. Other methods, such as DROP-seq [ ], can profile more than 10, cells at a time.
This increased number of single-cell libraries in each experiment directly allows for the identification of smaller subgroups within the population.
The small amount of starting material and the PCR amplification limit the depth to which single-cell libraries can be sequenced productively, often to less than a million reads. Deeper sequencing for scRNA-seq will do little to improve quantification as the number of individual mRNA molecules in a cell is small in the order of —, transcripts and only a fraction of them are successfully reverse-transcribed to cDNA [ 8 , ]; but deeper sequencing is potentially useful for discovering and measuring allele-specific expression, as additional reads could provide useful evidence.
Single-cell transcriptomes typically include about — expressed genes, which is far fewer than are counted in the transcriptomes of the corresponding pooled populations. The inclusion of added reference transcripts and the use of unique molecule identifiers UMIs have been applied to overcome amplification bias and to improve gene quantification [ , ]. Methods that can quantify gene-level technical variation allow us to focus on biological variation that is likely to be of interest [ ].
Typical quality-control steps involve setting aside libraries that contain few reads, libraries that have a low mapping rate, and libraries that have zero expression levels for housekeeping genes, such as GAPDH and ACTB , that are expected to be expressed at a detectable level.
Depending on the chosen single-cell protocol and the aims of the experiment, different bulk RNA-seq pipelines and tools can be used for different stages of the analysis as reviewed by Stegle et al. Single-cell libraries are typically analyzed by mapping to a reference transcriptome using a program such as RSEM without any attempt at new transcript discovery, although at least one package maps to the genome Monocle [ ].
While mapping onto the genome does result in a higher overall read-mapping rate, studies that are focused on gene expression alone with fewer reads per cell tend to use mapping to the reference transcriptome for the sake of simplicity.
Other single-cell methods have been developed to measure single-cell DNA methylation [ ] and single-cell open chromatin using ATAC-seq [ , ].
At present, we can measure only one functional genomic data-type at a time in the same single cell, but we can expect that in the near future we will be able to recover the transcriptome of a single cell simultaneously with additional functional data. The major limitation of short-read RNA-seq is the difficulty in accurately reconstructing expressed full-length transcripts from the assembly of reads.
Long-read technologies, such as Pacific-Biosciences PacBio SMRT and Oxford Nanopore, that were initially applied to genome sequencing are now being used for transcriptomics and have the potential to overcome this assembly problem. Long-read sequencing provides amplification-free, single-molecule sequencing of cDNAs that enables recovery of full-length transcripts without the need for an assembly step.
PacBio adds adapters to the cDNA molecule and creates a circularized structure that can be sequenced with multiple passes within one single long read. As one barcode corresponds to a limited number of molecules, assembly is greatly simplified and unambiguous reconstruction to long contigs is possible. This approach has recently been published for RNA-seq analysis [ ].
PacBio RNA-seq is the long-read approach with the most publications to date. The technology has proven useful for unraveling isoform diversity at complex loci [ ], and for determining allele-specific expression from single reads [ ].
Nevertheless, long-read sequencing has its own set of limitations, such as a still high error rate that limits de novo transcript identifications and forces the technology to leverage the reference genome [ ]. Moreover, the relatively low throughput of SMRT cells hampers the quantification of transcript expression.
These two limitations can be addressed by matching PacBio experiments with regular, short-read RNA-seq. The accurate and abundant Illumina reads can be used both to correct long-read sequencing errors and to quantify transcript levels [ ]. Updates in PacBio chemistry are increasing sequencing lengths to produce reads with a sufficient number of passes over the cDNA molecule to autocorrect sequencing errors. This will eventually improve sequencing accuracy and allow for genome-free determination of isoform-resolved transcriptomes.
Three factors determine the number of replicates required in a RNA-seq experiment. The first factor is the variability in the measurements, which is influenced by the technical noise and the biological variation. While reproducibility in RNA-seq is usually high at the level of sequencing [ 1 , 45 ], other steps such as RNA extraction and library preparation are noisier and may introduce biases in the data that can be minimized by adopting good experimental procedures Box 2.
Biological variability is particular to each experimental system and is harder to control [ ]. Nevertheless, biological replication is required if inference on the population is to be made, with three replicates being the minimum for any inferential analysis.
For a proper statistical power analysis, estimates of the within-group variance and gene expression levels are required. This information is typically not available beforehand but can be obtained from similar experiments. The exact power will depend on the method used for differential expression analysis, and software packages exist that provide a theoretical estimate of power over a range of variables, given the within-group variance of the samples, which is intrinsic to the experiment [ , ].
Table 1 shows an example of statistical power calculations over a range of fold-changes or effect sizes and number of replicates in a human blood RNA-seq sample sequenced at 30 million mapped reads. It should be noted that these estimates apply to the average gene expression level, but as dynamic ranges in RNA-seq data are large, the probability that highly expressed genes will be detected as differentially expressed is greater than that for low-count genes [ ].
For methods that return a false discovery rate FDR , the proportion of genes that are highly expressed out of the total set of genes being tested will also influence the power of detection after multiple testing correction [ ].
Filtering out genes that are expressed at low levels prior to differential expression analysis reduces the severity of the correction and may improve the power of detection [ 20 ]. Increasing sequencing depth also can improve statistical power for lowly expressed genes [ 10 , ], and for any given sample there exists a level of sequencing at which power improvement is best achieved by increasing the number of replicates [ ].
Tools such as Scotty are available to calculate the best trade-off between sequencing depth and replicate number given some budgetary constraints [ ]. RNA-seq library preparation and sequencing procedures include a number of steps RNA fragmentation, cDNA synthesis, adapter ligation, PCR amplification, bar-coding, and lane loading that might introduce biases into the resulting data [ ].
For bias minimization, we recommend following the suggestions made by Van Dijk et al. Another option, when samples are individually barcoded and multiple Illumina lanes are needed to achieve the desired sequencing depth, is to include all samples in each lane, which would minimize any possible lane effect.
Mapping to a reference genome allows for the identification of novel genes or transcripts, and requires the use of a gapped or spliced mapper as reads may span splice junctions.
The challenge is to identify splice junctions correctly, especially when sequencing errors or differences with the reference exist or when non-canonical junctions and fusion transcripts are sought. One of the most popular RNA-seq mappers, TopHat, follows a two-step strategy in which unspliced reads are first mapped to locate exons, then unmapped reads are split and aligned independently to identify exon junctions [ , ].
Important parameters to consider during mapping are the strandedness of the RNA-seq library, the number of mismatches to accept, the length and type of reads SE or PE , and the length of sequenced fragments. In addition, existing gene models can be leveraged by supplying an annotation file to some read mapper in order to map exon coordinates accurately and to help in identifying splicing events.
The choice of gene model can also have a strong impact on the quantification and differential expression analysis [ ]. We refer the reader to [ 30 ] for a comprehensive comparison of RNA-seq mappers. If the transcriptome annotation is comprehensive for example, in mouse or human , researchers may choose to map directly to a Fasta-format file of all transcript sequences for all genes of interests.
In this case, no gapped alignment is needed and unspliced mappers such as Bowtie [ ] can be used Fig. Mapping to the transcriptome is generally faster but does not allow de novo transcript discovery. Many statistical methods are available for detecting differential gene or transcript expression from RNA-seq data, and a major practical challenge is how to choose the most suitable tool for a particular data analysis job.
This enables a direct assessment of the sensitivity and specificity of the methods as well as their FDR control. As simulations typically rely on specific statistical distributions or on limited experimental datasets and as spike-in datasets represent only technical replicates with minimal variation, comparisons using simulated datasets have been complemented with more practical comparisons in real datasets with true biological replicates [ 64 , , ].
As yet, no clear consensus has been reached regarding the best practices and the field is continuing to evolve rapidly. However, some common findings have been made in multiple comparison studies and in different study settings. First, specific caution is needed with all the methods when the number of replicate samples is very small or for genes that are expressed at very low levels [ 55 , 64 , ]. Among the tools, limma has been shown to perform well under many circumstances and it is also the fastest to run [ 56 , 63 , 64 ].
DESeq and edgeR perform similarly in ranking genes but are often relatively conservative or too liberal, respectively, in controlling FDR [ 63 , , ]. SAMseq performs well in terms of FDR but presents an acceptable sensitivity when the number of replicates is relatively high, at least 10 [ 20 , 55 , ]. Cuffdiff and Cuffdiff2 have performed surprisingly poorly in the comparisons [ 56 , 63 ]. This probably reflects the fact that detecting differential expression at the transcript level remains challenging and involves uncertainties in assigning the reads to alternative isoforms.
In a recent comparison, BitSeq compared favorably to other transcript-level packages such as Cuffdiff2 [ ]. Besides the actual performance, other issues affecting the choice of the tool include ease of installation and use, computational requirements, and quality of documentation and instructions. Finally, an important consideration when choosing an analysis method is the experimental design. While some of the differential expression tools can only perform a pair-wise comparison, others such as edgeR [ 57 ], limma-voom [ 55 ], DESeq [ 48 ], DESeq2 [ 58 ], and maSigPro [ ] can perform multiple comparisons, include different covariates or analyze time-series data.
Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Computational methods for transcriptome annotation and quantification using RNA-seq. Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling.
Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.
Nat Biotechnol. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Differential expression in RNA-seq: a matter of depth. Genome Res. Andrews S. A quality control tool for high throughput sequence data. Accessed 29 September NGSQC: cross-platform quality analysis pipeline for deep sequencing data. BMC Genomics. Accessed 12 January Trimmomatic: a flexible trimmer for Illumina sequence data. Qualimap: evaluating next-generation sequencing alignment data.
GC-content normalization for RNA-seq data. BMC Bioinformatics. Assessment of transcript reconstruction methods for RNA-seq. Genome-guided transcript assembly by integrative analysis of RNA sequence data. Identification of novel transcripts in annotated genomes using RNA-Seq. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Hiller D, Wong WH. Simultaneous isoform discovery and quantification from RNA-Seq. Stat Biosci. PubMed Article Google Scholar.
Systematic evaluation of spliced alignment programs for RNA-seq data. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Full-length transcriptome assembly from RNA-seq data without a reference genome. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.
HTSeq - a Python framework to work with high-throughput sequencing data. Accessed on 12 January Pachter L. BMC Bioinformatics 14 , Kuleshov, M. Enrichr: a comprehensive gene set enrichment analysis web server update. Robinson, P. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Human Genet. Fernandez, N. Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data.
Data 4 , Drew, K. Integration of over 9, mass spectrometry experiments builds a global map of human protein complexes. Stark, C. BioGRID: a general repository for interaction datasets. Huttlin, E. The BioPlex network: a systematic exploration of the human interactome.
Cell , — Wu, C. Safran, M. GeneCards Version 3: the human gene integrator. Database , baq Rouillard, A. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database , baw Maglott, D. Merkel, D. Docker: lightweight linux containers for consistent development and deployment. Linux J. Google Scholar. Davis, S. Bioinformatics 23 , — Ignazio, R. Mesos in Action Manning Publications Co. Folk, M.
Visualizing data using t-SNE. Krijthe, J. Bostock, M. IEEE Trans. Dirksen, J. Learning Three. Barretina, J. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature , — Bolstad, B. R Package v1. Lean Big Data integration in systems biology and systems pharmacology. Trends Pharmacol. Download references. Keenan, Kathleen M. Jagodnik, Hoyjin J. Lee, Lily Wang, Moshe C. You can also search for this author in PubMed Google Scholar.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Reprints and Permissions. Massive mining of publicly available RNA-seq data from human and mouse. Nat Commun 9, Download citation. Received : 14 January Accepted : 08 March Published : 10 April Anyone you share the following link with will be able to read this content:.
Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative. BMC Bioinformatics BMC Medical Genomics Nature Protocols By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.
Advanced search. Skip to main content Thank you for visiting nature. Download PDF. Subjects Computer science Data integration Data processing. Full size image. Sample search with reduced dimensionality To enable reliable similarity search of signatures within the ARCHS4 data matrix, the matrix is compressed into a lower dimensional representation.
Prediction of biological function and protein interactions Gene—gene co-expression correlations across all human genes can be utilized to predict gene function and PPI by exploiting the fact that genes that co-express have the tendency to also share function and physically interact. References 1.
Article PubMed Google Scholar Google Scholar View author publications. Ethics declarations Competing interests The authors declare no competing interests.
Additional information Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
0コメント