Machine Learning for Single Cell RNA Sequencing Data Analysis: An Unsupervised Learning Approach towards Subclonal Cell Population Identification for Target Therapy Applications Against Tumor Heterogeneity

By Anastasia Dunca

In this work, I investigated machine learning algorithms for the effectiveness of statistical analysis on single cell RNA sequencing data for development of immunotherapy strategies. Single-cell RNA sequencing is an innovative tool in bioinformatics that is a group of methods that quantify the amount of RNA in a sample. This is very useful in the topic of heterogeneity in cell populations specifically tumor heterogeneity: the phenomenon of cells sharing phenotypic similarities but having differing cell behaviors. Using single cell RNA sequencing, gene expression profiles within cells can be developed to identify these convergent cell types. The identification of these cell types is especially useful in immunotherapy applications for cancer patients to combat resistance to drug treatment.

Machine learning is primarily used to analyze as well as visualize the data for interpretation and discovery of gene expression patterns. In the research, two datasets were examined: one with counts of each gene per cell and one with counts per million of each gene per cell. Due to the unlabeled and high dimensional nature of scRNAseq datasets, this is a problem for unsupervised learning. Two different pipelines for each dataset were applied and the resulting graphs were used for comparison. The first pipeline used consisted of filtering the data, dimensionality reduction with principle component analysis (PCA)–using a Scree Plot for component selection, projection with UMAP, and then cluster analysis using several cluster methods such as Ward and DBScan. The second pipeline used consisted of filtering the data, dimensionality reduction with PCA, and K- Means clustering–evaluated with the Silhouette Method and Elbow plot. Two classifiers–graph-based clustering and K-Means–were used to gauge the consistency of cluster analysis as well as optimal clarity for data analysis. After cluster generation, some exploratory data analysis was used with the “truth” data for each set to see what machine learning was able to identify for the researchers.

Ultimately, I wanted to try out a wide variety of machine learning clustering algorithms to analyze scRNAseq data in a way that would be helpful to biologists. I wanted to take a dataset full of zeros and small numbers and classify the inputs into a visually and numerically appeasing result that biologists could use to identify cancerous/treatment resistant genes.

Leave a Reply