DNA-based molecular classifiers for the profiling of gene expression signatures | Journal of Nanobiotechnology

AdminTeam April 19, 2024

Modular and programmable transformation of signatures

The substantial heterogeneity and intricate secondary structures of RNAs significantly restrict the commonality of DNA-based computation in gene expression signature profiling. Moreover, RNAs are typically found at concentrations ranging from attomolar to femtomolar in tissue and blood samples, necessitating a pre-amplification step for observable computation reactions. Herein, we developed a strategy based on asymmetric PCR and associative strand displacement, to modularly amplify and transform gene expression signatures into programmable inputs (Fig. 1B).

Asymmetric PCR was employed to achieve nearly linear amplification of RNAs, relative to their logarithmic initial concentrations. Using miRNA-21 as an example, we first employed a commercial kit that enables simultaneous poly(A) tailing reaction and reverse transcription to generate first strand cDNA. Subsequently, the generated cDNA was amplified by a specific primer and a universal primer, where the specific primer acts as an excess primer and the universal primer functions as a limiting primer. By adjusting the melting temperature (Tm) and stoichiometric ratio of the limiting and excess primer, the initial exponential phase of the reaction generates double-stranded amplicons until the limiting primers are exhausted, and the reaction switches to synthesis of only excess primer single strand DNA (ssDNA) [33]. At a specific cycle number, the ratio of generated ssDNA is consistent with that of the logarithmic initial concentrations of the RNAs (details of the proof process are provided in Additional file 1: Text S1). Fig. 1C–E showed the generated ssDNA from a series of initial miRNA concentrations ranging from 0.1 to 10 pM. The results demonstrated a linear relationship between the initial logarithmic concentrations of the miRNAs and the ssDNA produced by asymmetric PCR, confirming the feasibility of this method for subsequent molecular classifier.

Next, we designed associative strand displacement to modularly decouple sequence constraints between RNAs and subsequent DNA-based molecular classifiers. As shown in Fig. 1B, the two splitting modules partially complement to the generated ssDNA, and the remaining parts form a complete strand to trigger following strand displacement. Through the process of associative strand displacement, heterogeneous RNAs were transformed into a programmable sequence for universality. We first investigated the effect of hybridization length with ssDNA on the yield of strand replacement, each module was designed with at least 13 bases complementary to ssDNA to ensure a high yield (Additional file 1: Fig. S12). In addition, the split position and length of junction were optimized to minimize leakage during the process of conversion. According to the results shown in Additional file 1: Figs. S13 and S14, we strategically placed the split position 4 nt away from the toehold region and eliminated junction between two modules. Under optimal conditions, ssDNA were efficiently translated to programmable input for subsequent molecular classification (Fig. 1F, G). In general, these processing steps transform signatures into programmable inputs while preserving their original concentration relationship (Fig. 1H).

Arbitrary weight assignment to signatures

In molecular classifiers, various gene expression signatures hold their contributions to state of disease respectively, and a corresponding numerical weight is assigned to each signature in the machine learning model in silico. To implement arbitrary weight assignment at the molecular level, we designed a DNA catalytic system with an inhibitor as shown in Fig. 2A. Similar to the irreversible competitive inhibition model of enzymatic reactions (Fig. 2B):

$$\begin{aligned} Input+Amplifier \xrightarrow {k}Input + Output \end{aligned}$$

(1)

$$\begin{aligned} Input+Inhibitor \xrightarrow {k} Waste \end{aligned}$$

(2)

In an ideal situation, the final concentration of Output can be computed by integrating the corresponding differential equations:

$$\begin{aligned} \lim \limits _{t\rightarrow \infty }[Output](t)=[Input]_{0} \frac{[Amplifier]_0}{[Inhibitor]_0} \end{aligned}$$

(3)

As a consequence, we can exactly weight signatures by adjusting the initial concentration of the Amplifier and Inhibitor (see Additional file 1: Text S2 for details).

To experimentally validate this strategy, we designed an entropy driven catalytic system, namely, Amplifier, and a cascade reaction as corresponding Inhibitor maintained a consistent reaction rate [34, 35]. Amplifier can be catalyzed by inputs and release output strands, which then interact with double-stranded fluorescent reporters to determine their concentration. We first implemented weights (W) = 2.5, 3.5 or 4.5 for a series of concentrations of input ([Input] = 1, 2, 3 or 4 nM). Kinetic fluorescence measurements were performed after adding inputs to the competitive inhibition system, and we found that the final signal was linearly proportional to the stoichiometric ratio of $Amplifier_0$ and $Inhibitor_0$ for all concentrations of the inputs (Fig. 2C, D). The relationship between concentration of input and normalized signal was fitted to the linear equation $[Signal] = W \times [Input]$, the coefficients of determination ($R^2$) were greater than 0.98 for all the weights.

To further demonstrate that this mechanism can be used to assign an arbitrary weight to varying concentrations of input, we simulated the competitive inhibition system using ordinary differential equations (ODEs) (see Additional file 1: Text S3 for details). As shown in Fig. 2E, different weights were achieved by adjusting the concentration of inhibitor, and the performance remained consistent across various input concentrations. Then, we experimentally verified the simulated results, and the concentrations of Output and weights corresponding to different input concentrations demonstrated the precise weighting of input by the DNA-based competitive inhibition system (Fig. 2F, G).

Mathematical operations for the molecular classifier

To construct a comprehensive linear classifier, it is essential to employ mathematical operations that sum the weights and compare the resulting summation to the predefined threshold value, thereby obtaining the desired logistic response [36] (Fig. 3A). In DNA computation, the process of arithmetic summation can be naturally implemented through the production of identical output strands. Herein, we designed output strands that contain the same domain, allowing them to react with fluorescent reporters, for each input. The final fluorescence signal thus indicates the summation of weighted inputs:

$$\begin{aligned} \lim \limits _{t\rightarrow \infty }[Signal](t)=\sum \limits _{i}W_i\times [Input_i] \end{aligned}$$

(4)

Two-input and three-input summation systems were designed to verify the summation of weighted inputs, and the response signals were found to be consistent with the results of mathematical calculations (Fig. 3B, C). Simultaneously, another class of inputs, which exhibit a negative correlation with the outcome, yielded outputs containing distinct sequences for the negative reporters. The concentrations of different output strands individually represent the cumulative contributions of positive and negative inputs.

Then, a comparison between the output strands was implemented to generate the final result. It is convenient to accomplish the comparison by an annihilation reaction, summed output strands for positive and negative inputs were consumed at a stoichiometric ratio of 1:1 (Additional file 1: Figs. S15, S16). We carried out the annihilation reaction based on DNA cooperative hybridization mechanism [37]. As shown in Fig. 3D, one of the output strands is reversibly incorporated into the annihilator through the binding of a toehold. In the presence of another output strand, two outputs and an annihilator will irreversibly collapse into two waste molecules. The annihilation efficiency is highly dependent on the length of the toehold on the annihilator, toeholds with sufficient length have been intentionally designed to ensure the complete consumption of all minority species. In practice, HEX and ROX labeled reporters are designed to report the corresponding outputs associated with positive and negative weights in our system. Annihilation reactions with series of output concentrations ranging from 0 to 50 nM illustrate the successful implementation of subtraction.

We experimentally tested the main mathematical operations of the molecular classifier. Taking a simple linear classifier $[Signal]=1.5\times [Input1]-2\times [Input4]$ as an example, we combined a range of concentrations of each input to characterize the response. The fluorescence signals of 36 various input combinations were recorded fluorescence signal after they were added to the corresponding molecular computing system. Fig. 3E, F illustrate the endpoint fluorescence measurements captured from the HEX and ROX channels. Notably, a significant increase in fluorescence was observed in the HEX channel only when the value of weighted $Input_1$ surpassed that of $Input_4$, while no fluorescence signal was detected in the ROX channel, and vice versa. Among the experiments for which the weighted input was the same, both fluorescent signals were low (Fig. 3G), and were located on the diagonal. These observations suggest that the proposed design has credible mathematical operations.

Validation of the HCC diagnosis using synthetic miRNAs

To develop an effective classifier model for the in silico diagnosis of HCC, publicly available serum miRNA expression data corresponding to 345 HCC patients and 958 healthy individuals from GEO were used for classifier construction (details of the results are provided in Additional file 1: Text S4). First, differential expression analysis was used to identify miRNAs that were differentially expressed between the cancer and healthy groups. A total of 67 up-regulated and 174 down-regulated miRNA candidates exhibited expression level disparities that surpassed a fourfold magnitude. Then, a random-forest based algorithm was applied to assess the relevance of each signature by ranking them based on their predictive importance, and miRNAs were ranked by Mean Decrease Accuracy and Mean Decrease Gini. We subsequently designed a comprehensive SVM classifier consisting of 1 to 10 prominently ranked miRNAs in the training set, and selected a minimal set of miRNAs while maintaining classifier accuracy. It should be noted that the weights for each miRNA remained at one decimal place. In addition, the misclassification penalty for HCC samples was set to twice as high as that for healthy individuals, because an early diagnosis of HCC is crucial for improving its prognosis. Finally, we selected a classifier including five miRNAs with weights ranging from − 2.6 to 2.4 (Fig. 4C). The classifier discriminates between HCC and healthy samples with an area under the curve (AUC) of 0.9904 in the training dataset (171 HCC and 479 healthy samples) (Fig. 4A). The classifier model was further validated using an additional 174 HCC and 479 healthy samples, resulting in an AUC of 0.9871 (Fig. 4B). The classifier demonstrated excellent specificity and sensitivity, and allowed the implementation at the molecular level.

Next, we implemented the classifier by designing transformational and computational DNA circuits for miRNA inputs, and using synthetic miRNAs to evaluate the performance of well-designed molecular classifier. Ten patients each of HCC and healthy individuals correctly classified by the classifier in silico were selected and replicated in vitro. After transformation and DNA computation as illustrated before, fluorescence signals in the HEX and ROX channels were measured for each sample. The results showed that the expected signal was observed in the intended channel, while the signal remained near the background in the other channel (Fig. 4D). Moreover, we observed a robust correlation between the normalized signal intensity and the corresponding classifier output estimated in silico for each sample, indicating that our molecular classifier reproduced the SVM model (Additional file 1: Fig. S17).

Profiling clinical samples by molecular classifier

Finally, we verified the effectiveness of the molecular classifier for profiling HCC clinical samples. A general workflow is shown in Fig. 5A, miRNAs were first extracted from the plasma of each sample by a commercial kit, and reverse transcription and asymmetric PCR were subsequently performed to generate ssDNA, which was further transformed to corresponding inputs and processed by an established molecular classifier. The discrimination between HCC patients and healthy individuals was accomplished by monitoring the fluorescence signals in the HEX and ROX channels. The entire procedure takes approximately 3–4 h to complete.

The profiling results for 17 patients with HCC and 18 healthy individuals are shown in Fig. 5B and Additional file 1: Fig. S18. 15 out of 17 patients with HCC were diagnosed correctly with a sensitivity of 88.2%, 3 of 18 healthy individuals were misdiagnosed with a specificity of 83.3%. The total accuracy of the classifier for HCC diagnosis in clinical samples was 85.7% (Fig. 5C). Indeed, the results demonstrate the tremendous potential of our method in clinical diagnosis.

In our approach, some improvements were developed to drive the adoption and implementation of molecular classifiers in clinical settings. First, asymmetric PCR followed by subsequent associative strand displacement was used to modularly decouple sequence constraints between RNAs and molecular classifiers, which enables the extensive use of molecular classifiers across various gene expression signatures. Furthermore, for RNA transcripts with intricate secondary structures, associative strand displacement can be accomplished by hybridizing helper strands adjacent to the targeted region on ssDNA [21]. Second, the competitive inhibition system enables precise weight assignment for different inputs, which better aligns with the continuous optimization process of machine learning and accurately captures the importance of RNAs. The molecular implementation will accelerate more application of machine learning models in personalized diagnostics. Finally, by adjusting the sequence of DNA domains to control the reaction rate and orthogonality for different inputs, the molecular classifier could in principle be scaled to dozens or even hundreds of gene expression signatures. Overall, with the decreasing cost of synthetic DNA and advancements in microfluidics technology, an effective diagnostic model and a powerful DNA-based molecular classifier can be integrated into a completely automated classification workflow, this integration may facilitate a standardized testing process in low-resource settings.

Nevertheless, more efforts are needed to propel molecular classifiers from research settings to routine clinical practice. For instance, the introduction of an automated system could shorten the turnaround time of experiments and minimize human errors [38]. More testing of the classifier in large and diverse patient populations should be performed to ensure its robustness and generalizability. Optimization of the classifier’s parameters and algorithms may also be performed to enhance its predictive power. We believe future highly integrated DNA-based molecular classifiers may offer universality and scalability by allowing for the encoding of higher valence numbers and hence the detection of larger panels of biomarkers.