This study evaluates our novel feature extraction and data integration method for the accurate and interpretable classification of biological samples based on their mRNA and miRNA expression profiles. The main idea was to use the knowledge of miRNA targets and better approximate the actual protein amount synthesized in the sample. The raw mRNA and miRNA expression features become enriched or replaced by new aggregated features that model the mRNA-miRNA regulation instead. The underlying hypothesis is that "the sample profile presumably gets closer to the phenotype being predicted". The proposed subtractive aggregation method (SubAgg) directly implements a simple mRNA-miRNA interaction model in which mRNA expression is modified using the expression of its targeting miRNAs. This method works with the simplifying assumption of the equal weight of the individual miRNAs suitable for small sample sizes where learning of their proper weights may lead to overfiting. Its SVD-based modification (SVDAgg) enables different subtractive weights for different miRNAs learned by SVD. The two proposed knowledge-based subtractive methods were compared with their most straightforward counterparts for obtaining the integrated mRNA and miRNA data through merging two respective datasets. We classified myelodysplastic syndrome patients under various experimental settings and compared the straightforward concatenation with SubAgg and SVDAgg. The results suggest that the knowledge-based approaches dominate the concatenation benchmark, and the features resulting from the mRNA-miRNA target relation can improve classification performance.

Performance on gene set level features.
Performance on gene level features.
SubAgg.