Mathematical Analysis of Raman Spectra Data Arrays Using Machine Learning Algorithms

. This paper is devoted to the application of mathematical methods of classification and differentiation of low-resolution spectral data arrays of Raman light scattering for complex biological compounds as human platelets. Spectral data arrays consisted of 1266 spectra from 4 groups of patients, totaling 152 people were analyzed. A random forest algorithm was used. Potential biomarkers of differences between patient groups were identified, on which the given algorithms were tested. Using the random forest algorithm for classification of spectra of healthy patients without therapy and patients with cardiovascular pathologies without therapy, we have achieved the accuracy of 83.4%. Classification of the healthy patients on and off therapy shows the accuracy of 76.26% and classification of the patients with cardiovascular pathologies shows 70% accuracy. © 2023 Journal of Biomedical Photonics & Engineering.


Introduction
Cardiovascular diseases (CVDs) are the leading cause of death worldwide and in the Russian Federation.An estimated 17.9 million people died of cardiovascular disease in 2016, accounting for 31% of all deaths worldwide.Eighty five percent of these deaths were caused by heart attack and stroke [1].Cardiovascular disease is the most frequent cause of hospitalization and disability among the population of the Russian Federation.At the same time, about 40% of people in Russia die at an active working age (25-64 years) [2].
The development of new methods for diagnosing and detecting the risks of such diseases can significantly increase the efficiency and accuracy of preventing these diseases.The application of machine learning algorithms for blood Raman spectra analysis is a relatively novel method for detecting cardiovascular diseases.
Research into the application of machine learning to the processing of Raman spectra for the diagnosis of cardiovascular diseases is at an early stage of development.For example, machine learning has been applied to early warning of heart attacks using surfaceenhanced Raman spectroscopy [3].
A random forest algorithm was chosen to classify Raman spectra by patient groups and select the most significant spectral shifts [4][5][6][7].This choice is due to the ability of this method to classify, the ability to select the most significant features and to deal with large amounts of data.It is also effective and relatively fast, which is important for the analysis of large amounts of data in a short time.The originality of the approach lies in the first application of machine learning to the processing of a spectral array of complex Raman light scattering spectra of human platelets.
The purpose of this paper is to develop a solution for an important problem of differentiation of Raman spectra [8][9][10][11] in patients with and without cardiovascular pathologies, detection of spectral markers of platelet changes in pathologies and due to medication.
This article describes the results of applying a machine learning algorithm to the processing of spectral data arrays for different groups of patients: healthy patients, patients with pathologies of cardiovascular disease, healthy patients receiving therapy, and patients with pathologies of cardiovascular disease receiving 29 Jun 2023 © J-BPE 020308-2 therapy.The applicability of the random forest algorithm is shown.

SERS Experiment
For the platelet study using SERS, fresh venous blood samples were collected from healthy volunteers and patients with CVD in a vacuum tube containing EDTA (BD Vacutainer®).The samples were then centrifuged at 60 g for 15 min to separate platelet-rich plasma, and then it was centrifuged again with 60 g for 15 min to precipitate leukocytes and erythrocytes.In the last step, platelets were precipitated by centrifugation of the supernatant at 1500 g for 15 min.All centrifugation steps were performed at 4 °C.
Fresh venous blood samples for SERS were taken in a vacuum tube containing sodium citrate (Vacutainer 4.5 ml with 3.2%® sodium citrate).SERS spectra were obtained on a Centaur U HR Raman spectrometer (NanoScanTechnology LTD, Russia) using a Cobolt Samba diode-pumped solid-state laser with a photoexcitation wavelength of λ = 532 nm and a power per sample of 45 mW.The optical scheme included an Olympus BX 41 microscope with a 100X objective (NA 0.9).The spectrometer monochromator (ZAO Solar LS, Belarus) had a focal distance of 284 mm with a diffraction grating of 1200 gr/mm and was equipped with a IDus 401 CCD camera (Andor, UK).The resolution of camera was 1024 × 256 pixels.The spectrometer had a spectral resolution of 2.5 cm -1 .A 1 × 25 µm laser spot was positioned manually on the platelet mass.Rayleigh scattering was eliminated with a reflector filter.A drop of 5 μl of platelet-rich plasma was applied to a previously created titanium-based nanostructured surface, a detailed methodology of production of which is described in Ref. [12] substrate, dried for 5 min at room temperature, and then placed in a microscope rack.For each sample, spectra averaged three times from ten different drop locations were collected.The exposure time for the CCD camera was set as 70 s.Each time before the experiment, the spectrometer was calibrated for silicon by a static spectrum centered at a spectral shift frequency of 520.1 cm -1 for 1 s.After registration, the spectra were saved in .txtformat and in a special format (.ngs) on a PC connected to the Raman spectrometer.Due to the possibility of plasmon resonance generation, roughened titanium surfaces with spherical gold nanoparticles were used to amplify the Raman signal.The plasmon absorption maxima were λ = 530 nm and λ = 570 nm for gold nanoparticles and the rough Ti surface, respectively.For such structures, the spectral signal amplification was recorded up to orders of 103 times.
The study focused on the Raman spectra of the following groups of patients:  healthy without therapy (group 1),  healthy on therapy (aspirin, clopidogrel) (group 2),  patients with cardiovascular disease without therapy (group 3),  patients with cardiovascular disease on therapy (group 4).The total number of study participants was 152.Subjects were divided into 4 groups: healthy volunteers not receiving AT (group 1) and receiving acetylsalicylic acid (ASA), group 2; patients with cardiovascular disease (CVD) without AT (group 3) and receiving AT (group 4).Aggregometry and SERS spectra of platelets were performed in all subjects.An original optical sensor based on gold particle-modified nanostructured titanium surface was developed to obtain SERS spectra of platelets.
Fig. 1 Example of Raman spectra for healthy patients (blue line), patients with CVD on therapy (red line), group of healthy patients on therapy (green line) [12].
29 Jun 2023 © J-BPE 020308-3 Fig. 2 Decision tree in the classification of spectra by groups of healthy patients without therapy (group 1) and patients with cardiovascular disease pathologies without therapy (group 3).Spectrum data were read from the instrument and entered into .txtand .csvfiles.For the following study, spectral data were brought to uniformity by classifying a single frequency grid using the Parcer program.Uniform group tables were created and a unified frequency grid of 5 cm -1 spacing was generated.Thus, the developed program automatically converted files and formed an array of data.This program was written in C++ and arranged the spectrum by grid cells in the range from 400 cm -1 to 1800 cm -1 in 5 cm -1 increments.Thus, all spectral fluctuations were correlated with the designated grid.

Spectral Data Processing with Machine Learning
The Statistica 13 Random Forest module, which is an implementation of the random forest classifiers developed by Breiman [13], whose algorithm is also applicable to regression tasks, was chosen to spectral data.
To implement the random forest classification tasks, a group column was added to the data table, which included the name of the patient group of the selected observation.The dependent variable was the group distribution of patients according to their condition and medication intake.And spectral shift values between 400 cm -1 and 1800 cm -1 were taken as continuous predictors.Next, a test sample of spectral observations was allocated using the additional option of selecting the proportion of test and training samples.In our study, the test sample represents 30% of all observations.The random forest method in Statistica 13 defines a boundary function that measures the extent to which the mean number of votes for the correct class exceeds the mean number of votes for any other class present in the dependent variable.This measure provides us not only with a convenient way to make predictions, but also a way to relate the validity score to those predictions.The accuracy of the classifier was determined by the ratio of the number of correct responses to the total number of responses.

Results and Discussions
An array of spectral data from four groups of patients was accumulated based on the results of spectral imaging.An example of the obtained spectra is shown in Fig. 1.
Table 1 Classification matrix for separating healthy patients without therapy from CVD patients without therapy.

Predicted healthy without therapy Predicted CVDs without therapy
True healthy without therapy 37.5% 10.07% True CVD without therapy 6.53% 45.9% Note: The number of correctly classified spectral data by patient group is highlighted in green.The number of misclassified spectral data is highlighted in red.
Table 2 The most important spectral shifts and their interpretation for the classification of healthy patients without therapy from CVD patients without therapy.Table 3 Classification matrix to separate healthy patients without therapy from healthy patients on therapy.

Predicted healthy without therapy Predicted CVDs without therapy
True healthy without therapy 69.38% 11.88% True healthy on the therapy 11.88% 6.88% Note: The number of correctly classified spectral data by patient group is highlighted in green.The number of misclassified spectral data is highlighted in red.
Table 4 The most important spectral shifts and their interpretation for the classification of healthy patients without and on therapy.

Spectral Data Processing with Machine Learning
The data were classified into patient groups using a random forest algorithm.Let us first consider differentiation of spectra by groups of healthy patients without therapy (group 1) and patients with cardiovascular pathologies without therapy (group 3).Fig. 2 shows one of the decision trees, which shows the most informative spectral lines.
Most of the data were correctly identified using the random forest algorithm.The correctness of the algorithm for classifying the observations into groups of healthy patients without therapy and patients with cardiovascular pathology without therapy was 83.4%, as shown in Table 1.
In carrying out this classification, the most significant spectral shifts in separating the observations into groups were highlighted, as presented in Table 2.
29 Jun 2023 © J-BPE 020308-5 Note: The number of correctly classified spectral data by patient group is highlighted in green.The number of misclassified spectral data is highlighted in red.
Table 6 The most important spectral shifts and their interpretation for the classification of CVD patients without and on therapy.

Spectral shift, cm -1
Significance of the trait, relative units Interpretation Reference 1565 0.00662 C-N stretching [20] 1425 0.00655 The band at ~1420 cm -1 observed for saturated and Z-unsaturated FAs is hardly seen in the spectra of the unsaturated compounds and presumably accounts for the shoulder at approximately 1422 cm -1 [21] 1055 0.00615 Lipid hydrocarbon chain [14] 1440 0.00602 СH2 bend (lipids) [16][17] 1385 0.00582 Hydrogen bonding in protein [22] 1125 0.00576 V(Cß-methyl) [23] Also, a random forest algorithm was used to classify healthy patients on and off therapy with an accuracy of 76.26%, as can be seen in Table 3.
The most significant spectral shifts and their interpretation for the classification of healthy patients without and on therapy were highlighted, shown in Table 4.
When the patients with cardiovascular pathologies on and off therapy were classified, 70% accuracy was achieved, as shown in Table 5.
Table 6 highlights the most important spectral shifts and their interpretation for the classification of patients with cardiovascular pathologies without and on therapy.

Conclusions
As a result of this research, the initial results on the development of machine learning methods for the differentiation of Raman spectra in patients with and without cardiovascular disease pathologies were demonstrated.
Approaches were applied to process spectral data arrays consisting of 1266 spectra for different groups of patients: healthy patients, patients with pathologies of cardiovascular diseases, healthy patients receiving therapy and patients with pathologies of cardiovascular diseases receiving therapy.The applicability of the random forest algorithm was shown.Potential biomarkers of differences between patient groups on which the given algorithms were tested were identified.The achieved accuracy of classification using the random forest algorithm of spectra across the groups of healthy patients without therapy and patients with cardiovascular pathologies without therapy was 83.4%.Classification of the healthy patients on and off therapy shows the accuracy of 76.26% and classification of the patients with cardiovascular pathologies shows 70% accuracy.

Table 5
Classification matrix for separating non-therapy CVD patients from therapy naive CVD patients.