Near-Infrared Spectroscopy Enables Arthroscopic Histologic Grading of Human Knee Articular Cartilage

Purpose To develop the means to estimate cartilage histologic grades and proteoglycan content in ex vivo arthroscopy using near-infrared spectroscopy (NIRS). Methods In this experimental study, arthroscopic NIR spectral measurements were performed on both knees of 9 human cadavers, followed by osteochondral block extraction and in vitro measurements: reacquisition of spectra and reference measurements (proteoglycan content, and three histologic scores). A hybrid model, combining principal component analysis and linear mixed-effects model (PCA-LME), was trained for each reference to investigate its relationship with in vitro NIR spectra. The performance of the PCA-LME model was validated with ex vivo spectra before and after the exclusion of outlying spectra. Model performance was evaluated based on Spearman rank correlation (ρ) and root-mean-square error (RMSE). Results The PCA-LME models performed well (independent test: average ρ = 0.668, RMSE = 0.892, P < .001) in the prediction of the reference measurements based on in vitro data. The performance on ex vivo arthroscopic data was poorer but improved substantially after outlier exclusion (independent test: average ρ = 0.462 to 0.614, RMSE = 1.078 to 0.950, P = .019 to .008). Conclusions NIRS is capable of nondestructive evaluation of cartilage integrity (i.e., histologic scores and proteoglycan content) under similar conditions as in clinical arthroscopy. Clinical Relevance There are clear clinical benefits to the accurate assessment of cartilage lesions in arthroscopy. Visual grading is the current standard of care. However, optical techniques, such as NIRS, may provide a more objective assessment of cartilage damage.

C urrent diagnostic measures of musculoskeletal disorders involve clinical examination and imaging, e.g., x-ray imaging and magnetic resonance imaging, 1 which lack sensitivity to pinpoint localized defects and their severity (structural damage). 2 Conditions and injuries requiring medical interventions (e.g., a debridement, or cartilage repair) are treated in minimally invasive arthroscopy, often revealing previously unobserved focal cartilage defects. 3 As current arthroscopic measures of evaluation are qualitative and subjective (poor interobserver reliability), 4 a quantitative arthroscopic tool could be beneficial for the objective evaluation of chondral lesions.
Histology is the gold-standard measure of cartilage degeneration that is used to assess structural and compositional changes within the tissue. 5 The most commonly used histologic grading systems are the modified Mankin score (0-13), 6 Osteoarthritis Research Society International (OARSI) grading (0-6), 5 and International Cartilage Repair Society (ICRS) grading (0-4). 7 However, the practical clinical value of histology is limited due to invasive tissue extraction, which can greatly jeopardize cartilage integrity. Therefore, a nondestructive technique capable of providing an evaluation similar to histology would be of great value.
Optical techniques, such as near-infrared spectroscopy (NIRS) 8 and optical coherence tomography, 9 are capable of beyond-surface evaluation, making these techniques potentially superior to the current standard of visual evaluation during arthroscopy. The techniques use the nonionizing region of light (i.e., no deleterious effects) and the measurement can be performed in seconds. Previously, NIRS has been successfully applied for tissue diagnostics in the laboratory environment [10][11][12][13] and more recently in a few in vivo applications 14,15 by using custom sterilizable probes similar to the traditional arthroscopic hook. Studies also have evaluated the histologic properties of cartilage via NIRS with moderate-to-strong correlations, especially the Mankin score, which is the most common reference for tissue integrity in animals (bovine 16 and ovine 17 ) and humans 16,18,19 in the laboratory environment. In addition, several NIRS studies have focused on the estimation of cartilage composition, especially proteoglycan (PG) content. 8,12,20,21 The clinical in situ application of NIRS requires a pretrained multivariate model (e.g., chemometrics and neural networks), arising from the overlapping nature of spectral peaks in the NIR range, developed using a library of both spectral and reference measurements. 22 Lately, convolutional neural networks (CNNs) have outperformed classical chemometrics methods, 23 such as principal component regression. These approaches, however, are yet to account for the inherent dependencies within the data that can arise, for example, from repeated measures from the same subject 13,24 despecially with valuable human data. Linear mixed-effects (LME) modeling can resolve this dependency problem and is often paired with a dimensionality reduction technique, such as principal component analysis (PCA), 25 due to multicollinearity (i.e., a high number of collinear variables in spectral data).
In addition to the impactful role of the modeling approach, spectral preprocessing has a substantial impact on the signal interpretation and the accuracy of the resulting model. 26 Currently, the optimal preprocessing pipelines are based on an expert opinion and, to some extent, trial-and-error. To ease the decision on optimal pipeline, open-source preprocessing pipelines, such as nippy, 27 have become available to explore vast combinations of different preprocessing operators.
The purpose of this study was to develop the means to estimate cartilage histologic grades and PG content in ex vivo arthroscopy using NIRS. We hypothesized that NIRS would estimate cartilage lesion severity and PG content during an ex vivo arthroscopy.

Methods
In this experimental study, NIR spectra were collected in both ex vivo arthroscopy and in vitro, and the extracted samples were subjected to extensive reference measurements. Ex vivo spectral measurements from several standardized locations (n ¼ 19 [9 in the femur, 8 in the tibia, and 2 in the patella]) were recorded from both knees of human cadavers (N ¼ 9, age ¼ 68.4 AE 7.5 years) by an experienced orthopaedic surgeon (no living cartilage was assessed). 13 The inclusion criteria for the donors were that they were scheduled for a medical obduction, had no history of knee surgery (also visually verified), and had no infectious risks. The examinations and sample extraction of this study were performed before the medical obduction and post-haste after postmortem (max 4 days) during which time the donors were stored at 4 to 7 C. In the arthroscopies, the inferior extremity of the cadaver was freely movable on a straight table and stabilized with a lateral post on the femur to allow valgisating or varisating forces created by the surgeon, to apply the setting of normal knee arthroscopy. Anteromedial and anterolateral 1 cm parapatellar interchangeable portals were created for the conventional arthroscope (4 mm, 30 inclination; Karl Storz GmbH & Co, Tuttlingen, Germany) and the novel NIRS probe, respectively. The knee joint was filled with saline using a hand pump. If necessary, an arthroscopic shaver was used to flush the intra-articular space and resect the liposynovia, thereby clearing the visualization of the areas under examination. The NIRS probe was aligned perpendicular and in contact with the cartilage under the examination based on the visualization of the conventional arthroscope. After the measurements, the condyles of the tibia and femur, and patella were harvested, followed by the extraction of cylindrical osteochondral plugs (diameter ¼ 8 mm, the total number of extracted plugs ¼ 303 after exclusion of 39 plugs due to completely eroded cartilage) with a drill punch machine. The plugs were subjected to in vitro spectral measurements and histologic evaluation. The local research ethics committee (decision number 134/ 2015, Research Ethics Committee of the Northern Savo Hospital District, Kuopio University Hospital, Kuopio, Finland) approved the study. The followed procedures were by the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2000.

Near-Infrared Spectroscopy
The spectra were collected both ex vivo and in vitro (Fig 1) with the hardware, consisting of 2 spectrometers Avantes BV), and an arthroscopic optical probe. 28 The custom-made probe resembles the conventional arthroscopic hook and has a total of 114 optical fibers (fiber diameter ¼ 100 mm) within the sterilizable stainless-steel housing (outer diameter ¼ 3.25 mm). Thinner fibers (with a smaller minimum bend radius) were used to support the hook-based design.
During the ex vivo spectral measurements, joint cavities were irrigated and distended with saline similarly to routine clinical arthroscopy to enhance the visibility of the articulating surfaces and to ease probe alignment. The probe was aligned perpendicular and in contact with cartilage surface under the guidance of a conventional endoscope (4 mm, 30 inclination, Karl Storz GmbH & Co.) to prevent spectral saturation from the fluid environment (effective absorber of NIR light). A total of 15 spectra were recorded per location with each spectrum consisting of 10 coadded spectra (acquisition time per location ¼ 2.4 seconds).
The spectral measurements were repeated in vitro on osteochondral plugs extracted from the same ex vivo measurement locations. Contrary to the ex vivo measurements where optimal probe alignment could not be always ensured, in in vitro measurements the sample plugs were fixed into a goniometer (#55-841; Edmund Optics Inc., Barrington, NJ) to achieve reliable contact between the probe and the sample surface.

Spectral Preprocessing
Before preprocessing, data from the spectral region of 0 mm was excluded due to the poor signal-to-noise ratio (high water absorption).

Histology
The osteochondral plugs were halved and one half was decalcified in ethylenediaminetetraacetic acid, cut into 3-mm thick sections (n ¼ 3), and stained with Safranin-O (attracted by PGs). Digital densitometry system, consisting of a light microscope (Nikon Microphot-FXA; Nikon Co., Tokyo, Japan) with a monochromatic light source (wavelength 492 AE 8 nm) and a 12-bit CCD camera (ORCA-ER; Hamamatsu Photonics K.K., Hamamatsu, Japan), was used to determine sections optical density (OD w PG content). The system was calibrated with neutral density filters (0-3.0). The severity of OA was evaluated with 3 histologic grading systems: modified Mankin score, 6 OARSI grading, 5 and ICRS grading 7 (Fig 2). Four independent assessors (M.P., R.S., M.H., and N.H.) scored the sections in a randomized order and the final score for each sample was determined as the average over the 3 sections. The same sections were used for digital densitometry and histologic grading.

Regression Analysis
Before regression analysis, spectral (multivariate) and reference parameter (univariate) outliers were investigated. The in vitro spectra were visualized to ensure the absence of hardware-related recording errors. In the preprocessing of ex vivo spectra, 10 hardware-related outliers were identified and excluded. For univariate values, the normality of distribution (non-normal for all references) was determined with the one-sample KolmogoroveSmirnov test and, thus, any values exceeding 3 median absolute deviations from the reference median were excluded. PCA was used due to its ability to reduce the high dimensionality and collinearity of datasets, thereby enabling less computationally exhaustive modeling and outlier estimation, as well as reducing the chances of overfitting. PCA scores along with the nested data (i.e., patient, left/right, femur/tibia/patella, and measurement site) were used as inputs for the LME model.
In the modeling, 8 cadavers were assigned as the training set and a single cadaver was assigned as the independent test set. The test set was subsequently changed (9 iterations) until all cadavers were used (also known as nested cross-validation). The in vitro models were calibrated and optimized using 10-fold crossvalidation with the number of PCA scores limited to 12. The model with the smallest root-mean-square error of cross-validation (RMSECV) was selected to minimize overfitting. The optimal preprocessing pipeline was selected based on the highest median Spearman rank correlation in the independent test set.

Arthroscopic Outlier Detection
Due to the relatively narrow joint cavities and limited field of view in the ex vivo measurements, optimal contact between the cartilage surface and probe could not always be ensured. Furthermore, the high water content of cartilage makes the spectral separation between the measurements with good and bad contact especially challenging and, thus, a classifier was trained to identify spectra with non-optimal contact. The performance of classifiers, including fine k-nearest neighbors (kNN), weighted kNN, and support vector machines (SVM), was investigated due to their superior performance in the initial testing. The classifier optimization was performed as follows: the cross-validated PCA-LME model was used to predict the properties based on the ex vivo spectra of the 8 cadavers (same as in training). If the error between the predicted and reference value was greater than a set threshold (2 Â RMSECV, 3 Â RMSECV, or half set as outliers), the label was set to 1 (¼ outlier), otherwise to 0. These labels along with PCA scores (N ¼ 12) of ex vivo spectra were then used to train a 10-fold cross-validated classifier, which was used to classify the remaining independent arthroscopic measurements (one cadaver). The effect of preprocessing pipeline on classifier performance was also investigated. Ultimately, the performance of the same retained locations in both in vitro and ex vivo should be equaldthis was used as an indicator (along with classifier accuracy and F1-score) to determine the optimal combination of algorithm, threshold, and preprocessing. The combinations that classified <10% or >90% of ex vivo spectra as outliers were not included as these were not considered realistic.

Statistical Analysis
The PCA-LME model performance was evaluated based on performance in calibration (Spearman rank correlation [r], RMSECV) and the independent test (r, RMSE). As a statistic, the median was chosen over the average as it is less susceptible to outliers. Model and classifier training were performed in MATLAB (R2020b, MathWorks, Natick, MA). The level of significance was set at P < .05. Data of the current study are available from the corresponding author on reasonable request.

Results
The distributions of reference properties were all nonnormal with a single outlier detected with the Mankin score (Fig 2). The optimal PCA-LME models had a moderate performance with the histology scores (i.e., ICRS, OARSI, and Mankin) and slightly poorer performance with PG content (Table 1: PCA-LME). The optimal preprocessing pipelines always included scatter correction (i.e., LSNV); furthermore, for the histologic scores, the combination of spectral regions of 0.70 to 1.375 and 1.525 to 1.90 mm was optimal ( Table 2). The optimal preprocessing pipelines for ICRS and Mankin scores were identical.
Before outlier classification, the initial performance of the PCA-LME models on ex vivo spectra was assessed (Table 1: All Test). The optimal combination of preprocessing, classifier algorithm, and threshold substantially improved model performance on the ex vivo performance of the independent test set (r ¼ 0.462 to 0.614, RMSE ¼ 1.078 to 0.950, P ¼ .019 to .008, Fig 3). To better compare in vitro and ex vivo performance in the test set, a comparison of same retained locations revealed slightly different performance (r ¼ 0.660, RMSE ¼ 0.849, P ¼ .001 vs r ¼ 0.578, RMSE ¼ 0.951, P ¼ .018, respectively). In addition, to estimate model reliability, the prediction error was assessed in 4 classes (i.e., dividing reference ranges to four equally spaced subranges), which revealed the prediction error to be smallest with the 2 middle classes. Although the percentage of outlier spectra (Table 1: OutlierS) was relatively high, the exclusion percentage of measurement locations was substantially lower (Table 1: OutlierN). The variability in the percentage of outliers between reference properties relates to the differences in spectral preprocessing, the performance of the PCA-LME model (RMSECV as a metric for data labeling), and the accuracy of the classifier (prediction reliability).
The optimal preprocessing pipelines for classification systematically included scatter correction (i.e., LSNV) and smoothing (i.e., no derivative preprocessing, Table 2). The outlier thresholds of 3 Â RMSECV and 50% were optimal for the histology scores and PG content, respectively. The optimal classifiers for PG content, ICRS, OARSI, and modified Mankin score were fine-kNN, SVM, weighted kNN, and fine-kNN, respectively. Overall, none of the algorithms performed systematically better than the others and SVM classified more spectra as outliers than the kNN algorithms. We also investigated classifier performance when using the same preprocessing pipeline as in the modelling; however, the ex vivo performance was systematically worse compared with the optimized pipeline (r ¼ 0.452 to 0.614, RMSE ¼ 1.032 to 0.950, P ¼ .027 to .008).
Outlier exclusion decreased the standard deviation of ex vivo spectra by 23.3% and also decreased the absorption at 1.4 mm (water peak , Fig 1), depicting the exclusion of spectra with saline interference (i.e., more water). Most importantly, the visualization of ex vivo spectra before and after outlier exclusion confirmed that extreme spectra were excluded.

Discussion
In this study, we demonstrate that NIRS is capable of estimating cartilage lesion severity and PG content during an ex vivo arthroscopy, thus validating the hypothesis. The in vitro performance of PCA-LME models was moderate to strong and the ex vivo arthroscopic performance was slightly poorer, which was nevertheless substantially improved by excluding outlying spectra. We therefore, believe the technique could provide previously unobtainable diagnostic information during arthroscopic surgery.
NIRS has been previously applied to estimate cartilage histologic scores in animals 29 and humans 16,18,19 in vitro. In addition, ICRS scores based on visual evaluation in arthroscopy have been associated with in vitro 19,30 and in vivo 31 NIRS measurements. The human studies 16,18,19,30,31 have employed varying spectral ranges and analysis techniques with none accounting for the spatial dependency caused by multiple measurements per subject or validating model performance by independent testing. McGoverin et al. 16 30 reported a negative relationship (r ¼ e0.47) between arthroscopic ICRS score and a spectral characteristic value (estimated as a linear combination of 2 spectral peaks), whereas Spahn et al. 31 reported a positive association with spectral ratio (ratio of the same 2 peaks). Direct comparison of the aforementioned studies is limited due to their simplistic analysis approach. Overall, the findings of this study agree with in vitro performance of previous studies and extend technique validity and application for in vivo application.
Previous studies associating PG content with NIR spectra have focused on bovine 20,21 and equine cartilage. 8,12 Few studies have also investigated the PG content of engineered cartilage constructs. [32][33][34] The bovine studies of Brown et al. 20 and Afara et al. 21 both used the spectral region 0.8 to 2.5 mm, PLS regression, and assessed PG content after various stages of artificial depletion. Brown et al. 20 presented a significant distinction between normal and PG depleted samples, whereas Afara et al. 21 demonstrated a superior cross-validated performance (calibration R 2 ¼ 93.76, RMSECV ¼ 0.573) based on data within the spectral regions 0.8 to 1.0 mm and 1.55 to 1.84 mm. The calibration correlation of Afara et al. 21 is substantially greater than that found in this study (r ¼ 0.739, RMSECV ¼ 0.185), whereas the crossvalidated errors are similar, considering the ranges of OD values (0.2-1.5) and PG scores (0-4) in this and their study, 21 respectively. The higher correlation could be explained by their use of leave-one-out (vs k-fold) cross-validation and smaller sample size. Sarin et al. 8 applied a similar cross-validation scheme and also validated the performance for arthroscopy on equine. The validation performance of their CNN (r ¼ 0.691, RMSECV ¼ 0.274) was fairly similar. Furthermore, the validation performance on arthroscopic spectra was inferior to the aforementioned performance, similar to the present study. Most importantly, it's evident that prediction of PG content is possible and reproducible both for equine and human cartilage; although, their thickness varies greatly (0.14-1.36 mm and 1.23-5.90 mm, respectively 8,13 ). Due to the superior performance of sophisticated machinelearning approaches over conventional chemometrics techniques, 35 the combination of CNN and protocols accounting for data dependency could benefit future studies.
The NIRS literature of the aforementioned publications provides the groundwork for clinical adaptation of NIRS in cartilage assessment. Nowadays, arthroscopic cartilage evaluation usually relies on visual estimation and instrument palpation, which has limited reliability in cartilage defect grading. 4,36 Here, NIRS enables the estimation of PG content and defect severity in a situation resembling that of arthroscopic surgery without the need for destructive sample extraction. This nondestructive evaluation could enable the objective classification of cartilage defects and further enable the monitoring of potential treatment options in cartilage repair. In addition, the technique enables the mapping of the extent of traumatic cartilage injuries that would assist in the treatment decision-making and open new possibilities to evaluate the prognosis of a cartilage defect in joint trauma. For example, post-traumatic cartilage degeneration is a well-known consequence of an anterior cruciate ligament rupture, even after a successful anterior cruciate ligament reconstruction. 37,38 NIRS could enable the evaluation of this post-traumatic process in the knee and other joints.
Outlier detection is essential in both the initial model training and validation, followed by ensuring the validity of new data (i.e., no unreliable predictions). For this study, median-based statistics were chosen due to their inherent property of being less sensitive to outliers (compared with average). Estimation of spectral (multivariate) outliers is challenging and, thus, dimension reduction techniques, such as PCA, have been popular. 8,13,15 Furthermore, due to the high water content of cartilage, the spectral separation between the measurements with poor and good contact is especially challenging and requires further validation (i.e., a designated study). Therefore, the spectral outliers highly resemble the nonoutlier spectra. In 2 studies by Sarin et al., 8 a 3-dimensional volume was created based on PCA scores of in vitro data and arthroscopic spectra falling outside this volume were deemed outliers. The exclusion percentages were 4.5% to 23.5% and 3.1%. 8,15 Prakash et al. 13 used a similar outlier exclusion as in this study with similar accuracy of 50% to 70% in the test set but a relatively lower percentage of outliers (33%). Interestingly, after outlier exclusion, their correlation coefficients improved but error variance substantially increased. In this study, both the error and its variance decreased after outlier exclusion. Several studies 14,31,39 focusing on in vivo spectral measurements do not provide any means for outlier exclusion, although highlighting the difficulty of in vivo measurements. 39 In future studies, outlier detection may also benefit from the latest innovations, such as extended isolation forests. 40

Limitations
Some limitations were evident in this study. The number of cadavers was relatively low and could lead to limited range (nonrepresentative) of reference properties; however, several locations were assessed in extensive laboratory measurements, thereby sufficiently increasing the number of observations. A relatively high percentage of outlier spectra depicts the challenge in probe alignment within the joint cavity to ensure optimal contact with the cartilage surface, as also highlighted by Spahn et al. 39 However, the percentage of outlier locations (i.e., all 15 spectra excluded) was substantially lower, indicating that a successful measurement was acquired in most of the locations.

Conclusions
NIRS is capable of nondestructive evaluation of cartilage integrity (i.e., histologic scores and PG content) under similar conditions as in clinical arthroscopy.