# Advantages of deep learning with convolutional neural network in detecting disc displacement of the temporomandibular joint in magnetic resonance imaging

The research protocol for this study has been reviewed to ensure it adheres to the principles of the Declaration of Helsinki and has been approved by the Institutional Review Board of Kyung Hee University Dental Hospital in Seoul, South Korea (KHD IRB, IRB No-KH- DT21022). Informed consent was obtained from all participants.

### Study population

figure 1 shows a flow chart of the current study. The study population consisted of 1,260 patients with TMD (861 males and 399 females, mean age = 37.33 ± 18.83 years), who visited Kyung Hee University Dental Hospital with TMD between January 2017 and July 2021. A TMD specialist with > 7 years experience in TMD diagnosed TMD based on the criteria for TMD Axis I22

The exclusion criteria were serious previous injuries such as unstable multiple traumas and facial fractures; systemic diseases that may affect the TMJ, such as rheumatoid disease and systemic osteoarthritis; psychological problems; pregnancy; and psychiatric or neurological disorders. Cases where the TMJ discs were not observed on MRI and where neither signal strength nor contour could define the structure as a TMJ disc were also excluded.

Out of a total of 1260 patients (2520 TMJs), 2051 bilateral proton density MRI images of 1026 patients (81.4%) who visited the hospital between January 2017 and January 2021 formed the training set, while 468 images of 234 patients (18, 6%) who visited the hospital between February 2021 and July 2021 formed the evaluation dataset. When training the CNN models, 20% of the training set was used for training validation (Fig. 1

### MRI image acquisition

All patients underwent MRI examinations of the bilateral TMJ. The MR images were acquired using a 3.0T MRI system (Genesis Signa; GE Medical System) with a 6 cm x 8 cm diameter surface coil. All scans involved sagittal oblique sections of ≤ 3 mm, a field of view of 15 cm, and a matrix of 256 × 224. T2-weighted images (T2WIs) were acquired using a 2,650/82 TR/TE sequence; T1-weighted images (T1WIs) were acquired using a 650/14 TR/TE sequence; and proton density images were obtained using a 2,650/82 TR/TE sequence. Spin echo sagittal MR images were acquired using an axial localizer.

### Accurate determination of TMJ disc displacement

The left and right sides of one patient with bilateral TMJs and ADD were assessed separately. The MRI image observation indicators for the TMJ in patients with TMD were:23

1. (i)

Non-ADD: The posterior ligament of the articular disc was at the 12 o’clock position relative to the condylar apex in the closed position. The combination of the rear tire and the double plate section was between 10 and 12 o’clock.

2. (ii)

ADD: The posterior joint disc belt moved forward out of normal range in the closed mouth position (Fig. 2

The ADD was determined for the right and left sides of each patient. Reference was made to all T1-weighted images (T1WIs), T2WIs, and proton density images to determine the presence or absence of ADD to be learned by the deep learning model. All MRI examinations and interpretations were performed by two investigators with > 7 years of experience in head and neck MRI. The internal consistency was represented using Cronbach’s and the test-retest reliability was represented using the intraclass correlation coefficient (ICC). The ICC was 0.91. Any disagreement in the MRI measurements for ADD was resolved through discussion until a consensus was reached. No posterior disc displacement was observed in this study.

### Interpreting ADD with CNN Models

Since the performance of the CNN model was better when targeting proton density MR images compared to T2WIs and T1WIs, we used proton density MR images. The input MR images were preprocessed as follows: they were first scaled down to 224 × 224 and then converted to three-channel images, with each channel having the same grayscale image. As a result, the dimensions of the inputs were set to 224 × 224 × 3.

The pre-trained three-dimensional VGG16 models were used for image classification. VGG16 is a CNN architecture that won the 2014 ILSVR competition and has been rated as one of the best vision model architectures so far. VGG16 managed to train a network twice as deep as the existing AlexNet 8-layer model, reducing the error rate by half24† VGG16 includes a convolutional layer, three fully connected layers, a 3×3 convolutional filter, stride, padding 1, 2×2 max pooling, and a rectified linear unit (ReLU)25† We chose this model because of its simple structure because our interest was not only in achieving high AUC scores, but also in analyzing the learned functions and activation maps.

Three different machine learning schemes were tested. The first, “fine-tuning”, trained all layers of the pre-trained model from the very beginning. The second scheme, “from scratch”, trained a model without applying pre-trained weights. The latter, “freeze”, trained only the last layer of the pre-trained model, preventing the training of the other layers. The evaluation statistics were the AUC and accuracy26† Accuracy, specificity and sensitivity of the models were obtained from the optimal business value earned by the Youden’s index calculated in the validation set27† All three schemes applied the same data augmentation techniques to 32 samples per batch. The fine-tuning and from-scratch models used a learning speed of 1st-4 with 15 and 30 epochs respectively, while the freeze model has a rate of 5e . used-4 with 150 eras. All three schemes used the Adam optimizer.

### Ensemble model with data magnification

An additional ensemble method was used to test the improvement in the prediction performance of the single refined model. Three different data augmentation techniques were applied to train the three fine-tuning models, and the predicted outputs were averaged (Fig. 3† This “data” ensemble is derived from the idea that using diversified data helps improve generalization performance more than applying a single CNN model28

### Visual analysis to specify significant regions using Grad-CAM

To better understand the learned features of the fine-tuning and from-scratch models, we analyzed their Grad-CAM images. We compared the same sample images that correctly predicted both models as positive images. Grad-CAM shows the most significant area for prediction by getting importance weight $${\alpha }_{k}^{c}$$which is equal to the mean value $$\frac{\partial {y}^{c}}{\partial {A}^{k}}$$ Where $${y}^{c}$$ is the log of the class $$c$$ and $${A}^{k}$$ is the $$k$$– the activation card. The Grad-CAM Heat Map $${L}^{c}$$ is then obtained as follows:

$${L}_{c}=ReLU\left({\sum }_{k}{a}_{k}^{c}{A}^{k}\right).$$

Because the heat map $${L}^{c}$$ visualizes the significant pixels that change $${y}^{c}$$ most, applying the heat map to the input image shows the most important region for prediction. For visualization, we present some of the best Grad-CAM images.

### Validation of MRI findings by human experts

Finally, we compared the prediction results of the CNN models with those of two human experts evaluating the same test set. Under the same conditions as the CNN models, using only proton density MR images of the TMJ of patients with TMD, the experts diagnosed either non-ADD or ADD. The accuracy, specificity and sensitivity were compared between CNN models and human experts. The human experts were blind to each other and relied on their knowledge and experience when reading the MR images. Their ICC was 0.84.

### statistical methods

Descriptive statistics are reported as means ± standard deviation or numbers with percentages, as appropriate. To analyze the distribution of discontinuous data, we used χ2 tests for equality of proportions, Fisher’s exact tests and Bonferroni tests. All statistical analyzes were performed using IBM SPSS Statistics for Windows, version 22.0 (IBM Corp., Armonk, NY, USA), R version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria) and Python version 3.9.7 (Python Software Foundation, DE, USA). A receiver performance curve (ROC) was plotted and AUC was calculated for each model, with AUC = 0.5 indicating no discrimination, 0.6 AUC > 0.5 indicating poor discrimination, 0.7 AUC > 0.6 an acceptable discrimination, 0.8 ≥ AUC > 0.7 indicated excellent discrimination and AUC > 0.9 indicated excellent discrimination29† McNemar’s test was used to compare the prediction accuracies of the CNN models with those of the human experts. Statistical significance was set at a two-tailed p-value of < 0.05.

### Institutional Review Board

The study protocol has been reviewed in accordance with the Helsinki Declaration and approved by the Institutional Review Board of Kyung Hee University Dental Hospital in Seoul, South Korea (IRB No-KH-DT21022). Informed consent was obtained from all participants.

### Informed Consent

Informed consent was obtained from all subjects involved in the study.