Research on gastric cancer lymph node metastasis prediction model based on machine learning algorithms
-
摘要:目的
基于4种机器学习(ML)算法构建胃癌淋巴结转移的预测模型并验证。
方法回顾性收集531例胃癌根治术患者的临床资料, 按3∶1比例将患者随机分为训练集399例和测试集132例。通过单因素分析筛选胃癌淋巴结转移的特征选择变量,分别建立逻辑回归、随机森林、K-邻近算法、支持向量机算法模型并进行变量重要性排序。将所有ML算法模型在测试集中进行验证,绘制受试者工作特征(ROC)曲线,基于曲线下面积(AUC)、灵敏度、特异度、准确度确定最优ML算法模型。基于最优ML算法模型的变量重要性排序构建列线图模型,通过ROC曲线、校准曲线、决策曲线评价列线图模型的区分能力、校准能力和临床适用性。
结果4种ML算法模型比较结果显示,随机森林模型为最优算法模型,其在训练集中的准确度、灵敏度、特异度分别为72.7%、69.9%、75.0%, AUC为0.803, 其在测试集中的准确度、灵敏度、特异度分别为64.4%、66.7%、62.5%, AUC为0.751。基于随机森林算法模型的变量构建列线图模型, ROC曲线显示列线图模型在训练集、测试集中的AUC分别为0.721、0.776, 校准曲线和决策曲线显示列线图模型在训练集与测试集中均有较好的校准能力和临床适用性。
结论随机森林模型是4种ML算法模型中的最优算法模型,基于随机森林模型构建的列线图模型能够较准确地预测胃癌淋巴结转移风险,从而更好地指导临床诊断和治疗决策。
Abstract:ObjectiveTo establish and validate a prediction model for gastric cancer lymph node metastasis based on four machine learning (ML) algorithms.
MethodsA retrospective analysis was conducted on clinical data of 531 patients who underwent radical gastrectomy. The patients were randomly divided into training set (399 patients) and test set (132 patients) in a ratio of 3 to 1. Univariate analysis was used to screen for variables associated with gastric cancer lymph node metastasis, and Logistic regression, random forest, K-nearest neighbor algorithm, and support vector machine algorithm models were established to rank the importance of variables. All ML algorithm models were validated in the test set, and receiver operating characteristic (ROC) curves were plotted. The optimal ML algorithm model was determined based on the area under the curve (AUC), sensitivity, specificity, and accuracy. A nomogram model was constructed based on the variable importance ranking of the optimal ML algorithm model. The discrimination, calibration, and clinical applicability of the nomogram model were evaluated using ROC curves, calibration curves, and decision curves.
ResultsThe results of the comparison of the four ML algorithm models showed that the random forest model was the optimal algorithm model. The accuracy, sensitivity, and specificity of the random forest model in the training set were 72.7%, 69.9%, and 75.0%, respectively, with an AUC of 0.803. The accuracy, sensitivity, and specificity of the random forest model in the test set were 64.4%, 66.7%, and 62.5%, respectively, with an AUC of 0.751. A nomogram model was constructed based on the variables of the random forest algorithm model. The ROC curve showed that the AUCs of the nomogram model in the training set and test set were 0.721 and 0.776, respectively. Calibration curves and decision curves showed that the nomogram model had good calibration and clinical applicability in both the training set and test set.
ConclusionThe random forest model is the optimal algorithm model among the four ML algorithm models. The nomogram model based on the random forest model can accurately predict the risk of gastric cancer lymph node metastasis, thereby better guiding clinical diagnosis and treatment decisions.
-
-
表 1 训练集与测试集患者临床资料比较[n(%)][M(P25, P75)]
临床资料 分类 训练集(n=399) 测试集(n=132) χ2/Z P 性别 男 312(78.2) 102(77.3) 0.049 0.825 女 87(21.8) 30(22.7) 年龄/岁 58.0(50.0, 63.5) 59.0(50.5, 64.0) -0.650 0.516 民族 汉族 304(76.2) 106(80.3) 2.365 0.500 回族 42(10.5) 8(6.1) 藏族 37(9.3) 13(9.8) 其他 16(4.0) 5(3.8) 体质量指数/(kg/m2) 22.5(20.4, 24.9) 22.5(20.7, 24.9) -0.086 0.932 高血压 否 348(87.2) 116(87.9) 0.039 3 0.843 是 51(12.8) 16(12.1) 糖尿病 否 382(95.7) 121(91.7) 3.294 0.070 是 17(4.3) 11(8.3) 白蛋白/(g/L) 39.3(36.8, 41.9) 39.3(36.7, 41.5) -0.684 0.494 肿瘤标志物 AFP/(ng/mL) 2.2(1.7, 3.1) 2.3(1.6, 3.3) -0.106 0.916 CEA/(ng/mL) 2.0(1.3, 3.5) 2.2(1.4, 3.7) -1.169 0.242 CA125/(U/mL) 11.0(8.1, 16.4) 11.1(8.4, 14.6) -0.461 0.645 CA199/(U/mL) 7.2(3.3, 18.4) 9.8(3.2, 24.7) -1.524 0.127 肿瘤直径 < 2 cm 40(10.0) 10(7.6) 0.698 0.404 ≥2 cm 359(90.0) 122(92.4) 肿瘤位置 胃上部1/3 93(23.3) 40(30.3) 2.590 0.274 胃中部1/3 282(70.7) 85(64.4) 胃下部1/3 24(6.0) 7(5.3) 大体分型 溃疡型 350(87.7) 115(87.1) 1.355 0.508 隆起型 30(7.5) 13(9.8) 浸润型 19(4.8) 4(3.0) 分化程度 高分化 29(7.3) 17(12.9) 4.803 0.091 中分化 221(55.4) 63(47.7) 低分化 149(37.3) 52(39.4) 脉管侵犯 否 148(37.1) 53(40.2) 0.394 0.530 是 251(62.9) 79(59.8) 神经侵犯 否 174(43.6) 64(48.5) 0.953 0.329 是 225(56.4) 68(51.5) T分期 T1a期 13(3.3) 6(4.5) 5.143 0.273 T1b期 5(1.3) 4(3.0) T2期 122(30.6) 44(33.3) T3期 32(8.0) 5(3.8) T4期 227(56.9) 73(55.3) 淋巴结转移 否 183(45.9) 60(45.5) 0.006 0.935 是 216(54.1) 72(54.5) AFP: 甲胎蛋白; CEA: 癌胚抗原; CA125: 糖类抗原125; CA199: 糖类抗原199。 表 2 无淋巴结转移组与淋巴结转移组患者临床资料比较[n(%)][M(P25, P75)]
特征 分类 无淋巴结转移组(n=243) 淋巴结转移组(n=288) χ2/Z P 性别 男 192(79.0) 222(77.1) 0.285 0.593 女 51(21.0) 66(22.9) 年龄/岁 58.0(50.0, 63.0) 59.0(50.0, 64.0) -0.593 0.553 民族 汉族 195(80.2) 215(74.7) 2.370 0.499 回族 20(8.2) 30(10.4) 藏族 20(8.2) 30(10.4) 其他 8(3.3) 13(4.5) 体质量指数/(kg/m2) 22.4(20.7, 24.7) 22.5(20.2, 25.0) -0.189 0.850 高血压 否 213(87.7) 251(87.2) 0.030 0.862 是 30(12.3) 37(12.8) 糖尿病 否 231(95.1) 272(94.4) 0.101 0.751 是 12(4.9) 16(5.6) 白蛋白/(g/L) 39.4(36.8, 41.5) 39.3(36.6, 42.1) -0.065 0.948 肿瘤标志物 AFP/(ng/mL) 2.3(1.8, 3.2) 2.2(1.6, 3.0) -1.219 0.223 CEA/(ng/mL) 2.0(1.3, 3.1) 2.0(1.3, 4.0) -1.328 0.184 CA125/(U/mL) 10.5(7.8, 14.3) 11.4(8.4, 17.0) -2.385 0.017 CA199/(U/mL) 7.3(3.3, 14.4) 9.2(3.2, 28.3) -1.927 0.054 肿瘤直径 < 2 cm 29(11.9) 21(7.3) 3.330 0.068 ≥2 cm 214(88.1) 267(92.7) 肿瘤位置 胃上部1/3 59(24.3) 74(25.7) 0.203 0.903 胃中部1/3 169(69.5) 198(68.8) 胃下部1/3 15(6.2) 16(5.6) 大体分型 溃疡型 216(88.9) 249(86.5) 8.582 0.014 隆起型 23(9.5) 20(6.9) 浸润型 4(1.6) 19(6.6) 分化程度 高分化 24(9.9) 22(7.6) 1.318 0.517 中分化 132(54.3) 152(52.8) 低分化 87(35.8) 114(39.6) 脉管侵犯 否 139(57.2) 62(21.5) 71.299 < 0.001 是 104(42.8) 226(78.5) 神经侵犯 否 140(57.6) 98(34.0) 29.644 < 0.001 是 103(42.4) 190(66.0) T分期 T1a期 16(6.6) 3(1.0) 41.015 < 0.001 T1b期 6(2.5) 3(1.0) T2期 100(41.2) 66(22.9) T3期 16(6.6) 21(7.3) T4期 105(43.2) 195(67.7) 表 3 不同机器学习算法模型在训练集和测试集中的预测效能
数据集 模型 准确度/% 灵敏度/% 特异度/% AUC 训练集 逻辑回归模型 69.2 60.7 76.4 0.727 随机森林模型 72.7 69.9 75.0 0.803 K-邻近算法模型 68.9 58.5 77.8 0.772 支持向量机模型 71.9 64.5 78.2 0.792 测试集 逻辑回归模型 71.9 66.7 76.4 0.766 随机森林模型 64.4 66.7 62.5 0.751 K-邻近算法模型 58.3 48.3 66.7 0.612 支持向量机模型 60.6 55.0 65.3 0.637 -
[1] KINAMI S, SAITO H, TAKAMURA H. Significance of lymph node metastasis in the treatment of gastric cancer and current challenges in determining the extent of metastasis[J]. Front Oncol, 2021, 11: 806162.
[2] KIKUCHI S, KURODA S, NISHIZAKI M, et al. Management of early gastric cancer that meet the indication for radical lymph node dissection following endoscopic resection: a retrospective cohort analysis[J]. BMC Surg, 2017, 17(1): 72. doi: 10.1186/s12893-017-0268-0
[3] IASONOS A, SCHRAG D, RAJ G V, et al. How to build and interpret a nomogram for cancer prognosis[J]. J Clin Oncol, 2008, 26(8): 1364-1370. doi: 10.1200/JCO.2007.12.9791
[4] ZHAO L Y, YIN Y, LI X, et al. A nomogram composed of clinicopathologic features and preoperative serum tumor markers to predict lymph node metastasis in early gastric cancer patients[J]. Oncotarget, 2016, 7(37): 59630-59639. doi: 10.18632/oncotarget.10732
[5] HANDELMAN G S, KOK H K, CHANDRA R V, et al. eDoctor: machine learning and the future of medicine[J]. J Intern Med, 2018, 284(6): 603-619. doi: 10.1111/joim.12822
[6] LIU W C, LI Z Q, LUO Z W, et al. Machine learning for the prediction of bone metastasis in patients with newly diagnosed thyroid cancer[J]. Cancer Med, 2021, 10(8): 2802-2811. doi: 10.1002/cam4.3776
[7] ZHU J, ZHENG J X, LI L F, et al. Application of machine learning algorithms to predict central lymph node metastasis in T1-T2, non-invasive, and clinically node negative papillary thyroid carcinoma[J]. Front Med, 2021, 8: 635771. doi: 10.3389/fmed.2021.635771
[8] SANTOS M K, FERREIRA JUNIOR J R, WADA D T, et al. Artificial intelligence, machine learning, computer-aided diagnosis, and radiomics: advances in imaging towards to precision medicine[J]. Radiol Bras, 2019, 52(6): 387-396. doi: 10.1590/0100-3984.2019.0049
[9] KANG J, CHOI Y J, KIM I K, et al. LASSO-based machine learning algorithm for prediction of lymph node metastasis in T1 colorectal cancer[J]. Cancer Res Treat, 2021, 53(3): 773-783. doi: 10.4143/crt.2020.974
[10] 曹毛毛, 陈万青. GLOBOCAN 2020全球癌症统计数据解读[J]. 中国医学前沿杂志: 电子版, 2021, 13(3): 63-69. [11] SONGUN I, PUTTER H, KRANENBARG E M, et al. Surgical treatment of gastric cancer: 15-year follow-up results of the randomised nationwide Dutch D1D2 trial[J]. Lancet Oncol, 2010, 11(5): 439-449. doi: 10.1016/S1470-2045(10)70070-X
[12] LAWSON J D, SICKLICK J K, FANTA P T. Gastric cancer[J]. Curr Probl Cancer, 2011, 35(3): 97-127. doi: 10.1016/j.currproblcancer.2011.03.001
[13] MCCULLOCH P, NITA M E, KAZI H, et al. Extended versus limited lymph nodes dissection technique for adenocarcinoma of the stomach[J]. Cochrane Database Syst Rev, 2004(4): CD001964.
[14] 郑晋, 桑玉尔, 施伟斌. 胃癌病人淋巴结转移nomogram图预测模型构建及评价[J]. 中国实用外科杂志, 2021, 41(5): 570-575, 580. [15] LIN J X, WANG Z K, WANG W, et al. Risk factors of lymph node metastasis or lymphovascular invasion for early gastric cancer: a practical and effective predictive model based on international multicenter data[J]. BMC Cancer, 2019, 19(1): 1048. doi: 10.1186/s12885-019-6147-6
[16] YIN X Y, PANG T, LIU Y, et al. Development and validation of a nomogram for preoperative prediction of lymph node metastasis in early gastric cancer[J]. World J Surg Oncol, 2020, 18(1): 2. doi: 10.1186/s12957-019-1778-2
[17] BAI B L, WU Z Y, WENG S J, et al. Application of interpretable machine learning algorithms to predict distant metastasis in osteosarcoma[J]. Cancer Med, 2023, 12(4): 5025-5034. doi: 10.1002/cam4.5225
[18] STERNE J A, WHITE I R, CARLIN J B, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls[J]. BMJ, 2009, 338: b2393. doi: 10.1136/bmj.b2393
[19] HONG W D, LU Y J, ZHOU X Y, et al. Usefulness of random forest algorithm in predicting severe acute pancreatitis[J]. Front Cell Infect Microbiol, 2022, 12: 893294. doi: 10.3389/fcimb.2022.893294
[20] WANG R, ZUO C L, ZHANG R, et al. Carcinoembryonic antigen, carbohydrate antigen 199 and carbohydrate antigen 724 in gastric cancer and their relationship with clinical prognosis[J]. World J Gastrointest Oncol, 2023, 15(8): 1475-1485. doi: 10.4251/wjgo.v15.i8.1475
[21] 王妤, 褚嘉栋, 孙娜, 等. 围产期抑郁症辅助诊断预测模型的构建及机器学习算法的筛选[J]. 实用临床医药杂志, 2023, 27(18): 93-99. doi: 10.7619/jcmp.20232044 -
期刊类型引用(6)
1. 王莉娜,闫婷. 孕晚期基于胎心监护的临床护理对胎儿窘迫及新生儿窒息发生情况的影响. 临床医学研究与实践. 2023(19): 168-171 . 百度学术
2. 赖锡妹,邝彩红. 电子胎心监护联合超声脐动脉血流动力学指标对胎儿窘迫的预测效能分析. 影像研究与医学应用. 2022(10): 72-74 . 百度学术
3. 路会景,王阳,于艳艳,陈英红. 彩超检测胎儿大脑中动脉、肾动脉预测妊娠晚期胎儿宫内窘迫. 河南大学学报(医学版). 2021(04): 273-276+282 . 百度学术
4. 张晓微. 彩超监测胎儿脐血流指标在预测胎儿宫内缺氧中的应用. 黑龙江医药. 2020(01): 188-189 . 百度学术
5. 黄鹂鹂,林淑娟,区凯敏. 脐动脉血不同pH值与胎儿宫内窘迫的相关性及预后分析. 中国医药科学. 2020(19): 128-131 . 百度学术
6. 袁景春,周镇光,李瑞华. 脐动脉血气分析与乳酸水平对新生儿窒息的诊断价值. 中国医药科学. 2019(23): 116-119 . 百度学术
其他类型引用(1)