基于机器学习算法的胃癌淋巴结转移预测模型研究

施昊旻, 燕速, 乔梦梦, 杨惠莲

施昊旻, 燕速, 乔梦梦, 杨惠莲. 基于机器学习算法的胃癌淋巴结转移预测模型研究[J]. 实用临床医药杂志, 2024, 28(1): 41-47, 61. DOI: 10.7619/jcmp.20233076
引用本文: 施昊旻, 燕速, 乔梦梦, 杨惠莲. 基于机器学习算法的胃癌淋巴结转移预测模型研究[J]. 实用临床医药杂志, 2024, 28(1): 41-47, 61. DOI: 10.7619/jcmp.20233076
SHI Haomin, YAN Su, QIAO Mengmeng, YANG Huilian. Research on gastric cancer lymph node metastasis prediction model based on machine learning algorithms[J]. Journal of Clinical Medicine in Practice, 2024, 28(1): 41-47, 61. DOI: 10.7619/jcmp.20233076
Citation: SHI Haomin, YAN Su, QIAO Mengmeng, YANG Huilian. Research on gastric cancer lymph node metastasis prediction model based on machine learning algorithms[J]. Journal of Clinical Medicine in Practice, 2024, 28(1): 41-47, 61. DOI: 10.7619/jcmp.20233076

基于机器学习算法的胃癌淋巴结转移预测模型研究

详细信息
    通讯作者:

    杨惠莲, E-mail: yanghuilian7005@163.com

  • 中图分类号: R735.2;R319;R322.2

Research on gastric cancer lymph node metastasis prediction model based on machine learning algorithms

  • 摘要:
    目的 

    基于4种机器学习(ML)算法构建胃癌淋巴结转移的预测模型并验证。

    方法 

    回顾性收集531例胃癌根治术患者的临床资料, 按3∶1比例将患者随机分为训练集399例和测试集132例。通过单因素分析筛选胃癌淋巴结转移的特征选择变量,分别建立逻辑回归、随机森林、K-邻近算法、支持向量机算法模型并进行变量重要性排序。将所有ML算法模型在测试集中进行验证,绘制受试者工作特征(ROC)曲线,基于曲线下面积(AUC)、灵敏度、特异度、准确度确定最优ML算法模型。基于最优ML算法模型的变量重要性排序构建列线图模型,通过ROC曲线、校准曲线、决策曲线评价列线图模型的区分能力、校准能力和临床适用性。

    结果 

    4种ML算法模型比较结果显示,随机森林模型为最优算法模型,其在训练集中的准确度、灵敏度、特异度分别为72.7%、69.9%、75.0%, AUC为0.803, 其在测试集中的准确度、灵敏度、特异度分别为64.4%、66.7%、62.5%, AUC为0.751。基于随机森林算法模型的变量构建列线图模型, ROC曲线显示列线图模型在训练集、测试集中的AUC分别为0.721、0.776, 校准曲线和决策曲线显示列线图模型在训练集与测试集中均有较好的校准能力和临床适用性。

    结论 

    随机森林模型是4种ML算法模型中的最优算法模型,基于随机森林模型构建的列线图模型能够较准确地预测胃癌淋巴结转移风险,从而更好地指导临床诊断和治疗决策。

    Abstract:
    Objective 

    To establish and validate a prediction model for gastric cancer lymph node metastasis based on four machine learning (ML) algorithms.

    Methods 

    A retrospective analysis was conducted on clinical data of 531 patients who underwent radical gastrectomy. The patients were randomly divided into training set (399 patients) and test set (132 patients) in a ratio of 3 to 1. Univariate analysis was used to screen for variables associated with gastric cancer lymph node metastasis, and Logistic regression, random forest, K-nearest neighbor algorithm, and support vector machine algorithm models were established to rank the importance of variables. All ML algorithm models were validated in the test set, and receiver operating characteristic (ROC) curves were plotted. The optimal ML algorithm model was determined based on the area under the curve (AUC), sensitivity, specificity, and accuracy. A nomogram model was constructed based on the variable importance ranking of the optimal ML algorithm model. The discrimination, calibration, and clinical applicability of the nomogram model were evaluated using ROC curves, calibration curves, and decision curves.

    Results 

    The results of the comparison of the four ML algorithm models showed that the random forest model was the optimal algorithm model. The accuracy, sensitivity, and specificity of the random forest model in the training set were 72.7%, 69.9%, and 75.0%, respectively, with an AUC of 0.803. The accuracy, sensitivity, and specificity of the random forest model in the test set were 64.4%, 66.7%, and 62.5%, respectively, with an AUC of 0.751. A nomogram model was constructed based on the variables of the random forest algorithm model. The ROC curve showed that the AUCs of the nomogram model in the training set and test set were 0.721 and 0.776, respectively. Calibration curves and decision curves showed that the nomogram model had good calibration and clinical applicability in both the training set and test set.

    Conclusion 

    The random forest model is the optimal algorithm model among the four ML algorithm models. The nomogram model based on the random forest model can accurately predict the risk of gastric cancer lymph node metastasis, thereby better guiding clinical diagnosis and treatment decisions.

  • 图  1   各特征选择变量的相关性分析热图

    图  2   不同算法模型在训练集中的混淆矩阵和ROC曲线

    A: 逻辑回归模型的混淆矩阵; B: 随机森林模型的混淆矩阵; C: K-邻近算法模型的混淆矩阵; D: 支持向量机模型的混淆矩阵; E: 逻辑回归模型的ROC曲线; F: 随机森林模型的ROC曲线; G: K-邻近算法模型的ROC曲线; H: 支持向量机模型的ROC曲线。

    图  3   不同算法模型在测试集中的混淆矩阵和ROC曲线

    A: 逻辑回归模型的混淆矩阵; B: 随机森林模型的混淆矩阵; C: K-邻近算法模型的混淆矩阵; D: 支持向量机模型的混淆矩阵; E: 逻辑回归模型的ROC曲线; F: 随机森林模型的ROC曲线; G: K-邻近算法模型的ROC曲线; H: 支持向量机模型的ROC曲线。

    图  4   4种机器学习算法模型的变量重要性排序

    图  5   胃癌淋巴结转移风险的列线图预测模型

    图  6   列线图模型预测胃癌患者淋巴结转移风险的ROC曲线

    A: 训练集; B: 测试集。

    图  7   列线图模型的校准曲线和决策曲线

    A: 训练集校准曲线; B: 测试集校准曲线; C: 训练集决策曲线; D: 测试集决策曲线。

    表  1   训练集与测试集患者临床资料比较[n(%)][M(P25, P75)]

    临床资料 分类 训练集(n=399) 测试集(n=132) χ2/Z P
    性别 312(78.2) 102(77.3) 0.049 0.825
    87(21.8) 30(22.7)
    年龄/岁 58.0(50.0, 63.5) 59.0(50.5, 64.0) -0.650 0.516
    民族 汉族 304(76.2) 106(80.3) 2.365 0.500
    回族 42(10.5) 8(6.1)
    藏族 37(9.3) 13(9.8)
    其他 16(4.0) 5(3.8)
    体质量指数/(kg/m2) 22.5(20.4, 24.9) 22.5(20.7, 24.9) -0.086 0.932
    高血压 348(87.2) 116(87.9) 0.039 3 0.843
    51(12.8) 16(12.1)
    糖尿病 382(95.7) 121(91.7) 3.294 0.070
    17(4.3) 11(8.3)
    白蛋白/(g/L) 39.3(36.8, 41.9) 39.3(36.7, 41.5) -0.684 0.494
    肿瘤标志物 AFP/(ng/mL) 2.2(1.7, 3.1) 2.3(1.6, 3.3) -0.106 0.916
    CEA/(ng/mL) 2.0(1.3, 3.5) 2.2(1.4, 3.7) -1.169 0.242
    CA125/(U/mL) 11.0(8.1, 16.4) 11.1(8.4, 14.6) -0.461 0.645
    CA199/(U/mL) 7.2(3.3, 18.4) 9.8(3.2, 24.7) -1.524 0.127
    肿瘤直径 < 2 cm 40(10.0) 10(7.6) 0.698 0.404
    ≥2 cm 359(90.0) 122(92.4)
    肿瘤位置 胃上部1/3 93(23.3) 40(30.3) 2.590 0.274
    胃中部1/3 282(70.7) 85(64.4)
    胃下部1/3 24(6.0) 7(5.3)
    大体分型 溃疡型 350(87.7) 115(87.1) 1.355 0.508
    隆起型 30(7.5) 13(9.8)
    浸润型 19(4.8) 4(3.0)
    分化程度 高分化 29(7.3) 17(12.9) 4.803 0.091
    中分化 221(55.4) 63(47.7)
    低分化 149(37.3) 52(39.4)
    脉管侵犯 148(37.1) 53(40.2) 0.394 0.530
    251(62.9) 79(59.8)
    神经侵犯 174(43.6) 64(48.5) 0.953 0.329
    225(56.4) 68(51.5)
    T分期 T1a期 13(3.3) 6(4.5) 5.143 0.273
    T1b期 5(1.3) 4(3.0)
    T2期 122(30.6) 44(33.3)
    T3期 32(8.0) 5(3.8)
    T4期 227(56.9) 73(55.3)
    淋巴结转移 183(45.9) 60(45.5) 0.006 0.935
    216(54.1) 72(54.5)
    AFP: 甲胎蛋白; CEA: 癌胚抗原; CA125: 糖类抗原125; CA199: 糖类抗原199。
    下载: 导出CSV

    表  2   无淋巴结转移组与淋巴结转移组患者临床资料比较[n(%)][M(P25, P75)]

    特征 分类 无淋巴结转移组(n=243) 淋巴结转移组(n=288) χ2/Z P
    性别 192(79.0) 222(77.1) 0.285 0.593
    51(21.0) 66(22.9)
    年龄/岁 58.0(50.0, 63.0) 59.0(50.0, 64.0) -0.593 0.553
    民族 汉族 195(80.2) 215(74.7) 2.370 0.499
    回族 20(8.2) 30(10.4)
    藏族 20(8.2) 30(10.4)
    其他 8(3.3) 13(4.5)
    体质量指数/(kg/m2) 22.4(20.7, 24.7) 22.5(20.2, 25.0) -0.189 0.850
    高血压 213(87.7) 251(87.2) 0.030 0.862
    30(12.3) 37(12.8)
    糖尿病 231(95.1) 272(94.4) 0.101 0.751
    12(4.9) 16(5.6)
    白蛋白/(g/L) 39.4(36.8, 41.5) 39.3(36.6, 42.1) -0.065 0.948
    肿瘤标志物 AFP/(ng/mL) 2.3(1.8, 3.2) 2.2(1.6, 3.0) -1.219 0.223
    CEA/(ng/mL) 2.0(1.3, 3.1) 2.0(1.3, 4.0) -1.328 0.184
    CA125/(U/mL) 10.5(7.8, 14.3) 11.4(8.4, 17.0) -2.385 0.017
    CA199/(U/mL) 7.3(3.3, 14.4) 9.2(3.2, 28.3) -1.927 0.054
    肿瘤直径 < 2 cm 29(11.9) 21(7.3) 3.330 0.068
    ≥2 cm 214(88.1) 267(92.7)
    肿瘤位置 胃上部1/3 59(24.3) 74(25.7) 0.203 0.903
    胃中部1/3 169(69.5) 198(68.8)
    胃下部1/3 15(6.2) 16(5.6)
    大体分型 溃疡型 216(88.9) 249(86.5) 8.582 0.014
    隆起型 23(9.5) 20(6.9)
    浸润型 4(1.6) 19(6.6)
    分化程度 高分化 24(9.9) 22(7.6) 1.318 0.517
    中分化 132(54.3) 152(52.8)
    低分化 87(35.8) 114(39.6)
    脉管侵犯 139(57.2) 62(21.5) 71.299 < 0.001
    104(42.8) 226(78.5)
    神经侵犯 140(57.6) 98(34.0) 29.644 < 0.001
    103(42.4) 190(66.0)
    T分期 T1a期 16(6.6) 3(1.0) 41.015 < 0.001
    T1b期 6(2.5) 3(1.0)
    T2期 100(41.2) 66(22.9)
    T3期 16(6.6) 21(7.3)
    T4期 105(43.2) 195(67.7)
    下载: 导出CSV

    表  3   不同机器学习算法模型在训练集和测试集中的预测效能

    数据集 模型 准确度/% 灵敏度/% 特异度/% AUC
    训练集 逻辑回归模型 69.2 60.7 76.4 0.727
    随机森林模型 72.7 69.9 75.0 0.803
    K-邻近算法模型 68.9 58.5 77.8 0.772
    支持向量机模型 71.9 64.5 78.2 0.792
    测试集 逻辑回归模型 71.9 66.7 76.4 0.766
    随机森林模型 64.4 66.7 62.5 0.751
    K-邻近算法模型 58.3 48.3 66.7 0.612
    支持向量机模型 60.6 55.0 65.3 0.637
    下载: 导出CSV
  • [1]

    KINAMI S, SAITO H, TAKAMURA H. Significance of lymph node metastasis in the treatment of gastric cancer and current challenges in determining the extent of metastasis[J]. Front Oncol, 2021, 11: 806162.

    [2]

    KIKUCHI S, KURODA S, NISHIZAKI M, et al. Management of early gastric cancer that meet the indication for radical lymph node dissection following endoscopic resection: a retrospective cohort analysis[J]. BMC Surg, 2017, 17(1): 72. doi: 10.1186/s12893-017-0268-0

    [3]

    IASONOS A, SCHRAG D, RAJ G V, et al. How to build and interpret a nomogram for cancer prognosis[J]. J Clin Oncol, 2008, 26(8): 1364-1370. doi: 10.1200/JCO.2007.12.9791

    [4]

    ZHAO L Y, YIN Y, LI X, et al. A nomogram composed of clinicopathologic features and preoperative serum tumor markers to predict lymph node metastasis in early gastric cancer patients[J]. Oncotarget, 2016, 7(37): 59630-59639. doi: 10.18632/oncotarget.10732

    [5]

    HANDELMAN G S, KOK H K, CHANDRA R V, et al. eDoctor: machine learning and the future of medicine[J]. J Intern Med, 2018, 284(6): 603-619. doi: 10.1111/joim.12822

    [6]

    LIU W C, LI Z Q, LUO Z W, et al. Machine learning for the prediction of bone metastasis in patients with newly diagnosed thyroid cancer[J]. Cancer Med, 2021, 10(8): 2802-2811. doi: 10.1002/cam4.3776

    [7]

    ZHU J, ZHENG J X, LI L F, et al. Application of machine learning algorithms to predict central lymph node metastasis in T1-T2, non-invasive, and clinically node negative papillary thyroid carcinoma[J]. Front Med, 2021, 8: 635771. doi: 10.3389/fmed.2021.635771

    [8]

    SANTOS M K, FERREIRA JUNIOR J R, WADA D T, et al. Artificial intelligence, machine learning, computer-aided diagnosis, and radiomics: advances in imaging towards to precision medicine[J]. Radiol Bras, 2019, 52(6): 387-396. doi: 10.1590/0100-3984.2019.0049

    [9]

    KANG J, CHOI Y J, KIM I K, et al. LASSO-based machine learning algorithm for prediction of lymph node metastasis in T1 colorectal cancer[J]. Cancer Res Treat, 2021, 53(3): 773-783. doi: 10.4143/crt.2020.974

    [10] 曹毛毛, 陈万青. GLOBOCAN 2020全球癌症统计数据解读[J]. 中国医学前沿杂志: 电子版, 2021, 13(3): 63-69.
    [11]

    SONGUN I, PUTTER H, KRANENBARG E M, et al. Surgical treatment of gastric cancer: 15-year follow-up results of the randomised nationwide Dutch D1D2 trial[J]. Lancet Oncol, 2010, 11(5): 439-449. doi: 10.1016/S1470-2045(10)70070-X

    [12]

    LAWSON J D, SICKLICK J K, FANTA P T. Gastric cancer[J]. Curr Probl Cancer, 2011, 35(3): 97-127. doi: 10.1016/j.currproblcancer.2011.03.001

    [13]

    MCCULLOCH P, NITA M E, KAZI H, et al. Extended versus limited lymph nodes dissection technique for adenocarcinoma of the stomach[J]. Cochrane Database Syst Rev, 2004(4): CD001964.

    [14] 郑晋, 桑玉尔, 施伟斌. 胃癌病人淋巴结转移nomogram图预测模型构建及评价[J]. 中国实用外科杂志, 2021, 41(5): 570-575, 580.
    [15]

    LIN J X, WANG Z K, WANG W, et al. Risk factors of lymph node metastasis or lymphovascular invasion for early gastric cancer: a practical and effective predictive model based on international multicenter data[J]. BMC Cancer, 2019, 19(1): 1048. doi: 10.1186/s12885-019-6147-6

    [16]

    YIN X Y, PANG T, LIU Y, et al. Development and validation of a nomogram for preoperative prediction of lymph node metastasis in early gastric cancer[J]. World J Surg Oncol, 2020, 18(1): 2. doi: 10.1186/s12957-019-1778-2

    [17]

    BAI B L, WU Z Y, WENG S J, et al. Application of interpretable machine learning algorithms to predict distant metastasis in osteosarcoma[J]. Cancer Med, 2023, 12(4): 5025-5034. doi: 10.1002/cam4.5225

    [18]

    STERNE J A, WHITE I R, CARLIN J B, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls[J]. BMJ, 2009, 338: b2393. doi: 10.1136/bmj.b2393

    [19]

    HONG W D, LU Y J, ZHOU X Y, et al. Usefulness of random forest algorithm in predicting severe acute pancreatitis[J]. Front Cell Infect Microbiol, 2022, 12: 893294. doi: 10.3389/fcimb.2022.893294

    [20]

    WANG R, ZUO C L, ZHANG R, et al. Carcinoembryonic antigen, carbohydrate antigen 199 and carbohydrate antigen 724 in gastric cancer and their relationship with clinical prognosis[J]. World J Gastrointest Oncol, 2023, 15(8): 1475-1485. doi: 10.4251/wjgo.v15.i8.1475

    [21] 王妤, 褚嘉栋, 孙娜, 等. 围产期抑郁症辅助诊断预测模型的构建及机器学习算法的筛选[J]. 实用临床医药杂志, 2023, 27(18): 93-99. doi: 10.7619/jcmp.20232044
  • 期刊类型引用(6)

    1. 王莉娜,闫婷. 孕晚期基于胎心监护的临床护理对胎儿窘迫及新生儿窒息发生情况的影响. 临床医学研究与实践. 2023(19): 168-171 . 百度学术
    2. 赖锡妹,邝彩红. 电子胎心监护联合超声脐动脉血流动力学指标对胎儿窘迫的预测效能分析. 影像研究与医学应用. 2022(10): 72-74 . 百度学术
    3. 路会景,王阳,于艳艳,陈英红. 彩超检测胎儿大脑中动脉、肾动脉预测妊娠晚期胎儿宫内窘迫. 河南大学学报(医学版). 2021(04): 273-276+282 . 百度学术
    4. 张晓微. 彩超监测胎儿脐血流指标在预测胎儿宫内缺氧中的应用. 黑龙江医药. 2020(01): 188-189 . 百度学术
    5. 黄鹂鹂,林淑娟,区凯敏. 脐动脉血不同pH值与胎儿宫内窘迫的相关性及预后分析. 中国医药科学. 2020(19): 128-131 . 百度学术
    6. 袁景春,周镇光,李瑞华. 脐动脉血气分析与乳酸水平对新生儿窒息的诊断价值. 中国医药科学. 2019(23): 116-119 . 百度学术

    其他类型引用(1)

图(7)  /  表(3)
计量
  • 文章访问数:  167
  • HTML全文浏览量:  92
  • PDF下载量:  23
  • 被引次数: 7
出版历程
  • 收稿日期:  2023-09-25
  • 修回日期:  2023-11-28
  • 网络出版日期:  2024-01-22
  • 刊出日期:  2024-01-14

目录

    /

    返回文章
    返回
    x 关闭 永久关闭