Text Detection and Recognition for Robot Localization

Raisi, Z.; Zelek, J.

doi:10.22061/jecei.2023.9857.658

فهرست نشریات

دانشگاه تربیت دبیر شهید رجائی

انتشارات دانشگاه تربیت دبیر شهید رجائی

نشریات مستقل دانشگاه در سامانه ارزیابی نشریات علمی وزارت علوم

نشریه معماری وشهرسازی پایدار موفق به اخذ رتبه علمی-پژوهشی شد

تعداد نشریات	11
تعداد شماره‌ها	221
تعداد مقالات	2,213
تعداد مشاهده مقاله	3,261,988
تعداد دریافت فایل اصل مقاله	2,345,707

	Text Detection and Recognition for Robot Localization
Journal of Electrical and Computer Engineering Innovations (JECEI)
مقاله 11، دوره 12، شماره 1، فروردین 2024، صفحه 163-174 اصل مقاله (1.26 M)
نوع مقاله: Original Research Paper
شناسه دیجیتال (DOI): 10.22061/jecei.2023.9857.658
نویسندگان
Z. Raisi^* ¹؛ J. Zelek²
¹University of Waterloo, Waterloo, Canada and Chabahar Maritime University, Chabahar, Iran.
²Systems Design Engineering Department, University of Waterloo, Canada.
تاریخ دریافت: 05 تیر 1402، تاریخ بازنگری: 14 شهریور 1402، تاریخ پذیرش: 17 شهریور 1402
چکیده
Background and Objectives: Signage is everywhere, and a robot should be able to take advantage of signs to help it localize (including Visual Place Recognition (VPR)) and map. Robust text detection & recognition in the wild is challenging due to pose, irregular text instances, illumination variations, viewpoint changes, and occlusion factors. Methods: This paper proposes an end-to-end scene text spotting model that simultaneously outputs the text string and bounding boxes. The proposed model leverages a pre-trained Vision Transformer based (ViT) architecture combined with a multi-task transformer-based text detector more suitable for the VPR task. Our central contribution is introducing an end-to-end scene text spotting framework to adequately capture the irregular and occluded text regions in different challenging places. We first equip the ViT backbone using a masked autoencoder (MAE) to capture partially occluded characters to address the occlusion problem. Then, we use a multi-task prediction head for the proposed model to handle arbitrary shapes of text instances with polygon bounding boxes. Results: The evaluation of the proposed architecture's performance for VPR involved conducting several experiments on the challenging Self-Collected Text Place (SCTP) benchmark dataset. The well-known evaluation metric, Precision-Recall, was employed to measure the performance of the proposed pipeline. The final model achieved the following performances, Recall = 0.93 and Precision = 0.8, upon testing on this benchmark. Conclusion: The initial experimental results show that the proposed model outperforms the state-of-the-art (SOTA) methods in comparison to the SCTP dataset, which confirms the robustness of the proposed end-to-end scene text detection and recognition model.
کلیدواژه‌ها
Text detection؛ Text Recognition؛ Robotics Localization؛ Deep Learning؛ Visual Place Recognition

مراجع
[1] A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, L. Van Gool, “Night-to-day image translation for retrieval-based localization,” in Proc. 2019 International Conference on Robotics and Automation (ICRA): 5958–5964, 2019. [2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proc. IEEE/CVF International Conference on Computer Vision: 5297–5307, 2016. [3] R. Atienza, “Vision transformer for fast and efficient scene text recognition,” Document Analysis and Recognition – ICDAR 2021. Springer International Publishing, pp. 319–334, 2021. [4] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, H. Lee, “What is wrong with scene text recognition model comparisons? dataset and model analysis,” in Proc. International Conference on Computer Vision (ICCV), 2019. [5] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, “Character region awareness for text detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2019. [6] Y. Baek, S. Shin, J. Baek, S. Park, J. Lee, D. Nam, H. Lee, “Character region attention for text spotting,” ArXiv, vol. abs/2007.09629, 2020. [7] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, “End-to-end object detection with transformers,” arXiv preprint arXiv:2005.12872, 2020. [8] W. Chan, C. Saharia, G. Hinton, M. Norouzi, N. Jaitly, “Imputer: Sequence modeling via imputation and dynamic programming,” arXiv preprint arXiv:2002.08926, 2020. [9] C. K. Ch’ng, C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in Proc. IAPR International Conference on Document Anal. and Recognition (ICDAR), 1: 935–942, 2017. [10] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al., “Rethinking attention with performers,” arXiv preprint arXiv:2009.14794, 2020. [11] M. Cummins, P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” Int. J. Rob. Res., 27(6): 647–665, 2008. [12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. [13] S. Fang, H. Xie, Y. Wang, Z. Mao, Y. Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition: 7098–7107, 2021. [14] W. Feng, W. He, F. Yin, X. Y. Zhang, C. L. Liu, “Textdragon: An end-to-end framework for arbitrarily shaped text spotting,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9076–9085, 2019. [15] S. Garg, T. Fischer, M. Milford, “Where is your place, visual place recognition?” arXiv preprint arXiv:2103.06443, 2021. [16] A. Gupta, A. Vedaldi, A. Zisserman, “Synthetic data for text localization in natural images,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 2315–2324, 2016. [17] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., “A survey on the visual transformer,” arXiv preprint arXiv:2012.12556, 2020. [18] K. He, X. Chen, S. Xie, Y. Li, P. Dollar, R. Girshick, “Masked autoencoders are scalable vision learners,” arXiv preprint arXiv:2111.06377, 2021. [19] K. He, G. Gkioxari, P. Dollar, R. Girshick, “Mask R-CNN, ” in Proc. IEEE International Conference on Computer Vision: 2961–2969, 2017. [20] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 770–778, 2015. [21] S. Hochreiter, J. Schmidhuber, “Long short-term memory,” Neural Comput., 9(8): 1735–1780, 1997. [22] Z. Hong, Y. Petillot, D. Lane, Y. Miao, S. Wang, “Textplace: Visual place recognition and topological localization through reading scene texts,” in Proc. IEEE/CVF International Conference on Computer Vision: 2861–2870, 2019. [23] M. Iwamura, N. Morimoto, K. Tainaka, D. Bazazian, L. Gomez, D. Karatzas, “ICDAR2017 robust reading challenge on omnidirectional video,” in Proc. 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 1: 1448–1453, 2017. [24] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition, ” arXiv preprint arXiv:1406.2227, 2014. [25] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al., “ICDAR 2015 competition on robust reading,” in Proc. International Conference on Document Analysis and Recognition (ICDAR): 1156–1160, 2015. [26] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. I Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. De Las Heras, “ICDAR 2013 robust reading competition,” in Proc. International Conference on Document Analysis and Recognition: 1484–1493, 2013. [27] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah, “Transformers in vision: A survey,” arXiv preprint arXiv:2101.01169, 2021. [28] Y. Kittenplon, I. Lavi, S. Fogel, Y. Bar, R. Manmatha, P. Perona, “Towards weakly-supervised text spotting using a multi-task transformer,” arXiv preprint arXiv:2202.05508, 2022. [29] A. B. Laguna, K. Mikolajczyk, “Key. net: Keypoint detection by handcrafted and learned CNN filters revisited,” IEEE Trans. Pattern Anal. Mach. Intell., 45(1): 698-711, 2022. [30] J. Lee, S. Park, J. Baek, S. Joon Oh, S. Kim, H. Lee, “On recognizing texts of arbitrary shapes with 2D self-attention,” in Proc. IEEE CVPR: 546–547, 2020. [31] H. Li, P. Wang, C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” in Proc. 2017 IEEE International Conference on Computer Vision (ICCV): 5248–5256, 2017. [32] Y. Li, S. Xie, X. Chen, P. Dollar, K. He, R. Girshick, “Bench-marking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021. [33] M. Liao, G. Pang, J. Huang, T. Hassner, X. Bai, “Mask textspotter v3: Segmentation proposal network for robust scene text spotting,” in Proc. Computer Vision–ECCV 2020: 16th European Conference, Part XI 16: 706–722, 2020. [34] M. Liao, B. Shi, X. Bai, “Textboxes++: A single-shot oriented scene text detector,” IEEE Trans. Image Process., 27(8): 3676–3690, 2018. [35] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc Eur. Conference on Computer Vision. Springer: 740–755, 2014. [36] V. Nazarzehi, R. Damani, “Decentralised optimal deployment of mobile underwater sensors for covering layers of the ocean,” Indones. J. Electr. Eng. Comput. Sci., 25(2): 840–846, 2022. [37] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg, “SSD: Single shot multibox detector,” in Proc. Eur. Conference on Computer Vision. Springer: 21–37, 2016. [38] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, J. Yan, “FOTS: Fast oriented text spotting with a unified network,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 5676–5685, 2018. [39] Y. Liu, H. Chen, C. Shen, T. He, L. Jin, L. Wang, “Abcnet: Real-time scene text spotting with adaptive bezier-curve network,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9809–9818, 2020. [40] Y. Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-e end text spotting,” arXiv preprint arXiv:2105.03620, 2021. [41] S. Lowry, N. S. Underhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, M. J. Milford, “Visual place recognition: A survey,” IEEE Trans. Rob., 32(1): 1–19, 2015. [42] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, “ICDAR 2003 robust reading competitions,” in Proc. Seventh Int. Conference on Document Analysis and Recognition: 682– 687, 2023. [43] P. Lyu, M. Liao, C. Yao, W. Wu, X. Bai, “Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” in Proc. Eur. Conference on Computer Vision (ECCV) : 67– 83, 2018. [44] C. Masone, B. Caputo, “A survey on deep visual place recognition,” IEEE Access, 9: 19516–19547, 2021. [45] M. J. Milford, G. F. Wyeth, “Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights,” in Proc. IEEE International Conference on Robotics and Automation: 1643–1649, 2012. [46] A. Mishra, K. Alahari, C. V. Jawahar, “Scene text recognition using higher order language priors,” in BMVC, 2012. [47] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, Y. Xiao, “Towards unconstrained end-to-end text spotting,” in Proc. IEEE/CVF International Conference on Computer Vision: 4704–4714, 2019. [48] T. Q. Phan, P. Shivakumara, S. Tian, C. Lim Tan, “Recognizing text with perspective distortion in natural scenes,” in Proc. IEEE International Conference on Computer Vision: 569–576, 2013. [49] Z. Raisi, M. Naiel, P. Fieguth, S. Wardell, J. Zelek, “2d positional embedding-based transformer for scene text recognition,” J. Comput. Vision Imaging Syst., 6(1): 1–4, 2021. [50] Z. Raisi, M. A. Naiel, P. Fieguth, S. Wardell, J. Zelek, “Text detection and recognition in the wild: A review,” arXiv preprint arXiv:2006.04305, 2020. [51] Z. Raisi, M. A. Naiel, G. Younes, S. Wardell, J. Zelek, “2lspe: 2d learnable sinusoidal positional encoding using a transformer for scene text recognition,” in Proc. Conference on Robots and Vision (CRV): 119–126, 2021. [52] Z. Raisi, M. A. Naiel, G. Younes, S. Wardell, J. S. Zelek, “Transformer-based text detection in the wild,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops: 3162–3171, 2021. [53] Z. Raisi, G. Younes, J. Zelek, “Arbitrary shape text detection using transformers,” in Proc. IEEE International Conference on Pattern Recognition (ICPR): 3238-3245, 2022. [54] Z. Raisi, J. Zelek, “Occluded text detection and recognition in the wild,” in IEEE Proceeding Conference on Robots and Vision (CRV): 140-150, 2022. [55] Z. Raisi, J. S. Zelek, “End-to-end scene text spotting at character level,” J. Comput. Vision Imaging Syst., 7(1): 25-27, 2021. [56] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 779–788, 2016. [57] S. Ren, K. He, R. Girshick, J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Adv. in Neural Info. Process. Syst.: 91–99, 2015. [58] A. Risnumawan, P. Shivakumara, C. S. Chan, C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Syst. Appl., 41(18): 8027–8048, 2014. [59] D. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by back-propagating errors,” Nature, 323(6088): 533–536, 1986. [60] A. Shahab, F. Shafait, A. Dengel, “ICDAR 2011 robust reading competition challenge 2: Reading text in scene images,” in Proc. International Conference on Doc. Anal. and Recognition: 1491–1496, 2011. [61] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Trans. Pattern Anal. Mach. Intell., 41(9): 2035-2048, 2018. [62] Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al., “ICDAR 2019 competition on large-scale street view text with partial labeling –RRC-LSVT,” arXiv preprint arXiv:1909.07741, 2019. [63] Y. Tay, M. Dehghani, D. Bahri, D. Metzler, “Efficient transform- ers: A survey,” arXiv preprint arXiv:2009.06732, 2020. [64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need,” in Proc. Advances in Neural Information Processing Systems (NIPS 2017): 5998– 6008, 2017. [65] K. Wang, S. Belongie, “Word spotting in the wild,” in Proc. Eur. Conference on Computer Vision. Springer: 591–604, 2010. [66] C. Yao, X. Bai, W. Liu, Y. Ma, Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 1083–1090, 2012. [67] L. Yuliang, J. Lianwen, Z. Shuaitao, Z. Sheng, “Detecting curve text in the wild: New dataset and new solution,” in arXiv preprint arXiv:1712.02170, 2017. [68] X. Zhang, Y. Su, S. Tripathi, Z. Tu, “Text spotting transformers,” arXiv preprint arXiv:2204.01918, 2022. [69] X. Zhang, L. Wang, Y. Su, “Visual place recognition: A survey from deep learning perspective,” Pattern Recognit., 113: 107760, 2021. [70] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020. [71] S. X. Zhang, X. Zhu, J. B. Hou, C. Liu, C. Yang, H. Wang, X. C. Yin, “Deep relational reasoning graph network for arbitrary shape text detection,” in Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9699-9708, 2020. [72] L. Xing, Z. Tian, W. Huang, M. R. Scott, “Convolutional character networks,” in Proc. the IEEE/CVF International Conference on Computer Vision: 9126-9136, 2019. [73] I. Loshchilov, F. Hutter, “Decoupled weight decay regularization,” in Proc. International Conference on Learning Representations, 2018. [74] G. Liao, Z. Zhu, Y. Bai, T. Liu, Z. Xie, “PSENet-based efficient scene text detection,” EURASIP J. Adv. Signal Process., 97(1), 1-13, 2021. [75] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou,W. He, J. Liang, “East: an efficient and accurate scene text detector,” in Proc. the IEEE Conference on Computer Vision and Pattern Recognition: 5551-5560, 2017. [76] C. K. Ch'ng, C. S. Chan, “TotalText: A comprehensive dataset for scene text detection and recognition,” in Proc. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 1: 935-942, 2017. [77] L. Yuliang, J. Lianwen, Z. Shuaitao, Z. Sheng, “Detecting curve text in the wild: New dataset and new solution,” in arXiv preprint arXiv:1712.02170, 2017. [78] D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, Gpt-4 passes the bar exam. Available at SSRN 4389233. [79] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, “Building machines that learn and think like people,” Behav. Brain sci., 40, 2017.
آمار تعداد مشاهده مقاله: 720 تعداد دریافت فایل اصل مقاله: 397

سامانه مدیریت نشریات علمی. طراحی و پیاده سازی از سیناوب

پیوندهای مفید

پیوندهای مفید

اخبار و اعلانات

آمار

Text Detection and Recognition for Robot Localization