Journal of Electrical and Computer Engineering Innovations (JECEI)
مقالات آماده انتشار ، پذیرفته شده، انتشار آنلاین از تاریخ 15 دی 1403 اصل مقاله (1.26 M )
نوع مقاله: Original Research Paper
شناسه دیجیتال (DOI): 10.22061/jecei.2024.11256.784
نویسندگان
Z. Raisi* ؛ V. M. Nazarzehi Had ؛ E. Sarani ؛ R. Damani
Electrical Engineering Department, Chabahar Maritime University, Chabahar, Iran.
تاریخ دریافت : 16 شهریور 1403 ،
تاریخ بازنگری : 07 دی 1403 ،
تاریخ پذیرش : 10 دی 1403
چکیده
Background and Objectives: Research on right-to-left scripts, particularly Persian text recognition in wild images, is limited due to lacking a comprehensive benchmark dataset. Applying state-of-the-art (SOTA) techniques on existing Latin or multilingual datasets often results in poor recognition performance for Persian scripts. This study aims to bridge this gap by introducing a comprehensive dataset for Persian text recognition and evaluating SOTA models on it.Methods: We propose a Farsi (Persian) text recognition (FATR) dataset, which includes challenging images captured in various indoor and outdoor environments. Additionally, we introduce FATR-Synth, the largest synthetic Persian text dataset, containing over 200,000 cropped word images designed for pre-training scene text recognition models. We evaluate five SOTA deep learning-based scene text recognition models using standard word recognition accuracy (WRA) metrics on the proposed datasets. We compare the performance of these recent architectures qualitatively on challenging sample images of the FATR dataset.Results: Our experiments demonstrate that SOTA recognition models' performance declines significantly when tested on the FATR dataset. However, when trained on synthetic and real-world Persian text datasets, these models demonstrate improved performance on Persian scripts.Conclusion: Introducing the FATR dataset enhances the resources available for Persian text recognition, improving model performance. The proposed datasets, trained models, and code is available at https://github.com/zobeirraisi/FATDR.
کلیدواژهها
Persian Scripts ؛ Scene text recognition ؛ Real-World datasets ؛ Synthetic images ؛ Deep learning
مراجع
[1] Y. Zhu, C. Yao, X. Bai, “Scene text detection and recognition: Recent advances and future trends,” Front. Comput. Sci., 10(1): 19-36, 2016.
[2] H. Lin, P. Yang, F. Zhang, “Review of scene text detection and recognition,” Arch. Comput. Methods Eng., 27: 433-454, 2020.
[3] Z. Raisi, M. A. Naiel, P. Fieguth, S. Wardell, J. Zelek, “Text detection and recognition in the wild: A review,” arXiv preprint arXiv:2006.04305, 2020.
[4] Z. Raisi, J. Zelek, “Text detection and recognition for robot localization,” J. Electr. Comput. Eng. Innov., 12(1): 163-174, 2024.
[5] K. Wang, B. Babenko, S. Belongie, “End-to-end scene text recognition,” in Proc. 2011 International Conference on Computer Vision: 1457-1464, 2011.
[6] A. Bissacco, M. Cummins, Y. Netzer, H. Neven, “PhotoOCR: Reading text in uncontrolled conditions,” in Proc. 2013 IEEE International Conference on Computer Vision: 785-792, 2013.
[7] Z. Raisi, V. M. Nazarzehi, “A transformer-based approach with contextual position encoding for robust persian text recognition in the wild,” J. AI Data Min., 12(3): 455-464, 2024.
[8] Z. Raisi, G. Younes, J. Zelek, “Arbitrary shape text detection using transformers,” in Proc. 2022 26th International Conference on Pattern Recognition (ICPR): 3238-3245, 2022.
[9] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman., “Deep structured output learning for unconstrained text recognition,” arXiv:1412.5903v5, 2015.
[10] B. Shi, X. Bai, C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 39(11): 2298-2304, 2016.
[11] B. Shi, X. Wang, P. Lyu, C. Yao, X. Bai, “Robust scene text recognition with automatic rectification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 4168- 4176, 2016.
[12] W. Liu, C. Chen, K. Y. K. Wong, Z. Su, J. Han, “STARNet: A spatial attention residue network for scene text recognition,” in Proc. British Machine Vision Conference (BMVC): 43.1-43.13, 2016.
[13] F. Borisyuk, A. Gordo, V. Sivakumar, “Rosetta: Large scale system for text detection and recognition in images,” in Proc. 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining: 71-79, 2018.
[14] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, H. Lee, “What is wrong with scene text recognition model comparisons? Dataset and model analysis,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV): 4715-4723, 2019.
[15] C. Ma, L. Sun, J. Wang, Q. Huo, “Dq-detr: Dynamic queries enhanced detection transformer for arbitrary shape text detection,” in Proc. International Conference on Document Analysis and Recognition: 243-260, 2023.
[16] A. Rahman, A. Ghosh, C. Arora, “Utrnet: Highresolution urdu text recognition in printed documents,” in Proc. International Conference on Document Analysis and Recognition: 305-324, 2023.
[17] F. Alimoradi, F. Rahmani, L. Rabiei, M. Khansari, M. Mazoochi, “Synthesizing an image dataset for text detection and recognition in images,” J. Inf. Commun. Technol., 53(53): 78, 2023 [In Farsi].
[18] A. Rashtehroudi, A. Ranjkesh, A. Shahbahrami, "PESTD: a large-scale Persian-English scene text dataset," Multimedia Tools Appl., 82: 34793-34808, 2023.
[19] S. Kheirinejad, N. Riaihi, R. Azmi, “Persian text-based traffic sign detection with convolutional neural network: A new dataset,” in Proc. 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE): 060- 064, 2020.
[20] M. Rahmati, M. Fateh, M. Rezvani, A. Tajary, V. Abolghasemi, “Printed persian ocr system using deep learning,” IET Image Process., 14(15): 3920-3931, 2020.
[21] A. Fateh, M. Rezvani, A. Tajary, M. Fateh, “Persian printed text line detection based on font size,” Multimedia Tools Appl., 82(2): 2393-2418, 2023 .
[22] T. E. De Campos, B. R. Babu, M. Varma, et al., “Character recognition in natural images,” in Proc. Fourth International Conference on Computer Vision Theory and Applications (VISAPP), 7: 273-280, 2009.
[23] K. Wang, S. Belongie, “Word spotting in the wild,” in Proc. European Conference on Computer Vision: 591-604, 2010 .
[24] L. Neumann, J. Matas, “Real-time scene text localization and recognition,” in Proc. 2012 IEEE Conference on Computer Vision and Pattern Recognition: 3538-3545, 2012.
[25] F. Zhan, S. Lu, “Esir: End-to-end scene text recognition via iterative image rectification,” in Proc. 2019 IEEE Conference on Computer Vision and Pattern Recognition: 2059-2068, 2019.
[26] M. Sawaki, H. Murase, N. Hagita, “Automatic acquisition of context-based images templates for degraded character recognition in scene images,” in Proc. 15th International Conference on Pattern Recognition (ICPR), 4: 15-18, 2000.
[27] Y. F. Pan, X. Hou, C. L. Liu, “Text localization in natural scene images based on conditional random field,” in Proc. 2009 10th International Conference on Document Analysis and Recognition: 6-10, 2009.
[28] N. Dalal, B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1: 886-893, 2005.
[29] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. of Comp. Vision, 60(2): 91-110, 2004.
[30] J. A. Suykens, J. Vandewalle, “Least squares support vector machine classifiers,” Neural Process. Lett., 9(3): 293-300, 1999.
[31] J. Almazan, A. Gordo, A. Forn´ es, E. Valveny, “Word´ spotting and recognition with embedded attributes,” IEEE Trans. Pattern Anal. Mach. Intell., 36(12): 2552-2566, 2014.
[32] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. 23rd International Conference on Machine Learning: 369-376, 2006.
[33] Z. Wan, F. Xie, Y. Liu, X. Bai, C. Yao, “2D-CTC for scene text recognition,” arXiv:1907.09705v1, 2019.
[34] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Trans. Pattern Anal. Mach. Intell., 41(9): 2035-2048, 2018.
[35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, “Attention is all you need,” in Proc. 31st Conference on Neural Information Processing Systems (NIPS 2017): 5998-6008, 2017.
[36] Z. Raisi, M. A. Naiel, G. Younes, S. Wardell, J. Zelek, “2lspe: 2d learnable sinusoidal positional encoding using transformer for scene text recognition,” in Proc. 2021 18th Conference on Robots and Vision (CRV): 119-126, 2021.
[37] Z. Qiao, Z. Ji, Y. Yuan, J. Bai, “Decoupling visual semantic features learning with dual masked autoencoder for self-supervised scene text recognition,” in Proc. International Conference on Document Analysis and Recognition: 261-279, 2023.
[38] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. De Las Heras, “ICDAR 2013 robust reading competition,” in Proc. 2013 12th International Conference on Document Analysis and Recognition: 1484-1493, 2013.
[39] A. Mishra, K. Alahari, C. V. Jawahar, “Scene text recognition using higher order language priors,” in Proc. BMVC, 2012.
[40] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al., “ICDAR 2015 competition on robust reading,” in Proc. 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015.
[41] A. Risnumawan, P. Shivakumara, C. S. Chan, C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Syst. with Appl., 41(18): 8027- 8048, 2014.
[42] T. Quy Phan, P. Shivakumara, S. Tian, C. Lim Tan, “Recognizing text with perspective distortion in natural scenes,” in Proc. IEEE International Conference on Computer Vision (ICCV): 569-576, 2013.
[43] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, “Microsoft coco: Com-´ mon objects in context,” in Proc. 13th European Conference on Computer Vision: 740-755, 2014.
[44] A. Gupta, A. Vedaldi, A. Zisserman, “Synthetic data for text localisation in natural images,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 2315-2324, 2016.
[45] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” arXiv preprint arXiv:1406.2227, 2014.
[46] M. Iwamura, N. Morimoto, K. Tainaka, D. Bazazian, L. Gomez, D. Karatzas, “ICDAR2017 robust reading challenge on omnidirectional video,” in Proc. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 1: 1448-1453, 2017.
[47] Y. Sun, Z. Ni, C. K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al., “ICDAR 2019 competition on large-scale street view text with partial labeling– RRC-LSVT,” 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019.
[48] W. Wu, Y. Zhao, Z. Li, J. Li, M. Z. Shou, U. Pal, D. Karatzas, X. Bai, “Icdar 2023 competition on video text reading for dense and small text,” in Proc. International Conference on Document Analysis and Recognition: 405–419, 2023.
[49] R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang, et al., “ICDAR 2019 robust reading challenge on reading Chinese text on signboard,” in Proc. 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019.
[50] C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, J. Liu, D. Karatzas, C. Seng Chan, L. Jin, “Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art,” in Proc. 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019.
[51] Z. Wan, J. Zhang, L. Zhang, J. Luo, C. Yao, “On vocabulary reliance in scene text recognition,” in Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 11425-11434, 2020.
[52] M. Tounsi, I. Moalla, A. M. Alimi, F. Lebouregois, “Arabic characters recognition in natural scenes using sparse coding for feature representations,” in Proc. 2015 13th International Conference on Document Analysis and Recognition (ICDAR): 1036-1040, 2015.
[53] M. Tounsi, I. Moalla, A. M. Alimi, “Arasti: A database for arabic scene text recognition,” in Proc. 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR): 140-144, 2017.
[54] M. Jain, M. Mathew, C. Jawahar, “Unconstrained ocr for urdu using deep cnn-rnn hybrid networks,” in Proc. 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR): 747- 752, 2017.
[55] N. Sabbour, F. Shafait, “A segmentation-free approach to arabic and urdu ocr,” in Proc. Document recognition and retrieval XX, 8658: 215-226, 2013.
[56] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, 10: 707-710, 1966.
[57] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 1(2): 3, 2022.
[58] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[59] G. Team, R. Anil, S. Borgeaud, Y. Wu, J. B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
[60] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Y. Lo, et al., “Segment anything,” in Proc. the IEEE/CVF International Conference on Computer Vision: 4015- 4026, 2023.
[61] A. Kortylewski, Q. Liu, A. Wang, Y. Sun, A. Yuille, “Compositional convolutional neural networks: A robust and interpretable model for object recognition under occlusion,” arXiv preprint arXiv:2006.15538, 2020.
[62] Z. Raisi, J. Zelek, “Occluded text detection and recognition in the wild,” in Proc. 2022 19th Conference on Robots and Vision (CRV): 140-150, 2022.
[63] A. Faraji, M. Saeed, H. Nezamabadi-pour, "Introducing a database for Farsi document image understanding and segmentation," J. Mach. Vision Image Process., 10(2): 31-46, 2023 [In Persian].
آمار
تعداد مشاهده مقاله: 81
تعداد دریافت فایل اصل مقاله: 71