Actor Double Critic Architecture for Dialogue System

Saffari, Y.; Salimi Sartakhti, J.

doi:10.22061/jecei.2023.9346.614

تعداد نشریات	13
تعداد شماره‌ها	237
تعداد مقالات	2,406
تعداد مشاهده مقاله	3,887,934
تعداد دریافت فایل اصل مقاله	2,826,382

	Actor Double Critic Architecture for Dialogue System
Journal of Electrical and Computer Engineering Innovations (JECEI)
دوره 11، شماره 2، مهر 2023، صفحه 363-372 اصل مقاله (913.4 K)
نوع مقاله: Original Research Paper
شناسه دیجیتال (DOI): 10.22061/jecei.2023.9346.614
نویسندگان
Y. Saffari؛ J. Salimi Sartakhti^*
Department of Electrical and Computer Engineering, University of Kashan, Kashan, Iran.
تاریخ دریافت: 05 آذر 1401، تاریخ بازنگری: 18 دی 1401، تاریخ پذیرش: 10 اسفند 1401
چکیده
Background and Objectives: Most of the recent dialogue policy learning ‎methods are based on reinforcement learning (RL). However, the basic RL ‎algorithms like deep Q-network, have drawbacks in environments with ‎large state and action spaces such as dialogue systems. Most of the ‎policy-based methods are slow, cause of the estimating of the action value ‎using the computation of the sum of the discounted rewards for each ‎action. In value-based RL methods, function approximation errors lead to ‎overestimation in value estimation and finally suboptimal policies. There ‎are works that try to resolve the mentioned problems using combining RL ‎methods, but most of them were applied in the game environments, or ‎they just focused on combining DQN variants. This paper for the first time ‎presents a new method that combines actor-critic and double DQN named ‎Double Actor-Critic (DAC), in the dialogue system, which significantly ‎improves the stability, speed, and performance of dialogue policy learning. ‎ Methods: In the actor critic to overcome the slow learning of normal DQN, ‎the critic unit approximates the value function and evaluates the quality ‎of the policy used by the actor, which means that the actor can learn the ‎policy faster. Moreover, to overcome the overestimation issue of DQN, ‎double DQN is employed. Finally, to have a smoother update, a heuristic ‎loss is introduced that chooses the minimum loss of actor-critic and ‎double DQN. ‎ Results: Experiments in a movie ticket booking task show that the ‎proposed method has more stable learning without drop after ‎overestimation and can reach the threshold of learning in fewer episodes ‎of learning. ‎ Conclusion: Unlike previous works that mostly focused on just proposing ‎a combination of DQN variants, this study combines DQN variants with ‎actor-critic to benefit from both policy-based and value-based RL methods ‎and overcome two main issues of both of them, slow learning and ‎overestimation. Experimental results show that the proposed method can ‎make a more accurate conversation with a user as a dialogue policy ‎learner.‎
کلیدواژه‌ها
Actor-Critic؛ Dialogue system؛ DQN؛ Actor Double Critic

مراجع
[1] Z. C. Lipton, J. Gao, L. Li, X. Li, F. Ahmed, L. Deng, "Efficient exploration for dialog policy learning with deep {BBQ} networks & replay buffer spiking," arxiv preprint arxive: 1608.05081, 2016. [2] T. H. Wen, D. Vandyke, N. Mrkšić, M. Gašić, L. M. Rojas-Barahona, P. H. Su, S. Ultes, S. Young, "A network-based end-to-end trainable task-oriented dialogue system," in Proc. 15th Conference of the European Chapter of the Association for Computational Linguistics: 438-449, Valencia, Spain, 2017. [3] H. Cuay ́ahuitl, S. Renals, O. Lemon, H. Shimodaira, "Hierarchical Dialogue optimization using semi-markov decision processes," in Proc. 8th Annual Conference of the International Speech Communication Association: Interspeech: 2693-2696, 2007. [4] X. Li, Y. N. Chen, L. Li, J. Gao, A. Celikyilmaz, "End-to-End task-completion neural dialogue systems," in Proc. Eighth International Joint Conference on Natural Language Processing, (1): 733–743, Taipei, Taiwan, 2017. [5] H. Sun, C. Zhao, S. Liu, H. Jiang, "A pipeline dialogue system scheme," in Proc. 2nd International Conference on Machine Learning and Computer Application: 1-5, Shenyang, China, 2021. [6] R. Fellows, H. Ihshaish, S. Battle, C. Haines, P. Mayhew, J. I. Deza, "Task-oriented dialogue systems: performance vs. quality-optima, a review," arxive preprint. arxive: 2112.11176, 2021. [7] M. I. Bahria, Z. Yan, "Supervised machine learning approaches: A survey," Int. J. Soft Comput., (5): 946-952, 2015. [8] R. Howard, Dynamic Programming and Markov Processes, The MIT Press, Cambridge, 1960. [9] S. Young, M. Gasiˇ c, B. Thomson, J. D. Williams, "Pomdp-based statistical spoken dialog systems: A review," Proc. IEEE, 101(5): 1160–1179, 2013. [10] J. D. Williams, S. Young, "Partially observable markov decision processes for spoken dialog systems," Comput. Speech Lang., 21(2): 393–422, 2007. [11] J. Williams, A. Raux, D. Ramachandran, A. Black, "The dialog state tracking challenge," in Proc. SIGDIAL: 404–413, 2013. [12] P. Swazinna, S. Udluft, D. Hein, T. Runkler, "Comparing model-free and model-based algorithms for offline reinforcement learning," IFAC-PapersOnLine, 55(15):19-26, 2022. [13] V. Mnih , K. Kavukcuoglu, D. Silver, A. A. Rusu , J. Veness , M. G. Bellemare, A. Graves, M. Riedmiller, "Human-level control through deep reinforcement learning," Nature: 529-33, 26 Feb 2015. [14] S. Thrun, A. Schwartz, "Issues in using function approximation for reinforcement learning," in Proc. 4th Connectionist Models Summer School, 1993. [15] H. van Hasselt, A. Guez, D. Silver, "Deep reinforcement learning with double q-learning," arxive preprint arxiv:1509.06461, 2015. [16] R. Chen, J. H. Goldberg, "Actor-critic reinforcement learning in the songbird," Curr. Opin. in Neurobiol., (65): 1-9, 2020. [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. A. Riedmiller, "Playing atari with deep reinforcement learning," DeepMind Technologies, 2013. [18] D. Silver, A. Huang, C. J Maddison, A. Guez, L. Sifre, G. V. D. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, . M. Lanc, "Mastering the game of go with deep neural networks," Nature (529): 484, 2016. [19] B. Peng, X. Li, J. Gao, J. Liu, Y.-N. Chen, K.-F. Wong, "Adversarial advantage actor-critic model for task-completion dialogue policy learning," Int. Conf. IEEE, Acoustics, Speech and Signal Processing (ICASSP): 6149-6153, 2018. [20] X. Li, Z. C. Lipton, B. Dhingra, L. Li, J. Gao, Y. N. Chen, "A user simulator for task-completion dialogues," arxiv preprint arxive:1612.05688, 2016. [21] C. J. Watkins and P. Dayan, "Q-learning," Mach. Learn., (8): 279-292, 1992. [22] V. Mnih, A. Puigdomènech Badia, "Asynchronous methods for deep reinforcement learning," arxiv preprint arxiv:1602.01783v2, 2016. [23] J. Gao, M. Galley, L. Li, "Neural approaches to conversational ai, question answering, task-oriented dialogues and social chatbots," arxive preprint arxive:1809.08267, 2019. [24] Y. Wu, E. Mansimov, S. Liao, R. Grosse, "Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation," arxiv preprint arxiv:1708.05144, 2017. [25] Y. C. Wu, B. H. Tseng, M. Gas, "Actor-double-critic: incorporating model-based critic for task-oriented dialogue systems," Findings of the association for computational linguistics: EMNLP: 854–863, 2020. [26] J. Peters, S. Vijayakumar, S. Schaal, "Natural Actor-Critic," ECML: 280–291, 2005. [27] Z. Wang, V. Bapst, N. Hees, V. Mnih, R. Munos, K. Kavukcuoglu, N. D. Freitas, "Sample efficient actor-critic with experience replay," arxiv preprint arxive:1611.01224, 2016. [28] M. Sabry, K. M. A. Amr , "On the reduction of variance and overestimation of deep q-learning," arxive preprint arxiv:1910.05983v1, 2019. [29] X. Wang, A. Vinel, "Cross learning in deep q-networks," arxive preprint ariv:2009.13780v1, 2020. [30] Y. Chen, L. Schomaker, M. A. Wiering, "An Investigation Into the Effect of the Learning Rate on Overestimation Bias of Connectionist Q-learning," in Proc. International Conference on Agents and Artificial Intelligence 2021. [31] S. Fujimoto , H. van Hoof, D. Meger , "Addressing function approximation error in actor-critic methods," 2018. [32] Y. A. Wang, Y. N. Chen, "Dialogue environments are different from games: Investigating variants of deep q-networks for dialogue policy," in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU): 1070-1076, 2019. [33] D. Vath, N. T. Vu, "To combine or not to combine? A rainbow deep reinforcement learning agent for dialog policy," University of Stuttgart, Institute for Natural Language Processing (IMS), 2019. [34] M. Henderson, B. Thomson, J. D. William, "The second dialog state tracking challenge," in Proc. 15th annual meeting of the special interest group on discourse and dialogue (SIGDIAL), 2014. [35] M. Fatemi, L. E. Asri, H. Schulz, J. He, K. Suleman, "Policy networks with two-stage training for dialogue systems," arxive preprint arxive: 1606.03152, 2016. [36] H. R. Chinaei, B. Chaib-draa, L. Lamontagne, "Learning observation models for dialogue POMDPs," in Proc. Canadian Conference on Artificial Intelligence: Springer(7310), 2012. [37] I. Grondman, L. Busoniu, G. A. D. Lopes, R. Babuska, "A survey of actor-critic reinforcement learning: Standard and natural policy gradients," IEEE Trans. Syst. Man Cybern. Part C Appl. Rev., 42(6): 1291–1307, 2012. [38] D. P. Kingma, J. Ba, "Adam: A method for stochastic optimization," in 3rd Int.Conf. Learning Representations, San Diego, 2015. [39] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, F. De, "Dueling network architectures for deep reinforcement learning," Computer Science , Machine Learning, 2015.
آمار تعداد مشاهده مقاله: 660 تعداد دریافت فایل اصل مقاله: 522

سامانه مدیریت نشریات علمی. طراحی و پیاده سازی از سیناوب

پیوندهای مفید

آمار

Actor Double Critic Architecture for Dialogue System