A Framework for Grammar Inferencing using LLM Heuristics
Loading...
Files
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh
Abstract
Grammar inferencing is a major topic in the realm of programming language. To cre-
ate and enhance the ecosystem of any programming language (which generally con-
sists of optimized compilers, defined maintenance, and comprehensive documenta-
tion) inferring grammar for that language is unavoidable. Automated inferencing of
grammar from strings of code has been there for a while. The progression of grammar
inference has evolved through various stages, from template-based systems to black
box-dependent models, and more recently, to neural network-based approaches. We
introduce a novel technique in this continuum: LLM-based (Large Language Model)
systems. This innovative approach leverages the capabilities of large language models
to enhance performance, providing a fresh perspective and potentially transformative
impact on grammar inference. We have harnessed LLM’s ability to generalize and
consume domain-specific knowledge to infer grammar in a context-aware method.
Our method outperforms state of the art TREEVADA approach in both precision and
recall metrics and also makes the entire process more generalized and constraint-free.
Our method outperforms TREEVADA by 8% in terms of precision and 28% in terms
of recall. However, our testing methodology suffers from the absence of a totally un-
foreseen noble language for the LLMs. As our target approach, TREEVADA requires
some sort of lexer-parser setup within its oracle, comparing the two methods based
on a noble language is precluded from our work currently.
Description
Supervised by
Mr. Md. Jubair Ibna Mostafa,
Assistant Professor,
Department of Computer Science and Engineering (CSE)
Islamic University of Technology (IUT)
Board Bazar, Gazipur, Bangladesh
This thesis is submitted in partial fulfillment of the requirement for the degree of Bachelor of Science in Software Engineering, 2024
Keywords
Grammar Inference, LLM, Black Box, CFG, LLM Heuristics
Citation
[1] M. R. Arefin, S. Shetiya, Z. Wang, and C. Csallner, “Fast deterministic black-box context-free grammar inference,” in Proceedings of the IEEE/ACM 46th Interna tional Conference on Software Engineering, ser. ICSE ’24, New York, NY, USA: Association for Computing Machinery, 2024, isbn: 9798400702174. [2] O. Bastani, R. Sharma, A. Aiken, and P. Liang, “Synthesizing program input grammars,” in Proceedings of the 38th ACM SIGPLAN Conference on Program ming Language Design and Implementation, ser. PLDI 2017, Barcelona, Spain: Association for Computing Machinery, 2017, pp. 95–110, isbn: 9781450349888. [3] B. Bendrissou, R. Gopinath, and A. Zeller, ““synthesizing input grammars”: A replication study,” in Proceedings of the 43rd ACM SIGPLAN International Con ference on Programming Language Design and Implementation, 2022, pp. 260– 268. [4] C. Cummins, P. Petoumenos, A. Murray, and H. Leather, “Compiler fuzzing through deep learning,” in Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, 2018, pp. 95–105. [5] R. Edelmann and V. Kunčak, “Neural-network guided expression transforma tion,” arXiv preprint arXiv:1902.02194, 2019. [6] P. Godefroid, R. Singh, and H. Peleg, Machine learning for input fuzzing, US Patent 10,983,853, Apr. 2021. [7] E. M. Gold, “Language identification in the limit,” Information and control, vol. 10, no. 5, pp. 447–474, 1967. [8] R. Gopinath, B. Mathis, and A. Zeller, “Mining input grammars from dynamic control flow,” in Proceedings of the 28th acm joint meeting on european software engineering conference and symposium on the foundations of software engineer ing, 2020, pp. 172–183. [9] M. Höschele and A. Zeller, “Mining input grammars from dynamic taints,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Soft ware Engineering, 2016, pp. 720–725. 33 [10] B. Jean-Philippe et al., “Can recurrent neural networks learn nested recursion?” LiLT (Linguistic Issues in Language Technology), vol. 16, no. 1, pp. 1–20, 2018. [11] T.-W. Kim, T.-G. Kim, and J.-H. Seu, “Specification and automated detection of code smells using ocl,” International Journal of Software Engineering and Its Applications, vol. 7, no. 4, pp. 35–44, 2013. [12] N. Kulkarni, C. Lemieux, and K. Sen, “Learning highly recursive input gram mars,” in 2021 36th IEEE/ACM International Conference on Automated Soft ware Engineering (ASE), 2021, pp. 456–467. [13] Z. Lin, X. Zhang, and D. Xu, “Reverse engineering input syntactic structure from program execution and its applications,” IEEE Transactions on Software Engineering, vol. 36, no. 5, pp. 688–703, 2009. [14] L. Moonen, “Generating robust parsers using island grammars,” in Proceedings eighth working conference on reverse engineering, IEEE, 2001, pp. 13–22. [15] Y. Oda, H. Fudaba, G. Neubig, et al., “Learning to generate pseudo-code from source code using statistical machine translation,” in 2015 30th IEEE/ACM In ternational Conference on Automated Software Engineering (ASE), IEEE, 2015, pp. 574–584. [16] L. Sennhauser and R. C. Berwick, “Evaluating the ability of lstms to learn context free grammars,” arXiv preprint arXiv:1811.02611, 2018. [17] A. Shirani, A. P. Lopez-Monroy, F. Gonzalez, T. Solorio, and M. A. Alipour, “Evaluation of type inference with textual cues,” in Workshops at the Thirty Second AAAI Conference on Artificial Intelligence, 2018. [18] A. Stevenson and J. R. Cordy, “Grammatical inference in software engineer ing: An overview of the state of the art,” in Software Language Engineering: 5th International Conference, SLE 2012, Dresden, Germany, September 26-28, 2012, Revised Selected Papers 5, Springer, 2013, pp. 204–223. [19] A. Stevenson and J. R. Cordy, “A survey of grammatical inference in software engineering,” Science of Computer Programming, vol. 96, pp. 444–459, 2014, Se lected Papers from the Fifth International Conference on Software Language Engineering (SLE 2012), issn: 0167-6423. [20] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Ad vances in neural information processing systems, 2017, pp. 5998–6008. [21] D. M. Yellin and G. Weiss, “Synthesizing context-free grammars from recurrent neural networks,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems, Springer, 2021, pp. 351–369.