Security event log files give an idea of the state of the information system and allow you to find anomalies in user behavior and cybersecurity incidents. The existing event logs (application, system, security event logs) and their division into certain types are considered. But automated analysis of security event log data is difficult because it contains a large amount of unstructured data that has been collected from various sources. Therefore, this article presents and describes the problem of analyzing information security event logs.
And to solve this problem, new and not particularly studied methods and algorithms for data clustering were considered, such as Random forest (random forest), incremental clustering, IPLoM algorithm (Iterative Partitioning Log Mining - iterative analysis of the partitioning log). The Random forest algorithm creates decision trees for data samples, after which it is provided with a forecast for each sample, and the best solution is selected by voting. This method reduces overfitting by averaging the scores. The algorithm is also used in such types of problems as regression and classification. Incremental clustering defines clusters as groups of objects that belong to the same class or concept, which is a specific set of pairs. When clusters are defined, they can overlap, allowing for a degree of "fuzziness for samples" that lie at the boundaries of different clusters. The IPLoM algorithm uses the unique characteristics of log messages to iteratively partition the log, which helps to extract message types efficiently.
1.?Korolev M.A. Statisticheskii slovar' [Statistical dictionary]. Moscow, Finansy i statistika Publ., 1989. 623 p.
2.?Vorontsov K.V. Algoritmy klasterizatsii i mnogomernogo shkalirovaniya: kurs lektsii [Clustering and multidimensional scaling algorithms. Lecture course]. Moscow State University, 2007.
3.?Jain A., Murty M., Flynn P. Data clustering: a review. ACM Computing Surveys, 1999, vol. 31, iss. 3, pp. 264–323.
4.?Kotov A., Krasil'nikov N. Klasterizatsiya dannykh [Data clustering]. St. Petersburg, ITMO University, 2006.
5.?Mandel' I.D. Klasternyi analiz [Cluster analysis]. Moscow, Finansy i statistika Publ., 1988. 176 p.
6.?Aivazyan S.A., Bukhshtaber V.M., Enyukov I.S., Meshalkin L.D. Prikladnaya statistika: klassifikatsiya i snizhenie razmernosti [Applied statistics: classification and dimensionality reduction]. Moscow, Finansy i statistika Publ., 1989. 607 p.
7.?MachineLearning.Ru. Information and analytical resource dedicated to machine learning, pattern recognition and data mining. (In Russian). Available at: www.machinelearning.ru (accessed 04.03.2022).
8.?Chubukova I.A. Kurs lektsii "DataMining" [Lecture course "Data Mining"]. Internet University of Information Technologies. Available at: www.intuit.ru/
department/database/datamining (accessed 04.03.2022).
9.?Farid D.M., Rahman M.Z., Rahman C.M. Adaptive intrusion detection based on boosting and naïve Bayesian classifier. International Journal of Computer Applications, 2011, vol. 24 (3), pp. 12–19.
10.?Lepskiy A.E., Bronevich A.G. Matematicheskie metody raspoznavaniya obrazov: kurs lektsii [Mathematical methods for pattern recognition]. Taganrog, 2009. Available at: https://lepskiy.ucoz.ru/Posobie/MMPR_.pdf (accessed 04.03.2022).
11.?Intuit. National Open University. Lektsiya 9: Metody klassifikatsii i prognozirovaniya. Derev'ya reshenii [Lecture 9: Classification and forecasting methods. decision trees]. Available at: http://www.intuit.ru/studies/courses/6/6/
lecture/174 (accessed 14.03.2022).
12.?Kruglov V.V., Golunov R.Yu. Nechetkaya logika i iskusstvennye neironnye seti [Fuzzy logic and artificial neural networks]. Moscow, Fizmatlit Publ., 2001. 224 p.
13.?Vorontsov K.V. Lektsii po iskusstvennym neironnym setyam [Lectures on artificial neural networks], 2007, December 21. Available at: http://www.ccas.ru/
voron/download/NeuralNets.pdf (accessed 04.03.2022).
14.?Barskii A.B. Neironnye seti: raspoznavanie, upravlenie, prinyatie reshenii [Neural networks: recognition, control, decision making]. Moscow, Finansy i statistika Publ., 2004. 176 p.
15.?Panchenko T.V. Geneticheskie algoritmy [Genetic algorithms]. Astrakhan, Astrakhanskii universitet Publ., 2007. 87 p.
16.?CompoWiki. Zhurnal sobytii [CompoWiki. Event log]. Available at: https://wiki.compowiki.info/EventLog (accessed 04.03.2022).
17.?Zhurnaly sobytii Windows [Windows event log]. Available at: https://eventlogxp.com/rus/essentials/windowseventlog.html (accessed 04.03.2022).
18.?Zhurnal registratsii sobytii informatsionnoi bezopasnosti [Information security event log]. Available at: https://safe-surf.ru/glossary/ru/849/ (accessed 04.03.2022).
19.?Makanju A., Zincir-Heywood A.N., Milios E.E. Clustering event logs using iterative partitioning. KDD '09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2009, pp. 1255–1264. DOI: 10.1145/1557019.1557154.
20.?Svacina J., Raffety J., Woodahl C., Stone B., Cerny T., Bures M., Shin D., Frajtak K., Tisnovsky P. On vulnerability and security log analysis: a systematic literature review on recent trends. RACS '20: Proceedings of the International Conference on Research in Adaptive and Convergent Systems. ACM, 2020, pp. 175–180. DOI: 10.1145/3400286.3418261.
21.?Alvarenga S.C. de, Barbon S., Zarpelão B.B., Miani R.S., Cukier M. Process mining and hierarchical clustering to help intrusion alert visualization. Computers and Security, 2018, vol. 73, pp. 474–491. DOI: 10.1016/j.cose.2017.11.021.
22.?Alaba A., Maitanmi S., Ajayi O. An ensemble of classification techniques for Intrusion detection systems. International Journal of Computer Science and Information Security, 2019, vol. 17, no. 11, pp. 24–33.
23.?Chauhan A., Mishra G., Kumar G. Survey on data mining techniques in intrusion detection. International Journal of Scientific and Engineering Research, 2011, vol. 2, iss. 7, pp. 1–4.
24.?Ji S.-Y., Choi S., Jeong B.-K., Jeong D.H. A multi-level intrusion detection method for abnormal network behavior. Journal of Network and Computer Applications, 2016, vol. 62, pp. 9–17.
25.?Onan A., Korukoglu S., Bulut H. A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications, 2016, vol. 62, pp. 1–16. DOI: 10.1016/
j.eswa.2016.06.005.
26.?Harahap F., Harahap A.Y.N., Ekadiansyah E., Sari R.N., Adawiyah R., Harahap C.B. Implementation of naïve Bayes classification method for predicting purchase. 2018 6th International Conference on Cyber and IT Service Management (CITSM), Parapat, Indonesia, 2018, pp. 1–5. DOI: 10.1109/CITSM.2018.8674324.
27.?Farajiparvar P., Hoseinzadeh N., Han L.D., Hedayatipour A. Deep Learning techniques for traffic speed forecasting with side information. 2020 IEEE Green Energy and Smart Systems Conference (IGESSC), Long Beach, CA, 2020, pp. 1–5. DOI: 10.1109/IGESSC50231.2020.9285132.
28.?Aklani S.A. Metode fuzzy logic untuk evaluasi kinerja pelayanan perawat (Studi Kasus: RSIA Siti Hawa Padang). Edik Informatika, 2014, vol. 1, no. 1, pp. 35–43.
29.?Zhao C.H., Zhang B.L., He J., Lian J. Recognition of driving postures by contourlet transform and random forests. IET Intelligent Transport Systems, 2012, vol. 6 (2), pp. 161–168.
30.?Probst P., Wright M.N., Boulesteix A.-L. Hyperparameters and tuning strategies for random forest. WIREs Data Mining and Knowledge Discovery, 2019, vol. 9, p. e1301.
31.?Cheng L., Chen X., Cheng L., De Vos J., Witlox F., Lai X., Witlox F., Witlox F. Applying a random forest method approach to model travel mode choice behavior. Travel Behaviour and Society, 2019, vol. 14, pp. 1-10.
32.?Chen W., Xie X., Peng J., Shahabi H., Hong H., Bui D.T., Duan Z., Li S., Zhu A-X. GIS-based landslide susceptibility evaluation using a novel hybrid integration approach of bivariate statistical based random forest method. Catena, 2018, vol. 164, pp. 135–149.
33.?Wu H., Lin A., Xing X., Song D., Li Y. Identifying core driving factors of urban land use change from global land cover products and POI data using the random forest method. International Journal of Applied Earth Observation and Geoinformation, 2021, vol. 103, p. 102475.
34.?Cai Y., Lin H., Zhang M. Mapping paddy rice by the object-based random forest method using time series Sentinel-1/Sentinel-2 data. Advances in Space Research, 2019, vol. 64 (11), pp. 2233–2244.
35.?Valecha H., Varma A., Khare I., Sachdeva A., Goyal M. Prediction of consumer behaviour using random forest algorithm. 2018 5th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Gorakhpur, India, 2018, pp. 1–6. DOI: 10.1109/UPCON.2018.8597070.
36.?Kutukov D.S. [Application of clustering methods for news flow processing]. Tekhnicheskie nauki: problemy i perspektivy: materialy I Mezhdunarodnoi nauchnoi konferentsii [Technical sciences: problems and prospects: materials of the I International scientific conference], St. Petersburg, Renome Publ., 2011,
pp. 77–83. (In Russian). Available at: https://moluch.ru/conf/tech/archive/2/207/ (accessed 09.03.2022).
37.?Kailing K., Kriegel H.-P., Kröger P. Density-connected subspace clustering for high-dimensional data. Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2004, pp. 246–257.
38.?Braun R.K., Kaneshiro R. Exploiting topic pragmatics for new event detection in TDT-2004. DARPA Topic Detection and Tracking Workshop, Gaithersburg, 2004.
39.?Peters M., Zaki M.J. Click: clustering categorical data using K-partite maximal cliques. Computer Science Department Rensselaer Polytechnic Institute, Troy, NY, 2004. 31 p.
40.?Jiang B., Pei J., Tao Y, Lin X. Clustering uncertain data based on probability distribution similarity. IEEE Transactions on Knowledge and Data Engineering, 2013, vol. 25 (4), pp. 751–763. DOI: 10.1109/TKDE.2011.221.
41.?Makanju A., Zincir-Heywood A.N., Milios E.E. A lightweight algorithm for message type extraction in system application logs. IEEE Transactions on Knowledge and Data Engineering, 2012, vol. 24 (11), pp. 1921–1936. DOI: 10.1109/TKDE.2011.138.
42.?Makanju A., Zincir-Heywood A.N., Milios E.E. Clustering event logs using iterative partitioning. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 09), ACM, 2009, pp. 1255–1263.
43.?Oliner A., Ganapathi A., Xu W. Advances and challenges in log analysis: logs contain a wealth of information for help in managing systems. ACM Queue, 2011, vol. 9 (12). DOI: 10.1145/2076796.2082137.
44.?Best practices for incident response. 2020, September 3. Available at: https://www.securitymagazine.com/articles/93235-best-practices-for-incident-response (accessed 09.03.2022).
45.?Miranskyy A., Hamou-Lhadj A., Cialini E., Larsson A. Operational-log analysis for big data systems: challenges and solutions. IEEE Software, 2016, vol. 33 (2), pp. 52–59. DOI: 10.1109/MS.2016.33.
46.?HDFS Architecture. The Apache Software Foundation. Available at: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html (accessed 09.03.2022).
47.?Brownlee J. A tour of machine learning algorithms. 2019, August 12. Available at: https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ (accessed 09.03.2022).
The authors express their deep gratitude to Dr. tech. Sciences, Professor Belov Viktor Matveevich for valuable advice and comments made during the work on the article.
Sidorova D.N., Pivkin E.N. Algoritmy i metody klasterizatsii dannykh v analize zhurnalov sobytii informatsionnoi bezopasnosti [Algorithms and methods of data clustering in the analysis of information security event logs]. Bezopasnost' tsifrovykh tekhnologii = Digital Technology Security, 2022, no. 1 (104), pp. 41–60. DOI: 10.17212/2782-2230-2022-1-41-60.