On the example of a model collection of Tajik literary works, the problem of the possibility of determining the authorship of a fragment of the text of the minimum size extracted from the collection is studied. A model collection of texts in the Tajik language composed of works of classical poetry and modern prose in Cyrillic graphics is considered. Each piece is associated with a digital portrait - the distribution of the frequencies of symbolic bigrams. To solve the problem of identifying the authors of texts, bigrams are quite acceptable quantitative characteristics. A γ-classifier is used as a tool for implementing the task, which allows the authors of textual information to be identified by the frequency of elements of alphabetic bigrams with a sufficiently high degree of efficiency. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of bigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. It was also found that with the help of a γ-classifier by a digital portrait, it is possible to identify the authors of works in the Tajik language. By using the metric classifier and the method of the nearest (in terms of distance) neighbor, it was possible to identify the authors of decreasing sequences of text fragments from 7000 words (40,000 characters) up to 20 words (100 characters). The minimum volume of a sample of words or symbols for recognition of the author of a Tajik text has been determined. The results of experiments with a minimum sample size of words (characters) for recognizing the author of a text are described.
1. Usmanov Z.D. Klassifikator diskretnykh sluchainykh velichin [The classifier of discrete random variables]. Doklady Akademii nauk Respubliki Tadzhikistan = Reports of the Academy of Sciences of the Republic of Tajikistan, 2017, vol. 60, no. 7–8, pp. 291–300.
2. Usmanov Z.D. Algoritm nastroiki klasterizatora diskretnykh sluchainykh velichin [Tuning the algorithm of the clasifier of discrete random variables]. Doklady Akademii nauk Respubliki Tadzhikistan = Reports of the Academy of Sciences of the Republic of Tajikistan, 2017, vol. 60, no. 9, pp. 392–397.
3. Kosimov A.A., Rakhmonov F.A. [On the recognition of the author of the text based on the frequency of alphabetic bigrams]. Scientific-practical conference of teachers, young researchers, doctoral students PhD, undergraduates and students dedicated to the proclamation of 2019–2021 "Years of rural development, tourism and folk crafts", 2020–2040 "Twentieth anniversary of teaching and development of natural sciences, exact and mathematical sciences in the field of science and education”, Tajik Science Day and 30th anniversary of the State Independence of the Republic of Tajikistan, Tajik Technical University named after M.S. Osimi. Khujand, 2020. 11 p. (In Russian).
4. Kosimov A.A. O minimal'nom ob"eme teksta, neobkhodimogo dlya raspoznavaniya ego avtora [On the minimum amount of text required to recognize its author]. Doklady Akademii nauk Respubliki Tadzhikistan = Reports of the Academy of Sciences of the Republic of Tajikistan, 2017, vol. 60, no. 9, pp. 398–401. (In Russian).
5. Kosimov A.A., Umaralizoda R.Sh., Khasanov A.A., Saidov Sh.S. [On recognition of the author of a text based on the frequency of alphabetic unigrams]. Republican scientific-practical conference "Science – the basis of innovative development". Tajik Technical University named after M.S. Osimi. Dushanbe, April 27–28, 2021, pp. 322–326. (In Russian).
6. Vorontsov K.V. Matematicheskie metody obucheniya po pretsedentam [Mathematical methods of teaching by precedents]. Available at: http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf (accessed 11.02.2022).
7. D'yakonov A.G. Analiz dannykh, obuchenie po pretsedentam, logicheskie igry, sistemy WEKA, RapidMiner i MatLab (Praktikum na EVM kafedry matematicheskikh metodov prognozirovaniya) [Data analysis, training on precedents, logic games, WEKA, RapidMiner and MatLab systems (Workshop on the computer of the Department of Mathematical Forecasting Methods)]. Moscow, MSU Publ., 2010. 278 p.
8. Kayumov M.M. O tsifrovom portrete tekstovoi informatsii [On the digital portrait of text information]. Politekhnicheskii vestnik. Seriya: Intellekt. Innovatsii. Investitsii = Polytechnic Bulletin. Series: Intelligence. Innovation. Investments, 2019, no. 1 (45), pp. 7–10.
9. Kayumov M.M. O tsifrovom portrete tekstovoi informatsii, osnovannom na chastotnosti znakov punktuatsii [On the digital portrait of text information based on the frequency of punctuation marks]. Politekhnicheskii vestnik. Seriya: Intellekt. Innovatsii. Investitsii = Polytechnic Bulletin. Series: Intelligence. Innovation. Investments, 2019, no. 1 (45), pp. 20–23.
10. Kayumov M.M. O raspoznavanii avtora teksta na osnove chastotnosti αβ-kodov slovoform [On recognition of the author of a text based on the frequency of αβ-codes of word forms]. Politekhnicheskii vestnik. Seriya: Intellekt. Innovatsii. Investitsii = Polytechnic Bulletin. Series: Intelligence. Innovation. Investments, 2020, no. 2 (50), pp. 29–36.
11. Ashurova Sh.N. [Assessment of the effectiveness of the use of verbal bigrams in the identification of text]. Rol' IKT v innovatsionnom razvitii ekonomiki Respubliki Tadzhikistan: materialy mezhdunarodnoi nauchno-prakticheskoi konferentsii [Materials of the international scientific-practical conference HER "The role of ICT in the innovative development of the economy of the Republic of Tajikistan"]. Dushanbe, Bahmanrud Publ., 2017, pp. 292–297. (In Russian).
12. Ashurova Sh.N. Otsenka effektivnosti ispol'zovaniya slovesnykh trigramm pri identifikatsii teksta [Efficiency evaluation of using word trigrams for a text identification]. Vestnik Tekhnologicheskogo universiteta Tadzhikistana = Bulletin of the Technological University of Tajikistan, 2017, no. 4 (31), pp. 51–58. (In Russian).
13. Ashurova Sh.N., Toshkhudzhaev Kh.A. On recognition of the author of the text based on the frequency of verbal bigrams // Polytechnic Bulletin, Series: intelligence, innovation, investment. 2020. 2 (50). pp. 57–61 (in Russian).
14. Bakhteev K.S. O primenimosti ukorochennykh tsifrovykh portretov dlya identifikatsii avtora teksta [About the applicability of shortened digital portraits to identify the author’s text]. Politekhnicheskii vestnik. Seriya: Intellekt. Innovatsii. Investitsii = Polytechnic Bulletin. Series: Intelligence. Innovation. Investments, 2020, no. 2 (50), pp. 25–28.
15. Bakhteev K.S. O raspoznavanii avtorstva po usechennym tsifrovym portretam teksta [On the recognition of authorship by truncated digital portraits of text]. Izvestiya Akademii nauk Respubliki Tadzhikistan. Otdelenie fiziko-matematicheskikh, khimicheskikh, geologicheskikh i tekhnicheskikh nauk = News of the Academy of Sciences of the Republic of Tajikistan. Department of physical, mathematical, chemical, geological and technical sciences, 2018, no. 4 (173), pp. 82–92.
16. Romanov A.S., Shelupanov A.A., Meshcheryakov R.V. Razrabotka i issledovanie matematicheskikh modelei, metodik i programmnykh sredstv informatsionnykh protsessov pri identifikatsii avtora teksta [Development and research of mathematical models, methods and software for information processes in the identification of the author of the text]. Tomsk, V-Spektr Publ., 2011. 188 p.
Kosimov A.A. O raspoznavanii avtora tekstovogo fragmenta na osnove chastotnosti bukvennykh bigramm [On the recognition of the author of a text fragment based on the frequency of alphabetic bigrams]. Sistemy analiza i obrabotki dannykh = Analysis and Data Processing Systems, 2022, no. 1 (85), pp. 73–82. DOI: 10.17212/2782-2001-2022-1-73-82.