Using machine learning to study the population life quality: methodological aspects
https://doi.org/10.26425/2658-347X-2022-5-1-87-97
Abstract
Assessment of the population life quality is an important and relevant sociological task. Machine learning as a classification tool of social network users’ digital traces makes it possible to create a base to calculate subjective life quality index. The article consistently reviews all stages of the machine learning algorithms application to assess the life quality of the population of the regions of the Russian Federation and the issues of improving neural network accuracy. To train the neural network the authors formed a set of marked-up data extracted from regional communities of the social network “VKontakte”. Various approaches to text vectorisation, publicly available neural network models pre-trained on large Russian-language text corpora, as well as metrics for evaluating the algorithms results were analysed. Computational experiments with different algorithms were carried out, according to the results of which the Rubert-tiny algorithm was selected due to its high learning and classification rate. During the model parameters adjustment, the accuracy of f1-macro 0.545 was achieved. Computational experiments were carried out using Python scripts.Typical errors that a neural network makes in the process of automatic content classification were considered. The results of the study can be used to calculate the online activity index in the VKontakte social network of users from various Russian regions, on the basis of which the subjective life quality index will be calculated in the future. Improving the neural network accuracy will make it possible to obtain more reliable data for assessing the life quality in Russian regions based on users’ digital traces.
Keywords
About the Authors
E. V. ShchekotinRussian Federation
Evgeniy V. Shchekotin, Cand. Sci. (Philos.), Assoc. Prof., Head of the laboratory
Novosibirsk
В. Л. Гойко
Russian Federation
Vyacheslav L. Goiko, Head of the laboratory
Tomsk
P. A. Basina
Russian Federation
Polina A. Basina, Analyst
Tomsk
B. B. Bakulin
Russian Federation
Vyacheslav V. Bakulin, Analyst
Tomsk
References
1. Bogdanov M.B. and Smirnov I.B. (2021), “Opportunities and limitations of digital footprints and machine learning methods in Sociology”, Monitoring obshchestvennogo mneniya: ekonomicheskie i sotsial’nye peremeny, no. 1, pp. 304–328. (In Russian). https://doi.org/10.14515/monitoring.2021.1.1760
2. Chen T. and Guestrin C. (2016), “XGBoost: A Scalable Tree Boosting System”, KDD ‘16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. https://doi.org/10.1145/2939672.293978515
3. Chichkanov V.P. and Vasilyeva E.V. (2014), “Management of regional life quality: effectiveness evaluation and mechanism”, Gosudarstvennoe upravlenie. Elektronnyi vestnik, no. 47, pp. 163–182. (In Russian).
4. Dawson C. (2019), A–Z of digital research methods, Routledge, New York, USA.
5. Devlin J., Chang M., Lee K. and Toutanova K. (2019), “Bert: Pre-training of deep bidirectional transformers for language understanding”, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol. 1, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
6. Dvoynikova A.A. and Karpov A.A. (2020), “Analytical review of approaches to Russian text sentiment recognition”, Information and control systems, no. 4 (107), pp. 20–30. (In Russian). https://doi.org/10.31799/1684-8853-2020-4-20-30
7. Jones K.S. (2004), “A statistical interpretation of term specificity and its application in retrieval”, Journal of Documentation, vol. 60, no. 5, pp. 493–502. https://doi.org/10.1108/00220410410560573
8. Joulin A., Grave E., Bojanowski P. and Mikolov T. (2016), “Bag of tricks for efficient text classification”, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2. Valencia, Spain: Association for Computational Linguistics, pp. 427–431. https://doi.org/10.18653/V1/E17-2068
9. Kryshtanovskaya O.V. (2018), “Contactless sociology: new forms of research in a digital age”, Digital Sociology, no. 1, pp. 4-9. (In Russian). https://doi.org/10.26425/2658-347Х-2018-1-4-8
10. Kutuzov A. and Kuzmenko E. (2017), “WebVectors: A toolkit for building web interfaces for vector semantic models”, Communications in Computer and Information Science, vol. 661, pp. 155–161. https://doi.org/10.1007/978-3-319-52920-2_15
11. McGillivray M., Clarke M. [Eds], (2006.) Understanding human well-being, United Nations University Press, Tokyo, Japan; New York, USA; Paris, France.
12. Mikolov T., Chen K., Corrado G. and Dean J. (2013a), “Efficient estimation of word representations in vector space”, Proceedings of Workshop at ICLR, Scottsdale, May 2–4, pp. 1–11.
13. Mikolov T., Yih W.-T. and Zweig G. (2013b), “Linguistic regularities in continuous space word representations”, Proceedings of NAACL HLT, Atlanta, Georgia, June 9–14, pp. 746–751.
14. Müller A. and Guido S. (2016), Introduction to machine learning with Python, trans. from Eng. and ed. A.V. Gruzdeva, Williams, Moscow, Russia. (In Russian).
15. Nikolaenko G.A. and Fedorova A.A. (2017), “Non-reactive strategy: unobtrusive methods of gathering sociological information in web 2.0 age – evidence from digital ethnography and big data”, Sociology of power, vol. 29, no. 4, pp. 36–54. (In Russian). https://doi.org/10.22394/2074-0492-2017-4-36-54
16. Pennington J., Socher R. and Manning C.D. (2014), “GloVe: Global vectors for word representation”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162
17. Peters M.E., Neumann M., Iyyer M., Gardner M., Clark C., Lee K. and Zettlemoyer L. (2018), “Deep contextualized word representations”, Proceedings of NAACL-HLT, vol. 1, June 1–6, New Orleans, Louisiana, Association for Computational Linguistics, pp. 2227–2237. https://doi.org/10.18653/v1/N18-1202
18. Potdar K., Pardawala T.S. and Pai C.D. (2017), “A comparative study of categorical variable encoding techniques for neural network classifiers”, International Journal of Computer Applications, vol. 175, no. 4, pp. 7–9. https://doi.org/10.5120/IJCA2017915495
19. Shchekotin E.V. (2021), “Digital footprints as a new source of data on quality of life and well-being: an overview of current trends”, Tomsk State University journal, no. 467, pp. 170-181. (In Russian). https://doi.org/10.17223/15617793/467/21
20. Shchekotin E.V., Myagkov M.G., Goiko V.L., Kashpur V.V. and Kovarzh G.Yu. (2020), “Subjective measurement of population ill-being/well-being in the Russian regions based on social media data”, Monitoring obshchestvennogo mneniya: ekonomicheskie i sotsial’nye peremeny, no. 1 (155), pp. 78–116. (In Russian). https://doi.org/10.14515/monitoring.2020.1.05
21. Schober M.F., Pasek J., Guggenheim L., Lampe C. and Conrad F.G. (2016), “Research synthesis: Social media analyses for social measurement”, Public Opinion Quarterly, vol. 80, no. 1, pp. 180–211. https://doi.org/10.1093/poq/nfv048
22. Soumya G.K. and Joseph S. (2014), “Text classification by augmenting bag of words (BOW) representation with co-occurrence feature”, IOSR Journal of Computer Engineering, vol. 16, no. 1, pp. 34–38. https://doi.org/10.9790/0661-16153438
23. Tolstova Yu.N. (2015), “Sociology and computer technologies”, Sotsiologicheskie issledovaniya, no. 8 (376), pp. 3–13. (In Russian).
Review
For citations:
Shchekotin E.V., Гойко В.Л., Basina P.A., Bakulin B.B. Using machine learning to study the population life quality: methodological aspects. Digital Sociology. 2022;5(1):87-97. (In Russ.) https://doi.org/10.26425/2658-347X-2022-5-1-87-97