Construction of Lorenz Curves Based on Empirical Distribution Laws of Economic Indicators

The quality of construction of Lorenz curves depends on the features of the information used. As a rule, information is represented by a sample of values of the studied indicator, which is checked for unevenness. Economic indicators of income and cost, and features of their samples are considered. The feature of the cost economic indicator associated with the presence in the sample of its values of the clot is highlighted (the concentration of values on a small segment of the entire range of sample). It is shown that the established order of constructing empirical laws based on such samples does not give the desired effect when constructing Lorenz curves due to the loss of information content of the sample in the places of the clot. The purpose of this article is to improve the quality of the Lorenz curve by increasing the information content of the sample with a clot by applying the clustering procedure when constructing an empirical law. A step-by-step clustering procedure is proposed for dividing the entire range of sample into intervals to construct an empirical distribution law, which is an element of the novelty of this study. A specific example shows how to improve the quality of building a Lorenz curve using this procedure. In addition, it is shown that Lorenz curves for economic indicators can be constructed directly on the basis of the empirical distribution law and at the same time take into account its features.


Introduction
The issue of constructing Lorenz curves is relevant; this is confirmed by the fact that they are used in various fields related to economics, biology, medicine, technology, etc. Lorenz curves are interesting because they provide an analysis of economic inequality, uneven distribution of income, processes of stratification of mixtures, uneven costs of research work, etc. Researchers are attracted by the visibility of the deviation of the uneven distribution of the considered indicator from its uniform distribution and how it will change after the control effect on this indicator.
There are known works related to the construction of empirical laws and Lorenz curves , in which the construction was carried out in different interpretations in accordance with the features of the object of study. In General, following the classical view of the process of constructing Lorenz curves, the following steps can be formed. At the first stage, a set of values of the unevenness indicator is formed. The features of the sample of values are associated with the nature of the indicator under considerationbiological indicator, economic, technical, and chemical, etc. The distribution of random variables over the entire range of sample is analyzed. At the second stage, a histogram of the indicator is constructed (an empirical law), on the basis of which a theoretical distribution law is selected according to the consent criterion. The third stage is related to getting on the theoretical law functions Lorenz, reflecting the uneven distribution of the index relative to the line of equal distribution. The conclusion is made about the use of one or another type of Lorenz curve and the criterion reflecting unevenness. The fourth stage is associated with a visual representation of the Lorenz curve, which characterizes the uneven distribution of the indicator under consideration, on the graph. By varying the parameters of the Lorenz curve, interpreting the results of the study and forming the necessary conclusions.
The work [1] is aimed at obtaining all possible types of Lorenz curves and automating the calculation of their parameters using the developed lorenz program, which at the first stage of constructing the Lorenz curve fully supports the estimation of variance for complex samples, and at subsequent stages evaluates the Lorenz curves and concentration curves for specific data. This means that calculation data is always available for Lorenz curves, and the results are displayed as a graph only when necessary. Calculated dependencies on the construction of the Lorenz curve based on a sample of values of a specific economic indicator are presented. This is valuable when constructing a Lorenz curve based on simulation modeling, when there is a specific law of distribution of an economic indicator. Then, using the algorithm for simulating random variables according to the selected distribution law, presented, for example, in [2], it is possible to obtain the calculated values of the Lorenz curve and plot it. However, this method of constructing the Lorentz curve is quite complex and requires additional research, especially if the original sample of random variables has specific differences, such as the presence clots of random variables.
In [3] reflects all the stages of constructing Lorenz curves, the implementation of which requires a fairly representative sample of random variables, which is not always found in real conditions of economic indicator research, and the need for time spent on the selection of the theoretical distribution associated with additional research. At the same time, it is necessary to monitor the correspondence of the essence of the studied process of uneven economic indicators to the features of the selected distribution law. Getting the expression of the Lorenz function itself involves quite complex transformations, during which you can make an inaccuracy or just an error, which will distort the appearance of the Lorenz curve. For example, expressions for the Lorenz function are given in [4], but only for such laws as the uniform law and the lognormal law. Getting them does not cause much difficulty.
In [5,6], attention is paid to the derivative of the Lorenz curvethe Gini coefficient and its behavior in assessing the uneven distribution of economic indicators. The analysis of empirical distributions of income of the world's countries over several years has been carried out and it is shown that the Gini indices are centered on the value of 33.33% corresponding to the Gini index of uniform distribution and that the Lorenz curves of these distributions are consistent with the Lorenz curves of lognormal distributions. This corresponds to research in [7], where the Gini coefficient of income inequality in the rural region is 35%. In fact, the third stage of constructing the Lorenz curve is well considered.
In [8,9] presents the results of studies directly with Lorenz curves, how their appearance affects the original empirical law. However, nothing is said about such an economic indicator as the cost, the unevenness of its distribution. There are no examples of empirical laws of such an indicator and their corresponding Lorenz curves.
In [10], a method is given for constructing the Lorenz curve based on the empirical law of distribution, assuming a uniform density in each group interval, which leads to an overestimation of the total average income. It is shown that the average absolute errors of the Gini index estimation obtained by two methods using group values of averages are significantly less than by the method based on a histogram. This allows you to find the average values when dividing the histogram intervals into equal parts and continue the construction process further. But there are no examples of constructing a Lorenz curve for the case when there are samples with a concentration of random variables at a small interval 8-10 times smaller than the span of the entire sample. In addition, the processes of constructing Lorentz curves based on trigonometric functions, a new generalized Weibull-Pareto distribution, a four-parameter distribution called a non-standard generalized power distribution, and other distribution laws [11][12][13][14][15]. All of them take into account the features of the source data, which are considered in specific examples. However, there are no examples of such samples (samples with a concentration of random variables over a small interval). Meanwhile, when constructing the Lorenz curve, different situations arise related to the information conditions of the study. One of these situations is a strong spread of random variables. In official theory, such quantities are statistically recognized as anomalous and are simply discarded, and then the distribution laws are constructed without them. An example of this situation is a sample of the cost of research works (R & d) taken from the R & d register for 2019 [16]. Funding for research related to strategic studies of economic development can be allocated funds that are tens of times higher than the cost of conventional local research, which is the majority in the register. Therefore, it is necessary that such situations must be taken into account when constructing the Lorenz curve. In addition, it should be noted that the selection of the theoretical distribution law is based primarily on the construction of an empirical histogram, the type and information content of which strongly depends on the number and length of the histogram intervals. The situation is complicated by the fact that for economic indicators, for example, income, the cost of household work, a common feature is the asymmetry of distribution and the presence of so-called clots of random variables in the sample. The term clot is used by researchers at the conversational level, so in the future we will understand the clot as the final range of possible values of the studied parameter, in which they are located with a probability close to one (for example, from 0.7 to 0.90), and the final interval itself is 8-10 times smaller than the entire range of sample. The construction of an empirical distribution law for a sample with such a clot is very problematic, since it is known that the use of large intervals in the construction of an empirical law is associated with the loss of information [4]. This loss of information will affect the quality of the Lorenz curve.
Thus, there is a contradiction between the presence of samples of random variables with clots in the economy and the lack of methods and procedures for preserving the information content of the sample in specific applications, for example, when constructing Lorenz curves. With this in mind, the refined goal of this study is to develop a procedure for constructing an empirical law for samples of random variables with clots in order to preserve informative content of the sample when constructing the Lorenz curve.
In [17][18][19], when constructing an empirical law, the question of forming intervals using the so-called binning procedure is considered, which is associated with a decrease in the size of the histogram intervals. However, when changing the length of the interval, there is some uncertainty in finding the degree of decrease (increase), in addition, the problem of non-parametric statistics was solved, which distracted attention and increased the time for research. For example, the Sturges' formula, which gives the dependence of the number of intervals on the sample size, is currently outdated for many reasons (see [20]). The formulas of Scott [17], Freedman and Diaconis [21] are mainly used when calculating the length of intervals depending on the number of measurements and sample variance, the number of measurements and the difference between the upper and lower quartile. In [18,19,[22][23][24], the length of intervals is calculated using the risk function, Shannon entropy, and other calculation methods. However, it turns out that the length of the histogram intervals is usually the same for the entire range of sample, and this leads to a loss of information when constructing an empirical histogram exactly in the places of a clot of random variables.

Materials and Methods
It seems that the best way out of this situation, taking into account the selected features of the known methods for constructing Lorenz curves, as well as techniques for choosing the length of intervals when constructing an empirical histogram, is to build an empirical distribution law for the initial sample and obtain the Lorenz curve itself on its basis. At the same time, an empirical histogram for sampling random variables must be constructed using the clustering procedure. This procedure involves ordering random variables in the places of the clot in relatively uniform intervals by their number. This will help better account for the clot information and improve the quality of the resulting Lorenz curve. In addition, you can adjust the degree of reduction in the size of the interval with a clot by setting how many times to reduce the interval and determining the number of clustering stages.
The algorithm for forming intervals for constructing an empirical histogram in this case will be as follows. 1. Defining the first cluster: splitting the entire set of random variables in the sample by the selected number of intervals. As a rule, at the first stage, the length of each interval (or interval step) is selected the same max min where x min and x max are the minimum and maximum values of random variables in the sample, and N is the number of intervals.
Researchers usually (for small samples, up to 100 values) choose 8-10 intervals. This is most likely due to the researcher's ability to store information about the number of random variables in each interval in RAM. You can use well-known formulas. Thus, N intervals are formed in the first cluster, N 1kl =N. 2. If the sample of random variables has a clot, then additional clusters must be formed. If a clot falls in one interval, two clusters are formed, a cluster that covers the interval with the clot, and a cluster that includes all the other intervals (the first cluster has one interval, and the second has N-1 intervals). The selected interval is divided into n 1 subintervals (or parts) depending on the number of random variables that fall within the selected interval. The length of each subinterval, provided they are equal, is calculated using the expression Thus, two clusters are formed, in the first n 1 intervals, in the second N-1 intervals. 3. The total number of intervals of the entire set of random variables in the sample is determined, taking into account the two clusters formed 4. If the problem of removing the clot is not solved in the first cluster, a third cluster is formed. The subinterval in the first cluster is divided into n 2 parts, depending on whether the clot is removed or not. The length of each part is determined by the expression Three clusters were formed: in the first n 2 intervals, in the second (n 1 -1) intervals, in the third -(N-1).
5. The total number of intervals of the entire set of random variables for three clusters is determined Thus, the implemented two-stage procedure of clustering intervals: zero levelformed a single cluster that contains N intervals; the first stage is formed by two clusters in the first n 1 subintervals (or parts), in the second (N-1) intervals; the second step generates three cluster n 2 in the first parts, the second (n 1 -1) and the third (N-1).
As can be seen from the above algorithm, to smooth out clots of random variables, the range of the entire sample of random variables is divided into intervals of different lengths, reflected by the dependencies (1), (2), (4). This algorithm can be used locally to eliminate several clots of random variables in complex samples used to construct Lorenz curves.
Confirmation that the Lorenz curve constructed according to the empirical distribution law, is actually no different from the Lorenz curve constructed according to the theoretical law can be illustrated by constructing the Lorenz curve for one of the variants of the forecast distribution of per capita income of Russia's population until 2025 (see example 1). The construction of the Lorenz curve for a sample containing a clot using the clustering procedure can be shown using the example of source data from the R & d register for 2019 (see example 2).
Expressions for the Lorenz curve function. It is known [3][4][5] that the Lorenz curve for the continuous case has the following form where x α is the α quantile of a random variable X, f(x) is the probability density of the distribution of a random variable X, α=F(x α ), F(x α ) is the distribution function of a random variable X.
For a discrete case corresponding to an empirical law, where N is the number of intervals of the empirical histograms of the random variable X, x i is the mathematical expectation of a random variable for the i-th interval of the histogram, p i is the probability that the random variable in the i-th interval, α=F(x αi ) is a value of the distribution function at the point x αi , x αi -α quantile of the random variable X for the right border of the i-th interval.
Expressions (6) and (7) show that the value of the Lorenz curves at point x α is the ratio of the mathematical expectation of random variables for the interval [0, x α ] to the mathematical expectation of the entire set of random variables that the empirical histogram is based on.
The order of construction of the Lorenz curve according to the empirical distribution law is as follows. 1. Select a set of source data, check their reliability, representativeness of the sample, and so on. 2. Intervals are formed for constructing an empirical histogram, parameters are calculated for it, and the histogram itself is constructed. If there is a strong concentration of random variables in the places of clots, the clustering procedure is used when constructing an empirical histogram to increase the information content of the sample and the quality of the Lorenz curve graph. 3. The right boundary points of the intervals are fixed, and the values of the points of the Lorenz curve are calculated for them using expression (7). The Lorenz curve is plotted from the calculated points.
Adhering to this order of construction of the Lorenz curve, it is necessary to say that it is not necessary to count the average value for each interval of the histogram for the available sample; you can use the group average, i.e. the middle of the interval. It is taken into account that the values in the interval are evenly distributed. In [10] it is shown that for the construction of Lorenz curves, this assumption improves the overall picture of curve smoothing. Example 1. Based on the original data [25] obtained a variant of the forecast distribution of the Russian population by average per capita money income until 2025 (see table 1). It is necessary to construct an empirical histogram of the distribution of income of the population, choose a theoretical distribution law and construct Lorenz curves accordingly. Make a conclusion about the identity of the curves. In Fig. 1, an empirical histogram is constructed based on the initial data, and a theoretical distribution lawthe lognormal law-is selected. A characteristic feature of economic indicators is reflected-the asymmetry of distribution.
The lognormal law is represented as: In accordance with the table 1 μ * =3.736, σ *2 =4.3933, then µ=1.1812, σ 2 =0.2737. It can be seen, even without checking by the consent criterion, that the approximation is quite acceptable. Using dependencies (6) and (7)    In the first approximation, the empirical law can be approximated by an exponential distribution. However, this is highly controversial due to the presence of a significant clot of random variables. More research is needed. Figure 2 shows that the Lorenz curve corresponding to an empirical histogram with a significant clot of random variables in the interval [0; 3] does not fully describe the process of uneven distribution of the R & d cost for 2019. Information stored in the place of the clot of random variables is not taken into account. In order to take this information into account, the proposed two-step clustering algorithm described above is used. In accordance with it, the zero clustering stage is implemented (one cluster with 10 intervals is considered), which did not disclose the information content of the sample due to too large clot in the first interval, i.e. information is lost. The following clustering steps are used to extract information. In this case, a zero-degree cluster is transformed: the first interval changes its step from three million rubles to one. It becomes not 10, but 12 intervals (the first cluster has 3 intervals, the second -9). Thus, the first stage of clustering is implemented. Then the first cluster of the first stage is transformed: the first interval changes its step from 1 million rubles to 0.33 million rubles. It is no longer 12 intervals, but 14 (the first cluster is 3 intervals, the second 2, and the third 9). By setting the number of intervals at each stage of clustering, you can control the degree of reduction in the value of the histogram intervals at the location of the clot, which is a distinctive feature in the proposed algorithm. Table 4 shows the results of intermediate calculations of the first and second stages of clustering for constructing the Lorenz curve. Using them, figure 3 shows the empirical distribution law and the corresponding Lorenz curve. It is clear that the situation has changed in a positive direction. The Lorenz curve describes much better the unevenness of the R & d cost distribution.

Conclusions
When conducting research on the construction of Lorenz curves based on the empirical law of distribution of economic indicators, methods of descriptive statistics, the method of histograms, data processing techniquebinning, designed to convert the initial values into ranges with small intervals (bins), methods of clustering theory were used. The construction of empirical distribution laws for economic indicators takes into account the features associated with asymmetry and the presence of clots of random variables in the initial sample. Features of random variable samples are currently observed and discussed, for example, in [26]. It is shown that when constructing Lorenz curves, it is possible to avoid the laborious procedure of selecting the theoretical distribution law and obtain Lorenz curves for complex samples with clots of random variables directly according to the empirical distribution laws.
Thus, we propose a rather original method for constructing the Lorenz curve, which significantly expands the range of applied problems for estimating the uneven distribution of various economic indicators. Specific examples show that the main features of the empirical laws of distribution of economic indicators are taken into account; the asymmetry is reflected in example 1, and the presence of clots in the sample under study in example 2.
It should be emphasized the novelty of the research constructing the Lorenz curve by the empirical distribution associated with application of the developed clustering procedure that enables us to vary the long of intervals for the histogram, which significantly extends the application of Lorenz curves for the samples of random variables with clots. At the same time, it is not necessary to carry out labor-intensive work on the selection of the theoretical distribution law and obtaining a specific expression for the Lorenz function.