Choice of Strata Boundaries for Allocation Proportional to Stratum Cluster Totals in Stratified Cluster Sampling

In survey planning, sometimes, there arises situation to use cluster sampling because of nature of spatial relationship between elements of population or physical feature of land over which elements are dispersed or unavailability of reliable list of elements. At the same time, there requires technique and strategy for ensuring precision of the sample in representing the parent population. Although several theoretical cum practical works have been done in cluster sampling, stratified sampling and stratified cluster sampling, so far, the problem of stratified cluster sampling for a study variable based on an auxiliary variable, which is required in practice, has never been approached. For the first time, this paper deals with the problem of optimum stratification of population of clusters in cluster sampling with clusters of equal size of a characteristic y under study based on highly correlated concomitant variable x for allocation proportional to stratum cluster totals under a super population model. Equations giving optimum strata boundaries (OSB) for dividing population, in which sampling unit of the population is a cluster, are obtained by minimising sampling variance of the estimator of population mean. As the equations are implicit in nature, a few methods of finding approximately optimum strata boundaries (AOSB) are deduced from the equations giving OSB. In deriving the equations, mathematical tools of calculus and algebra are used in addition to statistical methods of finding conditional expectation of variance. All the proposed methods of stratification are empirically examined by illustrating in live data, population of villages in Lunglei and Serchhip districts of Mizoram State, India, and found to perform efficiently in stratifying the population. The proposed methods may provide practically feasible solution in planning socio-economic survey.

Abstract In survey planning, sometimes, there arises situation to use cluster sampling because of nature of spatial relationship between elements of population or physical feature of land over which elements are dispersed or unavailability of reliable list of elements. At the same time, there requires technique and strategy for ensuring precision of the sample in representing the parent population. Although several theoretical cum practical works have been done in cluster sampling, stratified sampling and stratified cluster sampling, so far, the problem of stratified cluster sampling for a study variable based on an auxiliary variable, which is required in practice, has never been approached. For the first time, this paper deals with the problem of optimum stratification of population of clusters in cluster sampling with clusters of equal size of a characteristic under study based on highly correlated concomitant variable for allocation proportional to stratum cluster totals under a super population model. Equations giving optimum strata boundaries (OSB) for dividing population, in which sampling unit of the population is a cluster, are obtained by minimising sampling variance of the estimator of population mean. As the equations are implicit in nature, a few methods of finding approximately optimum strata boundaries (AOSB) are deduced from the equations giving OSB. In deriving the equations, mathematical tools of calculus and algebra are used in addition to statistical methods of finding conditional expectation of variance. All the proposed methods of stratification are empirically

Introduction
In stratified sampling, a heterogeneous population is divided into a number of groups called strata which are within strata homogeneous and sample is selected from strata using suitable sample selection method; the method is used for administrative convenience and enhancing the precision of representation of the sample for the parent population. On the other hand, when the availability of reliable list of elements (units) of population is difficult or the elements are spatially dispersed in such a way that there requires lots of energy, time and cost while surveying the elements selected by simple random sampling, cluster sampling or area sampling is employed by grouping the contiguous elements or elements, which can be conveniently surveyed together without much extra effort, into clusters; then, the clusters are taken as sampling units of population while selecting sample from the population. The strategy used in cluster sampling for enhancing its precision is to make the population within cluster as heterogeneous as possible and increase inter cluster homogeneity as much as possible. Formation of cluster primarily depends on the spatial relationships between elements in terms of geographical contiguity, good connectivity and convenience in surveying together, less energy, resource and time while surveying the elements within cluster, in addition to scheming for increasing intra-cluster heterogeneity and inter-cluster homogeneity. When the clusters are considered as sampling units of population and then stratified by methods of stratified sampling, the inter cluster homogeneity is increased within strata of clusters which in turn serves the purpose of scheming in cluster sampling for enhancing the precision of representation of sample for the parent population.
In stratified sampling, ever since Dalenius [1] introduced the problem of finding optimum strata boundaries (OBS) based on Tschuprow [2] and Neyman [3] optimum allocation (TNOA) for enhancing homogeneity within strata, the vastness of research in the area has been increasing as a number of researchers, inter alia, Dalenius and Gurney [4], Mahalanobis [5], Hansen et al. [6], Dalenius and Hodges [7,8], Ekman [9], etc., embarked on the work who initially used study variable as stratification variable. As the use of study variable as stratification variable is unrealistic, many workers mostly in the later years extended the work of finding OSB and AOSB by using an auxiliary variable which is highly correlated with the study variable. Dalenius [10], Taga [11], Singh and Sukhatme [12], Singh [13][14][15][16], Singh and Prakash [17], Yadava and Singh [18], etc., to mention a few among many, worked on the problem of finding OSB and AOSB based on auxiliary variable for various allocations under different sampling designs. The problem of optimum stratification was again considered from the perspective of more than one study variable by, inter alia, Ghosh [19], Gupta and Seth [20], Rizvi et al. [21,22] etc., whereas Danish and Rizvi [23] approached the problem from the perspective of two auxiliary variables having one study variable.
It is pertinent to mention that in the direction of development of allocation of sample size to strata in stratified sampling, ever since Tschuprow [2] and Neyman [3] proposed TNOA based on study variable, it is Hanurav [24] and Rao [25] who introduced using auxiliary variable under a superpopulation model considered by them. Gupt and Rao [26] obtained allocation of sample size to strata for probability proportional to size under the superpopulation model. Gupt [27,28] modified the aforesaid superpopulation model into a more general form and hence obtained a few generalised model-based allocations; Gupt and Ahamed [29,30] obtained a few methods of stratification for some of the generalised model-based allocations under simple random sampling with and without replacement (SRSWR and SRSWOR) in the form of equations giving OSB and solutions to the equations giving AOSB. Gupt et al. [31] also obtained methods of stratification giving OSB and AOSB for auxiliary variable optimum allocation (AOSB) obtained by Hanurav [24].
In the area of stratified cluster sampling, Mehta and Mandowara [32] considered problem of finding OSB and AOSB in stratifying population based on study variable for TNOA, proportional and equal allocation under SRSWOR design.
For the first time, we have introduced in this paper the problem of optimum stratification for a characteristic under study based on a highly correlated auxiliary variable in stratified cluster sampling with clusters of equal size under the following superpopulation model which is a modified form of the model used by Hanurav [24] and Rao [25].
where , and 2 are the superpopulation parameters with 2 > 0 and the scripts , Ѵ and ς denote conditional expectation, variance and covariance given 's respectively.
Here in this paper, the crux of the work is to simultaneously address the inevitable conditions of spatial relationship of elements leading to the use of cluster sampling and scheming for increasing inter-cluster homogeneity and intra-cluster heterogeneity to increase precision of the sampling.
We use information on the auxiliary variable which is highly correlated with study variable to stratify population whose units are clusters whereas clusters are formed by grouping the elements in the way discussed above elaborately; the allocation and sample selection procedure used in this work are allocation proportional to stratum total and SRSWR, which will hold true for SRSWOR too when finite population correction is neglected.
The paper has six sections. Section 2 deals with obtaining conditional expectation of population variance between cluster means. In section 3, the derivation of equations giving OSB is presented. In section 4, a few methods of finding AOSB are presented. In Section 5, empirical illustration of all the proposed methods of stratification is carried out in live data and results are discussed. Section 6 gives the conclusion.

Expression for Conditional Expectation of Population Variance between Clusters Means
Considering a population consists of clusters of elements each and a sample of n clusters is to be selected from N clusters by SRSWR. Let be the value of characteristic under study for the ℎ element in the ℎ cluster, =1, 2, . . ., ; = 1, 2, . . ., . Then, mean square between the cluster means, , where � and � are the means of the ℎ cluster and cluster means. 2 can again be expressed as 2 = is the mean square of cluster totals and � is the mean of cluster totals Taking conditional expectation of (2) given 's Using (1) in (3), we get On simplification, we get

Derivation of Methods of Finding Optimum Strata Boundaries
The conditional expectation of 2 given in (4) can be expressed as is the mean square of the cluster totals and is the mean of cluster totals of the population for the variable. From this step onwards, we consider the notations and as the study variable and auxiliary variable respectively for cluster totals as unit of population, then we can rewrite (5) as For stratification purpose, we divide the population of N units into L number of strata such that ∑ ℎ = ℎ=1 and a sample of size ℎ is selected by SRSWR from ℎ ℎ stratum of size ℎ such that ∑ ℎ = ℎ . Sampling variance of the estimator of the population mean in stratified sampling for the study variable is , where ℎ is the weight of ℎ ℎ stratum.
Since, the allocation proportional to stratum total of the auxiliary variable is considered, we have where ℎ ( ) is the mean the ℎ ℎ stratum and � is the population mean of the variable. From (6), (7) and (8), we get Minimising Considering the first term of (11) and using (10) Similarly, we can get the second term of (11) as From (11), (12) and (13), we get Equations (14) give OSB for the estimation variable based on auxiliary variable . We call (14) as exact equations.

Approximation Based on Series Expansion
Since the exact equations (14) are implicit, i.e., equations consist of parameters which are the functions of OSB, it is difficult in solving for OSB in stratifying populations. Therefore, in this section, we carry out derivation for obtaining the solutions of equations (14) which give AOSB. Singh and Sukhatme [12] developed a technique to use Ekman's [33] identity for obtaining series expansion of conditional mean and variance. Gupt and Ahamed [29,30], Gupt et al. [31] used the technique for obtaining series expansion of conditional mean and variance for some functions. Here, in this paper too, the same technique is used for which we assume the existence of continuity and first three partial derivatives of ( ) with respect to , ∀ ∈ ( ℎ−1 , ℎ+1 ) for all the values of ℎ. For expanding, right hand side of equations (14) about the point ℎ ,we take ℎ+1 = ℎ+1 − ℎ and all the derivatives are evaluated at ℎ . Thus series expansion of conditional mean and variance are obtained as follows: . (16) From (15), we get Again from (15) and (16) Using (15), (17) and (18) in the exact equations (14), we can get as follows: Right Thus, we can rewrite the right hand side of equations (14) as RHS= ℎ+1 Similarly, we can obtain the left hand side of equations (14) as Therefore, equations (14) can be reduced to where ( ) = ( ) and writing in place the variable ℎ . On raising power 3 2 on both sides and expanding by binomial theorem, The identity proposed by Singh and Sukhatme [12] and used by, inter alia, Gupt and Ahamed [29,30] and Gupt et al. [31] is as follows: where = − Using the identity (20) in (19), we proceed as follows: Since ℎ = 1,2,3,. . . , ., the equality holds for all the strata, therefore, we get The AOSB corresponding to exact equations (14) are given by the two equivalent methods (21) and (22). The values of constants 1 and 2 can be approximately evaluated by 1 = 1 ( − ) 2 ∫ ( ) and 2 = 1 ∫ � ( ) 3 respectively, where we assume and are upper and lower bounds of points of stratification ℎ ′ , i.e., ≤ ℎ ≤ .We can use (21) and (22) in finding AOSB, i.e., ℎ ′ by fixing lower boundary ℎ−1 every time.
Thus, the above analytical study has led to arrive at the following theorem.
Theorem 1: If the function ( ) is bounded and possesses first two partial derivatives for all values of in ( , ), for a given number of strata, taking equal intervals on the cumulative of � ( ) 3 gives AOSB.

Other Approximations Using Assumptions on Coefficient of Variation
Again, we deduce some more methods of approximation from exact equations (14), these approximation methods are still implicit but easy to use. We proceed as follows: Equations (14) can be rewritten as where ℎ 2 and (ℎ+1) 2 are the square of coefficients of variation of ℎ ℎ and (ℎ + 1) ℎ strata.
If we consider square of coefficients of variation are approximately equal for every two successive strata, i.e., and approximately equal to arithmetic mean of square of coefficients of variation of the consecutive strata, then AOSB are given by where ̅ If we consider the square of coefficients of variation for every two successive strata are approximately equal to geometric mean of the square of coefficients of variation the two consecutive strata, AOSB are given by where ̃� ℎ+ 1 2 � = ℎ (ℎ+1) If we consider square of coefficients of variation are negligibly small relative to unity, AOSB are given by Thus, we have obtained (24), (25) and (26) as approximations to equations (14) to give AOSB.

Empirical Illustration
The proposed exact equations (14) and methods of approximation (22), (24), (25) and (26) are illustrated in stratifying population of two districts, Lunglei and Serchhip districts, of Mizoram, a state of India, in which villages are taken as elements of cluster. The data of villages is taken from Village Profile & Development Indicators [34]. There are 193 villages in the two districts. We take number of households of a village as the study variable and population of village as the auxiliary variable .The correlation between the study variable and stratification variable is sufficiently high, i.e., 0.9604. The data is shown in appendix I.
Mizoram is a hilly state of India, 88.93% of the total geographical area is covered by hilly forests; the villages and towns are spread over the hilly terrain of the state. There are rolling hills, tiny valleys, rivers and lakes in the state. Villages are connected by mostly minor and a few major hilly roads. In many cases, the geographical distance between any two villages may be short but they are separated by rivers, lakes, streams, and swamps in the thick rainforest. Therefore, the road transport connecting the two villages may be extremely long requiring lots of energy and time to travel from one to the other. Considering physical feature of the land and pattern of distribution of villages, stratified cluster sampling may be an appropriate sampling design in survey planning. Therefore, we use Google Earth pro and Geographic Information System to locate the villages, rivers, minor roads, major roads, hill ranges and altitudes while forming the clusters. The clusters are formed not only by combining the villages connected by the shortest roads but also taking in account other constraints like variation in altitude and separation by rivers, lakes and thick forest cover. The formation of clusters is shown in appendix II.
In the case of illustrating methods of approximation (21) and (22), since we have theoretically proven both the methods are equivalent, we conveniently use (22). While using (22), we require a Probability Density Function (PDF) that the auxiliary variable follows. For fitting a suitable distribution, we use the data of variable in which each value of the variable is divided by 1000. We have two sets of data, one is when each cluster is made of two villages, i.e., cluster size, = 2 and the other is when each cluster is made of three villages, i.e., cluster size, = 3. We try to find the most suitable PDF that the variable of the live data follows. The fitdistrplus package in R-software is used in fitting a number of known PDFs in data of variable of both the populations by using the methods -Maximum Likelihood Estimation (MLE), Moment Matching Estimation (MME) and Quantile Matching Estimation (QME) one after another.
Of all the various PDFs we have tried to fit to the data, Gamma Probability Density Function (GPDF) is found to be fitting best to the data of both populations; the decision of best fitting is made by taking into consideration simultaneously the values of LL (log likelihood), AIC (Akaike Information Criteria), BIC (Bayesian Information Criteria) and standard errors ( . ) of parameters.
Thus, the PDF followed by the variable is as follows: where > 0, > 0, ∀ ∈ (0, ∞) The two populations are characterised as follows in fitting GPDF to them. a. In the data for = 2 , shape parameter We use the above PDF (27) in illustrating approximation method (22), along with the estimates of parameters, in stratifying both the populations. Numerical integration and differentiation methods are used in working out the approximation method in stratifying the populations.
We examine the performance of all the proposed methods of stratification in the stratified cluster sampling design in the two sampling frames. At first, we illustrate the methods in the population in which cluster is formed by the combination of two villages and secondly in the population in which cluster is formed by combination of three villages; the results are shown in Tables 1, 2, 3 and 4. For population of clusters of size two, we present points of stratification due to all the proposed methods of stratification in Table 1 and the variances and relative efficiencies due to the methods in Table 2. Similarly, for population of clusters of size 3, we present the said results in the same way in Tables 3 and 4. Each of the two populations is stratified into two, three, four, five, and six strata by using the all the proposed stratification methods and equal interval stratification. The efficiencies of the proposed methods are compared with that of equal interval stratification in both the populations for each considered number of strata, =2, 3, 4, 5 and 6.  It is seen that in the population of clusters of size two, the exact equations (14) perform with higher efficiencies at = 2, 5 and much higher efficiencies at =3, 4 and 6 than that of equal interval stratification. In the same way, approximation methods (22), (24), (25) and (26) too perform when compared with equal interval stratification. Approximation methods (22) and (26) are found to be relatively better in overall performances than all the other four proposed methods of stratification.
In the population of clusters of size three, the exact equations (14), approximation methods (22), (24), (25) and (26) perform with considerably higher efficiencies than that of equal interval stratification at all the considered number of strata except when =3 at which all the proposed methods of stratification other than (22) perform with same efficiency with equal interval stratification; method (22) performs with slightly lower efficiency than that of equal interval stratification at =3. However, in all other number of strata, method (22) performs with strikingly high efficiencies; whereas all other proposed methods of stratification are performing with more or less same efficiency with that of method (22). But, although the proposed methods of stratification perform well in both the populations, it is found that the methods perform relatively better in the population of clusters of size two than in the population of clusters of size three.

Conclusions
It is seen in stratified cluster sampling with clusters of equal size, the proposed methods of stratification are performing excellently. The inevitability of the use of cluster sampling due to the nature of spatial relationship between elements of a population or unavailability of proper sampling frame and strategy for ensuring the precision of estimator of population parameter are simultaneously addressed in stratified cluster sampling design presented in this paper. The exact equations (14) and approximation methods (24), (25) and (26) are all performing with more or less same efficiencies, rather interestingly, the approximation method (26), i.e., AOSB are given by the geometric mean of means of consecutive strata, is found to be performing slightly better than other three proposed methods of stratification- (14), (24) and (25). The approximation method (22) which is in the form definite integral of a defined function according to the population used performs best in the overall. Although the methods of approximation (24), (25) and (26) are implicit, they are easy to use. Therefore, all these proposed methods of stratification may be useful in the practical application of survey planning for socio-economic survey.