International Journal of Statistics and Applications

p-ISSN: 2168-5193    e-ISSN: 2168-5215

2012;  2(1): 1-10

doi: 10.5923/j.statistics.20120201.01

Comparison of the Efficiency of the Various Algorithms in Stratified Sampling when the Initial Solutions are Determined with Geometric Method

Şebnem Er

Quantitative Methods Department, Istanbul University, Istanbul, 34320, Turkey

Correspondence to: Şebnem Er , Quantitative Methods Department, Istanbul University, Istanbul, 34320, Turkey.

Email:

Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.

Abstract

The main aim of this paper is to examine the efficiency of Genetic Algorithm (GA) of Keskintürk and Er (2007)[1], Kozak’s (2004) Random Search[2] and Lavallée and Hidiroglou’s (1988) Iterative Algorithm method[3] on determination of the stratum boundaries that minimize the variance of the estimate. Initial starting boundaries of the mentioned algorithms are obtained randomly. Here, it is aimed to reach better results in a shorter period of time by utilizing the initial boundaries obtained from Gunning and Horgan’s (2004) geometric method[4] compared to the random initial boundaries. Three algorithms are applied on various populations with both random and geometric initial boundaries and their performances are compared. With the stratification of 11 heterogenous populations that have different properties, higher variance of the estimates or infeasible solutions can be observed once the initial boundaries are obtained with geometric method.

Keywords: Stratified sampling, Stratum boundaries, Genetic algorithm, Random search, Iterative method

Cite this paper: Şebnem Er , "Comparison of the Efficiency of the Various Algorithms in Stratified Sampling when the Initial Solutions are Determined with Geometric Method", International Journal of Statistics and Applications, Vol. 2 No. 1, 2012, pp. 1-10. doi: 10.5923/j.statistics.20120201.01.

1. Introduction

In stratified sampling,in order to gain more precision than other methods of sampling, a heterogeneous population is divided into subpopulations, each of which is internally homogeneous. As a result the main problem arising in stratified sampling is to obtain the optimum boundaries. Several numerical and computational methods have been developed for this purpose. Some apply to highly skewed populations and some apply to any kind of populations. An early and very simple method is the cumulative square root of the frequency method (cum√f) of Dalenius & Hodges in 1959[5]. More recently Lavallée & Hidiroglou algorithm[3] and Gunning & Horgan's (2004) geometric method[4] have been proposed for highly skewed populations whereas Kozak's (2004) random search method[2] and Keskinturk & Er's (2007) genetic algorithm (GA) method[1] have been proposed for even non-skewed populations. Very recently, Brito et.all[6] proposed an exact algorithm for the stratification problem with only proportional allocation based on the concept of minimum path in graphs and they called their method StratPath. Moreover, developed an iterated local search method to solve the stratification problem of variables with any distribution with Neyman allocation[7].All these methods aim to achieve the optimum boundaries that maximise the level of precision or equivalently minimise the variance of the estimate or the sample size required to reach a level of precision and some of them are available in the stratification package stratification for use with the statistical programming environment R[8]; freely available on the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=stratification.
The main aim of this research is to compare the efficiency ratios of the Lavallée ve Hidiroglou iterative method, Kozak’s random search method and Keskinturk and Er’s genetic algorithm approach when the initial boundaries are obtained either randomly or from the geometric method of Gunning and Horgan, and to examine the performances of the three methods. The predetermined total sample size (n) is allocated using Neyman[9] optimum allocation method. The paper is structured as follows: In the second section the exact solution of Dalenius[10] and the methods that are developed in order to approximately solve the Dalenius equations are briefly explained. In the third section, the results obtained with Lavallée and Hidiroglou’s iterative method, Kozak’s random search method and Keskintürk and Er’s genetic algorithm are given when the initial boundaries are obtained randomly or from the geometric method of Gunning and Horgan and the performance of the algorithms are compared.

2. Dalenius’ (1950) Exact Solution

(1)
(2)
is estimated by Cochran as [11]
(3)
(4)
(5)
(6)
(7)
where the true variance is
(8)
(9)
It is well-known that this variance of the estimate is minimum
(10)
when total sample size n is allocated using Neyman’s optimum allocation method [9]:
(11)
Therefore the variance of the estimate is a function of the boundaries . As a result, it is very difficult to find the boundaries that minimise the variance of the estimate. Dalenius (1950)[10] has shown that the variance of the estimate obtained with Neyman’s optimum allocation method is optimum or in other words minimum, when the stratum boundaries satisfy the following equations:
(12)
It is very difficult to find the stratum boundaries that satisfy these equations remembered as Dalenius equations since these equations include and that both vary with stratum boundaries. As a result, there have been many approximations and algorithms proposed for solving Dalenius equations. The widely known simple method among the proposals is the cumulative square root frequency method of Dalenius and Hodges (1959) 5]. Then, in 1988 Lavallée and Hidiroglou’s iterative approach[3], in 2004 Gunning and Horgan’s geometric method [4] and Kozak’s random search method[2], in 2007 Keskintürk and Er’s genetic algorithm method[1] are developed in order to find the stratum boundaries. Among these methods, geometric method is the simplest method that does not include any complex algorithms. Therefore, the main aim of this research paper is to set the initial boundaries of the proposed algorithms with geometric method and compare the efficiencies of the algorithms when the boundaries are obtained with or without geometric method since it is believed that these algorithms would reach the solution in a shorter period once they start searching the entire space at a reasonable point. The details of the approaches and algorithms of these methods could be obtained from the original papers of Dalenius and Hodges’ (1959)[5], Gunning and Horgan (2004)[4], Kozak (2004)[2] and Keskintürk and Er’s (2007)[1]. All of these methods could be applied in R statistical environment using stratification[12] and GA4stratification[13] packages but the GA results given in this studyare obtained in Matlab 7.0 since in the package there is no option for setting the initial boundaries with non-random results.

3. Application

3.1. Populations for Stratification

In this paper, many populations are used for stratification with different skewness, kurtosis, mean, standard deviation and size properties.Those populations that are available in the R stratification[12] and GA4Stratification[13] packages are used for stratification. Each of the populations are divided into 3, 4, 5 and 6 strata and the boundaries are obtained using Lavallée and Hidiroglou, Kozak and GA methods with random and geometric initial boundaries.
Pop1: An accounting population of debtors in an Irish firm (Debtors).
Pop2: The population in thousands of US cities in 1940 (UScities).
Pop3: The number of students in four-year US colleges in 1952-1953 (UScolleges).
Pop4: The resources in millions of dollars of large commercial US banks (USbanks).
Pop5: Number of municipal employees of 284 municipalities in Sweden in 1984 (ME84).
Pop6: Population in thousands of 284 municipalities in Sweden in 1975 (P75).
Pop7: Real estate values in millions of kronor according to 1984 assessment of 284 municipalities in Sweden in 1984 (REV84)
Pop8: Simulated Data from the Monthly Retail Trade Survey of Statistics Canada (MRTS)
Pop9: Household income before taxes from the 2001 Survey of Household Spending carried out by Statistics Canada (HHINCTOT)
Pop10: Net sales data of 487 Turkish manufacturing firms among the largest 500 firms in 2004 by Istanbul Chamber of Industry (ICI) (iso2004)
Pop11: Net sales data of 485 Turkish manufacturing firms among the largest 500 firms in 2005 by Istanbul Chamber of Industry (ICI) (iso2005)
The boxplots of the populations are displayed between Figures 1 and 3, and the summary statistics of the populations are given in Table 2.
Referring the descriptive statistics in Table 2 and boxplots in Figures 1-3, we see that the populations to be stratified are highly heterogenous which makes stratified sampling efficient to use. For comparison, the initial boundaries are obtained with both random initial boundaries and with geometric method. The populations are divided into 3, 4, 5 and 6 strata and the total sample size is determined as 100 for Pop1-Pop11. For genetic algorithm, the number of iterations is set to 10000, the GA population size to 35, the crossover rate to 0.99 and the mutation rate to 0.15. For efficiency (efficiency – eff) comparisons of the ratio of variance of the estimates or the ratios of squares of coefficient of variations (CV) are calculated and given in Appendix 1. Since Lavallée and Hidiroglou’s (LH) method is based on sampling all of the elements in the last stratum (take-all top stratum), the following efficiency ratios are calculated if GA and Kozak’s methods provide a take-all top stratum solution:
(13)
(14)
(15)
From the efficiency and the coefficient of variation ratios given in Table 3in Appendix 1 and from the strata and sample sizes given in Table 5 in Appendix 2, it can be seen that the algorithms compared in this paper provide very close results and that the stratum boundaries are very close to each other when the initial boundaries are set randomly.When we look at the summary of the results given in Table 1, we see that the number of cases where GA or Kozak is better than the other one does not differ much and the gains in efficiencies are close to each other.
Table 1. Number of Cases where the Chosen Algorithm Gives Better Results and the Range of the Efficiency Gain (Random Initials)
     
On the other hand, the results are different with higher coefficient of variations when the initial boundaries are obtained with geometric method (Table 4).Moreover, when the initial boundaries are set to be found with geometric method, many infeasible or nonconverged results are obtained. For example, when we look at Table 4 where the initial boundaries are obtained with geometric method, we see that the coefficient of variations for GA increases in 32 cases among 44 cases. Yet some of these increases in the CVs result from a nonconverged or an infeasible solution. Only in 4 cases there is a gain in efficiency ranging in between ‰0.01 (CV falling from 0.01437 to 0.01436 for H=5 for Pop3-UScolleges) and %0.186 (falling from 0.02485 to 0.02299 for H=5 for Pop8-MRTS), which could be counted as a very minor gain. The results for L&H and Kozak’s are more or less the same with the results obtained for GA. When the initial boundaries are obtained with geometric method, with each of Kozak’s and L&H’s methods there is an efficiency gain in only 5 cases, which are again minor. For these reasons, Lavallée and Hidiroglou’s iterative method, Kozak’s random search method and Keskintürk and Er’s genetic algorithms give more efficient results when the initial boundaries are set randomly due to their nature. As a result, it can be concluded that starting with geometric initial boundaries does not have much contribution on the efficiency ratios or on the stratum boundaries for the computational methods. As proposed by Horgan (2011) [14], in order to obtain feasible solutions in some data sets,some modifications should be applied before utilising the geometric method. Horgan (2011) [14] suggests that the data should be analysed before applying the stratified sampling scheme if there are extreme outliers. In this paper the revitised version of the geometric method is not applied since the algorithms examined here already give good results with random initials. Furthermore, if any researcher wants to use the geometric initial boundaries for data sets with extreme outliers, modified version of the geometric method should be used.
Figure 1. Boxplots of Pop1-Pop4
Figure 2. Boxplots of Pop5-Pop9
Figure 3. Boxplots of Pop10-Pop11
Table 2. Summary Statistics of the Populations
     

4. Conclusions

Stratified sampling is a sampling methodology used for heterogeneous populations in order to gain more precision than other methods of sampling. This paper examines the improvement in the efficiency ratios and stratum boundaries obtained with Lavallée and Hidiroglou [3], Kozak [2] and Keskintürk and Er’s (2007) [1] methods once the initial boundaries are obtained with geometric method. With the stratification of 16 heterogenous populations that have different properties, higher variance of the estimates or infeasible solutions can be observed. As a result, researchers should be much more rigorous when using geometric method for the initial boundaries in algorithmic methods or else use the modified version of geometric method once the data has very extreme values.

ACKNOWLEDGEMENTS

I would like to thank the reviewer of this article whose comments and suggestions have helped improve the paper.

APPENDIX 1

Table 3. The efficiency and coefficient of variation ratios of LH, GA and Kozak’s methods when the initial boundaries are obtained randomly
     
Table 4. The coefficient of variation ratios of LH, GA and Kozak’s methods when the initial boundaries are obtained with geometric method
     

APPENDIX 2

Table 5. Size of the strata (Nh) and the sample sizes (nh) obtained fromLH, GA and Kozak’s methods when the initial boundaries are obtained randomly
     
Table 5. Continues: Size of the strata (Nh) and the sample sizes (nh) obtained fromLH, GA and Kozak’s methods when the initial boundaries are obtained randomly
     

References

[1]  Keskintürk, T., Er, Ş., A Genetic Algorithm Approach to Determine Stratum Boundaries and Sample Sizes of Each Stratum in Stratified Sampling. Computational Statistics & Data Analysis, 52, 1, pp.53-67, 2007.
[2]  Kozak, M., Optimal Stratification Using Random Search Method in Agricultural Surveys. Statistics in Transition, 6, 5, pp.797-806, 2004.
[3]  Lavallée, P., Hidiroglou, M., On the Stratication of Skewed Populations, Survey Methodology, 14, 1, pp.33-43, 1988.
[4]  Gunning, P., Horgan, J.M., A New Algorithm for the Construction of Stratum Boundaries in Skewed Populations. Survey Methodology, 30, 2, 2004.
[5]  Dalenius, Tore, Hodges, Joseph L.Jr., “Minimum Variance Stratification”, Journal of the American Statistical Association, 54 (285), pp.88-101, 1959.
[6]  Brito, J., Maculan, N. Lila, M., Montenegro, F. An Exact Algorithm for the Stratication Problem with Proportional Allocation. Optimization Letters, 4, 2, pp.185-195, 2010.
[7]  Brito, J., Ochi, L., Montenegro, F., Maculan, N. An Iterative Local Search Approach Applied to the Optimal StratificationProblem. International Transactions in Operational Research. 17, 6, pp.753-764, 2010.
[8]  R Development Core Team. R: A language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, (URL) http://www.r-project.org. 2005.
[9]  Neyman, Jerzy. “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection”, Journal of the Royal Statistical Society, 97 (4), pp.558-625, 1934.
[10]  Dalenius, T. The problem of optimum stratification, Skandinavisk Aktuarietidskrift, pp. 203-213, 1950.
[11]  Cochran, W. G., Sampling Techniques, 3rd ed., John Wiley & Sons, Inc. USA., 1977.
[12]  R: stratification. http://CRAN.R-project.org/package=stratification
[13]  R: GA4Stratification.http://CRAN.R-project.org/package=GA4stratification
[14]  Horgan, J.M., “Geometric Stratification Revitised”. ISI World Congress 2011 Proceedings, 2011.