Central Statistics Office Ireland

Skip navigation |

Surveys and Methodology

 

General Methodology Documents

Print Icon
Print Page
Change size of text
A.
A.
A.

EDITING AND CALIBRATION IN SURVEY PROCESSING


MACRO-EDITING, MICRO-EDITING AND CALIBRATION USING ANNUAL SERVICES INQUIRY DATA

 

Statistical Methods & Development

SMD-37
September 2000


Table of Contents


› 1 Introduction

2.1 The Hidiroglou-Berthelot Method (with Imputation)

4 Calibration

5 Survey Processing Conclusions

5.2.1 Best Practice Suggestions

Appendix I List of Businesses and Size Categories Used in the Project

Appendix II Annual Services Edits

Appendix III Chernikova’s Localisation Algorithm

 

 


Editing and Calibration Techniques for Survey Processing


1 Introduction


This report looks at the background theory and application of (i) automatic editing and (ii) grossing methods in survey processing systems. A typical survey processing system involves conducting the sample, followed in turn by data entry, editing, grossing and publication of results. For the purposes of this report the emphasis is on methods for editing and grossing. Other methodologies such as OCR (optical character recognition) that enhance the speed of data capture have been addressed in previous SMD reports and are currently in use in some units in CSO.
In general the editing and correction process is resource intensive in terms of human involvement. Examining every return that has been flagged in error can be tedious. This also means that every return, no matter how insignificant, gets the same attention as those that are vital to results. So, for example, firms with sales of £10,000 get dealt with in a similar fashion to those with sales of £10million. This traditional approach, while suiting a more manually based editing and correction system, can be made more efficient if automatic, that is computer based, methods are adopted. In particular, if the significant survey returns are validated manually and the bulk of the less significant returns are processed automatically savings can be made while quality is maintained or possibly improved.
When sample survey returns are clean they require grossing up to population totals. In some surveys there may be a need for grossing based on two or more variables. When this situation arises some form of post stratification is often employed. This procedure tinkers with the grossing factors to ensure balance in the survey results. To avoid this ad hoc means of adjustment it is possible to generate a single set of grossing factors, that ensures results agree under different grossing methods. This procedure is known as calibration.

 

1.1 Project Outline


In this report automatic editing methods and calibration are applied to returns taken from the Annual Services Inquiry of 1996 (referred to hereafter as ASI).
Macro-editing, that is editing a group of returns at, say, the stratum level is examined first. Macro-editing as applied here involves using the data itself to set upper and lower bounds for range checks. This process highlights significant errors while smaller errors with less impact are not changed. Imputation is then used to attempt to emulate the results currently obtained from the inquiry.
Second, micro-editing (i.e. record level editing) using a Generalised Edit and Imputation System (GEIS), similar to the one developed at Statistics Canada for business surveys, is applied. In this case edits are treated as linear constraints and optimum correction and imputation strategies are chosen and applied. SAS code for this purpose is available. However, in practice the implementation of this type of system is not trivial.
Finally the report assesses calibration methods. Calibration is used to circumvent the need for both persons engaged and number of enterprise grossing factors in the ASI. Calibration is also assessed where additional information such as VAT receipts is available. This situation could arise where National Accounts, for example, might wish to adjust Annual Services’ aggregates to ensure that VAT totals agree with the figures received from the Revenue Commissioners.

 

1.2 The Annual Services Inquiry


The ASI measures the principal trading aggregates of the retail, wholesale, real estate, renting, business and other selected services sectors. The survey sample is selected from CSO’s Business Register, with enterprises as the target respondents. Stratification into four size groups is based on Total Persons Engaged on the Business Register.

The four size groups are:


• 1-4 persons engaged
• 5-9 persons engaged
• 10-19 persons engaged
• 20 or more persons engaged


The selected sample has a proportion of enterprises (approximately 25%) “rolled over” from the previous year. All enterprises with 20 or more persons engaged are rolled over and certain others having a large Total Sales are also likely to be rolled over.
The sample for 1996 was 10,184 enterprises (from a population for all sectors of 65,417) and a field force of 25 field officers helped to obtain an overall response rate of 93%.
The report of the ASI for 1995 and 1996 gives a full description.

 


1.3 Existing Editing Procedure


The editing procedure currently in use mainly involves range checks applied in ratio form to highlight conflicting data on an enterprise’s return. An example would be:


VAT on Sales / Total Sales > 0.23


Returns flagged in this way are then checked and where appropriate corrected by Annual Services Unit (ASU). The unit comprises 18 persons and some or all of these may be involved in the editing and correction procedure depending on the workload at any particular time.
Edits similar to the example above are warning or soft errors. These edits are coded in SAS and applied to the returns held in SAS datasets. Soft errors may require correction but more generally the highlighted values require validation. This procedure very often involves contacting the relevant enterprise to check the consistency of the information returned. The validation, editing and correction effort requires about 50% of the unit’s resources.


1.4 Scope of the Project


In this report attention is focussed on retail (excluding sellers of motor vehicles) and wholesale enterprises. Results presented here will therefore not be identical to those released in the report of the ASI for 1995 and 1996 which also included Real Estate, Renting, Business and Selected Services.
For the purposes of this project we used data for 3,211 enterprises for which the principal aggregates were available from the clean (i.e. edited) datasets for 1995 and 1996 and the dirty (i.e. unedited) dataset for 1996. These were classified according to 13 types of retail business and 7 types of wholesale business, and cross-classified by the four persons engaged size groups. These classifications are listed in Appendix I


2 Macro-editing Methods


Common problems with micro-editing procedures are that data are subjected to unnecessary edit checks or to checks with bounds that are too narrow. Resources may be devoted to many error corrections that have little effect on estimates. Macro-editing procedures aim to reduce the number of records identified as errors by addressing the problem of too narrow bounds on edit checks. Using the data itself to set upper and lower bounds for checks, the aim of these procedures is that significant errors are identified, while smaller errors (with less impact on the results) are either not changed or are corrected automatically using, for example, a GEIS described later in this paper.
Details of interactive and graphical macro-editing procedures, for example the Top-Down method, the Box-Plot method, the Box method, are available in SMD-34 Techniques for Data Editing. For the purposes of this assessment of macro-editing techniques on the ASI, two non-graphical non-interactive methods were applied, the Hidiroglou-Berthelot method and the Aggregate method.

 


2.1 The Hidiroglou-Berthelot Method (with Imputation)


The Hidiroglou-Berthelot (H-B) edit procedure aims to identify big period-on-period changes in a variable (i.e. change outliers) by using information from the data itself and to allow for the typically skewed distributions of variables in business surveys.
Take a variable X measured in two consecutive time periods denoted by X(t) and X(t+1). Upper and lower bounds for a ratio check of the variable are generated from the data. Ratio changes outside these bounds (outliers) are flagged as errors. The method uses ratios as its starting point, instead of differences [X(t+1) – X(t)], to ensure that large relative changes are identified.
Calculate R = X(t+1)/X(t), the relative change for each observation of the data. As the relative change can increase or decrease, a translation of the ratios into positive and negative values about the median is performed as follows:


S = (R – Rmedian)/R            if 0 < R < Rmedian
= (R – Rmedian)/Rmedian     if R ? Rmedian


Half of the S values will now be less than zero. As emphasis on the magnitude of the observation may be important, a second transformation is performed:


E = S * [max(X(t), X(t+1))]**U


The method proposes that U be a value between 0 and 1. If U = 0 then no emphasis is placed on the magnitude of the observation and E=S; if U = 1, then full emphasis is placed on the magnitude. By the choice of U, more importance can be put on a relatively small change in a large value (U close to 1) than on a relatively large change in a small value (U close to 0).
Any E values that are too small or too big are considered outliers or errors by the method, as their trend is different from the overall trend of other observations. Upper and lower limits for the E values are constructed using:


Dq1 = max[Emedian – Eq1, |A*Emedian|]
Dq3 = max[Eq3 – Emedian, |A*Emedian|]


as


Upper limit = Emedian + C*Dq3
Lower limit = Emedian – C*Dq1


where A is an arbitrary value suggested by H-B to be 0.05. The A*Emedian term protects against the detection of too many outliers in situations where the E values are tightly clustered around the median. C is a constant that controls the width of the interval.
The example below illustrates the calculation of the of the S and E values for artificial data using C = 2 and A = 0.05.

Obs. X(t) X(t+1) X(t+1) - X(t) R = X(t+1)/X(t) S E [U=1] E [U=0.4]
1
10
2
-8
0.20
-5.45
-54.50
-13.69
2
19
10
-9
0.53
-1.45
-27.57
-4.71
3
30
25
-5
0.83
-0.55
-16.44
-2.14
4
8
5
-3
-3
x -1.06
-8.51
-2.44
5
15
14
-1
0.93
-0.38
-5.73
-1.13
6
17
22
5
1.29
0.00
0.07
0.01
7
20
30
10
1.50
0.16
4.88
0.63
8
10
20
10
2.00
0.55
11.01
1.82
9
25
50
25
2.00
0.55
27.52
2.63
10
2
12
10
6.00
3.65
43.81
9.87
11
5
30
25
6.00
3.65
109.53
14.23


With U=1 we get Emedian=0.07, Eq1=-16.44, Eq3=27.52, the upper bound is 54.9 and the lower bound is –33. Observations 1 and 11 are identified as outliers.


When U=0.4 we get Emedian=0.01, Eq1=-2.14, Eq3=2.63, the upper bound is 5.25 and the lower bound is -9.43. Observations 1 and 11 are again identified. Observation 10 is now also identified. Less emphasis has been placed on the magnitude of the observations by choice of U, and observation 10’s ratio of 6.00 is now seen as significant, even though the values of X(t) and X(t+1) are not large.


It is worthwhile to compare these results with those based on normal theory. The mean and standard deviation of R are 2.2 and 2.1 respectively. These would give with two standard deviations –2.0 < R < 6.4 and so no outliers are identified. However, based on inspection alone, it is clear that observations 10 and 11 are questionable.


2.1.1 Application to Annual Services Inquiry


The H-B method was repeatedly applied to the Total Sales of 1,231 retail and wholesale enterprises responding to the 1996 survey, rolled-over from 1995. Enterprises with 1996 Total Sales over £100 million were excluded from the trial of the method, as these large respondents would be carefully edited separately.
A Total Sales ratio is tested under the H-B method. The frequency distribution of ratios of unedited 1996 Total Sales to edited 1995 Total Sales for all rolled-over enterprises is given in Table 2.1.


Table 2.1 Frequencies of Ratios of Total Sales 1996 to Total Sales 1995

Ratio 1996/1995

Frequency % Frequency
missing
206
17
0.0 - 0.5
22
2
0.5 - 0.7
27
2
0.7 - 1.0
250
20
1.0 - 1.1
361
29
1.1 - 1.2
197
16
1.2 - 1.3
79
6
1.3 - 1.5
47
4
1.5 - 5.0
33
3
5.0 & over
9
1

Total

1,231
100


Approximately 25% of ratios lie outside 0.7 and 1.5, these being roughly the bounds applied by ASU in their ratio check on Total Sales. Total Sales is missing in 1996 for 17% of rolled-over enterprises.
One hundred subsets from the 1,231 rolled-over enterprises were selected randomly, each containing 500 enterprises and the H-B method with different test values for the parameters applied. Where the H-B method flagged an enterprise’s Total Sales as an error, the mean Total Sales for the enterprises of that size group and business sector was used to impute a corrected Total Sales.

 

 

2.1.2 Results of Application


The results from the application of the assessment methodology are shown in Table 2.2 for 4 combinations of the H-B parameters. The averages over the one hundred subsets of the number of errors identified and the total corrected Total Sales are shown. The errors flagged by ASU are all instances of change to Total Sales 1996, resulting from any one of the following three checks :


• The components of Total Sales do not add to reported Total Sales.
• The ratio Total Sales96/Total Sales95 is > 1.5 or < 1/1.5.
• Total Sales is greater than £5million.


The number of overlapping errors identified by both the H-B method and ASU is also shown. When the results are examined it is clear that a small number of errors account for the bulk of the changes to Total Sales 1996.


Table 2.2 Averages and Standard Deviations for 100 Replications of Hidiroglou-Berthelot Method to Total Sales

H-B parameters
(A = 0.05)
Number of errors flagged by ASU
Corrected Total Sales after ASU flagged errors corrected
Number of errors flagged by H-B method
Corrected Total Sales after H-B flagged errors corrected
Number of errors identified by ASU and H-B (overlapp-ing errors)
Corrected Total Sales of overlapping errors as % of corrected Total Sales of all ASU flagged errors
  Avg St dev Avg
(£m)
St dev
(£m)
Avg St dev Avg
(£m)
St dev
(£m)
Avg St dev Avg St dev

U=0.2

C=10

66
7
1,663
115
38
5
1,608
124
24
4
87%
3%

U=0.2

C=20

66
7
1,663
115
29
5
1,629
122
21
4
86%
3%
U=0.4
C=10
66
7
1,663
115
39
5
1,580
121
24
4
87%
3%
U=0.4
C=20
66
7
1,663
115
30
5
1,614
123
22
4
86%
4%
U=1.0
C=10
66
7
1,663
115
46
6
1,500
120
25
4
88%
4%
U=1.0
C=20
66
7
1,663
115
36
5
1,544
119
23
4
87%
4%


The average number of errors flagged in a subset under the H-B method was less than those identified by ASU. Average corrected overall Total Sales of a subset and its standard deviation under the H-B method can be reasonably compared to that resulting from ASU corrections, even when a simple imputation method (mean imputation) was used. As more importance was given to the magnitude of Total Sales by increasing the parameter U (so that emphasis was placed on editing ratio changes of the larger enterprises) the number of errors identified by the method increased. However, where more imputation occurred, the comparison of Total Sales worsened (see results for U=1.0 above). The level of corrections explained by overlapping errors was encouraging at 86% or higher.


Results are presented in Table 2.2 for 4 combinations of parameter values. Because of the robustness of the technique, variations in findings for different combinations of the H-B parameters, U and C were not marked. The parameter A was set at 0.05 as suggested.


Concentrating on the Food and Drink sector, the method was also applied to Total Sales in 1996 for 350 rolled-over enterprises. As before, 100 subsets were selected randomly, each containing approximately 150 enterprises. Also, as before, any enterprises with Total Sales 1996 over £100 million were excluded. The results are shown in Table 2.3.


Table 2.3 Averages and Standard Deviations for 100 Replications of the Hidiroglou-Berthelot method to Total Sales in the Food and Drink Sector

 

H-B parameters
(A = 0.05)
Number of errors flagged by ASU
Corrected Total Sales after ASU flagged errors corrected
Number of errors flagged by H-B method
Corrected Total Sales after H-B flagged errors corrected
Number of errors identified by ASU and H-B (overlapp-ing errors)
Corrected Total Sales of overlapping errors as % of corrected Total Sales of all ASU flagged errors
  Avg St dev Avg
(£m)
St dev
(£m)
Avg St dev Avg
(£m)
St dev
(£m)
Avg St dev Avg

St dev

U=0.2

C=10

19
4
242
30
11
3
240
30
6
2
94%
2%

U=0.2

C=20

19
4
242
30
7
2
239
30
4
1
92%
2%
U=0.4
C=10
19
4
242
30
11
3
240
30
6
2
94%
2%
U=0.4
C=20
19
4
242
30
7
2
239
30
4
1
92%
2%
U=1.0
C=10
19
4
242
30
12
4
233
27
6
2
94%
2%
U=1.0
C=20
19
4
242
30
8
3
239
30
5
2
94%
2%


The H-B method was successful in identifying the larger errors in Total Sales 1996 for the Food and Drinks sectors. As seen when examining the results for all sectors, a small number of errors accounted for the majority of the change to Total Sales 1996 for Food and Drink enterprises. The level of overlap between the H-B correction to Total Sales 1996 and the ASU correction was over 90% for the various parameter combinations above. Differences in results for different combinations of the method’s parameters were once again not marked, as seen in the Table 2.2 findings for all sectors.

 

 


2.1.3 Summary of Results


Overall, the method did reduce the number of enterprises identified with error Total Sales values, while still flagging the larger corrections made by ASU. Because of their impact on results, larger enterprises will always need the attention of experts. For smaller rolled-over enterprises the H-B method described could be sufficient to edit the data.
The method as described required data in two consecutive time periods, but variables measured in the same time period could be edited using the method, if a relationship existed.
Trials of the method over more time periods would be ideal to fully explore different H-B parameter combinations. To apply the method to a full set of survey variables, different parameter combinations would need to be investigated for each variable.
For simplicity, mean imputation was chosen as the imputation method. A more sophisticated imputation method should be investigated if the method were to be applied in a live situation.

 

2.2 The Aggregate Method


The idea behind the Aggregate method is simple. The method identifies errors in a two-stage approach. Initially, an edit is applied to aggregate or grouped data. Any group failing the check is flagged. The check is then applied to all observations in flagged groups and any observations failing the check are corrected. As a macro-editing method, the acceptance bounds should be based on the distribution of the edit check applied to the data. In the Aggregate method the acceptance bounds are set manually, by visual examination of the output of the application of the edit check value.


2.2.1 Application to Annual Services Inquiry


The Aggregate method was applied to two of the ASI edit checks:


• A check on the ratio of Total Sales for 1996 to Total Sales for 1995 for rolled-over companies
• A check on the ratio of closing stock 1996 to opening stock in 1996.


For the ratio check of 1996 Total Sales to 1995 Total Sales, and as for the H-B method, the Aggregate method was applied to the 1,231 retail and wholesale enterprises rolled-over from 1995. Also, as for the H-B, any enterprises with Total Sales 1996 over £100million were excluded. Randomly selected subsets of approximately 500 enterprises were generated 100 times and edited at group level and then at a micro level within the groups flagged as failing the check.
The check was applied to the ratio of Total Sales 1996 to Total Sales 1995 for the 80 strata (4 size groups by 20 business types). This is more or less the level of detail for which results are published. Any group with a ratio outside acceptable bounds was flagged. The ratios of all enterprises in flagged groups were calculated and flagged if outside these same bounds. The corrected values discovered by ASU were used to change the Total Sales for those enterprises flagged. If ASU had not altered the return, no change was made.
Initially, the acceptable bounds for ratios were set for the method as in the ASU edit procedures: an upper bound for the ratio of 1.5 and a lower bound of 0.7. An examination of the frequency distribution of the Total Sales 1996/Total Sales 1995 ratios (see Table 2.1) supports this choice of upper and lower bounds for a ratio check of the data. For the purposes of testing the method, other bounds were also tried.


2.2.2 Results of Application


The findings over the 100 repetitions are shown in Table 2.4 for various acceptance bounds on the edit check. For comparison, findings for applications of the check to all data in a straightforward micro-edit check, with no initial application to grouped data, are also presented.


Table 2.4 Averages and Standard Deviations over 100 Replications of the Aggregate Method to Total Sales

  Bounds on check
 
>1.6 or < 0.6
>1.5 or < 0.7
>1.3 or < 0.8
  Average St deviation Average St deviation Average St deviation
Number of fails – check applied with no initial grouping
47
5
59
6
88
7
Number of above fails changed by ASU
27
4
29
4
31
4
Total Sales after ASU changes
£1,662m
£116m
£1,663m
£115m
£1,663m
£115m
Number of fails – Aggregate method
18
4
22
4
28
5
Number of Aggregate identified fails changed by ASU
12
1
8
1
14
2
Total Sales after Aggregate changes
£1,672m
£113m
£1,661m
£115m
£1,655m
£112m

Under the Aggregate method even the most stringent (>1.3, <0.8) bound identified fewer fails on average (28) than the (>1.5, <0.7) bound applied in the straightforward micro approach (59 fails, on average).
From examination of Table 2.4 and as seen in the H-B method, a small number of changes accounted for the majority of the corrections to initially reported Total Sales. The number of changes made to failing observations is given in Table 2.4 above as not all ratios failing the check were in error.
It is worth reiterating that no imputation method was used when testing the Aggregate method. Where the method identified an error, the correct value for Total Sales 1996 as identified by ASU was used. The aim was to concentrate on comparing the number of enterprises flagged as having failed the Total Sales check under the usual ASU method and the Aggregate method.
The second edit check tested using the Aggregate method was that opening stock and closing stock should not differ greatly in a year. This edit was internal to 1996 responses, and was run on 100 randomly selected subsets of all replies to the 1996 ASI from the retail and wholesale sectors. No enterprises were excluded. Each subset contained approximately 1,200 enterprises. The check was run initially on the 80 strata and then on all enterprises in any of the 80 strata failing the check. Results are shown in Table 2.5.

 


Table 2.5 Averages and Standard Deviations for 100 Replications of the Aggregate Method to Opening and Closing Stock
Bounds on edit

 

  Bounds on check
 
>1.6 or < 0.6
>1.4 or < 0.7
>1.2 or < 0.8
  Average St deviation Average St deviation Average St deviation
Number of fails – check applied with no initial grouping
226
11
308
13
504
18
Number of above fails changed by ASU
141
9 142 9 142 9
Opening + Closing Stock total after ASU changes
£632m
£45m
£633m
£45m
£633m
£45m
Number of fails – Aggregate method
45
10
59
9
146
27
Number of Aggregate identified fails changed by ASU
9
2
15
2
39
5
Opening + Closing Stock total after Aggregate changes
£628m
£44m
£630m
£45m
£629m
£45m

 

In their present system ASU imposes upper and lower acceptance bounds of 1.4 and 0.7 respectively on the ratio of closing stock versus opening stock. A visual examination of the stock ratio supports this choice of bounds. With these bounds, the number of records failing under the Aggregate method was lower, on average, than that identified by a standard micro-editing approach to the check. Other acceptance bounds were tested for comparison. From examining Table 2.5, the impact on opening and closing stock of only correcting the fewer records flagged under the Aggregate method is small.
As with the check on Total Sales, a small number of changes account for the majority of the change to Opening Stock and Closing Stock. No imputation method was used when testing the Aggregate method. Where the method identified an error, the corrected value for opening or closing stock 1996 as identified by ASU was used.


2.2.3 Summary of Results


Given the simplicity of the Aggregate method, application of this method could be attractive. Application of different bounds at group level and at micro level would be an option where different levels of accuracy are required for the aggregated published data and the micro data.
Once the grouping for the group level check is at the level of aggregation published, then the problem of non-detection of compensating errors in the group level check is addressed with regard to published results.
At present sections check tables before publication for cells showing significant increases or decreases. The Aggregate method automates and streamlines this process somewhat. It should therefore be considered as a valuable checking methodology.


3 Micro-editing using Generalised Edit and Imputation


3.1 Background


The Generalised Edit and Imputation System (GEIS) is an automatic micro-editing (i.e. record level) and imputation system developed at Statistics Canada for continuous numerical data. It was designed to handle situations where the data edit checks could be written as linear equalities or inequalities. Recently other statistical offices have built their own GEIS; these include the CherryPi system developed at Statistics Netherlands and the SAS based AGGIES system developed by NASS at the US Department of Agriculture.


GEIS are normally composed of the following (non-trivial to implement) three parts:


• Edit Analysis: This involves looking at the consistency of the edit checks to determine redundancy and the scope of the edits to ensure completeness. This part is, in a sense, inactive, as it is done only once when the edit checks are being specified.

• Error Localisation: When a record fails any one or more of the edit checks, the question that arises is what combination of variables should be changed to ensure the record will pass all edits. In order to find a suitable combination the Felligi & Holt rule is invoked. It states that “changes should be made to as few variables as possible”. Clearly the use of this rule is appealing as it ensures that as much of the original data remains unchanged as possible.

• Imputation: Variables that are highlighted for amendment by error localisation are imputed (using a nearest neighbour or other rule).

 


3.2 Error Localisation Theory


3.2.1 Linear Edit Checks


Consider the following ratio edit check that specifies “VAT on Sales should be less than or equal to 21% of Total Sales”. This is normally written as a failure edit for programming as follows:


VAT on Sales / Total Sales > 0.21 -> Fail


In contrast linear edits are written as statements of truth. To re-cast the fail ratio edit as a true linear edit is straightforward:
1. Rewrite the edit as a true statement (note the ≤ sign):


VAT on Sales / Total Sales ≤ 0.21 -> True


2. Multiply through by the denominator (Total Sales) and take everything over to left hand side except constant values giving:


1 * VAT on Sales - 0.21 * Total Sales ≤ 0


The ratio inequality edit is now in linear form with constant coefficients of 1 and –0.21 for VAT on Sales and Total Sales respectively. In Appendix II a list of edits taken from the ASI is given.
Linear edits in 2 dimensions can be graphed by straight line constraints in the x-y plane. Consider the following list of linear edits for the variables x and y. The edits are also given in matrix form for convenience with the coefficient matrix labelled A and right hand side labelled b.

Linear Constraint Matrix Formulation
A x ≤ b
x <= 4 1 0 x 4
x + y <= 8 1 1 y ≤ 8
3x + 2y <= 18 3 2 18
-x – 2y <= -8 -1 -2 -8

Looking at Figure 3.a any record having combinations of x and y values (e.g. x=2, y=4) that lie inside the shaded area, called the feasible region, simultaneously satisfy all the edits. On the other hand a record with values given by (x=8,y=1) lies outside the shaded area and fails to satisfy one or more of the edit constraints.


Figure 3.a Inequalities Plotted as Constraint Lines in the x-y Plane


The central problem of error localisation is to determine the minimum number of changes that must be made to a failed record so that is can be made to pass all edits. Looking once again at the x-y plot it is easy to see that the failing initial point (i.e. record) must be moved to the point closest on the feasible region. That point is the record with values (4,2) and is a corner or extremal point of the feasible region. This suggests that to ensure the initial point passes the edits we should subtract 4 from the x-value and add 1 to the y-value. Thus the minimum number of changes required is 2; this number is known as the minimum cardinality solution among all possible solutions.


3.2.2 General Problem Formulation


Consider a record called x having n variables labelled x1, x2, …, xn representing (positive) data values of a respondent to a survey. The data fields are subject to a set of linear edits defined by the system of inequalities given in matrix form by:


A * x ≤ b


If one or more of these edits is not satisfied by the respondent record x then the problem is to determine the minimum number of variables to change so that the resulting record will satisfy the set of linear edits specified in the matrix A.
The corrected record can be represented as x + y – z, where y is a positive correction and z a negative correction with y, z ? 0. Note that x is a fixed known input record and therefore is not a correction; the new quantity y – z is the unknown correction to be determined and must satisfy the three inequalities:


A * (x + y – z) ≤ b
x + y - z ≥ 0
y, z ≥ 0

 

 


3.2.2.1 Cardinality and Chernikova’s Algorithm


The problem of error localisation is to determine the minimum number of fields to change. This is equivalent to minimising the cardinality, that is the number of non-zero elements of the correction record y – z. Mathematically the cardinality function N is defined as:

 

Chernikova's Algorithm



where Chernikova's Algorithm part 2.


To solve this problem it is important to note that the solution is of dimension 2n, that is the dimension of the space of both y and z. The error localisation module in GEIS at Statistics Canada uses Chernikova’s algorithm to generate all possible extremal point solutions (Note: The Simplex method for finding the minimum/maximum of a linear function subject to a set of linear constraints cannot be used to solve the error localisation problem. This is due to the fact that the cardinality function to be minimised is not linear).
When the enumeration of all possible solutions is complete the minimum cardinality can easily be computed. In Appendix III SAS code for error localisation using Chernikova’s algorithm is given.
Because the algorithm generates all extremal points it has proved to be very inefficient at solving even the smallest of problems. As a consequence a number of heuristics (i.e. rules of thumb) have been incorporated into the algorithm to increase its efficiency.
Even with the heuristics the error localisation module in GEIS at Statistics Canada appears only to be able to manage a combination of about 40 edit checks on about 40 variables. As a consequence it is only suitable for relatively small surveys. For larger surveys the edits are partitioned into disjoint groups. In this situation the solution obtained only approximates the best possible one.
At Statistics Netherlands error localisation is solved using another algorithm, called the Fourier-Motzkin Algorithm. It appears to be able to handle survey edit systems that are somewhat larger.


3.3 Error localisation for ASI 1996

3.3.1 Background and Edit Checks


To examine error localisation in a survey situation it was decided to apply an adapted version of the SAS code for Chernikova’s Algorithm to returns received in the ASI 1996. The code was adapted to cater for different combinations arising in the edits. In all 23 edits were chosen and these involved 19 variables. Most of the edit checks left out were those relating to dates and as these did not interact with the actual data there was no loss of complexity.
The coefficient matrix and right hand side associated with the linear edit checks are given in Table 3.2. In the coefficient matrix in

 

Table 3.2 the columns relate to the variables listed in Table 3.1.


Table 3.1 List of Annual Services Variables included in Test

Column number Annual Services variable Description
1
CP_ACQ96
Capital Acquisitions
2
CP_DIS96
Capital Disposals
3
CSTOCK96
Closing Stock
4
MPROP97
No. of Proprietors (March 1997)
5
MSALE96
Manufacturing Sales
6
OSALE96
Other Sales
7
OSTOCK96
Opening Stock
8
PUR96
Purchases
9
PURDR96
Purchases for Direct Resale
10
PUROGS96
Purchases of Other Goods & Services
11
PURVAT96
VAT on Purchases
12
RSALE96
Retail Sales
13
SPROP97
No. of Proprietors (September 1996)
14
TOTEMP96
Total Employment
15
TSALE95
Total Sales 1995
16
TSALE96
Total Sales 1996
17
VAT96
VAT on Sales
18
WAGSAL96
Wages & Salaries
19
WSALE96
Wholesale Sales



Table 3.2 Coefficient Matrix and Right Hand Side of Linear Edit Checks for ASI Edit 1996