Stochastic Latent Residual Approach for Consistency Model Assessment

Hypoglycaemia is a condition when blood sugar levels in body are too low. This condition is usually a side effect of insulin treatment in diabetic patients. Symptoms of hypoglycaemia vary not only between individuals but also within individuals making it difficult for the patients to recognize their hypoglycaemia episodes. Given this condition, and because the symptoms are not exclusive to only hypoglycaemia, it is very important for patients to be able to identify that they are having a hypoglycaemia episode. Consistency models are statistical models that quantify the consistency of individual symptoms reported during hypoglycaemia. Because there are variations of consistency model, it is important to identify which model best fits the data. The aim of this paper is to asses and verify the models. We developed an assessment method based on stochastic latent residuals and performed posterior predictive checking as the model verification. It was found that a grouped symptom consistency model with multiplicative form of symptom propensity and episode intensity threshold fits the data better and has more reliable predictive ability as compared to other models. This model can be used in assisting patients and medical practitioners to quantify patients’ reporting symptoms capability, hence promote awareness of their hypoglycaemia episodes so that corrective actions can be quickly taken.


Introduction
Hypoglycaemia is a condition of low glucose level in blood, i.e. below 4mmol/L. It is a common side effect of insulin treatment in diabetic patients. It is crucial to treat a hypoglycaemia episode promptly to avoid severe hypoglycaemia episode, where patient needs other people's help to recover. However, it is not easy for the patient to identify a hypoglycaemia episode because symptoms of hypoglycaemia vary within individuals. A given symptom is not equally covarying with blood glucose levels [1] implying a degree of between-subject variability. Individuals experiencing various symptoms of hypoglycaemia are not necessarily able to recognize a hypoglycaemic episode because the individuals' ability to recognize hypoglycaemia is significantly correlated with the number of symptoms reported per episode [2]. There are marked variability of the reported symptoms between episodes of hypoglycaemia [3] but the study is limited to children respondents. A consistency model was developed to quantify the consistency of reporting the symptoms of hypoglycaemia by adult patients [4]. Zulkafli et al [5] then introduced the grouped symptoms models as one of the consistency estimations models. This model adds another source of variation to symptoms' reportings by distributing the 26 symptoms to several groups according to the causes. Other functional form was briefly introduced as an alternative to be used in the consistency models [5].
With several consistency models developed, the challenge is to evaluate the performance of each model before making decisions on which model can give better consistency estimates.
Residual analysis is one prominent way in validating a statistical model. Cox and Snell, [3] introduced a general definition of residual for non-linear models. Deviance, Pearson and Anscombe residuals are examples of type of residuals commonly used in residual analysis. However, these residuals have unknown sampling distribution which will affect the interpretation of the analysis [7].
Among works that has been done in measuring the performance of statistical models related to diabetic data use coefficient of determination ( 2 ) goodness-of-fit measure [8] and robust method [9]. However, the work does not apply to the concept of latent residuals.
Latent residual analysis was used in analyzing binary response variable in regression framework [10,11]. 2 test for latent model testing is sensitive to distributional properties of the observed variables [12]. The test also will have high probability of Type 1 error with complex model [13]. Therefore, the intent of this paper is to present a method for assessing the adequacy of the stochastic model with latent variables utilissing the concept of stochastic latent residuals, . Also, one of the important aims of this work is to develop a model which can be used to make prediction of values of interest with quantified confidence. A good predictive model, enables us to predict how consistent a patient is in reporting hypoglycaemia when given some of his/her specific characteristics. This can be used in order to assist early detection of hypoglycaemia and give necessary advice to the patient. Therefore, the second objective of this paper is to examine the consistency model's predictive capability by employing a validation approach relying on the posterior predictive distribution.

Materials and Methods
The methods of model assessment discussed in this paper are applied to data collected from 66 diabetic patients where each subject is given a unique ID number [14]. Each patient recorded his/her symptoms in each hypoglycaemia episode experienced for a duration of 9 to 12 months.

The Consistency Model
A consistency model was developed under Bayesian approach [4]. Observed variable takes value 1 if patient = 1, . . . , I reports symptom = 1, . . . , J in episode = 1, . . . , K by patient = 1, . . . , I. Otherwise, takes value 0. ~ Bernoulli ( ) where is the probability of symptom is reported in episode by patient . A threshold, is defined for patient reporting symptoms at episode and symptom is considered as reported when the threshold exceeded by a functional form ℎ( , . ), i.e ≤ ℎ( , ). and are latent variables which correspond to the propensity of symptoms for patient and intensity of episodes in patient respectively.
The threshold, is assumed to follow a log-normal distribution, ~ Log-Normal(0, 2 ) where 2 is the parameter associated with the variability of symptoms reported by individual patient . The consistency estimate is defined as = 1 100+ 2 .
Each of the parameter is assigned a prior distribution as follows; This consistency model was later expanded by separating the symptoms into different groups according to their causes in order to have an additional source of variation [5]. Therefore, the prior corresponding to the symptoms propensity then become Earlier work of the consistency model assumed a threshold form ℎ( , . ) = [4]. Later, another option for the functional form was introduced, i.e. ℎ( , . ) = + and their differeces were briefly discussed [5].

Stochastic Latent Residual
The stochastic latent residuals, , would give rise to the observed data under the considered model. Following the concept of generalised residuals [6], the data can be regarded as generated through a functional model, (·), [15] depending on the vector of all model parameters and latent variables, say , i.e.
where ∼ (0, 1) are generalised residuals. Then, in the general case, (1) can be inverted to give the stochastic latent residuals For the assumed discrete model we have where ∼ (0, 1) and {·} is the indicator function. This implies that, under the assumed model, where 1 ∼ (0, ) and 2 ∼ ( , 1). Therefore, if the model is adequate, ∼ (0, 1) and a -value for testing the hypothesis of this uniform distribution can be obtained. To implement this method, 10,000 MCMC iterations were run for this model and obtained the latent residual, , for each subject such that where ̂ is the estimated probability of patient reporting symptom at episode at each iteration. Therefore, if the tested hypothesis is correct, ) ∼ (0, 1).
A Kolmogorov-Smirnov goodness-of-fit test was conducted on each posterior sample of residuals obtained in each MCMC iteration, resulting in a correspondingvalue, , where = 1, 2, 3, . . . , 10, 000 iterations. This will give a posterior distribution ( | ) where denotes the observation data.

Posterior Predictive Checking
This approach is commonly used for checking the model's suitability, and is based on work that was elaborated in [16] and later expanded in [17]. The purpose of the analysis is to compare the observed data with values predicted from the model.
The observations, are binary data that take value 1 if patient reported symptom in episode value zero otherwise.
( ) is defined as the predicted data, such that these are the data that will be obtained if we use the same model to do prediction. 10% of the total number of observations are randomly selected, which are then used as the validation sample. Then, the examined model is fitted to the remaining data. The fitted model is subsequently used to do prediction on symptom reporting for the sampled patients episodes, ( ) for , , in the sample. Recall that, is Bernoulli distributed with probability, . The posterior probability, , is sampled from the fitted model and is used to obtain the reporting prediction, ( ) .
Consequently, we compare the total number of predicted reportings, to the total number of observed reportings, . Accordingly, the distributions of and ( ) were compared. Four other measures are used to assess and describe the usefulness of the model's predictions [18]. The measures are related to sensitivity, specificity and predictive values. Here, sensitivity is defined as the proportion of experienced symptoms that are correctly predicted as being reported by the models whereas specificity is the proportion of symptoms that have not been experienced which are correctly predicted as not reported by the model. Ideally, a good predictive model should have high sensitivity and specificity. However, these two measures are often inversely proportional, meaning as sensitivity increases specificity decreases and vice versa. The probability of the model giving correct prediction were evaluated by using for , , in the sample PPV is the proportion of symptoms with positive prediction that was correctly classified as reported. PPV measures the probability of patient truly experiencing symptom at episode given that the model predicts the symptom is likely to be experienced.
for , , in the sample.
NPV is the proportion of symptoms with negative reporting prediction that was correctly classified as absent. NPV measures the chance of patient having symptom not present at episode given that the model predicts that it is not likely to be reported.  True Negative Rate (TNR), or also called specificity, represents the capacity of the model to predict that symptom is not reported at episode when the symptom is truly absent.

Model Assessment
As preliminary checking, we observe the histogram of the residuals for each patient for grouped symptom model with threshold ℎ( , ) = . Recall that patient reports symptom at episode k when ≤ . Thus, the observed variable, is equal to 1 when symptom is reported at episode by patient . Otherwise, takes value zero. Figure 1 presents the histogram for one patient, Subject 4028. The distribution pattern suggests that the residuals do follow a Uniform (0,1) distribution. To further confirm the distribution of we also check on the histogram of -values for this patient,  For comparison purposes, Figure 3 presents histograms of -values, , representing another patient, Subject 5088 when using the two thresholds, ℎ( , ) = and ℎ( , ) = + . Observing the posterior distributions of for subject 5088 it can be seen that there is no strong evidence against the models tested, although it appears that there is more evidence against the model when the model with threshold ℎ( , ) = + is fitted. This is evidenced from the higher concentration of -values close to zero.
As implied earlier, to have strong evidence against a tested model, i.e. to reject the hypothesis that the model is adequate, the posterior -values, , should be very small. Therefore as a measure of model goodness of fit, the proportion of less than 0.05, Pr( <0.05) is calculated for each subject. For comparison purposes, cases with greater Pr( <0.05) show stronger evidence against the model fit.
For 67% of the 66 subjects the proportions of <0.05 suggest that better fit of the model with threshold ℎ( , ) = . Bar plots in Figure 4 display the Pr( <0.05) obtained from the grouped symptoms model when using different thresholds for Subjects 3022, 4028, 5088, 4045, 4023 and 2013. For these patients, their Pr( <0.05) when using threshold ℎ( , ) = + (yellow bars) are higher than when using ℎ( , ) = which is indicated by the red bars. The same procedure was repeated for comparing the models with and without grouped symptoms. For both models the ℎ( , ) = threshold is used, and the proportion Pr( <0.05) is calculated. Only seven patients show higher Pr( <0.05) when the grouped symptoms model is used compared to the model without grouped symptoms. This suggests that the model with grouped symptoms fits the data better.

Model Verification
The posterior predictive checking approach was applied to study the predictive ability of the core model (without grouped symptoms) and the grouped symptoms model, using threshold ℎ( , ) = . Graphical plots in Figure 5 show the posterior distributions of the total predictednumber of reporting symptoms, with blue (dotted) lines marking the total number of predicted reportings, , whereas the red (solid) lines refer to the total observed value, for subjects 4045. The reporting symptoms for patient 4045 are very well predicted by the grouped symptoms model as indicated by the blue and red lines that almost overlap. The prediction made was =15.26 with 95% CI (9,22) and the symptom reportings i.e. the total observed value, , is 15. However, the nongrouped symptoms model also made a good prediction, although it is slightly over estimated ( = 17.63).
We also test the performance of different thresholds with the core model. Figure 6 gives the posterior distributions of the total predicted number of symptom reportings for subjects 5009. With each threshold, the predicted distributions comfortably contain the total number of observation, (represented by red solid lines). This indicates that for this patient, we cannot distinguish between the three threshold models in terms of their predictive ability. Note that graphs for all patients exhibit similar trend.  Finally, the performance of prediction for different models when using data from all patients in the analysis is compared, i.e. the models with and without grouped symptoms using thresholds ℎ( , ) = and ℎ( , ) = + . The results are provided in Table  1. Among the four models, the model with grouped symptoms with threshold ℎ( , ) = gives the closest predicted value, , to the observed number of symptoms reported. Figure 7 shows the posterior distributions of the predicted number of symptoms reported. The total number of symptoms predicted to be reported, is 754.3, with a 95% credible interval of (713,797), which contains the observed number of reported symptoms, 771. The other three models considered here do not perform well in terms of this prediction, with the corresponding posterior predictive distributions failing to contain the true value.  Regarding the other four predictive measures presented in Table 1, the models explored here do not display substantial differences. It is also obvious that prediction referring to symptoms not being experienced (NPV, TNR) is much more successful, as compared to prediction for reported symptoms (PPV, TPR). The fact that the developed models perform better in predicting that symptoms will not be reported, may be explained by the nature of the data, where the frequency of reporting symptoms is relatively low (771/7033). The proportion of symptoms with positive prediction that was correctly classified as reported is highest when using the core model with threshold ℎ( , ) = , i.e. PPV = 0.413, whereas the chance of a symptom not present in an episode given that the model predict it will not be reported is also highest with this model (NPV = 0.933).

Conclusions
This paper discusses the assessment of models with different thresholds using the concept of stochastic latent residuals. It was found that the grouped symptoms model with threshold ℎ( , ) = fits the data better. Performing the model verification and posterior predictive checking to verify which model is best in predicting symptom reporting, it is concluded that the grouped symptoms model with threshold ℎ( , ) = has more reliable predictive ability as compared to other models.