# Model Selection and Averaging of Health Costs in Treatment Groups Essay

- Words:
**2865** - Category: Database
- Pages:
**11**

Get Full Essay

Get access to this section to get all the help you need with your essay and educational goals.

Get AccessChapter 3 Model Selection and Averaging of Health Costs in Episode Treatment Groups

3.1 Datas

I am utilizing ETG cost informations from a major national wellness insurance company. It has 33 million sample observations from 9 million claimants. Each observation represents the entire cost per claimant per twelvemonth on each ETG. For those policyholders without claim cost on certain ETGs, there is no nothing record for them in the information set. There are 347 ETGs in all, including 320 non-routine ETGs, such as AIDS, haemophilia, and personality upset. We merely consider those non-routine ETGs in this thesis because we can non derive much information from those everyday ETGs such as everyday test, vaccinations, conditional test, and other preventive services. Basic drumhead statistics for several indiscriminately selected ETGs are shown in Table 3.1 for illustration. Different ETGs have assorted claim frequences, agencies and standard divergences.

Figure 3.1: Histograms of loss ( left panel ) and log-loss ( right panel ) for three ETGs

The histograms of these costs both on the original and log graduated table give penetration into the lopsidedness and tail thickness of the information. Using that information, we choose plausible campaigner distributions. Specifically, we consider lognormal, gamma, Lomax, and log-skew-t distributions in our theoretical account. Although about all the ETGs show similar form with a heavy tail and right lopsidedness on the original graduated table, the histograms for those costs on the log scale vary among different ETGs. The histograms for three indiscriminately selected ETGs are shown in Figure 2.1. The entire cost per claimant per twelvemonth on each ETG is on the dollar footing, therefore all the values in the information set are positive.

3.2 Model Choice

Proper theoretical account choice for those ETG based costs is indispensable to adequately monetary value and manage hazards in wellness insurance. The optimum theoretical account ( or pattern chances ) can alter depending on the disease. As discussed in the debut, theoretical account averaging enables us to average the tantrums for a figure of theoretical accounts, alternatively of picking a individual best theoretical account. It gives the analyst greater penetration into the comparative virtues of the viing theoretical accounts.

3.2.1 AIC and BIC Weights

Following the recommendations given by Akaike ( 1978 ) and Burnham and Anderson ( 2002 ) , we can calculate the alteration in values of AIC and BIC with regard to those of the best campaigner theoretical account. In peculiar, we compute

( 3.1 )

Where K denotes the figure of campaigner theoretical accounts. The weightsare known as AIC weights or Akaike weights. Similarly, the weightsare called the BIC weights. For exemplifying intents, the AIC values and Akaike weights on four theoretical accounts for selected ETGs are provided in Table 3.2.

For those randomly selected ETGs, the distributions for some ETGs such as ETG-1301 and ETG-3868 are instantly evident. The log-skew-t distribution is besides dominant for ETG-2080 and ETG-3144. It indicates that AIC values and Akaike weights have a strong penchant for the log- skew-t distribution for most of these informations sets. However, there are exclusions. For ETG-2070, the chance spreads between two distributions: 0.882 chance to lognormal theoretical account and 0.118 chance to the log-skew-t. And for ETG-4370, the chance spreads among all four distributions: 0.002 chance to lognormal theoretical account, 0.087 chance to gamma distribution, 0.816 distribution to log-skew-t distribution and 0.095 chance to the Lomax distribution.

3.2.2 Bayesian Inference and Parallel Model Selection

The Bayesian attack allows one to larn about the whole distribution of measures of involvement instead than merely a point estimation of parametric quantities, which can be really utile in actuarial scientific discipline. Rather than seeking to place the best theoretical account, a parallel theoretical account choice method proposed by Congdon ( 2006 ) will supply the posterior chances of each theoretical account being the best, enabling theoretical account averaging and supplying deeper penetrations into the relationships between the theoretical accounts. The uncertainness in the theoretical account choice procedure can besides be explicitly modeled in the theoretical account.

Table 3.3: Anterior distribution scenes for the campaigner theoretical accounts

We used the LaplaceDemon bundle in R to execute parallel MCMC algorithms. Several algorithms were tried and compared, such as Hit-and-Run Metropolis ( Chen and Schmeiser, 1993 ) No-U-Turn Sampler ( Hoffman and Gelman, 2014 ; Bai, 2009 ) and Hamiltonian Monte Carlo ( Neal, 2011 ) . We ran three ironss in most instances, each in analogue, where a sequenceof random elements of some set is a Markov concatenation if the conditional distribution ofgivendepends on ten n merely. The hints of those three MCMC ironss initialized with different get downing values. When making theoretical account choice, non-informative priors will overly punish complex theoretical accounts, so we set our priors to be semi-informative. After that, we look at the information or maximal likeliness estimations ( MLEs ) of the campaigner theoretical account parametric quantities and seek to happen hyperparameters which will set most of the chance mass on a sensible scope around those parametric quantity estimations. The anterior distributions for the parametric quantities of the campaigner theoretical accounts are given in Table 3.2. The other two of import scenes are burn-in sample and cut sample. Burn-in sample refers to the samples after flinging an initial part of a Markov concatenation sample so that the consequence of initial values on the posterior illation is minimized. The cut samples were introduced to cut down sample autocorrelations by maintaining every K^{Thursday}fake draw from each sequence. In fact, the hardiness of the priors varies among different distribution and anterior picks. For illustration, our current pick for lognormal distribution is really robust. The priors for Lomax and log-skew-t distributions are comparatively robust. They work good for about all the ETGs, but need longer clip to accomplish convergence. Therefore, we normally assign larger figure of loops and more burn-in samples for them. The anterior for the gamma theoretical account had a moderately big impact on the consequences. Our current pick of priors is comparatively robust and works good for about all the ETGs.

We applied parallel theoretical account choice to several indiscriminately selected ETGs ; the posterior theoretical account chances are given in Table 3.3. The distributions for some ETGs such as haemophilias, AIDS, and agranulosis are instantly evident. The lognormal distribution is besides dominant for lung graft and many others. For personality upset, the chance spreads between two distributions: 0.783 chance to lognormal theoretical account and 0.217 chance to the log-skew-t.

In add-on to the improved apprehension of the informations, these chances can be used for theoretical account averaging. When one theoretical account is dramatically better than the others, merely cognizing the best theoretical account will be sufficient. When the possible theoretical accounts are really similar in their tantrum for some informations sets, a simulation should account for that theoretical account uncertainness by pulling a proportion of the simulations from each of the theoretical accounts that fit the information good. For illustration, to imitate future ETG cost watercourses for personality upset, 78.3 % samples can be drawn from lognormal distribution, and 21.7 % of the samples drawn from log-skew-t. Under the standard methods, the proper theoretical account proportions are unknown.

Table 3.4: Posterior theoretical account chances utilizing parallel theoretical account choice for selected ETGs.

ETG Code |
ETG description |
lognormal |
gamma |
log-skew-t |
Lomax |

1301 |
Acquired immune deficiency syndrome |
0 |
0 |
1 |
0 |

1635 |
Hyper-functioning adrenal secretory organ |
0 |
0 |
1 |
0 |

1640 |
Hypo-functioning parathyroid secretory organ |
0 |
0 |
1 |
0 |

2068 |
Agranulocytosis |
0 |
0 |
1 |
0 |

2070 |
Hemophilias |
1 |
0 |
0 |
0 |

2080 |
Anemia of chronic diseases |
0 |
0 |
1 |
0 |

2082 |
Iron anaemia |
0 |
0 |
1 |
0 |

2394 |
Personality upset |
0.783 |
0 |
0.217 |
0 |

3868 |
Congestive bosom failure |
0.45 |
0 |
0.55 |
0 |

4370 |
Lung graft |
0.999 |
0 |
0.001 |
0 |

4744 |
Trauma of tummy or gorge |
0 |
0 |
1 |
0 |

7112 |
Juvenile arthritis |
0.999 |
0 |
0.001 |
0 |

3.2.3 Random Forest

In this subdivision we present the process and highlight the benefit of Bayesian theoretical account averaging over traditional methods. Ideally we want to use the Bayesian attack to all the ETGs cost informations ( including more than 33 million samples ) . However, it takes a long clip to finish Bayesian illation and theoretical account choice on all ETGs. Therefore, it is desirable to happen a faster attack for immense informations sets. Random woods are an ensemble acquisition method for categorization that works by building many determination trees at preparation clip and outputting the category that is the manner of the categories end product by the single trees. It grows a battalion of categorization trees and each tree outputs a categorization.

We can believe of the trees as vote for the categorization, and so the random forest chooses the categorization with the most ballots. As mentioned earlier, if we treat all the ETGs following one distribution as one bunch, choosing the best distribution is tantamount to seting the ETGs into the best bunch. In this instance, we can pull out some characteristics from each information set and utilize those characteristics for Random Forest ( RF ) categorization. We do non necessitate to look at each information point in the information set, merely some drumhead statistics, which can salvage a batch of clip. Our experiments besides show that RF is highly fast when compared to the MLE attack ( e.g. , the system clip for MLE is approximately 120 times that of RF ) . We can transport out RF theoretical account choice through the undermentioned three stairss:

• Step 1: Sphere Specific Feature Extraction.

We extract 12 characteristics ( mean, average, standard divergence, interquartile scope, average absolute divergence, 10th, 25th, 75th, 90th percentiles, coefficient of fluctuation, lopsidedness, and kurtosis ) from the information set both on the original and log graduated table. Therefore, we have 24 characteristics in all for Random Forest Model Selection. The information is saved as one row for each dataset and 24 columns for each row. Basically, there are two types of characteristics:

– Moment-based features ( e.g. , mean, standard divergence, coefficient of fluctuation, lopsidedness, and kurtosis ) for natural informations and the same steps for log-data.

– Percentile-based features ( e.g. , 10th, 25th, 50th, 75th, 90th percentiles, average absolute divergence, and interquartile scope ) for natural informations and the same steps for log-data.

• Step 2: Random Forest Training for Prediction.

Make a moderate size informations set ( e.g. , 600 observations for each distribution ) with known response variables to develop the random forest. Our experiments show that the figure of observations can be about chosen as the square of the figure of variables in random wood to accomplish a sensible out-of-bag mistake rate. We have 24 covariates here, hence a dataset with 600 observations will be sufficient.

• Step 3: Random Forest Model Selection.

Use the trained Random Forest in Step 2 to the original informations set with characteristics generated in Step 1.

In Step 1, we foremost use two groups of features ( the moment-based characteristics and percentile-based characteristics ) individually and happen that the attack based on moment-type characteristics outperforms the percentile-type attack in separating distributions. Furthermore, utilizing both moment-based and percentile-based characteristics one can accomplish the lowest out-of-bag mistake rate and the best public presentation in separating distributions. These findings are summarized in Table 3.5.

Table 3.5: Performance of moment-based characteristics versus percentile-based characteristics.

Candidate Models Used |
Feature Choice |
Out-of-Bag Error Rate |

lognormal, gamma, Lomax |
Moment-based characteristics merely Percentile-based characteristics merely Both types of characteristics |
0.25 % 1.00 % 0.08 % |

lognormal, gamma, Lomax, log-skew-t |
Moment-based characteristics merely Percentile-based characteristics merely Both types of characteristics |
3.53 % 13.63 % 2.01 % |

The public presentation of RF besides depends on the trouble of the undertakings. If the bunchs have obvious distinguishable characteristics ( there is a immense difference between the lognormal, gamma and Lomax distribution ) , RF would acknowledge that and the misclassification rate would be really low. But if the bunchs are rather similar, so it is more hard to separate the theoretical accounts. The more candidate distributions with similar features, the worse the random forest performs.

Table 3.5 shows the RF categorization consequences on preparation informations and Table 3.6 shows the consequences on the proving informations. Since RF grows many categorization trees, we set the figure of instances in the preparation set as 4,000, sample 4,000 instances at random – but with replacing, from the original information set. Besides, we have 24 input variables, normally a figure m a‰? 24 is specified such that at each node, m variables are selected at random out of the 24 and the best split on these m is used to divide the node. The value of m is held changeless during the forest growth. Here we choose the optimum m as 6, which was determined by experimentation.

Figure 3.2: Multidimensional scaling secret plans of propinquity matrix for two scenarios.

Multidimensional grading is an ordination technique to visualise the degree of similarity between single instances in a information set. It aims to put each object in n-dimensional infinite such that the between-object distances are preserved every bit far as possible. In Figure 3.2, the statistical characteristics of each information set are represented by a point in a two dimensional infinite. The points are arranged in this infinite so that the distances between braces of points relates to the similarities among the braces of objects. That is, two similar objects are represented by two points that are close together, and two dissimilar objects are represented by two points that are far apart. Tables 3.5 and 3.6 tell us that if merely three distributions ( gamma, lognormal, Lomax ) are considered, they are easy distinguishable. When the log-skew-t distribution is added to the mix, more similarities is introduced because some points with different forms are close together. Therefore, it is clear that the most hard undertaking would be the categorization for all four distributions ( lognormal, gamma, log-skew-t, Lomax ) because the points from different distributions can non be easy distinguished.

3.3 Consequences

To find how good the methods work in our scenes, we set up a simulation survey. To get down with, we use the MLE attack to suit four distributions on the same existent ETG information. And so we use these MLE-fitted theoretical accounts to imitate four random samples with 600 observations each that follows one of the lognormal, gamma, log-skew-t, and Lomax distributions, severally. After that we apply the three theoretical account choice methodological analysiss ( AIC weights, RF, Bayesian ) to the simulated informations sets and look into how accurately each attack identifies the true theoretical account. Our findings are summarized in Table 3.8.

Table 3.8: Model choice truth: AIC weights, Random Forest, Bayesian.

In each 4?4 matrix in Table 3.8, if the chances on the diagonals are close to 100 % , the metric accurately selects the true theoretical account. From the consequences, we can detect and compare degree of the manner uncertainness and anticipation power over different prosodies. Though the most computationally intense of the three methods, on an mean sense, Bayesian performs best because it precisely identifies lognormal and log-skew-t distributions and it is somewhat less certain about gamma and Lomax compared to AIC weights. AIC weights did a good occupation on norm. Random Forest performs somewhat more ill than the other two, but it still can about certainly place the theoretical account with the best tantrum. Particularly when we need to cover with large informations sets, its efficiency is valuable without losing much truth.

Following, we apply Random Forest and AIC weights prosodies to execute the theoretical account choice exercising for all 320 ETGs. We did non use Bayesian parallel theoretical account choice in the 2nd measure because we merely have entree to an 8 GB Thinkpad with a 2.50 GHz Intel Quad-Core processor for patterning. Based on our experience, presuming the size of the information set is less than 5000 observations and it can meet, it takes about 2 hours to suit all five distributions for a individual ETG. Sometimes it does non meet, and so we need more clip to either increase the figure of loops, or recheck the anterior distributions. The approximative clip to finish Bayesian illation and theoretical account choice on all ETGs is 4 hebdomads. Therefore Bayesian parallel theoretical account choice does non work good for large informations without ace computing machines. Despite the fact that MLE is normally seen as an efficient method, it still takes about 4 hours in all to complete the theoretical account choice for all the ETGs. In contrast, Random Forest characteristic categorization can be done within 2 proceedingss. This is explained by the fact that for AIC weights, every observation is used for illation and theoretical account choice, while for Random Forest, the theoretical account choice is done on the extracted characteristics of the information set, which is a much smaller information set than the original information set. When pull outing characteristic information from the original information set, it besides takes a little sum of clip. However, compared to the illation and theoretical account choice clip on each observation, the entire clip for information extraction and characteristic categorization utilizing random wood is still much less. Table 3.9 shows the velocity comparing among all four methodological analysiss.

Table 3.9: Speed comparing ( on all 320 ETGs ) .

Model Selection Methodology |
Time |

Random Forest |
~2 proceedingss |

AIC and BIC |
~4 hours |

Bayesian |
~4 hebdomads |

Now we explore how consistent the RF and AIC methodological analysiss are in choosing the same theoretical account ( for all 320 ETGs ) . First, in Table 3.10, we merely use three distributions ( lognormal, gamma, Lomax ) as campaigners for theoretical account choice. Those three distributions have obvious distinguishable characteristics. In the 3 ? 3 matrix, RF and AIC agree on all the 197 ETGs theoretical account choices on the diagonal. For some ETGs, compared to RF, AIC prefers lognormal to Lomax.

Table 3.10: Comparison of theoretical account assignments by RF and AIC for all 320 ETGs

( when three campaigner theoretical accounts used ) .

Distribution Selected by RF |
Distribution Selected By AIC |
RF sum |
||

lognormal |
gamma |
Lomax |
||

lognormal |
100 |
11 |
19 |
130 |

gamma |
1 |
5 |
3 |
9 |

Lomax |
87 |
2 |
92 |
181 |

AIC Total |
188 |
18 |
114 |
320 |

Following, in Table 3.10, we use four distributions ( lognormal, gamma, Lomax, log-skew-t ) . AIC has an evident penchant for the log-skew-t distribution because it selects this theoretical account for 292 of 320 ETGs. Random forest besides selects the log-skew-t distribution for most ETGs, but at the same clip it assigns 131 ETGs to lognormal distribution. One common subject is that none of the prosodies select the gamma distribution for any ETG. That is apprehensible because compared to other distributions, gamma is comparatively light tailed. Given the heavy dress suits for most ETG costs, one time the log-skew-t distribution is one of the campaigners, no metric will choose gamma distribution as the best theoretical account.