Model Selection and Averaging of Health Costs in Treatment Groups Essay Example
Model Selection and Averaging of Health Costs in Treatment Groups Essay Example

Model Selection and Averaging of Health Costs in Treatment Groups Essay Example

Available Only on StudyHippo
  • Pages: 7 (1776 words)
  • Published: September 13, 2017
  • Type: Research Paper
View Entire Sample
Text preview

Dates

The data used is ETG cost data from a top national health insurance company. It includes 33 million observations from 9 million claimants. Each observation represents the annual total cost per claimant for each ETG. Policyholders without claim costs for specific ETGs are not included in the dataset.

There are a total of 347 ETGs, including 320 non-routine ETGs such as AIDS, hemophilia, and personality upset. This thesis specifically focuses on these non-routine ETGs because everyday ETGs like routine tests, vaccinations, conditional tests, and other preventive services do not provide much useful information. Table 3.1 shows basic summary statistics for various randomly selected ETGs as an example. Different ETGs vary in claim frequencies, means, and standard deviations.


Histograms displaying loss (left panel) and log-loss (right panel) for three ETGs

T

...

he histograms of costs on both the original and log scales provide insight into the skewness and thickness of the tails of the data. Using this information, we select plausible candidate distributions.

Specifically, our theoretical account considers lognormal, gamma, Lomax, and log-skew-t distributions. Although all the ETGs have a similar form with a heavy tail and right lopsidedness on the original graduated table, the histograms for those costs on the log scale vary among different ETGs. Figure 2.1 shows the histograms for three indiscriminately selected ETGs. The entire cost per claimant per twelvemonth on each ETG is on the dollar footing, therefore all the values in the information set are positive.

Model Choice

Proper theoretical account choice for those ETG based costs is indispensable to adequately monetary value and manage hazards in wellness insurance. The optimum theoretical account ( o

View entire sample
Join StudyHippo to see entire essay

pattern chances ) can alter depending on the disease.

As mentioned previously, model averaging allows us to average the fits for multiple models rather than selecting a single best model. This approach provides the analyst with a better understanding of the relative merits of the competing models.

AIC and BIC Weights

Following the recommendations of Akaike (1978) and Burnham and Anderson (2002), we can calculate the differences in AIC and BIC values compared to those of the top-performing candidate model. Here, K represents the number of candidate models. These weights are also known as AIC weights or Akaike weights.

Table 3.2 presents the AIC values and Akaike weights for four theoretical accounts of selected ETGs as an example. Notably, certain ETGs such as ETG-1301 and ETG-3868 exhibit distinct distributions. On the other hand, for ETG-2080 and ETG-3144, the log-skew-t distribution appears to be prominent. This indicates that the majority of these information sets heavily support the log-skew-t distribution based on their AIC values and Akaike weights. Nevertheless, there are cases where this is not true.

For ETG-2070, there is a 0.882 chance of using a lognormal theoretical account and a 0.118 chance of using the log-skew-t distribution. As for ETG-4370, there is a 0.002 chance of using the lognormal theoretical account, a 0.087 chance of a gamma distribution, a 0.816 chance of the log-skew-t distribution, and a 0.095 chance of the Lomax distribution.

Bayesian Inference and Parallel Model Selection

The Bayesian approach allows for learning about the entire distribution of interest measures rather than just estimating parameters, which is particularly useful in actuarial science. Instead of identifying the best model, a parallel model selection method proposed

by Congdon (2006) provides the posterior probabilities for each model being the best, enabling model averaging and deeper insights into their relationships. The uncertainty in the model selection process can also be explicitly modeled.

We utilized the LaplaceDemon package in R to implement parallel MCMC algorithms. Multiple algorithms were tested and compared, including Hit-and-Run Metropolis (Chen and Schmeiser, 1993), No-U-Turn Sampler (Hoffman and Gelman, 2014; Bai, 2009), and Hamiltonian Monte Carlo (Neal, 2011). In most cases, we ran three chains concurrently, each with a Markov concatenation of random elements from a set, where the conditional distribution of given depends on ten n only. We initialized the three MCMC chains with different starting values. For model selection, non-informative priors penalize complex models excessively, so we opted for semi-informative priors.

Afterwards, we examine the information or maximum likeliness estimations (MLEs) of the campaigner theoretical account parametric quantities and attempt to determine hyperparameters which will concentrate most of the probability distribution on a reasonable range around those parameter estimations. The prior distributions for the parameters of the campaigner theoretical models are provided. The other two important settings are burn-in sample and cut sample. Burn-in sample refers to the samples after discarding an initial portion of a Markov chain sample so that the influence of initial values on the posterior illation is minimized. The cut samples were introduced to reduce sample autocorrelations by keeping every K

Thursday

pseudo draw from each sequence. In fact, the robustness of the priors varies among different distribution and prior choices.

Our current choice for the lognormal distribution is highly robust. The priors for the Lomax and log-skew-t distributions are also quite robust, albeit comparatively. These priors work well

for almost all the ETGs, but they require more time to achieve convergence. As a result, we usually allocate a larger number of loops and more burn-in samples for them. The prior for the gamma model had a moderately significant impact on the results. Overall, our current choice of priors is relatively robust and effective for almost all the ETGs.

We utilized parallel theoretical model selection on a variety of randomly chosen ETGs. The resulting model probabilities are provided in Table 3.3. The patterns for certain ETGs such as haemophilias, AIDS, and agranulosis can be immediately observed. The dominant distribution for lung graft and many others is also the lognormal distribution. As for personality disorder, the probability is divided between two distributions: there is a 0.783 probability for the lognormal model and a 0.217 probability for the log-skew-t model. These probabilities, in addition to enhancing our understanding of the data, can be employed for model averaging.

When there is a significant difference between theoretical models, knowing the best model is enough. However, when multiple models are similar in their fit for certain sets of information, a simulation should consider the uncertainty by drawing a proportion of simulations from each fitting model. For instance, to simulate future ETG cost watercourses for personality disorder, 78.3% of the samples can be drawn from a lognormal distribution and 21.7% of the samples drawn from a log-skew-t distribution. In traditional methods, the correct proportions of models are unknown.

Table 3.4 presents the posterior theoretical account probabilities for specific ETGs using parallel theoretical model selection. The table includes the ETG Code, ETG Description, lognormal, gamma, log-skew-t, and Lomax values:

< < td >0
< td >1
< td >0

< tr >
< td >1635< / td >< < td >Hyper-functioning adrenal secretory organ< / t d >< < t d >0< / t d >< < t d >0< / t d >< < t d >1< / t d >< < t d >0 << / tr >

<< br >

<< tr >

<< br >

<< < rt << tr <<

rt

<< rt << rt << wrt <

wretg rvgw
frr

In this section, we discuss the process and advantages of Bayesian model averaging compared to traditional methods. Our goal is to apply the Bayesian approach to obtain cost information for all ETGs, which involves over 33 million samples.Performing Bayesian inference and model selection on all ETGs is a time-consuming task.

Therefore, a faster approach is desired for processing large data sets. Random forests are a machine learning technique that involves building multiple decision trees and using their collective output to determine a category. Each tree in the forest generates a classification, and the random forest selects the category with the highest number of votes. In the context of ETGs, treating them as one group and selecting the best distribution is equivalent to putting the ETGs into the optimal cluster.

The Random Forest (RF) categorization method allows us to extract specific characteristics from each information set. Instead of examining every data point, we can use summary statistics, which is a time-saving approach. Our experiments show that RF is much faster than the MLE approach. For instance,

the system time for MLE is approximately 120 times longer compared to RF.

We can perform our RF model selection using the following three steps:


• Step 1: Sphere Specific Feature Extraction.

We extract 12 features (mean, median, standard deviation, interquartile range, average absolute deviation, 10th, 25th, 75th, 90th percentiles, coefficient of variation, skewness, and kurtosis) from the data set on both the original and log scales. This gives us a total of 24 features for Random Forest Model Selection. The data is stored as one row for each dataset with 24 columns for each row. There are basically two types of features: moment-based features (such as mean, standard deviation, coefficient of variation, skewness, and kurtosis) for raw data and the same set of steps for log-data. Percentile-based features are also included (such as...)

, 10th, 25th, 50th, 75th, 90th percentiles, average absolute divergence, and interquartile scope) are calculated for natural informations and log-data.

• Step 2: Random Forest Training for Prediction.

Make a moderate size informations set (e.g., 600 observations for each distribution) with known response variables to develop the random forest. The number of observations can be chosen as the square of the number of variables in the random forest to achieve a sensible out-of-bag mistake rate. There are 24 covariates here, so a dataset with 600 observations will be sufficient.

• Step 3: Random Forest Model Selection.

Use the trained Random Forest from Step 2 on the original information set with characteristics generated in Step 1.

In Step 1, we employ two groups of features: moment-based characteristics and percentile-based characteristics. We find that the attack based on moment-type characteristics is more effective in separating distributions compared to the percentile-type

attack. By utilizing both moment-based and percentile-based characteristics, we achieve the lowest out-of-bag mistake rate and the best performance in separating distributions. These findings are summarized in Table 3.5: Performance of moment-based characteristics versus percentile-based characteristics.

ETG Code ETG description lognormal gamma log-skew-t Lomax
1301 Acquired immune deficiency syndrome

0

Candidate Models Used Feature Choice Out-of-Bag Error Rate
lognormal, gamma, Lomax
Moment-based characteristics only Percentile-based characteristics only Both types of characteristics
0.25% 1.00% 0.08%

The performance of RF also depends on the difficulty of the tasks.

If the clusters have distinct characteristics (such as noticeable differences between the lognormal, gamma, and Lomax distributions), RF will recognize this and have a very low misclassification rate. However, if the clusters are similar, it becomes more difficult to differentiate between the models. The more candidate distributions that

have similar features, the worse the random forest performs. Table 3.5 displays the RF classification results on training data, while Table 3.6 shows the results on testing data. As RF grows multiple classification trees, we set the number of instances in the training set as 4,000, randomly sampling 4,000 instances - with replacement - from the original data set.

Our model includes 24 input variables. Typically, a value of m ? 24 is chosen and at each node, m variables are randomly selected from the 24. The node is divided based on the best split using these m variables. The value of m remains constant during the growth of the forest. In this case, we have determined that the optimal value for m is 6 through experimentation.


The proximity matrix for two scenarios is shown in multidimensional scaling plots.

Multidimensional scaling is a technique that visualizes the similarity between individual instances in a dataset.

The objective is to position each object in an infinite n-dimensional space so that the distances between objects are maintained as much as possible. Figure 3.2 illustrates this by representing each set of information with a point in a two-dimensional space. These points are arranged in a manner that reflects the similarities between objects through the distances between pairs of points. Essentially, similar objects are denoted by closely positioned points, while dissimilar objects are represented by points that are farther apart. Tables 3.5 and 3.6 demonstrate that it is easy to distinguish among three distributions (gamma, lognormal, Lomax). However, when the log-skew-t distribution is introduced, more similarities arise as certain points with different shapes become close together.

Therefore, it is clear that the most difficult task

would be categorizing all four distributions (lognormal, gamma, log-skew-t, Lomax) because it is not easy to distinguish the points from different distributions.

Consequences

To assess how well the methods perform in our scenarios, we conducted a simulation study. Firstly, we used the maximum likelihood estimation (MLE) approach to fit four distributions to the same actual ETG data. Then, we used these MLE-fitted models to generate four random samples, each containing 600 observations following one of the lognormal, gamma, log-skew-t, and Lomax distributions, respectively. Subsequently, we applied three model selection methodologies (AIC weights, RF, Bayesian) to the simulated data sets and examined how accurately each approach identified the true model. The results are summarized in Table 3.8.

Each 4?4 matrix in Table 3.8 accurately selects the true theoretical account if the chances on the diagonals are close to 100%. From the results, we can observe and compare the level of uncertainty and prediction power across different metrics. Although it is the most computationally intensive method, on average, Bayesian outperforms the others because it accurately identifies lognormal and log-skew-t distributions and is slightly less certain about gamma and Lomax compared to AIC weights. AIC weights perform well on average. Random Forest performs slightly worse than the other two, but it can still almost certainly identify the model with the best fit. Its efficiency is valuable, especially when dealing with large data sets, without sacrificing much accuracy.

Following, we applied Random Forest and AIC weights prosodies to perform the theoretical account choice exercise for all 320 ETGs. We did not use Bayesian parallel theoretical account choice in the second step because we only have access to an 8 GB Thinkpad with a

2.50 GHz Intel Quad-Core processor for patterning. Based on our experience, assuming the size of the data set is less than 5000 observations and it can meet, it takes about 2 hours to fit all five distributions for an individual ETG. Sometimes it does not meet, and then we need more time to either increase the number of loops or recheck the prior distributions.

The estimated time to complete Bayesian inference and model selection on all ETGs is 4 weeks. Therefore, Bayesian parallel model selection does not work well for large data without supercomputers. Although MLE is typically seen as an efficient method, it still takes about 4 hours in total to complete the model selection for all the ETGs. In contrast, Random Forest feature categorization can be done within 2 minutes.

This can be explained by the fact that, when using AIC weights, every observation is used for inference and model selection. On the other hand, for Random Forest, model selection is done on the extracted features of the data set, which is a much smaller set compared to the original data set. The process of extracting feature information from the original data set also takes a small amount of time. However, when compared to the inference and model selection time for each observation, the total time for data extraction and feature categorization using random forest is still much less. Table 3.9 presents the speed comparison among all four methodologies.

Table 3.9: Speed comparing (on all 320 ETGs).

Model Selection Methodology Time
Random Forest ~2 proceedings
AIC and BIC ~4 hours
Bayesian ~4 hebdomads

Now we explore how consistent the RF and AIC methodological analysis are in choosing the same theoretical account (for all 320 ETGs).

First, in Table 3.10, we merely use three distributions (lognormal, gamma, Lomax) as campaigners for theoretical account choice. Those three distributions have obvious distinguishable characteristics. In the 3 ? 3 matrices, RF and AIC agree on all the 197 ETGs theoretical account choices on the diagonal. For some ETGs, compared to RF, AIC prefers lognormal to Lomax.

Table 3.10 shows the comparison of theoretical account assignments by RF and AIC for all 320 ETGs when three campaigner theoretical accounts are used. The table includes the distributions selected by RF and AIC, as well as the RF sum.

The table shows that RF selected the lognormal distribution while AIC selected the gamma distribution. Additionally, RF assigned 11 ETGs to the Lomax distribution. On the other hand, AIC assigned 100 ETGs to the lognormal distribution.

Further analysis, shown in Table 3.10, reveals that when four distributions (lognormal, gamma, Lomax, log-skew-t) are considered, AIC has a preference for the log-skew-t distribution. This is evident as it selects this theoretical account for 292 out of 320 ETGs. Interestingly, random forest also selects the log-skew-t distribution for most ETGs, but it also assigns 131 ETGs to a lognormal distribution.

The gamma distribution is not chosen for any ETG pro study because it is relatively light-tailed compared to other distributions. This is understandable since most ETG costs are heavy-tailed. When the log-skew-t distribution is considered as an option, no metric will select the gamma distribution as the best model.

Get an explanation on any task
Get unstuck with the help of our AI assistant in seconds
New