As Statistics Coursework – Correlation Coefficient Essay Example
As Statistics Coursework – Correlation Coefficient Essay Example

As Statistics Coursework – Correlation Coefficient Essay Example

Available Only on StudyHippo
View Entire Sample
Text preview

The aim of this investigation is to assess the level of reliance and establish a possible connection between two variables, keeping in mind their expected strong correlation and dependence. To ensure precise outcomes, it is vital to collect relevant data and utilize statistical methods like computing correlation coefficients and regression lines. Additionally, any anomalies that could affect these coefficients and lines should be taken into consideration. This study focuses specifically on investigating the potential relationship between the height and weight of boys and girls in year 11.

The weight of an individual is expected to increase in correlation with their height, and the influence of gender will also be taken into consideration. The research will focus on students in year 11 at Wilnecote High School, comprising both boys and girls. To ensure a rep

...

resentative sample, a random selection method will be utilized. A total of 35 boys and 35 girls will be chosen from this year group. Each student will be assigned a number (ranging from 1 to 135 for boys and 1 to 124 for girls) in order to facilitate random selection using a calculator's random function. Weight measurements and height evaluations will be recorded for each selected student. This process will then be repeated for the girls, selecting an additional 35 students out of the original pool of 124 female students within the year group. Upon gathering the names of these representative boys and girls, arrangements were made to attend one of their assemblies for data collection.

After introducing the students, I requested that they stay behind. Once a sample group was chosen, I explained the purpose of the activity for my coursework. Subsequently,

View entire sample
Join StudyHippo to see entire essay

inquired if anyone had objections to me measuring their weight and height.

To ensure measurement consistency and accuracy, I asked all students to remove their shoes and any additional clothing except for their uniforms. This approach aimed to eliminate variations in weight caused by clothing and height due to shoe heels.

For weighing, I utilized bathroom scales assuming their accuracy while only factoring in the uniform but not shoes. To measure height, I affixed a tape measure to the wall using blue tac and measured to the nearest centimeter. To minimize measurement errors, I stood on a stool while ensuring all students stood with feet together against the wall with proper posture—no slouching or standing on tiptoes.

This standardized process was applied to every student for an equitable and unbiased test. The collected data was organized into tables which were later transferred into separate Excel sheets—one for boys and one for girls.

The diagrams displayed correlations ranging between -1 and 1; however, it is uncommon for correlations to be precisely -1, 0, or 1.The data could indicate a strong negative or positive correlation or show no correlation at all. Alternatively, there might be a non-linear pattern where the points follow a quadratic curve.

The correlation coefficient "r" is used to measure the strength of the correlation between height (x) and weight (y) variables in year 11 boys and girls. Strong positive correlations typically range from 0.7 to 0.9, while strong negative correlations fall within the -0.7 to -0.9 range. I will calculate "r" for both groups and analyze the results to explain correlation strength and any differences between them. To find the linear equation representing the relationship between X and

Y, I will use linear regression technique by utilizing Sxx, Syy, and Sxy equations in my coursework. Regression also serves as a method for determining the function that describes scatter diagram points.

The function provides points that pass through the mean. In linear regression, the function is a straight line represented by f(x) = a + bx, where a = - b. The least squares regression line approximates the position of all points on a graph as closely as possible and has the least square error, which is the sum of the squared deviations (?d2). By squaring all deviations, they become positive and yield the lowest possible sum of deviations squared (?d2). It can also be demonstrated that ?d2 is minimized when calculating the residual deviation of each point from the regression line, assuming the data points are (x1, y1), (x2, y2), etc...

Based on my hypothesis, the formula for deviance (d) would be suitable as formuladi = yi - (a + bx). I expect a moderate to strong positive correlation between boys and girls, but anticipate that girls will have a weaker correlation of approximately 0.15 less than boys. Furthermore, I predict that boys will demonstrate narrower residuals and a steeper regression line compared to girls due to the fact that boys develop muscle while girls accumulate fat during puberty, with muscle carrying more weight. I assume that shorter students are in early stages of puberty and have not progressed as much as taller students. Additionally, I propose a high dependence between height and weight.

During my study involving year 11 boys, I observed a relatively strong positive correlation (r = 0.63) between height and weight, indicating

that height directly influences weight.

The correlation between height and weight for year 11 girls (r = 0.34) was significantly lower than that of boys. The difference in range of heights may explain this disparity, as boys have a range from 1.5 to 2.0 while girls have a range from 1.4 to 1.73, resulting in a variation of 17cm. This discrepancy in background variables leads to less variability in data for girls, causing extreme cases of overweight or underweight individuals to have a stronger impact on the correlation result. Additionally, societal factors such as dieting may also influence the correlation outcome. Girls at this age are often exposed to images of thin celebrities and glamour models which could affect their weight and body image choices. Conversely, males in this age group typically engage less in activities that alter weight such as jogging.

The correlation between the boys and girls was expected to be stronger for the boys but not significantly so. This difference could be attributed to the random selection process.

The equation for the least squares regression line for the girls is y = a + bx, where y = -12.23 + 39x. This equation indicates that if two year 11 girls have a height difference of 1 meter, they should theoretically have a weight difference of 39kg based on the gradient (slope). However, this value seems slightly high when considering that the average weight of a year 11 girl is 51kg.

Upon closer examination of the regression, it becomes clear that its reliability is limited to the specific data set from which it was derived. For instance, according to the line equation, when a girl has a

height of 0m, her weight should be -12.23kg, which is clearly unreasonable. Predictions outside the given range of data points are referred to as extrapolation and can result in highly inaccurate results. To ensure accuracy, it is important to use interpolation by utilizing x-values within the given range instead.

In order to evaluate my best fit line's accuracy, I randomly selected three girls and substituted their respective height values into the line equation replacing x in order to obtain a theoretical value for y (weight). I will then compare this theoretical value with their actual weight.

One method for identifying anomalous and influential points when plotting the regression line is to refer to the graph and visually select anomalies. However, another approach is to utilize the least squares regression line for determining these points. In the case of boys, the equation for the least squares regression line is y = - 56.34 + 66.61x. This equation suggests that two boys in year 11 with a height difference of 1 meter should theoretically have a weight difference of 66.61 kg, as indicated by the gradient (b). However, this value appears unreasonably high, especially considering that I have not grown over the past year and have only gained 3 kg, going from 67 kg to 70 kg. Additionally, my height is 1.78 meters. According to this regression line, someone in year 11 and only 78 cm tall should theoretically weigh less than 1 kg. It becomes evident that when examining the data in more detail, the regression line is only applicable within the given range (highest and lowest x-values). Extrapolation, which involves calculating data outside this range, can lead

to incorrect results. For instance, on the male graph, when height = 0m, the least squares regression predicts a weight of -56.34 kg, which is clearly inaccurate due to the negative value. Therefore, it is advisable to only interpolate (use data within range) when employing the least squares regression technique.In order to assess the accuracy of my line of best fit, I selected three boys randomly. I substituted their height values into the line equation to obtain a theoretical value for weight (represented by y). I will then compare this theoretical weight to their actual weight. I believed this method was unfair because the boys exhibited a stronger correlation compared to the girls. Therefore, they should theoretically have a lower percentage error resulting from the least squares process.

The variation in the boys' data compared to the girls' data is likely due to random selection. However, I have demonstrated that both the boys' and girls' least squares regression lines accurately represent their respective datasets. Both boys and girls display positive correlations of at least moderate strength. Additionally, both groups show steep gradients for their regression lines and low residuals, indicating a clear dependency between height and weight. In other words, taller year 11 students (regardless of gender) tend to weigh more within the given boundaries. When reviewing my hypotheses:
1. Initially, I hypothesized that both boys and girls would exhibit moderate to strong positive correlations. While the boys showed a relatively strong correlation, the girls had a lower/moderately low correlation; however, both were positive.
2.

I predict that there will be a weaker correlation in girls compared to boys, with a difference of approximately 0.15 between them. The

girls did have a weaker correlation, but it was higher than 0.15. This difference could be due to the sample not representing the entire year group.

Additionally, I expect the boys to have a lower overall residual range. Upon comparing the graphs, this appears accurate. However, when calculating residuals using equations, the boys showed a wide range from 0.79% error to 15.67%.

Based on my understanding, I made a prediction about the regression lines for boys and girls. I expected the boys' line to have a steeper slope because they tend to build muscle while girls gain fat during puberty. This is due to muscle being heavier. Additionally, I assumed that shorter students were in early puberty and had not developed as much as taller students. The result confirmed my prediction of the boys' steeper slope, but it doesn't necessarily prove my reasoning correct. It's possible that other factors, like participation in sports, influenced these results and I may have overlooked them.

Furthermore, I believed that height and weight are closely related and dependent on each other. To improve accuracy, considering different year groups and collecting data from a larger sample size would have been beneficial since larger samples provide more precise results. Additionally, gathering secondary information from national sources on height and weight would have been advantageous instead of relying solely on localized data.

To enhance the reliability of the results, consider examining year 11 students from different schools in order to gather a more diverse range of sources. Utilize more precise measuring instruments capable of measuring to three or even four decimal places. One could also employ a stratified sample to account for the imbalance between male

and female students in year 11. Taking multiple smaller samples and combining their averages would yield more accurate readings and help eliminate anomalies. Additionally, explore the background of the subjects and analyze how environmental factors like wealth influence the data trends. Implementing any of the aforementioned approaches would enhance result accuracy, but increasing the sample size appears to be the most feasible solution with a significant impact on reliability.

Using the entire year group is one option, and another option is to use a stratified sample to address the potential error caused by the unequal distribution of boys and girls among the students.

Get an explanation on any task
Get unstuck with the help of our AI assistant in seconds
New