Pages

Tuesday, April 4, 2017

Assignment 4

Goals and Background

The purpose of the following assignment is to become familiar with "z" and "t" tests as a method of hypothesis and significance testing of gathered data. Deciphering which test to use, calculation of the actual test, and interpretation of the data will be discussed in the following post utilizing various examples.

Hypothesis and significance testing is a useful way to support the interpretations of research data and conclusions. While the testing does not prove the research to be a fact, it does provide additional evidence to support ones theories. The result of hypothesis testing only tells you if there is a difference between two sets of data and nothing more. The results DO NOT tell you how they are different. To understand how they are different you must return to the original data entered in the equation for analysis. 

Terms

Steps of Hypothesis Testing
  1. State the null hypothesis
  2. State the alternative hypothesis
  3. Choose a statistical test
  4. Choose the significance level (α)
  5. Calculate test statistic
  6. Make a decision about the null & alternative hypothesis

Null Hypothesis

A null hypothesis states there is NO significant difference between the sample mean (smaller portion of a larger data set) and the mean of the entire population which you are comparing the data against. Example would be comparing one Major League Baseball (MLB) (sample mean) team against the entire MLB as a whole (entire population).

Alternative Hypothesis

A alternative hypothesis states there IS a significant difference between the sample mean (derived from personal data) and the mean of the entire population which you are comparing the data against.

*The selection of the null or alternative hypothesis only tells you there is a difference and DOES NOT tell you by how much the difference is.*

**Either "Reject" or "Fail to Reject" the null hypothesis, NEVER "accept" the null hypothesis. Additionally, do not "Reject" or "Fail to Reject"the alternative hypothesis.

Type 1 Errors

Type 1 errors happen when a true (Fail to Reject) null hypothesis is rejected, which is also know as a false positive.

Type II Errors

Type II errors is the opposite or a false negative as compared to the false positive of the Type I error.

Significance Level (α)

Significance level is set during hypothesis testing to determine the likelihood/probability of Type I errors occurring. The level is set based on the confidence interval. An example of a typical significance level is .05. If a significance level is set to .05, it states that 95% of the time a Type I error will not occur. The significance level is used to determine the critical value. For a one tailed test you can leave the significance level alone, however for a two tailed test you have to divide the level in half. Example if you were given a Confidence Interval of 95% you would have to divided the last 5% by 2 which would give you a significance interval of .025.

Confidence Intervals (CI) 

The Confidence Interval (CI) is the range of number which fall between critical values of a two-tailed test or the majority of the normal distribution of a one tailed test (Fig. 1)

(Fig. 1) The white area under the normal distribution would be the Confidence Interval (CI) of a hypothesis test. The image shows a one tailed & a two tailed test. Image collected from https://www2.ccrb.cuhk.edu.hk/stat/User%20guidance,%20definition%20and%20terminology%20(Online%20Help).htm
Critical Value (CV)

The Critical Value (CV) is the exact numeric location on the normal distribution which divides the "reject" (black area of Fig. 1) and "fail to reject" (white area of Fig. 1). If a test statistic is below the CV then the test fails to reject the null hypothesis. Alternatively if a test statistic is above the CV then the test rejects the null hypothesis.

z-test

A z-test is utilized to determine if the average of 2 population data sets are statistically different (Fig. 2). One of the data sets must be a smaller representation of samples (sample mean) compared to the other (population mean). Z-tests are generally used for samples sizes larger than 30 (n). For the purposes of this assignment we will be using the same chart as we did in Assignment 3 to calculate the CV for our z-tests.

(Fig. 2) Formula used for a z-test. Image collected from http://isoconsultantpune.com/hypothesis-testing/.
t-test

T-test utilize the same formula as a z-test but when determining the CV one must utilize Degrees of Freedom to calculate. The Degrees of Freedom are calculated by subtracting 1 from the total number of test samples (n) and finding the corresponding value with you significant level (α). T-tests are used from sample sizes smaller than 30. For the purposes of this assignment we will be using a chart provided in our book to determine the CV (Fig. 3).

(Fig. 3) T-test chart from Statistical Methods for Geography by Peter A. Rogerson utilized for this assignment.


Part 1: t and z test

Question 1

I was provided a chart which contained the Interval Type, Confidence Interval (labeled as Confidence Level in this case), and the number of test samples (n) (Fig. 4). From this information I was to determine which test type was appropriate and the Significance Level (α). Using the descriptions from the above terms I was able to determine the missing values which are highlighted in gold in Fig. 4. Notice for the two-tailed tests there are 2 z or t values.

(Fig. 4) Table with the provided information highlighted in blue and the information I determined highlighted in gold.

Question 2

The following question/scenario was provided to me:
1.       A Department of Agriculture and Live Stock Development organization in Kenya estimate that yields in a certain district should approach the following amounts in metric tons (averages based on data from the whole country) per hectare: groundnuts. 0.57; cassava, 3.7; and beans, 0.29.  A survey of 23 farmers had the following results: (10 pts)                                                 μ             σ                Ground Nuts         0.52        0.3                Cassava                 3.3          .75                Beans                    0.34        0.12        
a.       Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate testb.       Be sure to present the null and alternative hypotheses for each as well as conclusionsc.       What are the probabilities values for each crop?
d.       What are the similarities and differences in the results 

I followed the steps of hypothesis testing to answer the majority of the questions.
  1. State the null hypothesis
    1. There is no difference between the average of Ground Nuts harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Ground Nuts (metric tons/hectare).
    2. There is no difference between the average of Cassava harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Cassava (metric tons/hectare).
    3. There is no difference between the average of Beans harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Beans (metric tons/hectare).
  2. State the alternative hypothesis
    1. There is a difference between the average of Ground Nuts harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Ground Nuts (metric tons/hectare).
    2. There is a difference between the average of Cassava harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Cassava (metric tons/hectare).
    3. There is a difference between the average of Beans harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Beans (metric tons/hectare).
  3. Choose a statistical test
    1. Since there are 23 farmers (n=23) I will utilize a t-test.
  4. Choose the significance level (α)
    1. Based on a Confidence Level of 95% and need to perform a 2 tailed test the significance level (α) is equal to .025.
    2. The critical value is then determined to be 2.074 based on a significance level of .025 and the Degrees of Freedom (23-1=22) chart.
  5. Calculate test statistic
    1. Ground nuts: t=(.52-.57)/(.3/sqrt(23))=-.05/.063=-.794
    2. Cassava: t=(3.3-3.7)/(.75/sqrt(23))=-.4/.156=-2.56
    3. Beans: t=(.34-.29)/(.12/sqrt(23))=.05/.025=2
  6. Make a decision about the null & alternative hypothesis
    1. Ground nuts: Failed to reject the null hypothesis
    2. Cassava: Reject the null hypothesis
    3. Beans: Failed to reject the null hypothesis
The probability values for each crop are as follows:
  1. Ground nuts: .78344
  2. Cassava: between .98938 & .99144 
  3. Beans: .97037
The similarities between the results are both Ground Nuts and Beans failed to reject the null hypotheses. The failure to reject tells us there is no difference between the sample mean and the hypothesized mean for either crop. The differences between the results are related to the Cassava which did reject the null hypothesis. The rejection of the null hypothesis tells me there is a difference between the sample mean compared to the country average. Examining the probabilities for each crop you can see the Bean probability is between 1-2% different than Cassava. Additional samples may need to be acquired to provide a better representation of the Bean crop in Kenya.

Based on the original statement it is not wise to utilize the country wide estimates for Cassava harvests in Kenya. The harvest estimates for Cassava should be examined more extensively to set their own local yield estimates. The harvest yield estimates for Beans and Ground Nuts are acceptable for Kenya.

Question 3

The following question/scenario was provided to me:
A researcher suspects that the level of a particular stream’s pollutant is higher than the allowable limit of 4.2 mg/l.  A sample of n= 17 reveals a mean pollutant level of 6.4 mg/l, with a standard deviation of 4.4.  What are your conclusions?  (one tailed test, 95% Significance Level) Please follow the hypothesis testing steps.  What is the corresponding probability value of your calculated answer? 
Again I followed the steps of hypothesis testing to answer the majority of the question.
  1. State the null hypothesis
    1. There is no difference between the sample mean of the specific stream and the allowable limit of streams pollutants.
  2. State the alternative hypothesis
    1. There is a difference between the sample mean of the specific stream and the allowable limit of streams pollutants.
  3. Choose a statistical test
    1. With a (n) value of 17 I will be utilizing a t-test.
  4. Choose the significance level (α)
    1. With a Confidence level of 95% and preforming a one-tailed test the significance level is equal to .05. Utilizing the significance level and Degrees of Freedom the critical value is equal to 1.746.
  5. Calculate test statistic
    1. t=(6.4-4.2)/(4.4/sqrt(17))=2.2/1.067=2.062
  6. Make a decision about the null & alternative hypothesis
    1. I reject the null hypothesis.
The corresponding probability value of my calculated answer is between .97403 & .97858. The probability tell me that between 97.4% and 97.8% of the time the calculation of rejecting the null hypothesis is correct. So I can say with ~97% confidence there is a difference between the stream sample and the allowable limit of pollutants. Looking back at the original data I can see the pollutant level of the sample stream is higher than the allowable limit. The result would make me inquire of why the pollutants of the stream are higher. I would look for possible sources contributing to the increased pollution.


Part II

I was provided a shapefile with average housing values by Block Group for Eau Claire County Wisconsin. I was then provided a shapefile with average housing values by Block Group for just the City of Eau Claire. I was given the following question to answer from the data:  "Is the average value of homes for the City of Eau Claire block groups significantly different from the block groups for Eau Claire County?"

Utilizing the housing value information attached to the shapefile I calculated the mean for both the city and the entire county. The average for the city block groups was $151,876.51 and the average for the entire county was $169,438.13. The standard deviation for the city block groups was $49,706.92. There were 53 block groups within the city of Eau Claire. With this information I was prepared to complete the steps for hypothesis testing.

Steps for hypothesis testing:
  1. State the null hypothesis
    1. There is no difference between the average housing values for the city of Eau Claire, WI compared to the entire county of Eau Claire, WI.
  2. State the alternative hypothesis
    1. There is a difference between the average housing values for the city of Eau Claire, WI compared to the entire county of Eau Claire, WI.
  3. Choose a statistical test
    1. There are 53 block groups in the city of Eau Claire, so I will be utilizing a z-test.
  4. Choose the significance level (α)
    1. I was given a Confidence level of 95% and will be using a one-tailed test, so the significance level is .05. Using the previous information I have calculated a critical value of -1.64 (negative because as you will see my calcualted result for my test statistic is a negative value).
  5. Calculate test statistic
    1. z=(151876.51-169438.13)/(49706.92/sqrt(53))=-17561.62/6827.77=-2.57
  6. Make a decision about the null & alternative hypothesis
    1. I will reject the null hypothesis.
Rejecting the null hypothesis states there is a statistical difference between the City of Eau Claire average housing values compared to the entire Eau Claire County. The hypothesis test does not tell us much about the difference besides being below the average for the county based on the negative test statistic. Looking back at the calculated mean values you can see the city block group difference is $17,561.62 less than the Eau Claire County as a whole.

I created a map to display the variation of housing values across the entire county (Fig. 5). I used a standard deviation classification for the purpose of trying to locate a pattern which would support the above hypothesis test. The map plainly displays all of the block groups in the county which are below the average are located in the City of Eau Claire block group. The map combined with the statistical test stating there is a difference between the city and the county as a whole provides evidence need to investigate why there is a difference between the two.

(Fig. 5) Display of average home values by county block groups of Eau Claire County, Wisconsin.

No comments:

Post a Comment