Pages

Wednesday, April 26, 2017

Assignment 5

Goals and Background

The primary purposes of this assignment is to practice calculating Correlation Statistics using IBM SPSS software and interpret the results of the SPSS combined with an Excel Scatterplot of the same data.

Correlation is the measurement of the association between two variables. One cannot measure the correlation of more than two variables. Correlation results informs the analyst about the strength of the association and also the direction of the association. Correlation results are always between -1 and 1. The closer to actual -1 or 1 the stronger the association. If the correlation result is 0 then there is no correlation association or a null relationship. A positive relationship is when the value of one variable increases and the value of the other variable increases as well. A negative relationship would be when the value of one variable increases the other variable decreases.

Part 1

I was provided data from the Census Tracts which were focused around categories for the Population in Milwaukee, WI. My directions were to explain the patterns found using the correlation matrix.

The data was provided to me in an Excel Spreadsheet. 

Below were the categories of data I was provided:
  • White  = White Pop. for the Census Tracts in Milwaukee County
  • Black = Black Pop
  • Hispanic = Hispanic Pop
  • MedInc = Median Household Income
  • Manu = Number of Manufacturing Employees
  • Retail = Number of Retail Employees
  • Finance = Number of Finance Employee

I opened the Excel file in SPSS and preformed the Bi-variate Correlate using Pearson Correlation on all of the categories of the data. Correlation can only be preformed on two categories at a time. SPSS creates a correlation matrix comparing all of the categories against each other (Fig. 1).

(Fig. 1) Correlation Matrix created in SPSS from the Excel Spreadsheet I was provided. Note the comparison only calculated the correlation between two categories.

Analyzing the correlation matrix one needs to look at the Pearson Correlation value. The closer the value is to -1 or 1 the stronger the correlation is. The highest Pearson Correlation value is between White and Manu with a .735. The value states there is a high positive correlation between the white population and the number of manufacturing employees. To better understand the correlation I created a scatter plot in Excel to visualize the trend (Fig. 2). 

(Fig. 2) Scatterplot created in Excel displaying the correlation trend between the white population and manufacturing jobs.
Analyzing the scatterplot you can see there is a positive relation between the white population and manufacturing jobs. The one thing correlation results cannot do is identify the causation of the trend. The trend could be merely coincident also called a spurious relationship.

Looking back at the matrix note there are some negative values. The negative values are designating a negative relationship. I created a scatterplot to display the relationship between the black population and the median household income which had the "highest" negative value (Fig. 3).

(Fig. 3) Scatterplot created in Excel displaying the correlation trend between the black population and median household income.
Looking at Fig. 3 you can see the negative correlation is not as distinct as Fig. 2. Which is also portrayed by the Pearson Value. -.417 is farther away from -1 than .735 is from 1 thus the variation between the two scatterplots.




Part 2

Introduction

I was provided data containing  Democratic votes and voter turnout for all the counties in Texas from the Texas Election Commission (TEC) for the 1980 and 2016 Presidential Elections. The data codes are as follows:
  • VTP80 = Voter Turnout 1980
  • VTP16 = Voter Turnout 2016
  • PRES80D = % Democratic Vote 1980
  • PRES16D = % Democratic Vote 2016
I was also instructed to download the percent of Hispanic populations for all the counties in Texas from the 2015 U.S. Census 2015 ACS Data.

Scenario 

The TEC has requested and analysis of the patterns of the election to determine if there is clustering of the voting patterns and voter turnout. The TEC wants to present my findings to the governor to see if the election patterns are different than they were 36 years ago. 

Methods

I utilized SPSS to run Bi-variate correlation using the Pearson method on the data I was provided. The resulting correlation matrix can be seen in Fig. 5.

I created two scatterplots of the correlation data to help better understand what the matrix is detailing. The first scatterplot displays the correlation between the 1980 percent Democratic vote and the 1980 voter turnout (Fig. 6). The second scatterplot displays the correlation between the 2016 percent Democratic vote and the 2016 voter turnout (Fig. 7).

The second form of analysis was preformed with a program called GeoDa to preform Spatial Autocorrelation on both years of voting data and the Hispanic population. Spatial Autocorrelation is different than regular correlation as it calculates the correlation of a singular variable through space against neighboring areas of the same distinction. Spatial Autocorrealtion compares neighboring areas (counties in this case) to each other to determine if they are "more alike" (positive), "unlike" (negative), or "random" (no spatial autocorrelation).

Spatial Autocorrelation with GeoDa was used to calculate Moran's I. Moran's I is an indicator of spatial autocorrealation. The results of Moran's I ranges between -1 and 1. The closer to 1 the more clustered the variable is said to be in a specific zone. The closer to -1 the less clustered (dispersion) the variable is said to be in a specific zone. Examining Fig. 4 you can see examples of exact -1 and 1 spatial autocorrelation. Vary rarely if ever will you see exact (perfect) spatial autocorrelation. I will utilized a map of the Texas counties to display the spatial autocorrelation calculated by GeoDa along with a Moran's I chart and value. The maps are color coded to display the clustering of the counties. Bright Blue denotes areas of low value surrounded by other areas of low value. Bright Red denotes areas of high value surrounded by other areas of high value. Light blue denotes areas which are low in value surrounded by areas of high value. Light red or periwinkle denotes areas of high value surrounded by areas of low value. The light blue and light red are the outliers in the clustering analysis.

(Fig. 4) Graphic displaying less clustered (-1) and more clustered (1) spatial autocorrelation. Image obtained from https://glenbambrick.com/tag/morans-i/.



Results
(Fig. 5) Correlation matrix for TEC data.

(Fig. 6) Scatterplot of 1980 voting data comparing the percent of democratic vote to the voter turnout.
Analyzing the scatterplot from the 1980 voting data comparing the percent democratic vote to the voter turnout you can see there is a negative correlation. The negative correlation shows as the voter turnout decreases the percent of Democratic vote increases. The Pearson's value of -.612 from SPSS tells us this is a Moderate correlation between the two values.

(Fig. 7) Scatterplot of 2016 voting data comparing the percent of democratic vote to the voter turnout.
Analyzing the scatterplot from the 2016 voting data you can see the same negative correlation as the 1980 data though not as correlated. The decrease in correlation is backed up by the Pearson's value of -.530 which is still a Moderate correlation but is less than the 1980 data.

1980 Percent Democratic Vote
(Fig. 8) Moran's I value and chart for 1980 Democratic vote.


(Fig. 9) 1980 LISA Cluster Map for 1980 Democratic vote.

Investigating the Spatial Autocorrelation of the 1980 Democratic vote we obtained a Moran's I of .575 (Fig.8). The numeric value tells us there is a slight Spatial Autocorrelation between the counties and the Democratic vote. However, the value is still about half way to one so the correlation is not very high. The map displays clustering of high Democratic voting in the Southern Tip (Fig. 9), and a few counties in far Eastern Texas. A large amount of clustering for counties which had low Democratic vote reside in the Northern portion of the state extending down slightly to the central portion.

2016 Percent Democratic Vote
(Fig. 10) Moran's I value and chart for 2016 Democratic vote.


(Fig. 11) LISA Cluster Map for 2016 Democratic vote.

Analyzing the 2016 Democratic vote we seen an increase in the Moran's I value to .685 (Fig. 10). The values tells us there is an increase in clustering over the 1980 data. The increased clustering can be seen in Fig. 11 with more of the counties being designated by various colors. Note the shift of the Low-Low (blue) from the West side to the East side of North Central Texas. The map displays a shift in voting tendencies of those counties. The counties with high Democratic voting are still primarily located in the Southern tip of the state just like the 1980's data.

1980 Voter Turnout
(Fig. 12) Moran's I value and chart for 1980 voter turnout.


(Fig. 13) LISA Cluster Map for 1980 voter turnout.

The 1980 voter turnout analysis shows clustering in certain areas but with a Moran's I value of .468 the clustering is minimal but note worthy is specific areas (Fig.12). One can see even less counties are colored in which is tied directly to the lower Moran's I value than the previous two examples (Fig. 13). You can see a large cluster of counties which had low voter turnout in the Southern tip of the state. The low voter turnout relates back to the first scatterplots which showed the negative correlation between voter turnout and democratic vote when comparing with the 1980 Democratic vote map. The same can be said for the counties in the Eastern part of the state which had low voter turnout.

2016 Voter Turnout

(Fig. 14) Moran's I value and chart for 2016 voter turnout.


(Fig. 15) LISA Cluster Map for 2016 voter turnout.

The 2016 voter turnout has the lowest Moran's I of all the data thus far with a value of .287 (Fig 14). Again there is clustering in a few areas throughout the state but for the most part no real specific patterns are seen in the map (Fig. 15). Again we see low voter turnout in the southern tip of the state which ties back to the high Democratic vote from previous map.

Percent Hispanic Population

(Fig. 16) Moran's I value and chart for 2015 percent Hispanic Population.


(Fig. 17) LISA Cluster Map for 2015 percent Hispanic Population.
Analyzing the percent of Hispanic population across Texas displays the highest clustering of all the data. The Moran's I value is .779 (Fig. 16), which is a high value stating there is high clustering of counties with low percentages of Hispanic populations and high clustering of counties with high percentages of Hispanic populations. There is a high percentage of Hispanics along the southern edge of Texas which boarders Mexico (Fig. 17). The North East portion of Texas has lower than average populations of Hispanics with the exception of the one periwinkle county in the middle. My professor stated there is a large carpet factory in that county which employs many Hispanic workers thus many Hispanics live there.

Discussion and Conclusion

Comparing the 1980 voter turnout to the 1980 Democratic vote LISA Map you can better see the correlation using the Spatial Autocorrelation display (Fig. 18). The Southern tip of Texas shows a distinct correlation between voter turnout and the Democratic vote. The Northern portion of the state has some correlation between the two but not nearly as high as the Southern Tip. The same could be said about the Eastern portion of the state. There was definitely some clustering in the voting patterns in Texas during the 1980's elections as displayed by the LISA Maps.

(Fig. 18) LISA maps for 1980 voter turnout (Left) and 1980 Democratic vote (Right).

The 2016 comparison of the voter turn out to the Democratic vote displays a similar pattern though shifted to the East a bit for the Democratic vote (Fig. 19). The shift could be attributed to growing population closer to Dallas and Fort Worth Texas which reside in the North Eastern Portion of the main body of Texas (Fig. 20).

(Fig. 19) LISA maps for 2016 voter turnout (Left) and 2016 Democratic vote (Right).


(Fig. 20) Locations of major cities in Texas taken from Google Maps.

The analysis I provide will be useful for target campaigns for specific "parties" (Republican or Democrats). Additionally, the areas with low voter turnout could be targeted for "Get out and vote" campaigns. Additional research is required to see the influencing factors which are causing the clustering in specific locations. Also, why the negative correlation holds true in some counties but not others. Lastly the counties which are "outliers" that go against the clustering of surrounding counties could be investigate for specific reasoning's. 

Tuesday, April 4, 2017

Assignment 4

Goals and Background

The purpose of the following assignment is to become familiar with "z" and "t" tests as a method of hypothesis and significance testing of gathered data. Deciphering which test to use, calculation of the actual test, and interpretation of the data will be discussed in the following post utilizing various examples.

Hypothesis and significance testing is a useful way to support the interpretations of research data and conclusions. While the testing does not prove the research to be a fact, it does provide additional evidence to support ones theories. The result of hypothesis testing only tells you if there is a difference between two sets of data and nothing more. The results DO NOT tell you how they are different. To understand how they are different you must return to the original data entered in the equation for analysis. 

Terms

Steps of Hypothesis Testing
  1. State the null hypothesis
  2. State the alternative hypothesis
  3. Choose a statistical test
  4. Choose the significance level (α)
  5. Calculate test statistic
  6. Make a decision about the null & alternative hypothesis

Null Hypothesis

A null hypothesis states there is NO significant difference between the sample mean (smaller portion of a larger data set) and the mean of the entire population which you are comparing the data against. Example would be comparing one Major League Baseball (MLB) (sample mean) team against the entire MLB as a whole (entire population).

Alternative Hypothesis

A alternative hypothesis states there IS a significant difference between the sample mean (derived from personal data) and the mean of the entire population which you are comparing the data against.

*The selection of the null or alternative hypothesis only tells you there is a difference and DOES NOT tell you by how much the difference is.*

**Either "Reject" or "Fail to Reject" the null hypothesis, NEVER "accept" the null hypothesis. Additionally, do not "Reject" or "Fail to Reject"the alternative hypothesis.

Type 1 Errors

Type 1 errors happen when a true (Fail to Reject) null hypothesis is rejected, which is also know as a false positive.

Type II Errors

Type II errors is the opposite or a false negative as compared to the false positive of the Type I error.

Significance Level (α)

Significance level is set during hypothesis testing to determine the likelihood/probability of Type I errors occurring. The level is set based on the confidence interval. An example of a typical significance level is .05. If a significance level is set to .05, it states that 95% of the time a Type I error will not occur. The significance level is used to determine the critical value. For a one tailed test you can leave the significance level alone, however for a two tailed test you have to divide the level in half. Example if you were given a Confidence Interval of 95% you would have to divided the last 5% by 2 which would give you a significance interval of .025.

Confidence Intervals (CI) 

The Confidence Interval (CI) is the range of number which fall between critical values of a two-tailed test or the majority of the normal distribution of a one tailed test (Fig. 1)

(Fig. 1) The white area under the normal distribution would be the Confidence Interval (CI) of a hypothesis test. The image shows a one tailed & a two tailed test. Image collected from https://www2.ccrb.cuhk.edu.hk/stat/User%20guidance,%20definition%20and%20terminology%20(Online%20Help).htm
Critical Value (CV)

The Critical Value (CV) is the exact numeric location on the normal distribution which divides the "reject" (black area of Fig. 1) and "fail to reject" (white area of Fig. 1). If a test statistic is below the CV then the test fails to reject the null hypothesis. Alternatively if a test statistic is above the CV then the test rejects the null hypothesis.

z-test

A z-test is utilized to determine if the average of 2 population data sets are statistically different (Fig. 2). One of the data sets must be a smaller representation of samples (sample mean) compared to the other (population mean). Z-tests are generally used for samples sizes larger than 30 (n). For the purposes of this assignment we will be using the same chart as we did in Assignment 3 to calculate the CV for our z-tests.

(Fig. 2) Formula used for a z-test. Image collected from http://isoconsultantpune.com/hypothesis-testing/.
t-test

T-test utilize the same formula as a z-test but when determining the CV one must utilize Degrees of Freedom to calculate. The Degrees of Freedom are calculated by subtracting 1 from the total number of test samples (n) and finding the corresponding value with you significant level (α). T-tests are used from sample sizes smaller than 30. For the purposes of this assignment we will be using a chart provided in our book to determine the CV (Fig. 3).

(Fig. 3) T-test chart from Statistical Methods for Geography by Peter A. Rogerson utilized for this assignment.


Part 1: t and z test

Question 1

I was provided a chart which contained the Interval Type, Confidence Interval (labeled as Confidence Level in this case), and the number of test samples (n) (Fig. 4). From this information I was to determine which test type was appropriate and the Significance Level (α). Using the descriptions from the above terms I was able to determine the missing values which are highlighted in gold in Fig. 4. Notice for the two-tailed tests there are 2 z or t values.

(Fig. 4) Table with the provided information highlighted in blue and the information I determined highlighted in gold.

Question 2

The following question/scenario was provided to me:
1.       A Department of Agriculture and Live Stock Development organization in Kenya estimate that yields in a certain district should approach the following amounts in metric tons (averages based on data from the whole country) per hectare: groundnuts. 0.57; cassava, 3.7; and beans, 0.29.  A survey of 23 farmers had the following results: (10 pts)                                                 μ             σ                Ground Nuts         0.52        0.3                Cassava                 3.3          .75                Beans                    0.34        0.12        
a.       Test the hypothesis for each of these products.  Assume that each are 2 tailed with a Confidence Level of 95% *Use the appropriate testb.       Be sure to present the null and alternative hypotheses for each as well as conclusionsc.       What are the probabilities values for each crop?
d.       What are the similarities and differences in the results 

I followed the steps of hypothesis testing to answer the majority of the questions.
  1. State the null hypothesis
    1. There is no difference between the average of Ground Nuts harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Ground Nuts (metric tons/hectare).
    2. There is no difference between the average of Cassava harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Cassava (metric tons/hectare).
    3. There is no difference between the average of Beans harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Beans (metric tons/hectare).
  2. State the alternative hypothesis
    1. There is a difference between the average of Ground Nuts harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Ground Nuts (metric tons/hectare).
    2. There is a difference between the average of Cassava harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Cassava (metric tons/hectare).
    3. There is a difference between the average of Beans harvested (metric tons/hectare) from the 23 sampled farmers compared to the whole country average of harvested Beans (metric tons/hectare).
  3. Choose a statistical test
    1. Since there are 23 farmers (n=23) I will utilize a t-test.
  4. Choose the significance level (α)
    1. Based on a Confidence Level of 95% and need to perform a 2 tailed test the significance level (α) is equal to .025.
    2. The critical value is then determined to be 2.074 based on a significance level of .025 and the Degrees of Freedom (23-1=22) chart.
  5. Calculate test statistic
    1. Ground nuts: t=(.52-.57)/(.3/sqrt(23))=-.05/.063=-.794
    2. Cassava: t=(3.3-3.7)/(.75/sqrt(23))=-.4/.156=-2.56
    3. Beans: t=(.34-.29)/(.12/sqrt(23))=.05/.025=2
  6. Make a decision about the null & alternative hypothesis
    1. Ground nuts: Failed to reject the null hypothesis
    2. Cassava: Reject the null hypothesis
    3. Beans: Failed to reject the null hypothesis
The probability values for each crop are as follows:
  1. Ground nuts: .78344
  2. Cassava: between .98938 & .99144 
  3. Beans: .97037
The similarities between the results are both Ground Nuts and Beans failed to reject the null hypotheses. The failure to reject tells us there is no difference between the sample mean and the hypothesized mean for either crop. The differences between the results are related to the Cassava which did reject the null hypothesis. The rejection of the null hypothesis tells me there is a difference between the sample mean compared to the country average. Examining the probabilities for each crop you can see the Bean probability is between 1-2% different than Cassava. Additional samples may need to be acquired to provide a better representation of the Bean crop in Kenya.

Based on the original statement it is not wise to utilize the country wide estimates for Cassava harvests in Kenya. The harvest estimates for Cassava should be examined more extensively to set their own local yield estimates. The harvest yield estimates for Beans and Ground Nuts are acceptable for Kenya.

Question 3

The following question/scenario was provided to me:
A researcher suspects that the level of a particular stream’s pollutant is higher than the allowable limit of 4.2 mg/l.  A sample of n= 17 reveals a mean pollutant level of 6.4 mg/l, with a standard deviation of 4.4.  What are your conclusions?  (one tailed test, 95% Significance Level) Please follow the hypothesis testing steps.  What is the corresponding probability value of your calculated answer? 
Again I followed the steps of hypothesis testing to answer the majority of the question.
  1. State the null hypothesis
    1. There is no difference between the sample mean of the specific stream and the allowable limit of streams pollutants.
  2. State the alternative hypothesis
    1. There is a difference between the sample mean of the specific stream and the allowable limit of streams pollutants.
  3. Choose a statistical test
    1. With a (n) value of 17 I will be utilizing a t-test.
  4. Choose the significance level (α)
    1. With a Confidence level of 95% and preforming a one-tailed test the significance level is equal to .05. Utilizing the significance level and Degrees of Freedom the critical value is equal to 1.746.
  5. Calculate test statistic
    1. t=(6.4-4.2)/(4.4/sqrt(17))=2.2/1.067=2.062
  6. Make a decision about the null & alternative hypothesis
    1. I reject the null hypothesis.
The corresponding probability value of my calculated answer is between .97403 & .97858. The probability tell me that between 97.4% and 97.8% of the time the calculation of rejecting the null hypothesis is correct. So I can say with ~97% confidence there is a difference between the stream sample and the allowable limit of pollutants. Looking back at the original data I can see the pollutant level of the sample stream is higher than the allowable limit. The result would make me inquire of why the pollutants of the stream are higher. I would look for possible sources contributing to the increased pollution.


Part II

I was provided a shapefile with average housing values by Block Group for Eau Claire County Wisconsin. I was then provided a shapefile with average housing values by Block Group for just the City of Eau Claire. I was given the following question to answer from the data:  "Is the average value of homes for the City of Eau Claire block groups significantly different from the block groups for Eau Claire County?"

Utilizing the housing value information attached to the shapefile I calculated the mean for both the city and the entire county. The average for the city block groups was $151,876.51 and the average for the entire county was $169,438.13. The standard deviation for the city block groups was $49,706.92. There were 53 block groups within the city of Eau Claire. With this information I was prepared to complete the steps for hypothesis testing.

Steps for hypothesis testing:
  1. State the null hypothesis
    1. There is no difference between the average housing values for the city of Eau Claire, WI compared to the entire county of Eau Claire, WI.
  2. State the alternative hypothesis
    1. There is a difference between the average housing values for the city of Eau Claire, WI compared to the entire county of Eau Claire, WI.
  3. Choose a statistical test
    1. There are 53 block groups in the city of Eau Claire, so I will be utilizing a z-test.
  4. Choose the significance level (α)
    1. I was given a Confidence level of 95% and will be using a one-tailed test, so the significance level is .05. Using the previous information I have calculated a critical value of -1.64 (negative because as you will see my calcualted result for my test statistic is a negative value).
  5. Calculate test statistic
    1. z=(151876.51-169438.13)/(49706.92/sqrt(53))=-17561.62/6827.77=-2.57
  6. Make a decision about the null & alternative hypothesis
    1. I will reject the null hypothesis.
Rejecting the null hypothesis states there is a statistical difference between the City of Eau Claire average housing values compared to the entire Eau Claire County. The hypothesis test does not tell us much about the difference besides being below the average for the county based on the negative test statistic. Looking back at the calculated mean values you can see the city block group difference is $17,561.62 less than the Eau Claire County as a whole.

I created a map to display the variation of housing values across the entire county (Fig. 5). I used a standard deviation classification for the purpose of trying to locate a pattern which would support the above hypothesis test. The map plainly displays all of the block groups in the county which are below the average are located in the City of Eau Claire block group. The map combined with the statistical test stating there is a difference between the city and the county as a whole provides evidence need to investigate why there is a difference between the two.

(Fig. 5) Display of average home values by county block groups of Eau Claire County, Wisconsin.