Goals and Background
The primary purposes of this assignment is to practice
calculating Correlation Statistics using IBM SPSS software and interpret the
results of the SPSS combined with an Excel Scatterplot of the same data.
Correlation is the measurement of the association between two variables. One cannot measure the correlation of more than two variables. Correlation results informs the analyst about the strength of the association and also the direction of the association. Correlation results are always between -1 and 1. The closer to actual -1 or 1 the stronger the association. If the correlation result is 0 then there is no correlation association or a null relationship. A positive relationship is when the value of one variable increases and the value of the other variable increases as well. A negative relationship would be when the value of one variable increases the other variable decreases.
Part 1
I was provided data from the Census Tracts which were focused around categories for the Population in Milwaukee, WI. My directions were to explain the patterns found using the correlation matrix.
The data was provided to me in an Excel Spreadsheet.
Below were the categories of data I was provided:
- White = White Pop. for the Census Tracts in Milwaukee County
- Black = Black Pop
- Hispanic = Hispanic Pop
- MedInc = Median Household Income
Manu = Number of Manufacturing Employees
Retail = Number of Retail Employees
Finance = Number of Finance Employee
I opened the Excel file in SPSS and preformed the Bi-variate Correlate using Pearson Correlation on all of the categories of the data. Correlation can only be preformed on two categories at a time. SPSS creates a correlation matrix comparing all of the categories against each other (Fig. 1).
|
(Fig. 1) Correlation Matrix created in SPSS from the Excel Spreadsheet I was provided. Note the comparison only calculated the correlation between two categories. |
Analyzing the correlation matrix one needs to look at the Pearson Correlation value. The closer the value is to -1 or 1 the stronger the correlation is. The highest Pearson Correlation value is between White and Manu with a .735. The value states there is a high positive correlation between the white population and the number of manufacturing employees. To better understand the correlation I created a scatter plot in Excel to visualize the trend (Fig. 2).
|
(Fig. 2) Scatterplot created in Excel displaying the correlation trend between the white population and manufacturing jobs. |
Analyzing the scatterplot you can see there is a positive relation between the white population and manufacturing jobs. The one thing correlation results cannot do is identify the causation of the trend. The trend could be merely coincident also called a spurious relationship.
Looking back at the matrix note there are some negative values. The negative values are designating a negative relationship. I created a scatterplot to display the relationship between the black population and the median household income which had the "highest" negative value (Fig. 3).
|
(Fig. 3) Scatterplot created in Excel displaying the correlation trend between the black population and median household income. |
Looking at Fig. 3 you can see the negative correlation is not as distinct as Fig. 2. Which is also portrayed by the Pearson Value. -.417 is farther away from -1 than .735 is from 1 thus the variation between the two scatterplots.
Part 2
Introduction
I was provided data containing Democratic votes and voter turnout for all the counties in Texas from the Texas Election Commission (TEC) for the 1980 and 2016 Presidential Elections. The data codes are as follows:
- VTP80 = Voter Turnout 1980
- VTP16
= Voter Turnout 2016
- PRES80D = % Democratic Vote 1980
- PRES16D
= % Democratic Vote 2016
I was also instructed to download the percent of Hispanic populations for all the counties in Texas from the 2015 U.S. Census 2015 ACS Data.
Scenario
The TEC has requested and analysis of the patterns of the election to determine if there is clustering of the voting patterns and voter turnout. The TEC wants to present my findings to the governor to see if the election patterns are different than they were 36 years ago.
Methods
I utilized SPSS to run Bi-variate correlation using the Pearson method on the data I was provided. The resulting correlation matrix can be seen in Fig. 5.
I created two scatterplots of the correlation data to help better understand what the matrix is detailing. The first scatterplot displays the correlation between the 1980 percent Democratic vote and the 1980 voter turnout (Fig. 6). The second scatterplot displays the correlation between the 2016 percent Democratic vote and the 2016 voter turnout (Fig. 7).
The second form of analysis was preformed with a program called GeoDa to preform Spatial Autocorrelation on both years of voting data and the Hispanic population. Spatial Autocorrelation is different than regular correlation as it calculates the correlation of a singular variable
through space against neighboring areas of the same distinction. Spatial Autocorrealtion compares neighboring areas (counties in this case) to each other to determine if they are "more alike" (positive), "unlike" (negative), or "random" (no spatial autocorrelation).
Spatial Autocorrelation with GeoDa was used to calculate Moran's I. Moran's I is an indicator of spatial autocorrealation. The results of Moran's I ranges between -1 and 1. The closer to 1 the more clustered the variable is said to be in a specific zone. The closer to -1 the less clustered (dispersion) the variable is said to be in a specific zone. Examining Fig. 4 you can see examples of exact -1 and 1 spatial autocorrelation. Vary rarely if ever will you see exact (perfect) spatial autocorrelation. I will utilized a map of the Texas counties to display the spatial autocorrelation calculated by GeoDa along with a Moran's I chart and value. The maps are color coded to display the clustering of the counties. Bright Blue denotes areas of low value surrounded by other areas of low value. Bright Red denotes areas of high value surrounded by other areas of high value. Light blue denotes areas which are low in value surrounded by areas of high value. Light red or periwinkle denotes areas of high value surrounded by areas of low value. The light blue and light red are the outliers in the clustering analysis.
Results
|
(Fig. 5) Correlation matrix for TEC data. |
|
(Fig. 6) Scatterplot of 1980 voting data comparing the percent of democratic vote to the voter turnout. |
Analyzing the scatterplot from the 1980 voting data comparing the percent democratic vote to the voter turnout you can see there is a negative correlation. The negative correlation shows as the voter turnout decreases the percent of Democratic vote increases. The Pearson's value of -.612 from SPSS tells us this is a
Moderate correlation between the two values.
|
(Fig. 7) Scatterplot of 2016 voting data comparing the percent of democratic vote to the voter turnout. |
Analyzing the scatterplot from the 2016 voting data you can see the same negative correlation as the 1980 data though not as correlated. The decrease in correlation is backed up by the Pearson's value of -.530 which is still a
Moderate correlation but is less than the 1980 data.
1980 Percent Democratic Vote
|
(Fig. 8) Moran's I value and chart for 1980 Democratic vote. |
|
(Fig. 9) 1980 LISA Cluster Map for 1980 Democratic vote. |
Investigating the Spatial Autocorrelation of the 1980 Democratic vote we obtained a Moran's I of .575 (Fig.8). The numeric value tells us there is a slight Spatial Autocorrelation between the counties and the Democratic vote. However, the value is still about half way to one so the correlation is not very high. The map displays clustering of high Democratic voting in the Southern Tip (Fig. 9), and a few counties in far Eastern Texas. A large amount of clustering for counties which had low Democratic vote reside in the Northern portion of the state extending down slightly to the central portion.
2016 Percent Democratic Vote
|
(Fig. 10) Moran's I value and chart for 2016 Democratic vote. |
|
(Fig. 11) LISA Cluster Map for 2016 Democratic vote. |
Analyzing the 2016 Democratic vote we seen an increase in the Moran's I value to .685 (Fig. 10). The values tells us there is an increase in clustering over the 1980 data. The increased clustering can be seen in Fig. 11 with more of the counties being designated by various colors. Note the shift of the Low-Low (blue) from the West side to the East side of North Central Texas. The map displays a shift in voting tendencies of those counties. The counties with high Democratic voting are still primarily located in the Southern tip of the state just like the 1980's data.
1980 Voter Turnout
|
(Fig. 12) Moran's I value and chart for 1980 voter turnout. |
|
(Fig. 13) LISA Cluster Map for 1980 voter turnout. |
The 1980 voter turnout analysis shows clustering in certain areas but with a Moran's I value of .468 the clustering is minimal but note worthy is specific areas (Fig.12). One can see even less counties are colored in which is tied directly to the lower Moran's I value than the previous two examples (Fig. 13). You can see a large cluster of counties which had low voter turnout in the Southern tip of the state. The low voter turnout relates back to the first scatterplots which showed the negative correlation between voter turnout and democratic vote when comparing with the 1980 Democratic vote map. The same can be said for the counties in the Eastern part of the state which had low voter turnout.
2016 Voter Turnout
|
(Fig. 14) Moran's I value and chart for 2016 voter turnout. |
|
(Fig. 15) LISA Cluster Map for 2016 voter turnout. |
The 2016 voter turnout has the lowest Moran's I of all the data thus far with a value of .287 (Fig 14). Again there is clustering in a few areas throughout the state but for the most part no real specific patterns are seen in the map (Fig. 15). Again we see low voter turnout in the southern tip of the state which ties back to the high Democratic vote from previous map.
Percent Hispanic Population
|
(Fig. 16) Moran's I value and chart for 2015 percent Hispanic Population. |
|
(Fig. 17) LISA Cluster Map for 2015 percent Hispanic Population. |
Analyzing the percent of Hispanic population across Texas displays the highest clustering of all the data. The Moran's I value is .779 (Fig. 16), which is a high value stating there is high clustering of counties with low percentages of Hispanic populations and high clustering of counties with high percentages of Hispanic populations. There is a high percentage of Hispanics along the southern edge of Texas which boarders Mexico (Fig. 17). The North East portion of Texas has lower than average populations of Hispanics with the exception of the one periwinkle county in the middle. My professor stated there is a large carpet factory in that county which employs many Hispanic workers thus many Hispanics live there.
Discussion and Conclusion
Comparing the 1980 voter turnout to the 1980 Democratic vote LISA Map you can better see the correlation using the Spatial Autocorrelation display (Fig. 18). The Southern tip of Texas shows a distinct correlation between voter turnout and the Democratic vote. The Northern portion of the state has some correlation between the two but not nearly as high as the Southern Tip. The same could be said about the Eastern portion of the state. There was definitely some clustering in the voting patterns in Texas during the 1980's elections as displayed by the LISA Maps.
|
(Fig. 18) LISA maps for 1980 voter turnout (Left) and 1980 Democratic vote (Right). |
The 2016 comparison of the voter turn out to the Democratic vote displays a similar pattern though shifted to the East a bit for the Democratic vote (Fig. 19). The shift could be attributed to growing population closer to Dallas and Fort Worth Texas which reside in the North Eastern Portion of the main body of Texas (Fig. 20).
|
(Fig. 19) LISA maps for 2016 voter turnout (Left) and 2016 Democratic vote (Right). |
|
(Fig. 20) Locations of major cities in Texas taken from Google Maps. |
The analysis I provide will be useful for target campaigns for specific "parties" (Republican or Democrats). Additionally, the areas with low voter turnout could be targeted for "Get out and vote" campaigns. Additional research is required to see the influencing factors which are causing the clustering in specific locations. Also, why the negative correlation holds true in some counties but not others. Lastly the counties which are "outliers" that go against the clustering of surrounding counties could be investigate for specific reasoning's.