The Differences Slavery Made: A Close Analysis of Two American Communities

View : Methods Overview | GIS | Statistics

Methodology for Creating and Analyzing a Census/GIS Database for Augusta Co., Virginia, and Franklin Co., Pennsylvania, 1860-1870:

Overview, Outline, and Detailed Discussion of Plans and Procedures for Data Automation

Aaron Sheehan-Dean, November 2001
Overview

The foundation of the Census/GIS database for the Valley of the Shadow project was the 1860 U.S. Census. We utilized records from the Population, Agriculture, Manufacturing and Slaveholder schedules for both counties (though the latter only applies to Augusta). All of the aggregate information used in the article is drawn from analysis of the census files, imported into the Statistical Package for the Social Sciences (SPSS). The most challenging part of the process involved aggregating the individual census information by household so that it could be linked into the GIS and then exported back into SPSS for additional analysis. A lengthy syntax was written to accomplish this task (more below).

Importing the Census files into SPSS

Virginia Center for Digital History (VCDH) staff, over the course of several years, entered the full 1860 U.S. Census into an Access database, which receives queries on the Valley of the Shadow website. An early, and important, step toward making meaningful analysis of Augusta data possible, was linking the slaveholder census with the agricultural schedule. We did this by joining households according to name, using the location of households as one of the checks to ensure that we linked the correct households across the two census schedules.

The census files were initially entered into a Microsoft Excel file. This file can be saved in database (.dbf) format and read by SPSS, which is the application we used to perform all of our statistical analysis. The Excel files were cleaned of all non-numerical data, saved in .dbf format and reopened in SPSS, where they were saved as .sav files. At this point, the census files consisted of individual names, variables identifying where that person was found in the census (for example, the Population census includes a family number, dwelling number and a page number for each entry), and whatever specific information was recorded for that individual in the original census report.

Aggregate census data by household

Since our maps for both counties identified residences by household name, we needed to aggregate the census data by households. In constructing the GIS for each county, we created a unique "map-id" number for each household connected to its residence on the map. Since last names, and even first name-last name combinations, are repeated so often in both censuses, we used the map-id number as the basis for aggregating households.

For Franklin, the 42,360 individuals needed to be aggregated into 7,709 families. For Augusta, 21,784 individuals needed to be aggregated into 3,740 families. This step required us to write an extensive SPSS syntax script expressly for our project. We also wrote a syntax aggregating households within the agricultural census. The population census syntax language is given below, with a small sample of how households were listed (there was one line of syntax for each residence in the GIS; 2800 for Augusta and 5000 for Franklin).

SAMPLE SYNTAX:
get file="g:\Ecai\SPSS Files\franklinspss\frankpopcen.sav".
missing values all ().
numeric age2.
compute count=1.
variable labels age2 "Age Group".
value labels age2
1 "0 - 4 years old"
2 "5 - 9 years old"
3 "10 - 14 years old"
4 "15 - 19 years old"
5 "20 - 24 years old"
6 "25 - 29 years old"
7 "30 - 39 years old"
8 "40 - 49 years old"
9 "50 - 59 years old"
10 "60 - 69 years old"
11 "70 - 79 years old"
12 "80 or more years old".
if age <5 age2 =1.
if age >4 and age <10 age2=2.
if age >9 and age <15 age2=3.
if age >14 and age <20 age2=4.
if age >19 and age <25 age2=5.
if age >24 and age <30 age2=6.
if age >29 and age <40 age2=7.
if age >39 and age <50 age2=8.
if age >49 and age <60 age2=9.
if age >59 and age <70 age2=10.
if age >69 and age <80 age2=11.
if age >79 age2=12.
execute.
numeric female.
numeric male.
if (sex eq 0) female=1.
if (sex eq 1) male=1.
execute.
numeric w1.
numeric w2.
numeric w3.
numeric w4.
numeric w5.
numeric w6.
numeric w7.
numeric w8.
numeric w9.
numeric w10.
numeric w11.
numeric w12.
numeric b1.
numeric b2.
numeric b3.
numeric b4.
numeric b5.
numeric b6.
numeric b7.
numeric b8.
numeric b9.
numeric b10.
numeric b11.
numeric b12.
numeric m1.
numeric m2.
numeric m3.
numeric m4.
numeric m5.
numeric m6.
numeric m7.
numeric m8.
numeric m9.
numeric m10.
numeric m11.
numeric m12.
execute.
variable labels
w1 "White 0-4 years old"
w2 "White 5-9 years old"
w3 "White 10-14 years old"
w4 "White 15-19 years old"
w5 "White 20-24 years old"
w6 "White 25-29 years old"
w7 "White 30-39 years old"
w8 "White 40-49 years old"
w9 "White 50-59 years old"
w10 "White 60-69 years old"
w11 "White 70-79 years old"
w12 "White 80 or more years old"
b1 "Black 0-4 years old"
b2 "Black 5-9 years old"
b3 "Black 10-14 years old"
b4 "Black 15-19 years old"
b5 "Black 20-24 years old"
b6 "Black 25-29 years old"
b7 "Black 30-39 years old"
b8 "Black 40-49 years old"
b9 "Black 50-59 years old"
b10 "Black 60-69 years old"
b11 "Black 70-79 years old"
b12 "Black 80 or more years old"
m1 "Mulatto 0-4 years old"
m2 "Mulatto 5-9 years old"
m3 "Mulatto 10-14 years old"
m4 "Mulatto 15-19 years old"
m5 "Mulatto 20-24 years old"
m6 "Mulatto 25-29 years old"
m7 "Mulatto 30-39 years old"
m8 "Mulatto 40-49 years old"
m9 "Mulatto 50-59 years old"
m10 "Mulatto 60-69 years old"
m11 "Mulatto 70-79 years old"
m12 "Mulatto 80 or more years old".
execute.
if (age2 = 1 and color eq 1) w1 = 1.
if (age2 = 1 and color eq 2) b1 = 1.
if (age2 = 1 and color eq 3) m1 = 1.
if (age2 = 2 and color eq 1) w2 = 1.
if (age2 = 2 and color eq 2) b2 = 1.
if (age2 = 2 and color eq 3) m2 = 1.
if (age2 = 3 and color eq 1) w3 = 1.
if (age2 = 3 and color eq 2) b3 = 1.
if (age2 = 3 and color eq 3) m3 = 1.
if (age2 = 4 and color eq 1) w4 = 1.
if (age2 = 4 and color eq 2) b4 = 1.
if (age2 = 4 and color eq 3) m4 = 1.
if (age2 = 5 and color eq 1) w5 = 1.
if (age2 = 5 and color eq 2) b5 = 1.
if (age2 = 5 and color eq 3) m5 = 1.
if (age2 = 6 and color eq 1) w6 = 1.
if (age2 = 6 and color eq 2) b6 = 1.
if (age2 = 6 and color eq 3) m6 = 1.
if (age2 = 7 and color eq 1) w7 = 1.
if (age2 = 7 and color eq 2) b7 = 1.
if (age2 = 7 and color eq 3) m7 = 1.
if (age2 = 8 and color eq 1) w8 = 1.
if (age2 = 8 and color eq 2) b8 = 1.
if (age2 = 8 and color eq 3) m8 = 1.
if (age2 = 9 and color eq 1) w9 = 1.
if (age2 = 9 and color eq 2) b9 = 1.
if (age2 = 9 and color eq 3) m9 = 1.
if (age2 = 10 and color eq 1) w10 = 1.
if (age2 = 10 and color eq 2) b10 = 1.
if (age2 = 10 and color eq 3) m10 = 1.
if (age2 = 11 and color eq 1) w11 = 1.
if (age2 = 11 and color eq 2) b11 = 1.
if (age2 = 11 and color eq 3) m11 = 1.
if (age2 = 12 and color eq 1) w12 = 1.
if (age2 = 12 and color eq 2) b12 = 1.
if (age2 = 12 and color eq 3) m12 = 1.
execute.
numeric id.
if (township eq "Antrim" and family = 1185 and dwelling = 1196) id = 1998.
if (township eq "Antrim" and family = 1222 and dwelling = 1235) id = 3265.
...

if (township eq "Chambersburg" and family = 1571) id = 5705.
if (township eq "Chambersburg" and family = 1572) id = 4555.
execute.
select if (id > 0).
sort cases by id.
execute.
variable labels
realest "Real Estate Value"
perest "Personal Estate Value"
count "Number in Household".
execute.
AGGREGATE
/OUTFILE="g:\Ecai\SPSS Files\franklinspss\franklin.sav"
/BREAK=id
/realest = SUM(realest) /perest = SUM(perest)
/w1 = SUM(w1) /w2 = SUM(w2)
/w3 = SUM(w3) /w4 = SUM(w4) /w5 = SUM(w5) /w6 = SUM(w6) /w7 = SUM(w7)
/w8 = SUM(w8) /w9 = SUM(w9) /w10 = SUM(w10) /w11 = SUM(w11)
/w12 = SUM(w12) /b1 = SUM(b1) /b2 = SUM(b2) /b3 = SUM(b3) /b4 = SUM(b4)
/b5 = SUM(b5) /b6 = SUM(b6) /b7 = SUM(b7) /b8 = SUM(b8)
/b9 = SUM(b9) /b10 = SUM(b10) /b11 = SUM(b11) /b12 = SUM(b12)
/m1 = SUM(m1) /m2 = SUM(m2) /m3 = SUM(m3)
/m4 = SUM(m4) /m5 = SUM(m5) /m6 = SUM(m6)
/m7 = SUM(m7) /m8 = SUM(m8) /m9 = SUM(m9) /m10 = SUM(m10)
/m11 = SUM(m11) /m12 = SUM(m12) /count = SUM(count)
/female = SUM(female) /male = SUM(male).
execute.

Writing a syntax of this length and with this much repetition demanded shortcuts. We used the Excel file made for creating the GIS and wrote each line of script into an Excel spreadsheet which had been stripped down to location, page number, number on page, and map-id numbers. The map-id numbers were the essential piece, since the script would combine information about all residents according to the map-id they shared. The file was copied out of Excel and pasted into Microsoft Word, in order to create sentence structure that SPSS would recognize. The "find and replace" command is an excellent tool for reshaping the sentence back into an acceptable format for SPSS.

The other aggregating function that we needed to perform with the census material was combining the information from the population and agriculture censuses into one SPSS file. Since we had aggregated individual data by households using the map-id numbers, we could quite easily merge variables from these two separate files using the SPSS "merge files" command and relying on the map-id as the key variable.

Categorizing data for analysis

In order to link the household census data to the GIS, and indeed, in order to perform any meaningful analysis of the data itself, we needed to organize much of the material around new variables. For the Population census, the most important aggregating was on wealth. Accordingly, we combined real estate wealth and personal property wealth into one new variable called "household wealth."

We chose to use quintile divisions for most of the household and farm wealth analysis. We chose this because it reflects the distribution of wealth within each county rather than using arbitrary lines around which to group households. Using the "frequency" command in SPSS, we were able to obtain quintile breakdowns for variables like household wealth, farm value, farm size, etc. These frequency values were then used as the demarcation points for creating new variables (using the SPSS "transform into new variable" command) with households organized into quintile cohorts.

For slaveholding in Augusta County, we did not use quintiles to create slaveholder categories. Rather, we relied upon prevailing methods in the literature, combined with an awareness of slavery's presence in the county. Although the county as a whole had the same number of slaveholders as the average Southern county in 1860 (25%), Augusta residents held smaller numbers of slaves than many in Virginia's Piedmont (the fertile counties immediately to the west) or in the Deep South cotton belt. Our subdivisions for slaveholders were as follows: 0, 1, 2-5, 6-10, 11-20, 21-30, 31+. We also coded households as either slaveholders or non-slaveholders in order to facilitate a broader comparison of the two groups.

A final aspect of census data preparation was unrelated to the household aggregation and grouping but followed many of the same procedures. In order to analyze the political landscape of each place, we culled all personal names that appeared in any of the four newspapers for the two counties for the years 1860 and 1861. These lists were brought into Microsoft Word, where they were scanned for relevance to political activity. The majority of names that appeared in the newspaper came from jury rolls but in both places, the newspapers made many reports of political meetings, particularly around the 1860 presidential election. An Excel spreadsheet for each county was created with the following variables: name, party identity, candidate, date of mention, location. We then cross-referenced these lists against the GIS residence list, using names and then locations (when available) to link individuals and their party identities. Since these people appeared, often more than once, at party events, we have classified them as "political activists."

For Franklin party, we grouped the Republican, People's and Union parties under the Republican label in our analysis. The People's party was the label used by Republicans throughout much of Pennsylvania. The Union party was more properly a "fusion" party, comprised of Republicans and some Democrats. Since the Democrats attack the Union party in their literature, we treated them as Republicans as well. We analyzed the socio-economic backgrounds of these activists using standard quantitative methods. After creating the original file (matched GIS names with the census), we went back to the census and drew out age, occupation, and household as well. We grouped occupations into five categories: farmer, laborer, artisan, professional and business (coded 1-5 ascending). For rank in household, we assumed that men above age 18 with jobs who were ranked as fourth or higher in their households were probably boarders. The majority of those we identified in this way were in the 40s and 50s and almost certainly living alone in a rooming house.

Importing data into the Geographic Information System (GIS) and exporting from GIS with added variables

Once the Census data files for each county had been aggregated by household, matched with the appropriate map-id (from the digitizing during the creation of the GIS) and new variables facilitating analysis had been added, the resulting files were joined to the respective GIS systems for Franklin and Augusta. * See the GIS_Procedures page for a more detailed discussion of how the GIS was created and how we conducted the spatial and geographic analysis of the households.

We used the GIS to add data calibrating the geographical and spatial relations between points on the map (private residences as well as public institutions and commercial establishments, roads, railroads, etc.) and natural features (rivers, elevation, soil type, etc.). These were done through the creation of buffers around points or line features (a standard GIS approach) or, in the case of polygon items (as with the digital elevation models or the soil type coverages), through assigning variables denoting location inside specific polygons. For both Augusta and Franklin we added the following variables to the Census database: proximity to the railroad and railroad depots; proximity to a major road; proximity to a church; proximity to a school; proximity to a town (all with 1 mile buffers around the relevant points or lines); elevation; soil type; and voting precinct.

Although Arc software comes with its own statistical analysis software, we chose to use SPSS for the subsequent analysis of the Census/GIS dataset. From ArcView, we exported the table for the residence coverage (which included all the new geographic variable information). ArcView will only export tables in .dbf format. These can be opened in SPSS and then converted to .sav files, the standard format for SPSS data analysis. Once converted, the resulting SPSS files need to be double checked to make sure that value labels were maintained as they were initially entered before importing the dataset into the GIS. This step is important since correct value labels are essential to performing accurate statistical analysis of the material.

During the course of analyzing the GIS/Census dataset, we continued to add variables as new queries occurred to us. The new tables generated within the GIS as a result of these queries could be exported and the new variables added to the existing SPSS files through the "merge variables" command in SPSS.

Analysis

All of the quantitative analysis we performed on the dataset was done using standard statistical procedures, such as frequencies, cross-tabs, correlations and regressions. For subsequent analyses, we needed to aggregate variables or create new ones. Below is a discussion of new variables created using SPSS.

Crop production: In order to ascertain levels of crop production we recoded the wheat and corn as a percent of total grain by running frequencies and taking three equal groups. Those breaks were used in recoding to create three groups with low, middle and high investment in wheat and corn production.

Crop value: We assigned per bushel values for corn and wheat based on newspaper advertisements placed by merchants seeking to buy these crops from farmers, in each county. For both counties we looked at advertisements in both spring and fall. For Augusta County, we obtained our values from ads in the Staunton Spectator placed by merchants from Richmond and Staunton and for Franklin County, we obtained our values from ads in the Chambersburg Valley Spirit place by merchants from Philadelphis and Chambersburg. Once we had obtained a base value for each crop, we could compute total crop values and per acre production, in dollars, for all those households for which we had Agricultural Census information.

Property-owning: In the SPSS files for the full pop censuses, we have a "missing value" for entries below 1. Running a simple frequency on real estate values (sorted by rank in household) turns up a sizable number of "missing" data that is not included in the frequency table. We decided that the missing data should be zero. The vast majority of missing cases in both counties are ones where someone has personal property and no real estate wealth.

Age Cohorts: We elaborated the census household matches to include information on the age of the head of the household and the ages of those men in the household who were at or above voting age. We then averaged these ages (including the head and excluding the head) to obtain an average age for all eligible voters in each household. We used these figures to compare those precincts that gave proportionally higher levels of support to each of the candidates.


Citation: Key = TM2
Historiography Tools