Parsing data into pieces
** Download the accompanying Excel Workbook referenced in this post. **
Every week I spend a couple of hours exploring public-use data sets. Because most of my work has centered on the welfare of children, I tend to gravitate towards data sets that capture child health, education, and developmental outcomes. Oftentimes, these data sets are nested. Nested data are data where observations can be grouped based on some common feature(s). Examples include:
Children who have the same parents (children nested within families);
Students who have the same teacher (students nested within classrooms); or
Patients who see the same physician (patients nested within physicians).
This weekend, I was playing around with NYC Open Data, and I was looking at the 2013 - 2018 Demographic Snapshot data sets for New York City schools and districts. These data include total enrollments broken down by gender, race, and other demographic characteristics at the school and district levels.
Wanting to merge the school and district data together, I searched for a unique identifier that was common between the two data sets. I found that each school in New York City has a unique DBN, which is an acronym for District, Borough, and School Number.
For example, the first record in the school data set has the following DBN number: 01M015
which means the school is in Administrative District one (1) within the borough of Manhattan, and the school’s unique identification number is 015.
Because I wanted to merge school data with district data, I needed to extract the first two characters of the DBN; 01 from our example. The programmer in me was like eh, I could do this in a minute in R. And I did. But because I love sharing how to use Excel for all things data, I decided to replicate my results in Excel.
First, I downloaded the school and district data sets to my computer and combined them into a single Excel Workbook (different tabs). I then removed data for the years 2013-14; 2014-15; 2015-16; and 2016-17 from either tab.
Next, I toggled over to the ‘school data’ tab, inserted a new column to the right of the DBN column, and, in cell B1, entered ‘Administrative District’ as the new column’s header.
On to parsing the data!
To extract the first two strings of each school’s DBN, I used two functions:
LEFT and
VALUE
The LEFT function allows you to extract elements of a string starting with the leftmost character. The function has two arguments (1) cell location; and (2) the number of characters to extract.
Using the VALUE function, you can convert text or string that is a numerical value into a number. The function has one main argument called ‘text’.
I entered the formula =VALUE(LEFT(A2,2)) into cell B2. (This formula tells Excel that I not only want to extract the first two elements of the DBN located in cell A2, but I want to convert that text into a number.) Then, I double-clicked the fill handle to apply the formula to the rest of the column.
Finally, I created a new tab called ‘Full’ and copy-pasted the first three columns of the school worksheet into this new tab.
Now I can use VLOOKUP to merge the two data sets together!
Vlookup is a built-in Excel function that finds a value in a column of data and returns a corresponding value from another column. VLOOKUP has three main arguments:
The value or string to look up (aka. "lookup value"
A table where the first column contains the lookup value
A column number in the table from which to retrieve a value or string
An optional argument (range_lookup) controls whether an approximate (True) or exact (False) match is required. If you are completely unfamiliar with VLOOKUP, Microsoft Office has a tutorial on how to use the function. (Note: a weird quirk of VLOOKUP is that the lookup value must be the first column in the range for the function to work correctly.)
Onto merging the data sets!
First, I created a new column in the Full tab called Total School Enrollment which will contain information from the Total Enrollment column in the School tab (i.e., the number of students at each school in the School data set). I populated this column by entering:
=VLOOKUP(A2,school!$A$1:$AN$1845,5,FALSE)
into cell D2 (Full tab). Then double-click the fill handle to apply the formula to the rest of the column
Here is another example. Add values to a new column called Total District Enrollment by entering:
=VLOOKUP(B2,district!$A$1:$AL$36,3,FALSE)
into cell E2 (Full tab). Then double-click the fill handle to apply the formula to the rest of the column
So you see, parsing and blending data in Excel is easy to do when you have a common unique identifier.