Data Preprocessing in R — Processing Obituaries for Gephi

Following my interest in data processing and in character development, I’ve been working with Drs. Mark Alfano, of the University of Oregon Dept. of Philosophy, and Andrew Higgens, of the University of Illinois at Urbana‑Champaign, to learn about different communities’ moral priorities by examining the obituaries that they publish. Mark, the project’s PI,

“PI” = Primary Investigator
originally made a post about the project here.

Currently, the project involves four primary steps:

  1. We gather a large number of obituaries to read, through web scraping, using obituary databases (yes, they exist), or simply reading daily newspapers.
  2. Mark serves as an expert coder, reading through each obituary and noting all of the descriptors that could be used to describe the character of the deceased.
  3. We process the data to make them usable with Gephi, a piece of open-source software for network analysis and visualization.
  4. Andrew visualizes the data, and we look for patterns.

These are early days for this project; it will likely expand in the coming months to have broader outputs and goals. For now, though, I’m taking a moment to describe step 3 above: how a table of character descriptions becomes able to be processed by Gephi. In the process, we turn a table like this (where each row is an obituary, and 0 = Female and 1 = Male):

gender traits
0 civic, gentle, integrity, conscientious, sharp
1 optimist, devoted, inspiring, honest, courageous, friend, honest, encouraging
0 philanthropist
1 kind, feisty, pragmatic, kind
0 mysterious, committed, social, entrepreneur, fair, musician, honorable, athlete, loyal
0 compassionate, cook, journalist
1
1 patient, sassy, expert, charming, thoughtful
1 humorous, writer, generous
1
1 researcher, cosmopolitan, fierce, peace movement, intrepid, engaging, feminist
1 professor, happy, vibrant, coach, kind, happy
0 considerate, lawyer, enthusiastic

…into a network graph like this:

This network map comes from this post consigned to immanence’“), and shows gender differences from obituaries published in the Eugene, OR, Register-Guard newspaper during Dec. 2013 (red = female deceased, blue = male deceased).
Network map of Eugene, OR sex differences in obituaries

For this type of analysis, Gephi requires a dataset formatted following several rules:

  1. There should be three columns: Gender, Source (i.e., “From”), and Target (i.e., “To”).
  2. The “Target” and “Source” columns should always be in the same alphabetical order, so that Gephi always draws the line going in the same direction (here, from “Family” to “Loving,” and not from “Loving” to “Family”).
  3. We shouldn’t use repeat terms from the same row (so if “kind” was used twice in the same obituary, only use the first instance).

For the map above, one row of the processed dataset could look like this:

Target Source Gender
family loving 0

This tells Gephi “Draw a line from ‘Family’ to ‘Loving’, and color it red (for Female, since 0 = female).”

With these rules in mind, below is a heavily-commented script for R to automatically process the dataset.

This code is released under an MIT license. A simple summary of what that means can be found here.

# Script for taking comma-separated strings (within cells of a CSV) and combining them into one-way relationships for easy import into Gephi
# Jacob Levernier
# Jan. 2014 ff.
# Released under an MIT License (http://opensource.org/licenses/MIT).

# Clear R's memory, setting a fresh foundation from which to run this script.
rm(list=ls())

#################
# CONFIGURATION SETTINGS -- EDIT THESE
#################

# Working directory: The directory on your computer in which the data file can be found, and into which you would like to save the output from this script. Include a trailing slash ('/') at the end:
workingDirectoryToSet <- "/path/to/Obituaries Project/Data/"

# Data file name within the working directory:
# NOTE: The data file should be a Comma-Separated Values file with comma delimiters and double-quotes (") around text fields.
# The data file should have two columns: "gender", and "traits". The traits column should be a comma-separated string of terms to be combined.
# The CSV file SHOULD have a header row (i.e., a row with column names).
dataFileName <- "Example_Data_CSV.csv"

# The filename to which to write the output from this script:
fileNameToWriteOutputTo <- "Example_Data_Output_CSV.csv"


#################
# END CONFIGURATION SETTINGS
#################



#################
# PROCESSING STEPS -- DO NOT EDIT THESE
#################

# Set the working directory 
setwd(workingDirectoryToSet)

dataSetToParse <- read.csv(dataFileName, header = TRUE, sep=",")

# This is good for debugging purposes
# colnames(dataSetToParse)

# Split the triats column by commas, and save each row's now-parsed traits list into a column of the dataSetToParse object called "split":
dataSetToParse[["split"]] <- strsplit(as.character(dataSetToParse[["traits"]]), ", ")

# A good introduction to R data frames is at http://www.r-tutor.com/r-introduction/data-frame

# Make the object into a dataframe:
fullDataFrame <- data.frame(row.names = c("Target","Source", "Gender", "OriginalRowNumber"))

# Loop through every row in the dataSetToParse:
for (i in 1:nrow(dataSetToParse)) {
	# If there are any trait terms to use in this row...
	if(length(as.character(dataSetToParse[["split"]][[i]])) > 1) {

		# Make all pairwise (hence "2") combinations of terms from the "split" column, only using unique terms from that row (i.e., throw out terms listed more than once in that row).
		combinationOfTraitTermsFromRow <- combn(unique(as.character(dataSetToParse[["split"]][[i]])), 2) # unique() is to get rid of duplicate combinations from duplicate terms (and edges/connections between duplicate terms).

		# Transpose (in the matrix algebra sense) the list of combinations so that it's in a vertical format.
		transposedCombination <- t(combinationOfTraitTermsFromRow)
		
		# Make the list into a dataframe, which makes it easier to process below:
		transposedCombination <- as.data.frame(transposedCombination)

		# Get the "gender" column from the row of the original dataset, and list it alongside every combination of terms (from that row).
		transposedCombination[["Gender"]] <- dataSetToParse[["gender"]][[i]]
		
	 	# Get the number of the row from the original dataset, and list it alongside every combination of terms (from that row).
		transposedCombination[["OriginalRowNumber"]] <- i+1 # The +1 is to account for the fact that the original CSV file had a header row that got taken out when we imported the data.
		
		# Assign names to all of the columns. These will get wiped out in a moment and then will have to be re-set, but I think that including this line here makes things easier to understand:
		names(transposedCombination) <- c("Target","Source", "Gender", "OriginalRowNumber")
		
		# Take a look (This is good for debugging purposes)
		#View(transposedCombination)
		
		# Tack on transposedCombination to the end of the object called fullDataFrame (which we established as an empty data frame before going into this row-by-row loop, above):
		fullDataFrame <- rbind(fullDataFrame, transposedCombination)
		
		# Take a look (This is good for debugging purposes)
		#View(fullDataFrame)
		
		# Wipe the objects that we were using within the row, since everything from the row has been written to the fullDataFrame object now (as of a few lines above):
		rm(combinationOfTraitTermsFromRow,transposedCombination)
	} else { # If the row has NO traits listed
		# This is good for debugging purposes
		#cat("Skipping row ",i," because it does not include any trait terms...\n") # Give some output to the user.
	}
}

# Take a look (This is good for debugging purposes)
#View(fullDataFrame)

# Apply, to every row, specifically to the "Source" and "Target" columns, the sort() function, in order to alphabetize them (within that row of fullDataFrame, i.e., within that combination of terms). Transpose this so that it's in a vertical format.
dataFrameAlphabetizingRows <- t(apply(fullDataFrame[,c("Source","Target")], 1, sort))

# Take a look (This is good for debugging purposes)
#View(dataFrameAlphabetizingRows)

# Put together dataFrameAlphabetizingRows (which has the alphabetized "Source" and "Target" columns) with the corresponding Gender value from fullDataFrame:
fullDataFrameWithAlphabetizedRows <- as.data.frame(cbind(dataFrameAlphabetizingRows,fullDataFrame[["Gender"]],fullDataFrame[["OriginalRowNumber"]]))

# (Re-)Assign names to this new dataframe:
names(fullDataFrameWithAlphabetizedRows) <- c("Target","Source", "Gender", "OriginalRowNumber")

# This is good for debugging purposes
#View(fullDataFrameWithAlphabetizedRows)

# Write a CSV file for this:
write.table(fullDataFrameWithAlphabetizedRows, file=fileNameToWriteOutputTo, row.names = FALSE, col.names = TRUE, sep=",", eol="\r\n") # The eol/"end of line" marker \r\n is \n, which is read by older Mac machines as "new line", \r, which is read by Unix machines as newline (or "carriage Return"), and \r\n, which is read by Windows machines as newline (see, e.g., https://stackoverflow.com/questions/1761051/difference-between-n-and-r/1761086#1761086)


#################
# END PROCESSING STEPS
#################

A few points to note about the script:

  1. There are several lines that are commented-out (with # at the beginning of the line). These can be un-commented (by deleting the #). Each is there to show how the previous step changed the data.
  2. Line 45 splits a string like "civic, gentle, integrity, conscientious, sharp" into a vector like ["civic", "gentle", "integrity", "conscientious", "sharp"]. Line 55 simply checks whether that vector has more than 1 item in it to decide whether it should try to make combinations of the items within the vector.
  3. At lines 61 and 96, R takes horizontal data and makes it vertical using the t() (transpose) function. It does this because Gephi requires its input to be vertically-oriented.
    An excellent visual explanation of matrix transposition is here.
  4. At line 58, R’s combn() function produces all pairwise combinations of traits for each row. It’s a hugely powerful function, and worth knowing. By changing the 2 to a 3 at line 58 above, combn() would create all three-way combinations of traits. Wow!

From the dataset given at the start of this post, the script will create a new CSV file that looks like this:

Target Source Gender OriginalRowNumber
civic gentle 0 2
civic integrity 0 2
civic conscientious 0 2
civic sharp 0 2
gentle integrity 0 2
conscientious gentle 0 2
gentle sharp 0 2
conscientious integrity 0 2
integrity sharp 0 2
conscientious sharp 0 2
devoted optimist 1 3
considerate enthusiastic 0 14
enthusiastic lawyer 0 14

The full output is 120 rows long. Automating in this way can save a lot of time vs. processing these data by hand.

If you’re not familiar with R but would like to be, I recommend the introductory materials on the UOregon R Club blog, starting here. For a quick reference, I also find Learn X in Y minutes useful.

Related

Next
Previous