Comparing datasets, reporting only non-duplicated rows

27 Feb

The following is in response to question on the R-Help list.

Consider two datasets –

reported <-
structure(list(Product = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
3L, 4L, 5L, 5L), .Label = c("Cocoa", "Coffee C", "GC", "Sugar No 11",
"ZS"), class = "factor"), Price = c(2331, 2356, 2440, 2450, 204.55,
205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61L, -61L,
5L, 1L, 40L, 40L, -1L, -1L, -1L, 1L)), .Names = c("Product",
"Price", "Nbr.Lots"), row.names = c(1L, 2L, 3L, 4L, 6L, 7L, 5L,
10L, 8L, 9L), class = "data.frame")

exportfile <- structure(list(Product = c("Cocoa", "Cocoa", "Cocoa", "Coffee C", "Coffee C", "GC", "Sugar No 11", "ZS", "ZS"), Price = c(2331, 2356, 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61, -61, 6, 40, 40, -1, -1, -1, 1)), .Names = c("Product", "Price", "Nbr.Lots"), row.names = c(NA, 9L), class = "data.frame")

Two possible solutions -
m <- rbind(reported, exportfile)
m1 <- m[duplicated(m),]
res <- m[$key, m1$key)),]


exportfile$key <-, exportfile) reported$key <-, reported) a <- reported[$key, exportfile$key)),] b <- exportfile[$key, reported$key)),] res <- rbind(a, b)

R – Recoding variables reliably and systematically

12 Nov

Survey datasets typically require a fair bit of repetitive recoding of variables. Reducing errors in recoding can be done by writing functions carefully (see some tips here) and automating and systematizing naming, and application of the recode function (which can be custom) –

fromlist <- c("var1", "var2", "var3", "var4", "var5") tolist <- paste(c("var1", "var2", "var3", "var4", "var5"), "recoded", sep="") data[,tolist] <- sapply(data[,fromlist], function(x) car::recode(x , "recode.directions")) Simple functions can also be directly applied to each column of a data.frame. For instance, data[,tolist] <- ![,fromlist]) data[,tolist] <- abs(data[,fromlist] - .5)

Working with modestly large datasets in R

2 Nov

Even modestly large (< 1 GB) datasets can quickly overwhelm modern personal computers. Working with such datasets in R can be still more frustrating because of how R uses memory. Here are a few tips on how to work with modestly large datasets in R.

Setting Memory Limits
On Windows, right click R and in the Target field set maximum vector size and memory size as follows:

"path\to\Rgui.exe" --max-vsize=4800M (Deprecated as of 2.14). 

Alternately, use

utils::memory.limit(size=4800) in .Rprofile.

Type in mem.limits() to check maximum vector size
Learn more

Reading in CSVs
Either specify column classes manually or get the data type for each column by reading in the first few rows – enough so that data type can be inferred correctly – and using the class that R is using.

# Read the first 10 rows to get the classes
ads5    <- read.csv("data.csv", header=T, nrows=10)
classes <- sapply(ads5, class)

Specifying number of rows in the dataset (even a modestly greater number than what is there) can be useful.

read.csv("data.csv", header=T, nrows=N, colClasses=classes)  

Improvements in performance are not always stupendous but given the low cost of implementation, likely worthwhile.

Selective Reading
You can selectively read columns by specifying colClasses=NULL for the columns you don't want read.
Alternately, you can rely on cut. For instance,

data <- read.table(pipe("cut -f 2,5 -d, data.csv"))

Opening Connections
Trying to directly read csv can end in disaster. Open a connection first to reduce memory demands.

abc <- file("data.csv")
bbc <- read.csv(abc)


f <- file("data.csv")
Df <- sqldf("select * from f", dbname=tempfile(), file.format=list(header=T, row.names=F))
Problems include inability to deal with fields which have commas etc.

Using Filehash

Filehash package stores files on the hard drive. You can access the data using either with() if dealing with env variable, or directly via dbLoad() that mimics functionality of attach. Downside: it is tremendously slow.

dumpDF(read.csv("data.csv", header=T, nrows=N, colClasses=classes), dbName="db01")
ads <- db2env(db ="db01")

Selecting Columns

subset(data, select=columnList) rather than data[, columnList].

Reducing Errors in Survey Analysis

22 May

Analysis of survey data is hard to automate because of the immense variability across survey instruments— different variables, differently coded, and named in ways that often defy even the most fecund imagination. What often replaces complete automation is ad-hoc automation — quickly coded functions to recode a variable to lie within a particular range, etc. applied by intelligent people frustrated by lack of complete automation, and bored by the repetitiveness of the task. Ad-hoc automation attracts mistakes for functions are often coded without rigor, and useful alerts and warnings not included.

One way to reduce mistakes is to prevent them from happening. Carefully coded functions with robust error checking and handling, and alerts (and passive verbose outputs) that are cognizant of our own biases, and bounded attention, can reduce mistakes. Functions applied most routinely typically need the most attention.

Let’s use the example of recoding a variable 0 to 1 (in R) to illustrate how one may think about coding a function. A few things we one want to consider:

  1. Data type: Is the variable numeric, ordinal, or categorical. Let’s say we want to constrain our function to handle only numeric variables. Some numeric variables may be coded as ‘character.’ We may want to seamlessly deal with these issues, and possibly issue warnings (or passive outputs) when improper data types are used.
  2. Range: Range that the variable takes in the data may not span the entire domain. We want to account for that but perhaps seamlessly by printing out the range that the variable takes, and also allowing the user to input the true range.
  3. Missing Values: Variety of functions we may rely on when recoding our variable may take fail (quietly) when fronted with missing values, for example, range(x). We may want to alert the user to the issue, but still handle missing values seamlessly.
  4. A user may not see the actual data so we may want to show user some of the data by default. Efficient summaries of the data (fivenum, mean, median, etc.) or displaying a few initial items may be useful.

A function that addresses some of the issues:

zero1 <- function(x, minx=NA, maxx=NA) {
# Test the type of x and see if it is a double, or can be transformed into a double
stopifnot(identical(typeof(as.numeric(x)), 'double'))
if(typeof(x)=='character') x <- as.numeric(x)
print(head(x)) #displays first few items
print(paste("Range:", paste(range(x, na.rm=T), collapse=" "))) #shows the range the variable takes in the data
res <- rep(NA, length(x))
if(! res <- (x - minx)/(maxx - minx)
if(  res <- (x - min(x,na.rm=T))/(max(x,na.rm=T) - min(x,na.rm=T))

These tips also apply to canned functions available in R (and those writing them) and functions in other statistical packages that routinely do not display alerts or other secondary information that may help reduce mistakes. One can always build on top of canned functions. For example, recode (car package) function can be coded to passively display correlation between the recoded variable and the original variable by default.

In addition to writing better functions, one may also want to do some posthoc checking. Before we move to some ways of doing post-hoc checks, a caveat – Post hoc methods are only good at detecting aberrations among the variables you test, and they are costly and incomprehensive.

  1. Using prior knowledge:
    1. Identify beforehand how some variables relate to each other. For example, education is typically correlated with political knowledge, race with partisan preferences, etc. Test these hypotheses. In some cases, these can also be diagnostic of sampling biases.
    2. Over an experiment you may have hypotheses about how variables change across times. For example, constraint typically increases across attitude indices over the course of a treatment designed to produce learning. Test these priors.
  2. Characteristics of the coded variable: If using multiple datasets, check to see if number of levels of a categorical variable are the same across each dataset. If not, investigate. Cross-tabulations across merged data are a quick way to diagnose problems, which can range from varying codes for missing data to missing levels.

help(R): matrix indexing

19 Jan

In R, some want to treat matrix like a data frame. It is a bad idea. And no – one cannot use the dollar sign to pick out a column of a matrix. The underlying data structure for a matrix in R is a vector, whereas data.frame object in R is of type list (see using typeof(object)) and it carries within it vectors, reachable using the dollar sign followed by name of the vector (a way to only access elements within lists).

> a <- data.frame(ab=1:2, bb=1:2) > b <- matrix(ncol=2, nrow=2) > colnames(b) <- c("ab", "bb") > a$ab
[1] 1 2

> b$bb
Error in b$bb : $ operator is invalid for atomic vectors

# here's why
> dput(a)
structure(list(ab = 1:2, bb = 1:2), .Names = c("ab", "bb"), row.names = c(NA,
-2L), class = "data.frame")

> dput(b)
structure(c(NA, NA, NA, NA), .Dim = c(2L, 2L), .Dimnames = list(NULL, c("ab", "bb")))

# list
> c <- list(ab=c(1,2,3), bb=c(1,2,3)) > dput(c)
structure(list(ab = c(1, 2, 3), bb = c(1, 2, 3)), .Names = c("ab",

# so you can do
[1] 1 2 3

# for matrices, data frames, etc. one can however use

# or

# for people wishing to do so, here's why treating matrices as data frames is a bad idea -
> a <- matrix(ncol=2, nrow=2) > a[,1] <- 1:2 > a[,2] <- c("abc", "bbc") > str(a[,1])
chr [1:2] "1" "2"
> a[,1] <- as.numeric(a[,1]) > str(a[,1])
chr [1:2] "1" "2"

# Notes on coercion
a <- matrix(ncol=2, nrow=3) > a$b <- 1:2 Warning message: In a$b <- 1:2 : Coercing LHS to a list > typeof(a)
[1] "list"