Category Archives: R

R-Tip: Missing weights, weighted.mean

There are instances where sampling weights are not only unknown, but in fact, cannot be known (unless one makes certain unsavory assumptions). Under those circumstances, weights for certain respondents can be ‘missing’. Typically there the strategy is to code those weights as 0. However if you retain those as NA, weighted.mean etc. is wont to give you NA as an answer, even if you set na.rm=T.

To get non-NA answers, set NAs to zero or estimate mean over respondents with non-missing weights.

Comparing datasets, reporting only non-duplicated rows

The following is in response to question on the R-Help list.

Consider two datasets –

reported <-
structure(list(Product = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
3L, 4L, 5L, 5L), .Label = c(“Cocoa”, “Coffee C”, “GC”, “Sugar No 11″,
“ZS”), class = “factor”), Price = c(2331, 2356, 2440, 2450, 204.55,
205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61L, -61L,
5L, 1L, 40L, 40L, -1L, -1L, -1L, 1L)), .Names = c(“Product”,
“Price”, “Nbr.Lots”), row.names = c(1L, 2L, 3L, 4L, 6L, 7L, 5L,
10L, 8L, 9L), class = “data.frame”)

exportfile <-
structure(list(Product = c(“Cocoa”, “Cocoa”, “Cocoa”, “Coffee C”,
“Coffee C”, “GC”, “Sugar No 11″, “ZS”, “ZS”), Price = c(2331,
2356, 2440, 204.55, 205.45, 17792, 24.81, 1273.5, 1276.25), Nbr.Lots = c(-61,
-61, 6, 40, 40, -1, -1, -1, 1)), .Names = c(“Product”, “Price”,
“Nbr.Lots”), row.names = c(NA, 9L), class = “data.frame”)

Two possible solutions –
A.
m <- rbind(reported, exportfile)
m1 <- m[duplicated(m),]
res <- m[is.na(match(m$key, m1$key)),]

B.

exportfile$key <- do.call(paste, exportfile)
reported$key <- do.call(paste, reported)
a <- reported[is.na(match(reported$key, exportfile$key)),]
b <- exportfile[is.na(match(exportfile$key, reported$key)),]
res <- rbind(a, b)

R – Recoding variables reliably and systematically

Survey datasets typically require a fair bit of repetitive recoding of variables. Reducing errors in recoding can be done by writing functions carefully (see some tips here) and automating and systematizing naming, and application of the recode function (which can be custom) –

fromlist <- c(“var1″, “var2″, “var3″, “var4″, “var5″)
tolist <- paste(c(“var1″, “var2″, “var3″, “var4″, “var5″), “recoded”, sep=””)
data[,tolist] <- sapply(data[,fromlist], function(x) car::recode(x , “recode.directions”))

Simple functions can also be directly applied to each column of a data.frame. For instance,
data[,tolist] <- !is.na(data[,fromlist])
data[,tolist] <- abs(data[,fromlist] – .5)

Working with modestly large datasets in R

Even modestly large (< 1 GB) datasets can quickly overwhelm modern personal computers. Working with such datasets in R can be still more frustrating because of how R uses memory. Here are a few tips on how to work with modestly large datasets in R.

Setting Memory Limits
On Windows, right click R and in the Target field set maximum vector size and memory size as follows –
“path\to\Rgui.exe” –max-vsize=4800M (Deprecated as of 2.14).

Alternately, use utils::memory.limit(size=4800) in .Rprofile.

Type in mem.limits() to check maximum vector size
Learn more

Reading in CSVs
Either specify column classes manually or get the data type for each column by reading in the first few rows – enough so that data type can be inferred correctly – and using the class that R is using.

# Read the first 10 rows to get the classes
ads5 <- read.csv("data.csv", header=T, nrows=10)
classes <- sapply(ads5, class)

Specifying number of rows in the dataset (even a modestly greater number than what is there) can be useful.
read.csv("data.csv", header=T, nrows=N, colClasses=classes)

Improvements in performance are not always stupendous but given the low cost of implementation, likely worthwhile.

Selective Reading
You can selectively read columns by specifying colClasses=NULL for the columns you don’t want read.
Alternately, you can rely on cut. For instance,
data <- read.table(pipe("cut -f 2,5 -d, data.csv"))

Opening Connections
Trying to directly read csv can end in disaster. Open a connection first to reduce memory demands.
abc <- file("data.csv")
bbc <- read.csv(abc)

Using SQLDF
library(sqldf)
f <- file("data.csv")
Df <- sqldf("select * from f", dbname=tempfile(), file.format=list(header=T, row.names=F))
Problems include inability to deal with fields which have commas etc.

Using Filehash

Filehash package stores files on the hard drive. You can access the data using either with() if dealing with env variable, or directly via dbLoad() that mimics functionality of attach. Downside: it is tremendously slow.

library(filehash)
dumpDF(read.csv(“data.csv”, header=T, nrows=N, colClasses=classes), dbName=”db01″)
ads <- db2env(db ="db01")

Selecting Columns
Use subset(data, select=columnList) rather than data[, columnList].

Reducing Errors in Survey Analysis

Analysis of survey data is hard to automate because of the immense variability across survey instruments – different variables, differently coded, and named in ways that often defy even the most fecund imagination. What often replaces complete automation is ad-hoc automation – quickly coded functions to recode a variable to lie within a particular range, etc. applied by intelligent people frustrated by lack of complete automation, and bored by the repetitiveness of the task. Ad-hoc automation attracts mistakes for functions are often coded without rigor, and useful alerts and warnings not included.

One way to reduce mistakes is to prevent them from happening. Carefully coded functions with robust error checking and handling, and alerts (and passive verbose outputs) that are cognizant of our own biases, and bounded attention, can reduce mistakes. Functions applied most routinely typically need the most attention.

Let’s use the example of recoding a variable 0 to 1 (in R) to illustrate how one may think about coding a function. A few things we one want to consider –

  1. Data type: Is the variable numeric, ordinal, or categorical. Let’s say we want to constrain our function to handling only numeric variables. Some numeric variables may be coded as ‘character’. We may want to seamlessly deal with these issues, and possibly issue warnings (or passive outputs) when improper data types are used.
  2. Range: Range that the variable takes in the data may not span the entire domain. We want to account for that but perhaps seamlessly by printing out the range that the variable takes, and also allowing the user to input the true range.
  3. Missing Values: Variety of functions we may rely on when recoding our variable may take ‘fail’ (quietly) when fronted with missing values, for example, range(x). We may want to alert the user to the issue, but still handle missing values seamlessly.
  4. A user may not see the actual data so we may want to show user some of the data by default. Efficient summaries of the data (fivenum, mean, median, etc.) or displaying a few initial items may be useful.

A function that addresses some of the issues:

zero1 <- function(x, minx=NA, maxx=NA){
# Test the type of x and see if it is a double, or can be transformed into a double
stopifnot(identical(typeof(as.numeric(x)), 'double'))
if(typeof(x)=='character') x <- as.numeric(x)
print(head(x)) #displays first few items
print(paste("Range:", paste(range(x, na.rm=T), collapse=" "))) #shows the range the variable takes in the data
res <- rep(NA, length(x))
if(!is.na(minx)) res <- (x - minx)/(maxx - minx)
if(is.na(minx)) res <- (x - min(x,na.rm=T))/(max(x,na.rm=T) - min(x,na.rm=T))
res
}

These tips also apply to canned functions available in R (and those writing them) and functions in other statistical packages that routinely do not display alerts or other secondary information that may help reduce mistakes. One can always build on top of canned functions. For example, recode (car package) function can be coded to passively display correlation between the recoded variable and the original variable by default.

In addition to writing better functions, one may also want to do some post-hoc checking. Before we move to some ways of doing post-hoc checks, a caveat – Post hoc methods are only good at detecting aberrations among the variables you test, and they are costly and incomprehensive.

  1. Using prior knowledge:
    1. Identify beforehand how some variables relate to each other. For example, education is typically correlated with political knowledge, race with partisan preferences, etc. Test these ‘hypotheses’. In some cases, these can also be diagnostic of sampling biases.
    2. Over an experiment you may have hypotheses about how variables change across times. For example, ‘constraint’ typically increases across attitude indices over the course of a treatment designed to produce learning. Test these priors.
  2. Characteristics of the coded variable: If using multiple datasets, check to see if number of ‘levels’ of a categorical variable are the same across each dataset. If not, investigate. Cross-tabulations across merged data are a quick way to diagnose problems, which can range from varying codes for missing data to missing ‘levels’.

help(R): matrix indexing

In R, some want to treat matrix like a data frame. It is a bad idea. And no – one cannot use the dollar sign to pick out a column of a matrix. The underlying data structure for a matrix in R is a vector, whereas data.frame object in R is of type list (see using typeof(object)) and it carries within it vectors, reachable using the dollar sign followed by name of the vector (a way to only access elements within lists).

> a <- data.frame(ab=1:2, bb=1:2)
> b <- matrix(ncol=2, nrow=2)
> colnames(b) <- c("ab", "bb")

> a$ab
[1] 1 2

> b$bb
Error in b$bb : $ operator is invalid for atomic vectors

# here's why
> dput(a)
structure(list(ab = 1:2, bb = 1:2), .Names = c("ab", "bb"), row.names = c(NA,
-2L), class = "data.frame")

> dput(b)
structure(c(NA, NA, NA, NA), .Dim = c(2L, 2L), .Dimnames = list(NULL, c("ab", "bb")))

# list
> c <- list(ab=c(1,2,3), bb=c(1,2,3))

> dput(c)
structure(list(ab = c(1, 2, 3), bb = c(1, 2, 3)), .Names = c("ab",
"bb"))

# so you can do
c$ab
[1] 1 2 3

# for matrices, data frames, etc. one can however use
a[,"ab"]
b[,"ab"]

# or
b[,2]

# for people wishing to do so, here's why treating matrices as data frames is a bad idea -
> a <- matrix(ncol=2, nrow=2)
> a[,1] <- 1:2
> a[,2] <- c("abc", "bbc")
> str(a[,1])
chr [1:2] "1" "2"
> a[,1] <- as.numeric(a[,1])
> str(a[,1])
chr [1:2] "1" "2"

# Notes on coercion
a <- matrix(ncol=2, nrow=3)
> a$b <- 1:2
Warning message:
In a$b <- 1:2 : Coercing LHS to a list
> typeof(a)
[1] "list"