Wednesday, 29 January 2014

R: hard things are easy and easy things are hard

Statisticians looooove R, and I guess I can see why. For them, it makes hard things easy. For instance MultiDimensional Scaling

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in adistance matrix. An MDS algorithm aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible. Each object is then assigned coordinates in each of the N dimensions. The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. Choosing N=2 optimizes the object locations for a two-dimensional scatterplot.[1]

In EdgeR, this is:

plotMDS(y)

Pretty cool! And you'll probably want to run that once per experiment to visualise your samples and make sure they cluster accordingly. But do you know what I do more than visualise clustering in multi-dimensional space? Concatenate some strings! Which is:

a = paste(a, b, sep='')

Which is longer to type! How about sorting a dataframe by column? How about:

dd = dd[with(dd, order(col)), ]

The above in Python / Pandas is:

a += b
df = df.sort(col)

I totally agree with John D. Cook:

I’d rather do math in a general-purpose language than try to do general-purpose programming in a math language.

That and I gag a little whenever I type a '.' inside a variable name. Let's drag the statisticians towards programming languages rather than letting them drag us into R!

4 comments:

  1. I get your point, but your R examples could be written more succinctly:

    `paste0(a, b)` instead of `paste(a, b, sep='')`

    `arrange(dd, desc(col), b)` instead of `dd[with(dd, order(-col, b)), ]`

    ReplyDelete
  2. Of course, if you do a lot of string pasting with the same options, then write a new function that has sep='' as the default, and write a new ordering function that works the way *you* think it should. No one is stopping you. Don't complain about how something works, instead make it better if it is useful. I would use Hadley Wickham as an example. He takes things that he does a lot (string manipulation, working with dates, cleaning data are some examples) and makes them better (stringr, lubridate, dplyr are matching packages).

    Panda's is built on top of Python. EdgeR is built on top of R. You can write your own functions to do things in a way that is easiest for you.

    ReplyDelete
  3. Thanks Mike, I simplified the examples.

    As for paste0, I think that's a really great example of why I don't like R, it seems the solution to everything is "import a function into global scope"

    Almost every language higher than C has the '+' string concatenation syntactic sugar. Sugar matters when you spend all day doing things. And to compare R vs Pandas dataframes - in Pandas dataframe objects are object methods, it's just a lot nicer to use.

    ReplyDelete
  4. aRrgh: a newcomer's (angry) guide to R "R.... suffers from ... decades .. of stupid hacks from a community containing, to a first-order approximation, zero software engineers. "

    R The Master Troll of statistical languages

    ReplyDelete