Interesting packages taken from R/Pharma
A few month ago I joined the R/Pharma conference in Cambridge, MA.
As a take away I thought of my project and how I can improve, with solutions others provided. Mainly solutions in R are R-packages. So I’m a R-Shiny programmer in a regulated environment, so the list of the solutions I took are mainly helping you, if you are providing a) Shiny Apps b) Statistical packages c) verified solutions. Let’s go and see which R-packages I did not know and now find really useful:
Packrat
What most of my co-workers are producing, are reports. Actually tons of statistical reports. As we work in a regulated environment, all reports are double-checked. Meaning you program it, and someone else programs it, too. You do not want to waste time, because there was an update of a mathematical package which leads to differences in a number. There is a really nice solution for that.
Packrat allows you to store all packages you are using for a certain session/project. The main guide for packrat can be found at the RStudio blog describing it.
Packrat will not only store all packages, but also all project files. It’s integrated in RStudio’s user interface. It allows you to share projects along different co-workers really fast.
The main lack I see is the need for a server, where you store all these packages. This should be solved with RStudio’s new Package Manager. Another disadvantage is the incompatibility with some packages. I noticed that I could not use the BH package under R-3.4.2 with packrat and had to find a work-around for that.
Diffdf
I have to tell you that I wasted nearly 30% of some of my days comparing data.frames. It’s an important task in testing statistical outcome or any calculations done with a statistical application you compiled. In Pharmaceutical and diagnostics applications one of the most relevant aspects is the validity of data. To ensure that the data we use in clinical studies, quality assurance or on a daily basis just getting data from a co-worker. For me this task has not only been hard, but even harder to document.
The diffdf package by Kieran Martin really solved that task. It not only provides you with a neat interface, but also with well arranged outcomes.
The basic diffdf example looks like this:
library(diffdf)
iris2 <- iris
for (i in 1:3) iris2[i,i] <- i^2
iris2$new_var <- "hello"
class(iris2$Species) <- "some class"
diffdf(iris, iris2)
You can see that basically one column is newly introduced, three values are changed in 3 different numeric columns and the type of a column is changed. All these three changes are displayed in a separate output. Additionally also things that did not change are mentioned, with can be really helpful, in case you do not check for full equality of the data frames you are comparing.
Differences found between the objects!
A summary is given below.
There are columns in BASE and COMPARE with different classes !!
All rows are shown in table below
==================================
VARIABLE CLASS.BASE CLASS.COMP
----------------------------------
Species factor some class
----------------------------------
There are columns in COMPARE that are not in BASE !!
All rows are shown in table below
=========
COLUMNS
---------
new_var
---------
Not all Values Compared Equal
All rows are shown in table below
=================================
Variable No of Differences
---------------------------------
Sepal.Length 1
Sepal.Width 1
Petal.Length 1
---------------------------------
All rows are shown in table below
============================================
VARIABLE ..ROWNUMBER.. BASE COMPARE
--------------------------------------------
Sepal.Length 1 5.1 1
--------------------------------------------
All rows are shown in table below
===========================================
VARIABLE ..ROWNUMBER.. BASE COMPARE
-------------------------------------------
Sepal.Width 2 3 4
-------------------------------------------
All rows are shown in table below
============================================
VARIABLE ..ROWNUMBER.. BASE COMPARE
--------------------------------------------
Petal.Length 3 1.3 9
--------------------------------------------
The output is easily readable and covers all the information you need to do the expected: comparing two data frames. What I really like is the quick feedback on how many differences were observed. In case you have a lot of differences, expect you added +1 to every value of a column, you can immediately see this in the summary.
Additionally the detailed information, given not only the value difference, but also the position of the value in the table, is a huge advantage. Sometimes analyzing large cohorts of patients, can reveal a difference in measurement 99,880 and you do not want to scroll through a table of “matches” until you find this one difference. Therefore this detail view is a huge advantage against other packages.
Archivist
An R package designed to improve the management of results of data analysis. Key functionalities of this package include:
(i) management of local and remote repositories which contain R objects and their meta-data (objects’ properties and relations between them);
(ii) archiving R objects to repositories;
(iii) sharing and retrieving objects (and it’s pedigree) by their unique hooks;
(iv) searching for objects with specific properties or relations to other objects;
(v) verification of object’s identity and context of it’s creation.
This can be really important in reproducible data analytics. In pharmacological projects you often have to reproduce cases after really long time. The archivist package allows to store models, data sets and whole R objects, which can also be functions or expressions, in files. Now you can store the file in a long-term data storage and even after 10 years, using packrat + archivist you’ll be able to reproduce your study.
Example for task (ii) — restore models
This example gives a list of models stored inside the package
library(archivist)
models <- asearch("pbiecek/graphGallery", patterns = "class:lm")
modelsBIC <- sapply(models, BIC)
sort(modelsBIC)
Example for task (i) — store objects locally
A data.frame comes with in my repository at https://github.com/zappingseb/RPharma2018packages in the arepo
folder. Your task is to create a new data.frame, store it in the arepo_new
folder and add it to the restored data.frame. If everything works out the sum of the data.frames shows up to be 2 for position (1,1).
repo <- "arepo_new"
createLocalRepo(repoDir = repo, default = TRUE)
df <- data.frame(x=c(1,2),y=c(2,3))
saveToRepo(df)
setLocalRepo("arepo")
df2 <- loadFromLocalRepo("4a7369a8c51cb1e7efda0b46dad8195e",value = TRUE)
df_test <- df + df2
print(df_test[1,1]==2)
You can see in this task, that my old data.frame is not only stored as a data.frame, but also with a distinct and reproducible md5 hash. This makes it incredibly easy to find stuff in a few years again and showcase that it’s the exact piece you needed.
logR
The logR package can be used to basically log steps of your analysis. In case you have a lot of steps in your analysis and need to know, how long these take, what was the status (error, warning) and what was the exact call, you can use logR to store everything that was done. logR therefore connects to a PostGres database and logs all steps of your analysis there. I can highly recommend to use logR in case you’re not sure if your analysis will go through running it a second time. logR will check each one of your steps, therefore any failure is stored. If your next step runs just because of any environment variable, that was set, you can definitely see this. Here is the basic example for logR from the author:
library(logR)
# setup connection, default to env vars: `POSTGRES_DB`, etc.
# if you have docker then: docker run --rm -p 127.0.0.1:5432:5432 -e POSTGRES_PASSWORD=postgres --name pg-logr postgres:9.5
logR_connect()
# [1] TRUE
# create logr table
logR_schema()
# make some logging and calls
logR(1+2) # OK
#[1] 3
logR(log(-1)) # warning
#[1] NaN
f = function() stop("an error")
logR(r <- f()) # stop
#NULL
g = function(n) data.frame(a=sample(letters, n, TRUE))
logR(df <- g(4)) # out rows
# a
#1 u
#2 c
#3 w
#4 p
# try CTRL+C / 'stop' button to interrupt
logR(Sys.sleep(15))
# wrapper to: dbReadTable(conn = getOption("logR.conn"), name = "logr")
logR_dump()
# logr_id logr_start expr status alert logr_end timing in_rows out_rows mail message cond_call cond_message
#1: 1 2016-02-08 16:35:00.148 1 + 2 success FALSE 2016-02-08 16:35:00.157 0.000049163 NA NA FALSE NA NA NA
#2: 2 2016-02-08 16:35:00.164 log(-1) warning TRUE 2016-02-08 16:35:00.171 0.000170801 NA NA FALSE NA log(-1) NaNs produced
#3: 3 2016-02-08 16:35:00.180 r <- f() error TRUE 2016-02-08 16:35:00.187 0.000136896 NA NA FALSE NA f() an error
#4: 4 2016-02-08 16:35:00.197 df <- g(4) success FALSE 2016-02-08 16:35:00.213 0.000696145 NA 4 FALSE NA NA NA
#5: 5 2016-02-08 16:35:00.223 Sys.sleep(15) interrupt TRUE 2016-02-08 16:35:05.434 5.202319000 NA NA FALSE NA NA NA
RInno — Shiny Apps as Windows Applications
We often build up shiny apps that need local PC settings to perform well. For example did we build a shiny app that accesses a MySQL database by the Active Directory login of the user. To get the Active Directory Credentials without a login window, we just ran the shiny app locally. As not all users in the department knew how to run R + runApp()
, RInno sounds like a great solution for me.
RInno packs your shiny app into an .exe
file that your users can run directly on their PC. This would also allow them to use fancy ggplot functionalities on locally stored Excel files. This can be really important in case data is under security protection and cannot be uploaded to a server. The tutorial given by the developer can help you a lot to understand the issue and how to solve it.
Yardstick
A package containing all measures you might need to evaluate the predictive power of a statistical model. I came across this package in one of the first sessions. Whenever we think of a proper way how to measure the difference between our model and the data, we discussed a lot of different ways. Of course it’s simple to write sqrt(sum((x-y)**2))
but it looks way better in yardstick two_class_example %>% rmse(x, y)
. In yardstick you know where your data is coming from and you can easily exchange the function rmse
while in the example I’ve shown you need to re-code the whole functionality. Yardstick will save a lot of discussions in our team in the future. Happy it came out.