{"id":4543,"date":"2019-12-02T16:52:08","date_gmt":"2019-12-02T14:52:08","guid":{"rendered":"https:\/\/engel-wolf.com\/?p=4543"},"modified":"2019-12-03T00:25:15","modified_gmt":"2019-12-02T22:25:15","slug":"does-using-dtplyr-make-sense-for-all-datasets","status":"publish","type":"post","link":"https:\/\/engel-wolf.com\/?p=4543","title":{"rendered":"Does using dtplyr make sense for all datasets? (UPDATED)"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\">When Hadley Wickham published <a href=\"https:\/\/www.tidyverse.org\/blog\/2019\/11\/dtplyr-1-0-0\/\">dtplyr 1.0.0<\/a> last week, I was super excited. Moreover, I wanted to know, if it really is a useful package. Therefore I performed some tiny research being presented in this blog entry.<\/h3>\n\n\n\n<p>This article was updated at December 2nd 2019 due to upcoming issues:  <a href=\"https:\/\/github.com\/zappingseb\/dtplyrtest\/issues?q=is%3Aissue+is%3Aclosed\">https:\/\/github.com\/zappingseb\/dtplyrtest\/issues?q=is%3Aissue+is%3Aclosed<\/a> <br>Many thanks to:<br><a href=\"https:\/\/github.com\/stefanfritsch\">Stefan Fritsch<\/a>, <a href=\"http:\/\/www.spaetzle.de\/infos.html\">Dirk Raetzel<\/a>, <a href=\"https:\/\/github.com\/jangorecki\">Jan Gorecki<\/a> for qualified review of the article and the code.<\/p>\n\n\n\n<p>Ok. So <em>dtplyr <\/em>gives you the speed of <em>data.table<\/em> with the nice API of <em>dplyr<\/em>. This is the main feature. You will see this in the following code. The speedup of <em>dtplyr<\/em> should allow converting variables in large scale data sets by simple calls. So I first constructed an example call which is partly taken from Hadley Wickham`s <a href=\"https:\/\/www.tidyverse.org\/blog\/2019\/11\/dtplyr-1-0-0\/\">blog entry<\/a>: <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: r; title: ; notranslate\" title=\"\">\nn &lt;- 100\n# Create a word vector\nwords &lt;- vapply(1:15, FUN = function(x) {\n  paste(sample(LETTERS, 15, TRUE), collapse = &quot;&quot;)\n}, FUN.VALUE = character(1))\n\n# Create an example dataset\ndf &lt;- data.frame(\n  word = sample(words, n, TRUE),\n  letter = sample(LETTERS, n, TRUE),\n  id = 1:n,\n  numeric_int = sample(1:100, n, TRUE),\n  numeric_double = sample(seq(from = 0, to = 2, by = 1\/(10^6)), n, TRUE),\n  missing = sample(c(NA, 1:100), n, TRUE)\n)\n\n# Convert the numeric values\ndf %&gt;%\n  filter(!is.na(missing)) %&gt;% \n  group_by(letter) %&gt;% \n  summarise(n = n(), delay = mean(numeric_int))\n<\/pre><\/div>\n\n\n<p>The call shown is the one used within <em>dplyr.<\/em> In <em>dtplyr<\/em> a lazy evaluation term needs to be added. At the end of the <em>dplyr <\/em>call a <code>%&gt;% as_tibble()<\/code> is needed to start the evaluation. In <em>data.table<\/em> the call looks totally different.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n# dtplyr call\nds_lazy &lt;- lazy_dt(ds)\nds_lazy %&gt;% fil.... %&gt;%\n  as_tibble()\n\nDT &lt;- as.data.table(ds)\n# data.table call\nDT_int&#91;!is.na(missing), .(n = .N, delay = mean(numeric_int)), letter]\n<\/pre><\/div>\n\n\n<p>To evaluate the execution I created testing datasets of different sizes. The smallest dataset will contain 100 rows. The largest dataset 10E7 rows.  Afterward I ran the data conversion operations and logged the time of execution using the <em>bench<\/em> package and 1000 iterations.  All calculations were performed on an <a href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/t3\/\">AWS t3.2xlarge<\/a> machine hosting the<a href=\"https:\/\/hub.docker.com\/r\/rocker\/tidyverse\"> <\/a><em><a href=\"https:\/\/hub.docker.com\/r\/rocker\/tidyverse\">rocker\/tidyverse<\/a><\/em> docker image.  This is the result of the calculations plotted in a linear manner:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"665\" src=\"https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_lin_2-1-1024x665.png\" alt=\"\" class=\"wp-image-4563\" srcset=\"https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_lin_2-1-1024x665.png 1024w, https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_lin_2-1-300x195.png 300w, https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_lin_2-1-768x498.png 768w, https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_lin_2-1-462x300.png 462w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>  The execution time of data transformations using dplyr, dtplyr and data.table plotted in a linear manner.  <\/figcaption><\/figure>\n\n\n\n<p>One can clearly see that <em>dtplyr <\/em>(ds_lazy &#8230;) and <em>data.table<\/em> (red line, calc_check_dt) increase nearly linear over time. Both seem to be really good solutions when it comes to large datasets. The difference of both at 10E7 can be explained by the low number of iterations being used at benchmarking. This post should not provide perfect benchmarks, but rather some fast insights into dtplyr. Thus, a reasonable computing time was preferred. <\/p>\n\n\n\n<p>Please consider, conversion and copying of a large <em>data.frame<\/em> into a data.table or tibble may take some time. This shall be considered. When using <em>data.table<\/em> all datasets should be data.tables right from the start.<\/p>\n\n\n\n<p>From the linear plot you cannot see at what point it is useful to use dtplyr against dplyr. This can thus be seen in the following plot:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"665\" src=\"https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_log_2-1-1024x665.png\" alt=\"\" class=\"wp-image-4564\" srcset=\"https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_log_2-1-1024x665.png 1024w, https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_log_2-1-300x195.png 300w, https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_log_2-1-768x498.png 768w, https:\/\/engel-wolf.com\/wp-content\/uploads\/num_overall_log_2-1-462x300.png 462w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption> The execution time of dtplyr, dplyr and data.table plotted for the conversion of different datasets in a logarithmic manner.  <\/figcaption><\/figure>\n\n\n\n<p>The above image clearly shows that it is worth using <em>dtplyr <\/em>over <em>dplyr <\/em>in case there are more than <strong>10E4 <\/strong>rows inside the dataset. So if you never work with datasets of this size it might not be worth considering <em>dtplyr<\/em> at all. On the other hand, you can see the<em> data.table<\/em> package itself is always the fastest solution. <\/p>\n\n\n\n<p>As in my project we do not just deal with numeric transformations of data, but also character transformations, I wanted to know if the <em>dtplyr<\/em> package could be helpful for this. The conversion I performed for character strings looks like this:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n# dplyr code\ndf %&gt;%\n  filter(letter == &quot;A&quot;) %&gt;% \n  mutate(word_new = paste0(word, numeric_int)) %&gt;% \n  select(word_new)\n\n# data.table code\nDT&#91;letter == &quot;A&quot;, .(word_new = paste0(word, numeric_int))]\n<\/pre><\/div>\n\n\n<p>The calculation results are similar to those gained with numeric conversions<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"665\" src=\"https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_lin_2-1-1024x665.png\" alt=\"\" class=\"wp-image-4565\" srcset=\"https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_lin_2-1-1024x665.png 1024w, https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_lin_2-1-300x195.png 300w, https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_lin_2-1-768x498.png 768w, https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_lin_2-1-462x300.png 462w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Inside this plot the difference between <em>dplyr <\/em>and <em>data.table<\/em> is nearly invisible. All methods increase the computing time in linear relation to the data set size.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"665\" src=\"https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_log_2-1-1024x665.png\" alt=\"\" class=\"wp-image-4566\" srcset=\"https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_log_2-1-1024x665.png 1024w, https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_log_2-1-300x195.png 300w, https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_log_2-1-768x498.png 768w, https:\/\/engel-wolf.com\/wp-content\/uploads\/char_overall_log_2-1-462x300.png 462w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption> The execution time of large scale data conversions for character evaluations with dtplyr, dplyr and data.table.  <\/figcaption><\/figure>\n\n\n\n<p>For character transformations the performance of <em>dplyr <\/em>is pretty good. It is just useful to use <em>dtplyr<\/em> for datasets with <strong>&gt;10E5 <\/strong>rows. In case of the largest dataset tested in this approach with 10E7 rows, <em>dtplyr<\/em> is twice as fast as <em>dplyr.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n\n\n\n<p>The code you need to write to use <em>dtplyr<\/em> is really simple. Especially if you are like me <em>dplyr<\/em> native and do not want to start using <em>data.table<\/em>. But there will be a lot of use-cases where it is not necessary or not even useful to use the <em>dtplyr<\/em> package. So before you start writing fancy <em>dtplyr<\/em> pipelines, think about the necessity. <\/p>\n\n\n\n<p>All code for this project is available at:  <a href=\"https:\/\/github.com\/zappingseb\/dtplyrtest\">https:\/\/github.com\/zappingseb\/dtplyrtest<\/a> &#8211; Please see updates from <a href=\"https:\/\/github.com\/zappingseb\/dtplyrtest\/commit\/96503599fc3f8548d94344235b206f9b13ef6e98\">last commit<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>When Hadley Wickham published dtplyr 1.0.0 last week, I was super excited. Moreover, I wanted to know, if it really is a useful package. Therefore I performed some tiny research [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4566,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[424,441,440,438,439,391,384,381],"_links":{"self":[{"href":"https:\/\/engel-wolf.com\/index.php?rest_route=\/wp\/v2\/posts\/4543"}],"collection":[{"href":"https:\/\/engel-wolf.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/engel-wolf.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/engel-wolf.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/engel-wolf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4543"}],"version-history":[{"count":8,"href":"https:\/\/engel-wolf.com\/index.php?rest_route=\/wp\/v2\/posts\/4543\/revisions"}],"predecessor-version":[{"id":4569,"href":"https:\/\/engel-wolf.com\/index.php?rest_route=\/wp\/v2\/posts\/4543\/revisions\/4569"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/engel-wolf.com\/index.php?rest_route=\/wp\/v2\/media\/4566"}],"wp:attachment":[{"href":"https:\/\/engel-wolf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4543"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/engel-wolf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4543"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/engel-wolf.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4543"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}