`rqdatatable`

is an implementation of the `rquery`

piped
Codd-style relational algebra hosted on `data.table`

.
`rquery`

allow the expression of complex transformations as a
series of relational operators and `rqdatatable`

implements
the operators using `data.table`

.

A `Python`

version of
`rquery`

/`rqdatatable`

is under initial
development as `data_algebra`

.

For example scoring a logistic regression model (which requires grouping, ordering, and ranking) is organized as follows. For more on this example please see “Let’s Have Some Sympathy For The Part-time R User”.

`library("rqdatatable")`

```
## Loading required package: wrapr
## Loading required package: rquery
```

```
# data example
<- build_frame(
dL "subjectID", "surveyCategory" , "assessmentTotal" |
1 , "withdrawal behavior", 5 |
1 , "positive re-framing", 2 |
2 , "withdrawal behavior", 3 |
2 , "positive re-framing", 4 )
```

```
<- 0.237
scale
# example rquery pipeline
<- local_td(dL) %.>%
rquery_pipeline extend_nse(.,
probability :=
exp(assessmentTotal * scale)) %.>%
normalize_cols(.,
"probability",
partitionby = 'subjectID') %.>%
pick_top_k(.,
k = 1,
partitionby = 'subjectID',
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
select_columns(., c('subjectID',
'diagnosis',
'probability')) %.>%
orderby(., cols = 'subjectID')
```

We can show the expanded form of query tree.

`cat(format(rquery_pipeline))`

```
mk_td("dL", c(
"subjectID",
"surveyCategory",
"assessmentTotal")) %.>%
extend(.,
probability := exp(assessmentTotal * 0.237)) %.>%
extend(.,
probability := probability / sum(probability),
partitionby = c('subjectID'),
orderby = c(),
reverse = c()) %.>%
extend(.,
row_number := row_number(),
partitionby = c('subjectID'),
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
select_rows(.,
row_number <= 1) %.>%
rename_columns(.,
c('diagnosis' = 'surveyCategory')) %.>%
select_columns(.,
c('subjectID', 'diagnosis', 'probability')) %.>%
order_rows(.,
c('subjectID'),
reverse = c(),
limit = NULL)
```

And execute it using `data.table`

.

`ex_data_table(rquery_pipeline)`

```
## subjectID diagnosis probability
## 1 1 withdrawal behavior 0.6706221
## 2 2 positive re-framing 0.5589742
```

One can also apply the pipeline to new tables.

```
build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
7 , "withdrawal behavior", 5 |
7 , "positive re-framing", 20 ) %.>%
rquery_pipeline
```

```
## subjectID diagnosis probability
## 1 7 positive re-framing 0.9722128
```

Initial bench-marking of `rqdatatable`

is very favorable
(notes here).

To install `rqdatatable`

please use
`install.packages("rqdatatable")`

.

Some related work includes:

`data.table`

`Polars`

`data algebra`

`disk.frame`

`dbplyr`

`dplyr`

`dtplyr`

`maditr`

`nc`

`poorman`

`rquery`

`SparkR`

`sparklyr`

`sqldf`

`table.express`

`tidyfast`

`tidyfst`

`tidyquery`

`tidyr`

`tidytable`

(formerly`gdt`

/`tidydt`

)

–

Note `rqdatatable`

has an “immediate mode” which allows
direct application of pipelines stages without pre-assembling the
pipeline. “Immediate mode” is a convenience for ad-hoc analyses, and has
some negative performance impact, so we encourage users to build
pipelines for most work. Some notes on the issue can be found here.

`rqdatatable`

implements the `rquery`

grammar
in the style of a “Turing or Cook reduction” (implementing the result in
terms of multiple oracle calls to the related system).

`rqdatatable`

is intended for “simple column names”, in
particular as `rqdatatable`

often uses `eval()`

to
work over `data.table`

escape characters such as
“`\`

” and “`\\`

” are not reliable in column names.
Also `rqdatatable`

does not support tables with no
columns.