time_hours <- function(mins) mins / 60
worked, but
time_hours_rounded <- function(mins) round(mins / 60)
did not; now both work. These are automatic translations rather than
true user-defined functions (UDFs); for UDFs, see
register_scalar_function()
. (#41223)mutate()
expressions can now include aggregations, such
as x - mean(x)
. (#41350)summarize()
supports more complex expressions, and
correctly handles cases where column names are reused in
expressions.na_matches
argument to the
dplyr::*_join()
functions is now supported. This argument
controls whether NA
values are considered equal when
joining. (#41358)pull
on grouped
datasets, it now returns the expected column. (#43172)base::prod
have been added so you can now
use it in your dplyr pipelines (i.e.,
tbl |> summarize(prod(col))
) without having to pull the
data into R (@m-muecke, #38601).dimnames
or colnames
on
Dataset
objects now returns a useful result rather than
just NULL
(#38377).code()
method on Schema objects now takes an
optional namespace
argument which, when TRUE
,
prefixes names with arrow::
which makes the output more
portable (@orgadish,
#38144).SystemRequirements
(#39602).sub
, gsub
,
stringr::str_replace
, stringr::str_replace_all
are passed a length > 1 vector of values in pattern
(@abfleishman,
#39219).?open_dataset
documenting how to use the ND-JSON support added in arrow 13.0.0 (@Divyansh200102,
#38258).s3_bucket
, S3FileSystem
), the debug log
level for S3 can be set with the AWS_S3_LOG_LEVEL
environment variable. See ?S3FileSystem
for more
information. (#38267)to_duckdb()
) no longer
results in warnings when quitting your R session. (#38495)LIBARROW_BINARY=true
for old behavior (#39861).ARROW_R_ALLOW_CPP_VERSION_MISMATCH=true
)
and requires atleast Arrow C++ 13.0.0 (#39739).open_dataset()
, the partition variables are now included in
the resulting dataset (#37658).write_csv_dataset()
now wraps
write_dataset()
and mirrors the syntax of
write_csv_arrow()
(@dgreiss, #36436).open_delim_dataset()
now accepts quoted_na
argument to empty strings to be parsed as NA values (#37828).schema()
can now be called on data.frame
objects to retrieve their inferred Arrow schema (#37843).read_csv2_arrow()
(#38002).CsvParseOptions
object creation now
contains more information about default values (@angela-li, #37909).fixed()
,
regex()
etc.) now allow variables to be reliably used in
their arguments (#36784).ParquetReaderProperties
, allowing users to
work with Parquet files with unusually large metadata (#36992).add_filename()
are
improved (@amoeba,
#37372).create_package_with_all_dependencies()
now properly
escapes paths on Windows (#37226).data.frame
and no
other classes now have the class
attribute dropped,
resulting in now always returning tibbles from file reading functions
and arrow_table()
, which results in consistency in the type
of returned objects. Calling as.data.frame()
on Arrow
Tabular objects now always returns a data.frame
object
(#34775)open_dataset()
now works with ND-JSON files
(#35055)schema()
on multiple Arrow objects now returns
the object’s schema (#35543).by
/by
argument now supported in
arrow implementation of dplyr verbs (@eitsupi, #35667)dplyr::case_when()
now accepts
.default
parameter to match the update in dplyr 1.1.0
(#35502)arrow_array()
can be used to
create Arrow Arrays (#36381)scalar()
can be used to create
Arrow Scalars (#36265)RecordBatchReader::ReadNext()
from DuckDB from the
main R thread (#36307)set_io_thread_count()
with
num_threads
< 2 (#36304)strptime()
in arrow will return a timezone-aware
timestamp if %z
is part of the format string (#35671)group_by()
and
across()
now matches dplyr (@eitsupi, #35473)read_parquet()
and read_feather()
functions can now accept URL arguments (#33287, #34708).json_credentials
argument in
GcsFileSystem$create()
now accepts a file path containing
the appropriate authentication token (@amoeba, #34421, #34524).$options
member of GcsFileSystem
objects can now be inspected (@amoeba, #34422, #34477).read_csv_arrow()
and read_json_arrow()
functions now accept literal text input wrapped in I()
to
improve compatability with readr::read_csv()
(@eitsupi, #18487,
#33968).$
and
[[
in dplyr expressions (#18818, #19706).FetchNode
and
OrderByNode
to improve performance and simplify building
query plans from dplyr expressions (#34437, #34685).arrow_table()
(#35038,
#35039).data.frame
with NULL
column names to a
Table
(#15247, #34798).open_csv_dataset()
family of functions (#33998,
#34710).dplyr::n()
function is now mapped to the
count_all
kernel to improve performance and simplify the R
implementation (#33892, #33917).s3_bucket()
filesystem helper with endpoint_override
and fixed
surprising behaviour that occurred when passing some combinations of
arguments (@cboettig, #33904, #34009).schema
is supplied and
col_names = TRUE
in open_csv_dataset()
(#34217, #34092).open_csv_dataset()
allows a schema to be specified.
(#34217)dplyr:::check_names()
(#34369)map_batches()
is lazy by default; it now returns a
RecordBatchReader
instead of a list of
RecordBatch
objects unless lazy = FALSE
.
(#14521)open_csv_dataset()
,
open_tsv_dataset()
, and open_delim_dataset()
all wrap open_dataset()
- they don’t provide new
functionality, but allow for readr-style options to be supplied, making
it simpler to switch between individual file-reading and dataset
functionality. (#33614)col_names
parameter allows specification of
column names when opening a CSV dataset. (@wjones127, #14705)parse_options
, read_options
, and
convert_options
parameters for reading individual files
(read_*_arrow()
functions) and datasets
(open_dataset()
and the new open_*_dataset()
functions) can be passed in as lists. (#15270)read_csv_arrow()
. (#14930)join_by()
has been
implemented for dplyr joins on Arrow objects (equality conditions only).
(#33664)dplyr::group_by()
/dplyr::summarise()
calls are
used. (#14905)dplyr::summarize()
works with division when divisor is
a variable. (#14933)dplyr::right_join()
correctly coalesces keys.
(#15077)lubridate::with_tz()
and
lubridate::force_tz()
(@eitsupi, #14093)stringr::str_remove()
and
stringr::str_remove_all()
(#14644)POSIXlt
objects.
(#15277)Array$create()
can create Decimal arrays. (#15211)StructArray$create()
can be used to create StructArray
objects. (#14922)lubridate::as_datetime()
on Arrow objects can
handle time in sub-seconds. (@eitsupi, #13890)head()
can be called after
as_record_batch_reader()
. (#14518)as.Date()
can go from timestamp[us]
to
timestamp[s]
. (#14935)check_dots_empty()
. (@daattali, #14744)Minor improvements and fixes:
.data
pronoun in
dplyr::group_by()
(#14484)Several new functions can be used in queries:
dplyr::across()
can be used to apply the same
computation across multiple columns, and the where()
selection helper is supported in across()
;add_filename()
can be used to get the filename a row
came from (only available when querying ?Dataset
);slice_*
family:
dplyr::slice_min()
, dplyr::slice_max()
,
dplyr::slice_head()
, dplyr::slice_tail()
, and
dplyr::slice_sample()
.The package now has documentation that lists all dplyr
methods and R function mappings that are supported on Arrow data, along
with notes about any differences in functionality between queries
evaluated in R versus in Acero, the Arrow query engine. See
?acero
.
A few new features and bugfixes were implemented for joins:
keep
argument is now supported, allowing separate
columns for the left and right hand side join keys in join output. Full
joins now coalesce the join keys (when keep = FALSE
),
avoiding the issue where the join keys would be all NA
for
rows in the right hand side without any matches on the left.Some changes to improve the consistency of the API:
dplyr::pull()
will return
a ?ChunkedArray
instead of an R vector by default. The
current default behavior is deprecated. To update to the new behavior
now, specify pull(as_vector = FALSE)
or set
options(arrow.pull_as_vector = FALSE)
globally.dplyr::compute()
on a query that is grouped
returns a ?Table
instead of a query object.Finally, long-running queries can now be cancelled and will abort their computation immediately.
as_arrow_array()
can now take blob::blob
and ?vctrs::list_of
, which convert to binary and list
arrays, respectively. Also fixed an issue where
as_arrow_array()
ignored type argument when passed a
StructArray
.
The unique()
function works on ?Table
,
?RecordBatch
, ?Dataset
, and
?RecordBatchReader
.
write_feather()
can take
compression = FALSE
to choose writing uncompressed
files.
Also, a breaking change for IPC files in
write_dataset()
: passing "ipc"
or
"feather"
to format
will now write files with
.arrow
extension instead of .ipc
or
.feather
.
As of version 10.0.0, arrow
requires C++17 to build.
This means that:
R >= 4.0
. Version 9.0.0 was the
last version to support R 3.6.arrow
,
but you first need to install a newer compiler than the default system
compiler, gcc 4.8. See
vignette("install", package = "arrow")
for guidance. Note
that you only need the newer compiler to build arrow
:
installing a binary package, as from RStudio Package Manager, or loading
a package you’ve already installed works fine with the system
defaults.dplyr::union
and dplyr::union_all
(#13090)dplyr::glimpse
(#13563)show_exec_plan()
can be added to the end of a dplyr
pipeline to show the underlying plan, similar to
dplyr::show_query()
. dplyr::show_query()
and
dplyr::explain()
also work and show the same output, but
may change in the future. (#13541)register_scalar_function()
to create them. (#13397)map_batches()
returns a RecordBatchReader
and requires that the function it maps returns something coercible to a
RecordBatch
through the as_record_batch()
S3
function. It can also run in streaming fashion if passed
.lazy = TRUE
. (#13170, #13650)stringr::
, lubridate::
) within queries.
For example, stringr::str_length
will now dispatch to the
same kernel as str_length
. (#13160)lubridate::parse_date_time()
datetime parser: (#12589,
#13196, #13506)
orders
with year, month, day, hours, minutes, and
seconds components are supported.orders
argument in the Arrow binding works as
follows: orders
are transformed into formats
which subsequently get applied in turn. There is no
select_formats
parameter and no inference takes place (like
is the case in lubridate::parse_date_time()
).lubridate
date and datetime parsers such as
lubridate::ymd()
, lubridate::yq()
, and
lubridate::ymd_hms()
(#13118, #13163, #13627)lubridate::fast_strptime()
(#13174)lubridate::floor_date()
,
lubridate::ceiling_date()
, and
lubridate::round_date()
(#12154)strptime()
supports the tz
argument to
pass timezones. (#13190)lubridate::qday()
(day of quarter)exp()
and sqrt()
. (#13517)read_ipc_file()
and
write_ipc_file()
are added. These functions are almost the
same as read_feather()
and write_feather()
,
but differ in that they only target IPC files (Feather V2 files), not
Feather V1 files.read_arrow()
and write_arrow()
, deprecated
since 1.0.0 (July 2020), have been removed. Instead of these, use the
read_ipc_file()
and write_ipc_file()
for IPC
files, or, read_ipc_stream()
and
write_ipc_stream()
for IPC streams. (#13550)write_parquet()
now defaults to writing Parquet format
version 2.4 (was 1.0). Previously deprecated arguments
properties
and arrow_properties
have been
removed; if you need to deal with these lower-level properties objects
directly, use ParquetFileWriter
, which
write_parquet()
wraps. (#13555)write_dataset()
preserves all schema metadata again. In
8.0.0, it would drop most metadata, breaking packages such as sfarrow.
(#13105)write_csv_arrow()
) will automatically (de-)compress data if
the file path contains a compression extension
(e.g. "data.csv.gz"
). This works locally as well as on
remote filesystems like S3 and GCS. (#13183)FileSystemFactoryOptions
can be provided to
open_dataset()
, allowing you to pass options such as which
file prefixes to ignore. (#13171)S3FileSystem
will not create or delete
buckets. To enable that, pass the configuration option
allow_bucket_creation
or
allow_bucket_deletion
. (#13206)GcsFileSystem
and gs_bucket()
allow
connecting to Google Cloud Storage. (#10999, #13601)$num_rows()
method returns a
double (previously integer), avoiding integer overflow on larger tables.
(#13482, #13514)arrow.dev_repo
for nightly builds of the R package
and prebuilt libarrow binaries is now https://nightlies.apache.org/arrow/r/.open_dataset()
:
skip
argument for skipping
header rows in CSV datasets.UnionDataset
.{dplyr}
queries:
RecordBatchReader
. This allows, for
example, results from DuckDB to be streamed back into Arrow rather than
materialized before continuing the pipeline.dplyr::rename_with()
.dplyr::count()
returns an ungrouped dataframe.write_dataset()
has more options for controlling row
group and file sizes when writing partitioned datasets, such as
max_open_files
, max_rows_per_file
,
min_rows_per_group
, and
max_rows_per_group
.write_csv_arrow()
accepts a Dataset
or an
Arrow dplyr query.option(use_threads = FALSE)
no longer crashes R. That
option is set by default on Windows.dplyr
joins support the suffix
argument to
handle overlap in column names.is.na()
no longer
misses any rows.map_batches()
correctly accepts Dataset
objects.read_csv_arrow()
’s readr-style type T
is
mapped to timestamp(unit = "ns")
instead of
timestamp(unit = "s")
.{lubridate}
features and fixes:
lubridate::tz()
(timezone),lubridate::semester()
,lubridate::dst()
(daylight savings time boolean),lubridate::date()
,lubridate::epiyear()
(year according to epidemiological
week calendar),lubridate::month()
works with integer inputs.lubridate::make_date()
&
lubridate::make_datetime()
+
base::ISOdatetime()
& base::ISOdate()
to
create date-times from numeric representations.lubridate::decimal_date()
and
lubridate::date_decimal()
lubridate::make_difftime()
(duration constructor)?lubridate::duration
helper functions, such as
lubridate::dyears()
, lubridate::dhours()
,
lubridate::dseconds()
.lubridate::leap_year()
lubridate::as_date()
and
lubridate::as_datetime()
base::difftime
and
base::as.difftime()
base::as.Date()
to convert to datebase::format()
strptime()
returns NA
instead of erroring
in case of format mismatch, just like
base::strptime()
.as_arrow_array()
and as_arrow_table()
for main
Arrow objects. This includes, Arrow tables, record batches, arrays,
chunked arrays, record batch readers, schemas, and data types. This
allows other packages to define custom conversions from their types to
Arrow objects, including extension arrays.?new_extension_type
.vctrs::vec_is()
returns TRUE (i.e.,
any object that can be used as a column in a
tibble::tibble()
), provided that the underlying
vctrs::vec_data()
can be converted to an Arrow Array.Arrow arrays and tables can be easily concatenated:
concat_arrays()
or, if
zero-copy is desired and chunking is acceptable, using
ChunkedArray$create()
.c()
.cbind()
.rbind()
. concat_tables()
is
also provided to concatenate tables while unifying schemas.sqrt()
, log()
, and
exp()
with Arrow arrays and scalars.read_*
and write_*
functions support R
Connection objects for reading and writing files.median()
and quantile()
will warn only
once about approximate calculations regardless of interactivity.Array$cast()
can cast StructArrays into another struct
type with the same field names and structure (or a subset of fields) but
different field types.set_io_thread_count()
would set
the CPU count instead of the IO thread count.RandomAccessFile
has a $ReadMetadata()
method that provides useful metadata provided by the filesystem.grepl
binding returns FALSE
for
NA
inputs (previously it returned NA
), to
match the behavior of base::grepl()
.create_package_with_all_dependencies()
works on Windows
and Mac OS, instead of only Linux.{lubridate}
features: week()
,
more of the is.*()
functions, and the label argument to
month()
have been implemented.summarize()
, such as
ifelse(n() > 1, mean(y), mean(z))
, are supported.tibble
and data.frame
to create columns of
tibbles or data.frames respectively
(e.g. ... %>% mutate(df_col = tibble(a, b)) %>% ...
).factor
type) are supported inside
of coalesce()
.open_dataset()
accepts the partitioning
argument when reading Hive-style partitioned files, even though it is
not required.map_batches()
function for custom
operations on dataset has been restored.encoding
argument when
reading).open_dataset()
correctly ignores byte-order marks
(BOM
s) in CSVs, as already was true for reading single
fileshead()
no longer hangs on large CSV datasets.write_csv_arrow()
now follows the signature of
readr::write_csv()
.$code()
method on a
schema
or type
. This allows you to easily get
the code needed to create a schema from an object that already has
one.Duration
type has been mapped to R’s
difftime
class.decimal256()
type is supported. The
decimal()
function has been revised to call either
decimal256()
or decimal128()
based on the
value of the precision
argument.write_parquet()
uses a reasonable guess at
chunk_size
instead of always writing a single chunk. This
improves the speed of reading and writing large Parquet files.write_parquet()
no longer drops attributes for grouped
data.frames.proxy_options
.pkg-config
to search
for system dependencies (such as libz
) and link to them if
present. This new default will make building Arrow from source quicker
on systems that have these dependencies installed already. To retain the
previous behavior of downloading and building all dependencies, set
ARROW_DEPENDENCY_SOURCE=BUNDLED
.glue
, which
arrow
depends on transitively, has dropped support for
it.str_count()
in dplyr queriesThere are now two ways to query Arrow data:
dplyr::summarize()
, both grouped and ungrouped, is now
implemented for Arrow Datasets, Tables, and RecordBatches. Because data
is scanned in chunks, you can aggregate over larger-than-memory datasets
backed by many files. Supported aggregation functions include
n()
, n_distinct()
, min(),
max()
, sum()
, mean()
,
var()
, sd()
, any()
, and
all()
. median()
and quantile()
with one probability are also supported and currently return approximate
results using the t-digest algorithm.
Along with summarize()
, you can also call
count()
, tally()
, and distinct()
,
which effectively wrap summarize()
.
This enhancement does change the behavior of summarize()
and collect()
in some cases: see “Breaking changes” below
for details.
In addition to summarize()
, mutating and filtering
equality joins (inner_join()
, left_join()
,
right_join()
, full_join()
,
semi_join()
, and anti_join()
) with are also
supported natively in Arrow.
Grouped aggregation and (especially) joins should be considered somewhat experimental in this release. We expect them to work, but they may not be well optimized for all workloads. To help us focus our efforts on improving them in the next release, please let us know if you encounter unexpected behavior or poor performance.
New non-aggregating compute functions include string functions like
str_to_title()
and strftime()
as well as
compute functions for extracting date parts (e.g. year()
,
month()
) from dates. This is not a complete list of
additional compute functions; for an exhaustive list of available
compute functions see list_compute_functions()
.
We’ve also worked to fill in support for all data types, such as
Decimal
, for functions added in previous releases. All type
limitations mentioned in previous release notes should be no longer
valid, and if you find a function that is not implemented for a certain
data type, please report an
issue.
If you have the duckdb package
installed, you can hand off an Arrow Dataset or query object to DuckDB for further querying using the
to_duckdb()
function. This allows you to use duckdb’s
dbplyr
methods, as well as its SQL interface, to aggregate
data. Filtering and column projection done before
to_duckdb()
is evaluated in Arrow, and duckdb can push down
some predicates to Arrow as well. This handoff does not copy
the data, instead it uses Arrow’s C-interface (just like passing arrow
data between R and Python). This means there is no serialization or data
copying costs are incurred.
You can also take a duckdb tbl
and call
to_arrow()
to stream data to Arrow’s query engine. This
means that in a single dplyr pipeline, you could start with an Arrow
Dataset, evaluate some steps in DuckDB, then evaluate the rest in
Arrow.
arrange()
the query result. For calls to
summarize()
, you can set
options(arrow.summarise.sort = TRUE)
to match the current
dplyr
behavior of sorting on the grouping columns.dplyr::summarize()
on an in-memory Arrow Table or
RecordBatch no longer eagerly evaluates. Call compute()
or
collect()
to evaluate the query.head()
and tail()
also no longer eagerly
evaluate, both for in-memory data and for Datasets. Also, because row
order is no longer deterministic, they will effectively give you a
random slice of data from somewhere in the dataset unless you
arrange()
to specify sorting.sf::st_as_binary(col)
) or using the sfarrow package
which handles some of the intricacies of this conversion process. We
have plans to improve this and re-enable custom metadata like this in
the future when we can implement the saving in a safe and efficient way.
If you need to preserve the pre-6.0.0 behavior of saving this metadata,
you can set
options(arrow.preserve_row_level_metadata = TRUE)
. We will
be removing this option in a coming release. We strongly recommend
avoiding using this workaround if possible since the results will not be
supported in the future and can lead to surprising and inaccurate
results. If you run into a custom class besides sf columns that are
impacted by this please report an
issue.LIBARROW_MINIMAL=true
. This will have the core
Arrow/Feather components but excludes Parquet, Datasets, compression
libraries, and other optional features.create_package_with_all_dependencies()
function
(also available on GitHub without installing the arrow package) will
download all third-party C++ dependencies and bundle them inside the R
source package. Run this function on a system connected to the network
to produce the “fat” source package, then copy that .tar.gz package to
your offline machine and install. Special thanks to @karldw for the huge amount
of work on this.libz
) by setting ARROW_DEPENDENCY_SOURCE=AUTO
.
This is not the default in this release (BUNDLED
,
i.e. download and build all dependencies) but may become the default in
the future.read_json_arrow()
) are now
optional and still on by default; set ARROW_JSON=OFF
before
building to disable them.options(arrow.use_altrep = FALSE)
Field
objects can now be created as non-nullable, and
schema()
now optionally accepts a list of
Field
swrite_parquet()
no longer errors when used with a
grouped data.framecase_when()
now errors cleanly if an expression is not
supported in Arrowopen_dataset()
now works on CSVs without header
rowsT
and t
were reversed in read_csv_arrow()
log(..., base = b)
where b is something
other than 2, e, or 10Table$create()
now has alias
arrow_table()
This patch version contains fixes for some sanitizer and compiler warnings.
There are now more than 250 compute functions available for use
in dplyr::filter()
, mutate()
, etc. Additions
in this release include:
strsplit()
and
str_split()
; strptime()
; paste()
,
paste0()
, and str_c()
; substr()
and str_sub()
; str_like()
;
str_pad()
; stri_reverse()
lubridate
methods such as
year()
, month()
, wday()
, and so
onlog()
et al.); trigonometry
(sin()
, cos()
, et al.); abs()
;
sign()
; pmin()
and pmax()
;
ceiling()
, floor()
, and
trunc()
ifelse()
and if_else()
for all but
Decimal
types; case_when()
for logical,
numeric, and temporal types only; coalesce()
for all but
lists/structs. Note also that in this release, factors/dictionaries are
converted to strings in these functions.is.*
functions are supported and can be used inside
relocate()
The print method for arrow_dplyr_query
now includes
the expression and the resulting type of columns derived by
mutate()
.
transmute()
now errors if passed arguments
.keep
, .before
, or .after
, for
consistency with the behavior of dplyr
on
data.frame
s.
write_csv_arrow()
to use Arrow to write a data.frame to
a single CSV filewrite_dataset(format = "csv", ...)
to write a Dataset
to CSVs, including with partitioningreticulate::py_to_r()
and r_to_py()
methods. Along with the addition of the
Scanner$ToRecordBatchReader()
method, you can now build up
a Dataset query in R and pass the resulting stream of batches to another
tool in process.Array$export_to_c()
,
RecordBatch$import_from_c()
), similar to how they are in
pyarrow
. This facilitates their use in other packages. See
the py_to_r()
and r_to_py()
methods for usage
examples.data.frame
to an Arrow
Table
uses multithreading across columnsoptions(arrow.use_altrep = FALSE)
is.na()
now evaluates to TRUE
on
NaN
values in floating point number fields, for consistency
with base R.is.nan()
now evaluates to FALSE
on
NA
values in floating point number fields and
FALSE
on all values in non-floating point fields, for
consistency with base R.Array
,
ChunkedArray
, RecordBatch
, and
Table
: na.omit()
and friends,
any()
/all()
RecordBatch$create()
and
Table$create()
are recycledarrow_info()
includes details on the C++ build, such as
compiler versionmatch_arrow()
now converts x
into an
Array
if it is not a Scalar
,
Array
or ChunkedArray
and no longer dispatches
base::match()
.LIBARROW_MINIMAL=false
) includes both
jemalloc and mimalloc, and it has still has jemalloc as default, though
this is configurable at runtime with the
ARROW_DEFAULT_MEMORY_POOL
environment variable.LIBARROW_MINIMAL
,
LIBARROW_DOWNLOAD
, and NOT_CRAN
are now
case-insensitive in the Linux build script.Many more dplyr
verbs are supported on Arrow
objects:
dplyr::mutate()
is now supported in Arrow for many
applications. For queries on Table
and
RecordBatch
that are not yet supported in Arrow, the
implementation falls back to pulling data into an in-memory R
data.frame
first, as in the previous release. For queries
on Dataset
(which can be larger than memory), it raises an
error if the function is not implemented. The main mutate()
features that cannot yet be called on Arrow objects are (1)
mutate()
after group_by()
(which is typically
used in combination with aggregation) and (2) queries that use
dplyr::across()
.dplyr::transmute()
(which calls
mutate()
)dplyr::group_by()
now preserves the .drop
argument and supports on-the-fly definition of columnsdplyr::relocate()
to reorder columnsdplyr::arrange()
to sort rowsdplyr::compute()
to evaluate the lazy expressions and
return an Arrow Table. This is equivalent to
dplyr::collect(as_data_frame = FALSE)
, which was added in
2.0.0.Over 100 functions can now be called on Arrow objects inside a
dplyr
verb:
nchar()
, tolower()
, and
toupper()
, along with their stringr
spellings
str_length()
, str_to_lower()
, and
str_to_upper()
, are supported in Arrow dplyr
calls. str_trim()
is also supported.sub()
,
gsub()
, and grepl()
, along with
str_replace()
, str_replace_all()
, and
str_detect()
, are supported.cast(x, type)
and dictionary_encode()
allow changing the type of columns in Arrow objects;
as.numeric()
, as.character()
, etc. are exposed
as similar type-altering conveniencesdplyr::between()
; the Arrow version also allows the
left
and right
arguments to be columns in the
data and not just scalarsdplyr
verb. This enables you to access Arrow functions that
don’t have a direct R mapping. See list_compute_functions()
for all available functions, which are available in dplyr
prefixed by arrow_
.dplyr::filter(arrow_dataset, string_column == 3)
will error
with a message about the type mismatch between the numeric
3
and the string type of string_column
.open_dataset()
now accepts a vector of file paths (or
even a single file path). Among other things, this enables you to open a
single very large file and use write_dataset()
to partition
it without having to read the whole file into memory.write_dataset()
now defaults to
format = "parquet"
and better validates the
format
argumentschema
in open_dataset()
is now correctly handledScanner$Scan()
method has been removed; use
Scanner$ScanBatches()
value_counts()
to tabulate values in an
Array
or ChunkedArray
, similar to
base::table()
.StructArray
objects gain data.frame-like methods,
including names()
, $
, [[
, and
dim()
.<-
) with either $
or
[[
Schema
can now be edited by assigning in new
types. This enables using the CSV reader to detect the schema of a file,
modify the Schema
object for any columns that you want to
read in as a different type, and then use that Schema
to
read the data.Table
with a schema,
with columns of different lengths, and with scalar value recycling\0
) characters, the error message now informs you that you
can set options(arrow.skip_nul = TRUE)
to strip them out.
It is not recommended to set this option by default since this code path
is significantly slower, and most string data does not contain
nuls.read_json_arrow()
now accepts a schema:
read_json_arrow("file.json", schema = schema(col_a = float64(), col_b = string()))
vignette("install", package = "arrow")
for details. This
allows a faster, smaller package build in cases where that is useful,
and it enables a minimal, functioning R package build on Solaris.FORCE_BUNDLED_BUILD=true
.arrow
now uses the mimalloc
memory
allocator by default on macOS, if available (as it is in CRAN binaries),
instead of jemalloc
. There are configuration
issues with jemalloc
on macOS, and benchmark
analysis shows that this has negative effects on performance,
especially on memory-intensive workflows. jemalloc
remains
the default on Linux; mimalloc
is default on Windows.ARROW_DEFAULT_MEMORY_POOL
environment
variable to switch memory allocators now works correctly when the Arrow
C++ library has been statically linked (as is usually the case when
installing from CRAN).arrow_info()
function now reports on the additional
optional features, as well as the detected SIMD level. If key features
or compression libraries are not enabled in the build,
arrow_info()
will refer to the installation vignette for
guidance on how to install a more complete build, if desired.vignette("developing", package = "arrow")
.ARROW_HOME
to point to a specific directory where the Arrow
libraries are. This is similar to passing INCLUDE_DIR
and
LIB_DIR
.flight_get()
and
flight_put()
(renamed from push_data()
in this
release) can handle both Tables and RecordBatchesflight_put()
gains an overwrite
argument
to optionally check for the existence of a resource with the same
namelist_flights()
and flight_path_exists()
enable you to see available resources on a Flight serverSchema
objects now have r_to_py
and
py_to_r
methods+
, *
, etc.) are
supported on Arrays and ChunkedArrays and can be used in filter
expressions in Arrow dplyr
pipelines<-
) with either $
or [[
names()
rlang
pronouns .data
and
.env
are now fully supported in Arrow dplyr
pipelines.arrow.skip_nul
(default FALSE
, as
in base::scan()
) allows conversion of Arrow string
(utf8()
) type data containing embedded nul \0
characters to R. If set to TRUE
, nuls will be stripped and
a warning is emitted if any are found.arrow_info()
for an overview of various run-time and
build-time Arrow configurations, useful for debuggingARROW_DEFAULT_MEMORY_POOL
before loading the Arrow package to change memory allocators. Windows
packages are built with mimalloc
; most others are built
with both jemalloc
(used by default) and
mimalloc
. These alternative memory allocators are generally
much faster than the system memory allocator, so they are used by
default when available, but sometimes it is useful to turn them off for
debugging purposes. To disable them, set
ARROW_DEFAULT_MEMORY_POOL=system
.sf
tibbles to faithfully preserved and
roundtripped (#8549).schema()
for more details.write_parquet()
can now write RecordBatchesreadr
’s problems
attribute is removed when
converting to Arrow RecordBatch and table to prevent large amounts of
metadata from accumulating inadvertently (#9092)SubTreeFileSystem
gains a useful print method and no
longer errors when printingr-arrow
package are available with
conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
cmake
versionsvignette("install", package = "arrow")
, especially for
known CentOS issuesdistro
package. If
your OS isn’t correctly identified, please report an issue there.write_dataset()
to Feather or Parquet files with
partitioning. See the end of
vignette("dataset", package = "arrow")
for discussion and
examples.head()
, tail()
, and take
([
) methods. head()
is optimized but the
others may not be performant.collect()
gains an as_data_frame
argument,
default TRUE
but when FALSE
allows you to
evaluate the accumulated select
and filter
query but keep the result in Arrow, not an R
data.frame
read_csv_arrow()
supports specifying column types, both
with a Schema
and with the compact string representation
for types used in the readr
package. It also has gained a
timestamp_parsers
argument that lets you express a set of
strptime
parse strings that will be tried to convert
columns designated as Timestamp
type.libcurl
and
openssl
, as well as a sufficiently modern compiler. See
vignette("install", package = "arrow")
for details.read_parquet()
,
write_feather()
, et al.), as well as
open_dataset()
and write_dataset()
, allow you
to access resources on S3 (or on file systems that emulate S3) either by
providing an s3://
URI or by providing a
FileSystem$path()
. See
vignette("fs", package = "arrow")
for examples.copy_files()
allows you to recursively copy directories
of files from one file system to another, such as from S3 to your local
machine.Flight
is a general-purpose client-server framework for high performance
transport of large datasets over network interfaces. The
arrow
R package now provides methods for connecting to
Flight RPC servers to send and receive data. See
vignette("flight", package = "arrow")
for an overview.
==
, >
, etc.) and boolean
(&
, |
, !
) operations, along
with is.na
, %in%
and match
(called match_arrow()
), on Arrow Arrays and ChunkedArrays
are now implemented in the C++ library.min()
, max()
, and
unique()
are implemented for Arrays and ChunkedArrays.dplyr
filter expressions on Arrow Tables and
RecordBatches are now evaluated in the C++ library, rather than by
pulling data into R and evaluating. This yields significant performance
improvements.dim()
(nrow
) for dplyr queries on
Table/RecordBatch is now supportedarrow
now depends on cpp11
, which brings
more robust UTF-8 handling and faster compilationInt64
type when all
values fit with an R 32-bit integer now correctly inspects all chunks in
a ChunkedArray, and this conversion can be disabled (so that
Int64
always yields a bit64::integer64
vector)
by setting options(arrow.int64_downcast = FALSE)
.ParquetFileReader
has additional methods for accessing
individual columns or row groups from the fileParquetFileWriter
; invalid ArrowObject
pointer
from a saved R object; converting deeply nested structs from Arrow to
Rproperties
and arrow_properties
arguments to write_parquet()
are deprecated%in%
expression now faithfully returns all relevant
rows.
or _
; files and subdirectories starting
with those prefixes are still ignoredopen_dataset("~/path")
now correctly expands the
pathversion
option to write_parquet()
is
now correctly implementedparquet-cpp
library has been
fixedcmake
is more robust, and you can now specify a /path/to/cmake
by
setting the CMAKE
environment variablevignette("arrow", package = "arrow")
includes tables
that explain how R types are converted to Arrow types and vice
versa.uint64
, binary
,
fixed_size_binary
, large_binary
,
large_utf8
, large_list
, list
of
structs
.character
vectors that exceed 2GB are converted to
Arrow large_utf8
typePOSIXlt
objects can now be converted to Arrow
(struct
)attributes()
are preserved in Arrow metadata when
converting to Arrow RecordBatch and table and are restored when
converting from Arrow. This means that custom subclasses, such as
haven::labelled
, are preserved in round trip through
Arrow.batch$metadata$new_key <- "new value"
int64
, uint32
, and
uint64
now are converted to R integer
if all
values fit in boundsdate32
is now converted to R Date
with double
underlying storage. Even though the data values
themselves are integers, this provides more strict round-trip
fidelityfactor
, dictionary
ChunkedArrays that do not have identical dictionaries are properly
unifiedRecordBatch{File,Stream}Writer
will
write V5, but you can specify an alternate
metadata_version
. For convenience, if you know the consumer
you’re writing to cannot read V5, you can set the environment variable
ARROW_PRE_1_0_METADATA_VERSION=1
to write V4 without
changing any other code.ds <- open_dataset("s3://...")
.
Note that this currently requires a special C++ library build with
additional dependencies–this is not yet available in CRAN releases or in
nightly packages.sum()
and
mean()
are implemented for Array
and
ChunkedArray
dimnames()
and as.list()
reticulate
coerce_timestamps
option to
write_parquet()
is now correctly implemented.type
definition if provided by the userread_arrow
and write_arrow
are now
deprecated; use the read/write_feather()
and
read/write_ipc_stream()
functions depending on whether
you’re working with the Arrow IPC file or stream format,
respectively.FileStats
,
read_record_batch
, and read_table
have been
removed.jemalloc
included, and Windows packages
use mimalloc
CC
and
CXX
values that R usesdplyr
1.0reticulate::r_to_py()
conversion now correctly works
automatically, without having to call the method yourselfThis release includes support for version 2 of the Feather file
format. Feather v2 features full support for all Arrow data types, fixes
the 2GB per-column limitation for large amounts of string data, and it
allows files to be compressed using either lz4
or
zstd
. write_feather()
can write either version
2 or version 1 Feather
files, and read_feather()
automatically detects which file
version it is reading.
Related to this change, several functions around reading and writing
data have been reworked. read_ipc_stream()
and
write_ipc_stream()
have been added to facilitate writing
data to the Arrow IPC stream format, which is slightly different from
the IPC file format (Feather v2 is the IPC file format).
Behavior has been standardized: all
read_<format>()
return an R data.frame
(default) or a Table
if the argument
as_data_frame = FALSE
; all
write_<format>()
functions return the data object,
invisibly. To facilitate some workflows, a special
write_to_raw()
function is added to wrap
write_ipc_stream()
and return the raw
vector
containing the buffer that was written.
To achieve this standardization, read_table()
,
read_record_batch()
, read_arrow()
, and
write_arrow()
have been deprecated.
The 0.17 Apache Arrow release includes a C data interface that allows
exchanging Arrow data in-process at the C level without copying and
without libraries having a build or runtime dependency on each other.
This enables us to use reticulate
to share data between R
and Python (pyarrow
) efficiently.
See vignette("python", package = "arrow")
for
details.
dim()
method, which sums rows across
all files (#6635, @boshek)UnionDataset
with the c()
methodNA
as FALSE
,
consistent with dplyr::filter()
vignette("dataset", package = "arrow")
now has correct,
executable codeNOT_CRAN=true
. See
vignette("install", package = "arrow")
for details and more
options.unify_schemas()
to create a Schema
containing the union of fields in multiple schemasread_feather()
and other reader functions close any
file connections they openR.oo
package is also loadedFileStats
is renamed to FileInfo
, and the
original spelling has been deprecatedinstall_arrow()
now installs the latest release of
arrow
, including Linux dependencies, either for CRAN
releases or for development builds (if nightly = TRUE
)LIBARROW_DOWNLOAD
or NOT_CRAN
environment variable is setwrite_feather()
, write_arrow()
and
write_parquet()
now return their input, similar to the
write_*
functions in the readr
package (#6387,
@boshek)list
and create a
ListArray when all list elements are the same type (#6275, @michaelchirico)This release includes a dplyr
interface to Arrow
Datasets, which let you work efficiently with large, multi-file datasets
as a single entity. Explore a directory of data files with
open_dataset()
and then use dplyr
methods to
select()
, filter()
, etc. Work will be done
where possible in Arrow memory. When necessary, data is pulled into R
for further computation. dplyr
methods are conditionally
loaded if you have dplyr
available; it is not a hard
dependency.
See vignette("dataset", package = "arrow")
for
details.
A source package installation (as from CRAN) will now handle its C++ dependencies automatically. For common Linux distributions and versions, installation will retrieve a prebuilt static C++ library for inclusion in the package; where this binary is not available, the package executes a bundled script that should build the Arrow C++ library with no system dependencies beyond what R requires.
See vignette("install", package = "arrow")
for
details.
Table
s and RecordBatch
es also have
dplyr
methods.dplyr
, [
methods
for Tables, RecordBatches, Arrays, and ChunkedArrays now support natural
row extraction operations. These use the C++ Filter
,
Slice
, and Take
methods for efficient access,
depending on the type of selection vector.array_expression
class has also been added, enabling among other things the ability to
filter a Table with some function of Arrays, such as
arrow_table[arrow_table$var1 > 5, ]
without having to
pull everything into R first.write_parquet()
now supports compressioncodec_is_available()
returns TRUE
or
FALSE
whether the Arrow C++ library was built with support
for a given compression library (e.g. gzip, lz4, snappy)character
(as R factor
levels are required to
be) instead of raising an errorClass$create()
methods. Notably,
arrow::array()
and arrow::table()
have been
removed in favor of Array$create()
and
Table$create()
, eliminating the package startup message
about masking base
functions. For more information, see the
new vignette("arrow")
.ARROW_PRE_0_15_IPC_FORMAT=1
.as_tibble
argument in the read_*()
functions has been renamed to as_data_frame
(#5399, @jameslamb)arrow::Column
class has been removed, as it was
removed from the C++ libraryTable
and RecordBatch
objects have S3
methods that enable you to work with them more like
data.frame
s. Extract columns, subset, and so on. See
?Table
and ?RecordBatch
for examples.read_csv_arrow()
supports more parsing options,
including col_names
, na
,
quoted_na
, and skip
read_parquet()
and read_feather()
can
ingest data from a raw
vector (#5141)~/file.parquet
(#5169)double()
), and time types can be created
with human-friendly resolution strings (“ms”, “s”, etc.). (#5198,
#5201)Initial CRAN release of the arrow
package. Key features
include: