Creating samples with sampler

The sampler function is part of the whSample package of utilities for sampling. It is a menu-driven R application that reads an Excel spreadsheet or delimited text file and generates a new Excel workbook with a copy of the original data, a Simple Random or Stratified Random Sample, and a report of what it did.

The sampler function calls ssize to get its sample size estimate. Sampler uses the same sample size defaults as ssize and they can be overridden when launching sampler from the command line.

The defaults for these additional arguments are backups=5, irisData=F, seed=NULL and keepOrg=F. The default seed will tell sampler to use the current system time in milliseconds. Any number can be used as a seed. Whichever one is used will be listed in the Report output tab. The keep-original option (keepOrg) defaults to FALSE, but could be set to keepOrg=T for smaller populations that wouldn’t exceed Excel’s row limit is 1,048,576 rows.

To override any of these defaults, enter name=value as an argument when calling sampler, as in:

Jump right in

Sampler includes an .xlsx and a .csv version of Anderson’s Iris dataset that characterizes the flower’s variations. The Examples section will get you started with sampler. Or keep reading for a detailed walk-through.

Running sampler

The easiest way to run sampler is by calling it from the R console:

sampler()

Choose the source data file

A file chooser will appear to select a source file. Sampler will remember where the source file came from and will put the resulting output file in the same folder with the name source file name_Sample.xlsx.

Sampler will read Excel or delimited text files.

Choose the data worksheet

If the source file is in an Excel workbook, sampler will display a list of worksheets in that file for you to choose from.

Choose the number of backups

A “Number of Backups” menu will offer either zero, five, or ten backups (per stratum). The default is five.

There may be times when you want a specific number of backups that isn’t listed. In this case, you need to start sampler with the number of backups desired, as in:

sampler(backups=25)

Important note: At this point, if you specified a number of backups on the command line, hit Cancel and sampler will use the number you specified. Clicking Okay will accept whatever was highlighted in the menu. If you didn’t specify a number on the command line and hit Cancel, sampler will use the default of five.

For a Simple Random Sample, the startup example shown above will add the next 25 items from the randomized population to a separate table at the end of the output Excel workbook.

Stratified samples will include 25 backups for each stratum. This could be a problem if a given stratum doesn’t have 25 more items to include. In this case, sampler will scale back the number of backups to the number of items available for that stratum. The Report tab will list the number of backups actually provided for each stratum.

Choose the sampling type

Sampler can create a Simple Random Sample, a Stratified Random Sample, or a variation of the stratified sample where each stratum is listed with its backups in a separate Excel worksheet. A “Sample Type” menu provides the options.

Choose what to stratify on

The “Stratify on” menu lists all the column headers in the selected worksheet. Sampler will group all the samples by this name and split out the results accordingly.

Output

Sampler creates a new Excel workbook in three parts:

a copy of the original (source) data if previously requested,
an Excel spreadsheet with the requested sample, and
a new tab called Report with key reference information:
- path and name of the source file
- size (in rows) of the source file
- sample type (Simple Random Sample, Stratified Random Sample, or Tabbed Stratified Sample)
- sample size
- desired confidence level
- desired margin of error
- anticipated rate of occurrence
- stratification key
- number of strata
- number of backups per stratum
- random number seed used, for documentation and reproducibility
- date-time stamp of when the sample was generated
- stratification information (stratified samples only):
  - name
  - size of the population
  - proportion of the population for each stratum
  - the number of primary samples
  - the actual number of backup samples (may not be the same as that requested)

Installation

To install from CRAN, use:

install.packages("whSample")

All necessary dependencies also will be installed as necessary.

You can install* the latest development version of whSample from the R console with:

devtools::install_github("km4ivi/whSample")

* This requires the devtools package to be installed first.

Other necessary packages

sampler depends on several external packages to run properly. Ensure these packages are installed:

tidyverse (or individually: magrittr, dplyr, purrr)
openxlsx
data.table
tools
tcltk
bit64

Examples

Running sampler(irisData=T) from the R console’s > prompt will launch sampler with the other defaults and will open a file chooser with iris.xlsx and iris.csv options (check the file type drop down for different suffixes). sampler will read either format (or delimited text) and generate an Excel file with the sampling parameters you choose. A series of menus will take you step by step through the process and will open the generated spreadsheet with an Excel-compatible app if one is on your computer.

Sampler will pop up a directory chooser to let you decide where to put the output file. By default, it will offer the same folder as the source file. If you want to see the output Excel file without first saving it, hit Cancel. The file will open but won’t automatically be saved. You can save it from Excel if you want.

If you used the example file and don’t have an Excel-compatible app installed, you can find the output folder by entering this command at the > prompt:

system.file("extdata", package="whSample")

sampler(p=0.6, seed=12345) Will override those specific defaults. You can override the other defaults in the same way.

If you skipped the middle part of this vignette, now’s a good time to catch up.