---
title: "Introduction to qDEA: Quantile Data Envelopment Analysis"
author: "Joe Atwood"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to qDEA: Quantile Data Envelopment Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
```

## Overview

The **qDEA** package implements quantile Data Envelopment Analysis (qDEA), an extension of traditional DEA that allows for a pre-specified proportion of observations to lie outside the production frontier. This approach provides robust efficiency estimation in the presence of outliers or measurement error.

### Key Features

- Standard DEA and quantile DEA estimation
- Multiple model orientations (input, output, graph, hyperbolic, directional distance)
- Returns to scale specifications (CRS, VRS, DRS, IRS)
- Bootstrap-based bias correction
- Iterative qDEA with automatic convergence testing
- Peer identification and projection calculations

### Methodology

Traditional DEA assumes all observations should be on or below the production frontier. In contrast, qDEA allows a proportion α (alpha) of observations to lie outside the frontier, making the method more robust to outliers while maintaining computational efficiency through linear programming.

The qDEA method is based on:

- Atwood, J., and S. Shaik. (2020). "Theory and Statistical Properties of Quantile Data Envelopment Analysis." *European Journal of Operational Research*, 286:649-661.
- Atwood, J., and S. Shaik. (2018). "Quantile DEA: Estimating qDEA-alpha Efficiency Estimates with Conventional Linear Programming." In *Productivity and Inequality*. Springer Press.

## Installation

```{r eval=FALSE}
# Install from CRAN
install.packages("qDEA")

# Load the package
library(qDEA)
```

```{r echo=FALSE}
# For vignette building
library(qDEA)
```

## Basic Usage: One Input, One Output

We'll start with the simplest case using the CST11 dataset from Cooper, Seiford, and Tone (2006).

### Standard DEA

```{r basic-dea}
# Load example data
data(CST11)
head(CST11)

# Prepare input and output matrices
X <- as.matrix(CST11$EMPLOYEES)
Y <- as.matrix(CST11$SALES_EJOR)

# Run output-oriented DEA with constant returns to scale
result_dea <- qDEA(X = X, Y = Y, 
                   orient = "out", 
                   RTS = "CRS",
                   qout = 0)  # Standard DEA (no outliers allowed)

# View efficiency scores
result_dea$effvals
```

Efficiency scores range from 0 to 1, where 1 indicates the unit is on the efficient frontier.

### Quantile DEA

Now let's allow one observation (12.5% of 8 observations) to be outside the frontier:

```{r basic-qdea}
# Run qDEA allowing one outlier
qout <- 1/nrow(X)  # Proportion = 1/8 = 0.125

result_qdea <- qDEA(X = X, Y = Y, 
                    orient = "out", 
                    RTS = "CRS",
                    qout = qout)

# Compare DEA and qDEA efficiency scores
comparison <- data.frame(
  Store = CST11$STORE,
  DEA = round(result_dea$effvals, 3),
  qDEA = round(result_qdea$effvalsq, 3),
  Difference = round(result_qdea$effvalsq - result_dea$effvals, 3)
)
print(comparison)
```

Notice how qDEA scores can exceed 1.0, as some units are now allowed to be "super-efficient" relative to the tighter frontier.

## Multiple Inputs: Two Inputs, One Output

Let's examine a more complex example with multiple inputs:

```{r two-inputs}
# Load two-input, one-output data
data(CST21)
head(CST21)

# Prepare matrices
X <- as.matrix(CST21[, c("EMPLOYEES", "FLOOR_AREA")])
Y <- as.matrix(CST21$SALES)

# Input-oriented DEA with VRS
result_vrs <- qDEA(X = X, Y = Y,
                   orient = "in",
                   RTS = "VRS",
                   qout = 1/nrow(X))

# Display results
data.frame(
  Store = CST21$STORE,
  Employees = CST21$EMPLOYEES,
  Floor_Area = CST21$FLOOR_AREA,
  Sales = CST21$SALES,
  Efficiency = round(result_vrs$effvalsq, 3)
)
```

## Multiple Outputs: One Input, Two Outputs

```{r two-outputs}
# Load one-input, two-output data
data(CST12)
head(CST12)

# Prepare matrices
X <- as.matrix(CST12$EMPLOYEES)
Y <- as.matrix(CST12[, c("CUSTOMERS", "SALES")])

# Output-oriented qDEA
result_outputs <- qDEA(X = X, Y = Y,
                       orient = "out",
                       RTS = "CRS",
                       qout = 0.15)  # Allow 15% outliers

# Results
data.frame(
  Store = CST12$STORE,
  Employees = CST12$EMPLOYEES,
  Customers = CST12$CUSTOMERS,
  Sales = CST12$SALES,
  DEA_Eff = round(result_outputs$effvals, 3),
  qDEA_Eff = round(result_outputs$effvalsq, 3)
)
```

## Hospital Example: Two Inputs, Two Outputs

A realistic example using hospital data:

```{r hospitals}
# Load hospital data
data(CST22)
head(CST22)

# Prepare matrices
X <- as.matrix(CST22[, c("DOCTORS", "NURSES")])
Y <- as.matrix(CST22[, c("OUT_PATIENTS", "IN_PATIENTS")])

# Run qDEA with 10% outliers allowed
result_hospital <- qDEA(X = X, Y = Y,
                        orient = "in",
                        RTS = "VRS",
                        qout = 0.10)

# Create results table
hospital_results <- data.frame(
  Hospital = CST22$HOSPITAL,
  Doctors = CST22$DOCTORS,
  Nurses = CST22$NURSES,
  Out_Patients = CST22$OUT_PATIENTS,
  In_Patients = CST22$IN_PATIENTS,
  Efficiency = round(result_hospital$effvalsq, 3)
)

print(hospital_results)

# Identify efficient hospitals
efficient <- hospital_results$Hospital[hospital_results$Efficiency >= 0.99]
cat("\nEfficient hospitals:", paste(efficient, collapse = ", "))
```

## Model Orientations

The qDEA package supports multiple orientations:

### Input Orientation

Minimize inputs while maintaining output levels:

```{r input-orientation}
result_in <- qDEA(X = X, Y = Y, orient = "in", RTS = "CRS", qout = 0.10)
cat("Input-oriented efficiencies:\n")
print(round(result_in$effvalsq, 3))
```

### Output Orientation

Maximize outputs while maintaining input levels:

```{r output-orientation}
result_out <- qDEA(X = X, Y = Y, orient = "out", RTS = "CRS", qout = 0.10)
cat("Output-oriented efficiencies:\n")
print(round(result_out$effvalsq, 3))
```

### Graph (Input-Output) Orientation

Minimize inputs and maximize outputs simultaneously:

```{r graph-orientation}
result_graph <- qDEA(X = X, Y = Y, orient = "inout", RTS = "VRS", qout = 0.10)
cat("Graph-oriented distances:\n")
print(round(result_graph$distvalsq, 3))
```

## Returns to Scale

Compare different returns to scale assumptions:

```{r rts-comparison}
# Constant Returns to Scale (CRS)
result_crs <- qDEA(X = X, Y = Y, orient = "in", RTS = "CRS", qout = 0.10)

# Variable Returns to Scale (VRS)
result_vrs_comp <- qDEA(X = X, Y = Y, orient = "in", RTS = "VRS", qout = 0.10)

# Decreasing Returns to Scale (DRS)
result_drs <- qDEA(X = X, Y = Y, orient = "in", RTS = "DRS", qout = 0.10)

# Increasing Returns to Scale (IRS)
result_irs <- qDEA(X = X, Y = Y, orient = "in", RTS = "IRS", qout = 0.10)

# Compare results
rts_comparison <- data.frame(
  Hospital = CST22$HOSPITAL,
  CRS = round(result_crs$effvalsq, 3),
  VRS = round(result_vrs_comp$effvalsq, 3),
  DRS = round(result_drs$effvalsq, 3),
  IRS = round(result_irs$effvalsq, 3)
)

print(rts_comparison)
```

VRS efficiency is always ≥ CRS efficiency, as VRS is a less restrictive assumption.

## Bootstrap Bias Correction

Bootstrap methods can correct for bias in efficiency estimates:

```{r bootstrap, eval=FALSE}
# Run qDEA with bootstrap (100 replications)
# Note: This takes longer to run
result_boot <- qDEA(X = X, Y = Y,
                    orient = "in",
                    RTS = "VRS",
                    qout = 0.10,
                    nboot = 100,
                    seedval = 12345)

# Access bias-corrected estimates
bias_corrected <- result_boot$BOOT_DATA$effvalsq.bc

# Compare original and bias-corrected
comparison_boot <- data.frame(
  Hospital = CST22$HOSPITAL,
  Original = round(result_boot$effvalsq, 3),
  Bias_Corrected = round(bias_corrected, 3),
  Bias = round(result_boot$effvalsq - bias_corrected, 3)
)

print(comparison_boot)
```

## Peer Identification

Identify which units serve as benchmarks for inefficient units:

```{r peers}
# Run qDEA to get peer information
data(CST11)
X <- as.matrix(CST11$EMPLOYEES)
Y <- as.matrix(CST11$SALES_EJOR)

result_peers <- qDEA(X = X, Y = Y,
                     orient = "out",
                     RTS = "VRS",
                     qout = 0.10)

# Access peer information
peers <- result_peers$PEER_DATA$PEERSq
print(peers)
```

The peers dataframe shows which efficient units (and with what weights) form the benchmark for each inefficient unit.

## Projected Values

Calculate target input/output levels for inefficient units:

```{r projections}
# Run with projection calculation
result_proj <- qDEA(X = X, Y = Y,
                    orient = "out",
                    RTS = "VRS",
                    qout = 0.10,
                    getproject = TRUE)

# Access projected values
X_target <- result_proj$PROJ_DATA$X0HATq
Y_target <- result_proj$PROJ_DATA$Y0HATq

# Compare actual vs. target
projection_comparison <- data.frame(
  Store = CST11$STORE,
  Actual_Input = X[,1],
  Target_Input = round(X_target[,1], 1),
  Actual_Output = Y[,1],
  Target_Output = round(Y_target[,1], 1),
  Efficiency = round(result_proj$effvalsq, 3)
)

print(projection_comparison)
```

## Iterative qDEA

For more precise qDEA estimates, use iterative refinement:

```{r iterative}
# Run with multiple iterations
result_iter <- qDEA(X = X, Y = Y,
                    orient = "out",
                    RTS = "CRS",
                    qout = 0.15,
                    nqiter = 5)  # Allow up to 5 iterations

# Check number of iterations used
cat("Iterations completed:", result_iter$LP_DATA$qiter, "\n")

# Check actual proportion of outliers
cat("Actual outlier proportion:", round(result_iter$LP_DATA$qhat, 4), "\n")
```

The algorithm iteratively adjusts the frontier until the proportion of outliers converges to the specified qout value.

## Practical Tips

### Choosing the Outlier Proportion (qout)

- **qout = 0**: Standard DEA (no outliers allowed)
- **qout = 1/n**: Allow one outlier (good starting point)
- **qout = 0.05-0.10**: Common range for robust estimation
- **qout = 0.20-0.25**: More permissive (use with caution)

Higher qout values make the frontier more robust but may be too permissive.

### Choosing Orientation

- **Input-oriented**: When managers control inputs but not outputs (e.g., production)
- **Output-oriented**: When managers control outputs but not inputs (e.g., sales)
- **Graph (inout)**: When both inputs and outputs can be adjusted
- **Directional distance**: When you want to specify exact directions for improvement

### Choosing Returns to Scale

- **CRS**: Appropriate when all units operate at optimal scale
- **VRS**: Most flexible, appropriate when scale effects vary
- **DRS**: When larger units tend to be less efficient
- **IRS**: When larger units tend to be more efficient

When in doubt, start with VRS as it's the most general assumption.

### Bootstrap Guidelines

- Use **nboot ≥ 100** for preliminary analysis
- Use **nboot ≥ 1000** for publication-quality results
- Bootstrap is computationally intensive for large datasets
- Set **seedval** for reproducible results

## Understanding the Output

The main function `qDEA()` returns a list with several components:

```{r output-structure, eval=FALSE}
result <- qDEA(X = X, Y = Y, orient = "out", RTS = "CRS", qout = 0.10)

# Main efficiency scores
result$effvals    # DEA efficiency scores
result$effvalsq   # qDEA efficiency scores
result$distvals   # DEA distance measures
result$distvalsq  # qDEA distance measures

# Input data (for reference)
result$INPUT_DATA

# Bootstrap data (if nboot > 0)
result$BOOT_DATA

# Peer information
result$PEER_DATA

# Projected values (if getproject = TRUE)
result$PROJ_DATA

# LP solver information
result$LP_DATA
```

## Advanced Example: Complete Analysis

Here's a complete analysis workflow:

```{r complete-analysis}
# Load data
data(CST22)

# Prepare data
X <- as.matrix(CST22[, c("DOCTORS", "NURSES")])
Y <- as.matrix(CST22[, c("OUT_PATIENTS", "IN_PATIENTS")])

# Run comprehensive analysis
result_complete <- qDEA(
  X = X, 
  Y = Y,
  orient = "in",
  RTS = "VRS",
  qout = 0.10,
  nqiter = 3,
  getproject = TRUE
)

# Create comprehensive results table
complete_results <- data.frame(
  Hospital = CST22$HOSPITAL,
  Doctors = CST22$DOCTORS,
  Nurses = CST22$NURSES,
  Out_Patients = CST22$OUT_PATIENTS,
  In_Patients = CST22$IN_PATIENTS,
  DEA_Eff = round(result_complete$effvals, 3),
  qDEA_Eff = round(result_complete$effvalsq, 3),
  Target_Doctors = round(result_complete$PROJ_DATA$X0HATq[,1], 1),
  Target_Nurses = round(result_complete$PROJ_DATA$X0HATq[,2], 1)
)

print(complete_results)

# Summary statistics
cat("\n=== Summary Statistics ===\n")
cat("Mean DEA efficiency:", round(mean(result_complete$effvals), 3), "\n")
cat("Mean qDEA efficiency:", round(mean(result_complete$effvalsq), 3), "\n")
cat("Efficient units (qDEA ≥ 0.99):", 
    sum(result_complete$effvalsq >= 0.99), "out of", nrow(X), "\n")

# Potential input reductions
input_savings <- cbind(
  X - result_complete$PROJ_DATA$X0HATq
)
colnames(input_savings) <- c("Doctor_Savings", "Nurse_Savings")

cat("\nTotal potential input reductions:\n")
cat("Doctors:", round(sum(input_savings[,1]), 1), "\n")
cat("Nurses:", round(sum(input_savings[,2]), 1), "\n")
```

## Visualization

Here's how to visualize efficiency scores:

```{r visualization, fig.width=7, fig.height=5}
# Create efficiency comparison plot
data(CST22)
X <- as.matrix(CST22[, c("DOCTORS", "NURSES")])
Y <- as.matrix(CST22[, c("OUT_PATIENTS", "IN_PATIENTS")])

result_viz <- qDEA(X = X, Y = Y, orient = "in", RTS = "VRS", qout = 0.10)

# Prepare data for plotting
plot_data <- data.frame(
  Hospital = CST22$HOSPITAL,
  DEA = result_viz$effvals,
  qDEA = result_viz$effvalsq
)

# Simple bar plot
barplot(
  t(as.matrix(plot_data[, c("DEA", "qDEA")])),
  beside = TRUE,
  names.arg = plot_data$Hospital,
  col = c("skyblue", "coral"),
  main = "Hospital Efficiency: DEA vs qDEA",
  ylab = "Efficiency Score",
  xlab = "Hospital",
  legend.text = c("DEA", "qDEA"),
  args.legend = list(x = "topright")
)
abline(h = 1, lty = 2, col = "red")
```

## Common Issues and Solutions

### Issue: All efficiency scores equal 1

**Cause**: Dataset may be too small or all units are on the frontier.

**Solution**: Try a larger dataset or verify data quality.

### Issue: Some efficiency scores are very low

**Cause**: Possible outliers or data quality issues.

**Solution**: 
- Increase qout to allow more outliers
- Verify data for errors
- Consider whether VRS is more appropriate than CRS

### Issue: Bootstrap takes too long

**Cause**: Large dataset or too many bootstrap replications.

**Solution**:
- Start with fewer replications (nboot = 100)
- Process a subset of units using dmulist parameter
- Run on a more powerful computer

## References

Cooper, W.W., Seiford, L.M., and Tone, K. (2006). *Introduction to Data Envelopment Analysis and Its Uses*. Springer, New York.

Atwood, J., and Shaik, S. (2020). Theory and Statistical Properties of Quantile Data Envelopment Analysis. *European Journal of Operational Research*, 286:649-661. https://doi.org/10.1016/j.ejor.2020.03.054

Atwood, J., and Shaik, S. (2018). Quantile DEA: Estimating qDEA-alpha Efficiency Estimates with Conventional Linear Programming. In *Productivity and Inequality*. Springer Press. https://doi.org/10.1007/978-3-319-68678-3_4

Simar, L., and Wilson, P.W. (2011). Two-stage DEA: caveat emptor. *Journal of Productivity Analysis*, 36:205-218.

## Getting Help

For questions or issues with the qDEA package:

- Email: jatwood@montana.edu
- Report bugs or request features via your package repository

## Session Information

```{r session-info}
sessionInfo()
```