Subsample table in pepr

Michal Stolarczyk & Nathan Sheffield

2023-11-21

Learn sample subannotations in pepr

This vignette will show you how and why to use the subsample table functionality of the pepr package.

Problem/Goal

This series of examples below demonstrates how and why to use sample subannoatation functionality in multiple cases to provide multiple input files of the same type for a single sample.

Solutions

Example 1: basic sample subannotation table

This example demonstrates how the sample subannotation functionality is used. In this example, 2 samples have multiple input files that need merging (frog_1 and frog_2), while 1 sample (frog_3) does not. Therefore, frog_3 specifies its file in the sample_table.csv file, while the others leave that field blank and instead specify several files in the subsample_table.csv file.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/example_results
  • Sample table:
    sample_name protocol file
    frog_1 anySampleType multi
    frog_2 anySampleType multi
    frog_3 anySampleType multi
  • Subsample table:
    sample_name subsample_name file
    frog_1 sub_a data/frog1a_data.txt
    frog_1 sub_b data/frog1b_data.txt
    frog_1 sub_c data/frog1c_data.txt
    frog_2 sub_a data/frog2a_data.txt
    frog_2 sub_b data/frog2b_data.txt

Let’s create the Project object and see if multiple files are present

projectConfig1 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable1",
"project_config.yaml",
package = "pepr"
)
p1 = Project(projectConfig1)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_subtable1/project_config.yaml
# Check the files
p1Samples = sampleTable(p1)
p1Samples$file
#> [[1]]
#> [1] "data/frog1a_data.txt" "data/frog1b_data.txt" "data/frog1c_data.txt"
#> 
#> [[2]]
#> [1] "data/frog2a_data.txt" "data/frog2b_data.txt"
#> 
#> [[3]]
#> [1] "multi"
# Check the subsample names
p1Samples$subsample_name
#> [[1]]
#> [1] "sub_a" "sub_b" "sub_c"
#> 
#> [[2]]
#> [1] "sub_a" "sub_b"
#> 
#> [[3]]
#> NULL

And inspect the whole table in p1@samples slot

sample_name protocol file subsample_name
frog_1 anySampleType data/frog1a_data.txt, data/frog1b_data.txt, data/frog1c_data.txt sub_a, sub_b, sub_c
frog_2 anySampleType data/frog2a_data.txt, data/frog2b_data.txt sub_a, sub_b
frog_3 anySampleType multi NULL

You can also access a single subsample if you call the getSubsample method with appropriate sample_name - subsample_name attribute combination. Note, that this is only possible if the subsample_name column is defined in the sub_annotation.csv file.

sampleName = "frog_1"
subsampleName = "sub_a"
getSubsample(p1, sampleName, subsampleName)
#>    sample_name      protocol                 file subsample_name
#> 1:      frog_1 anySampleType data/frog1a_data.txt          sub_a

Example 2: subannotations and derived attributes

This example uses a subsample_table.csv file and a derived attributes to point to files. This is a rather complex example. Notice we must include the file_id column in the sample_table.csv file, and leave it blank; this is then populated by just some of the samples (frog_1 and frog_2) in the subsample_table.csv, but is left empty for the samples that are not merged.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml
   sample_modifiers:
      derive:
          attributes: file
          sources:
              local_files: ../data/{identifier}{file_id}_data.txt
              local_files_unmerged: ../data/{identifier}_data.txt
  • Sample annotation table:
    sample_name protocol identifier file
    frog_1 anySampleType frog1 local_files
    frog_2 anySampleType frog2 local_files
    frog_3 anySampleType frog3 local_files_unmerged
    frog_4 anySampleType frog4 local_files_unmerged
  • Sample subannotation table:
    sample_name file_id subsample_name
    frog_1 a a
    frog_1 b b
    frog_1 c c
    frog_2 a a
    frog_2 b b

Let’s load the project config, create the Project object and see if multiple files are present

projectConfig2 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable2",
"project_config.yaml",
package = "pepr"
)
p2 = Project(projectConfig2)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_subtable2/project_config.yaml
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows
#> to replace 1 rows
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 2 rows
#> to replace 1 rows
# Check the files
p2Samples = sampleTable(p2)
p2Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#> 
#> [[2]]
#> [1] "../data/frog2a_data.txt"
#> 
#> [[3]]
#> [1] "../data/frog3_data.txt"
#> 
#> [[4]]
#> [1] "../data/frog4_data.txt"

And inspect the whole table in p2@samples slot

sample_name protocol identifier file file_id subsample_name
frog_1 anySampleType frog1 ../data/frog1a_data.txt a, b, c a, b, c
frog_2 anySampleType frog2 ../data/frog2a_data.txt a, b a, b
frog_3 anySampleType frog3 ../data/frog3_data.txt NULL NULL
frog_4 anySampleType frog4 ../data/frog4_data.txt NULL NULL

Example 3: subannotations and expansion characters

This example gives the exact same results as Example 2, but in this case, uses a wildcard for frog_2 instead of including it in the subsample_table.csv file. Since we can’t use a wildcard and a subannotation for the same sample, this necessitates specifying a second data source class (local_files_unmerged) that uses an asterisk (*). The outcome is the same.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml
   sample_modifiers:
      derive:
          attributes: file
          sources:
              local_files: ../data/{identifier}{file_id}_data.txt
              local_files_unmerged: ../data/{identifier}*_data.txt
  • Sample annotation table:
    sample_name protocol identifier file file_id
    frog_1 anySampleType frog1 local_files NA
    frog_2 anySampleType frog2 local_files_unmerged NA
    frog_3 anySampleType frog3 local_files_unmerged NA
    frog_4 anySampleType frog4 local_files_unmerged NA
  • Sample subtable table:
    sample_name file_id
    frog_1 a
    frog_1 b
    frog_1 c

Let’s load the project config, create the Project object and see if multiple files are present

projectConfig3 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable3",
"project_config.yaml",
package = "pepr"
)
p3 = Project(projectConfig3)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_subtable3/project_config.yaml
#> Warning in `[<-.data.frame`(x, i, j, value): replacement element 1 has 3 rows
#> to replace 1 rows
# Check the files
p3Samples = sampleTable(p3)
p3Samples$file
#> [[1]]
#> [1] "../data/frog1a_data.txt"
#> 
#> [[2]]
#> [1] "../data/frog2*_data.txt"
#> 
#> [[3]]
#> [1] "../data/frog3*_data.txt"
#> 
#> [[4]]
#> [1] "../data/frog4*_data.txt"

And inspect the whole table in p3@samples slot

sample_name protocol identifier file file_id
frog_1 anySampleType frog1 ../data/frog1a_data.txt a, b, c
frog_2 anySampleType frog2 ../data/frog2*_data.txt
frog_3 anySampleType frog3 ../data/frog3*_data.txt
frog_4 anySampleType frog4 ../data/frog4*_data.txt

Example 4: subannotations and multiple (separate-class) inputs

Merging is for same class inputs (like, multiple files for read1). Different-class inputs (like read1 vs read2) are handled by different attributes (or columns). This example shows you how to handle paired-end data, while also merging within each.

This example is made up of these components:

  • Project config file:
   pep_version: 2.0.0
   sample_table: sample_table.csv
   subsample_table: subsample_table.csv
   looper:
      output_dir: $HOME/hello_looper_results
      pipeline_interfaces: ../pipeline/pipeline_interface.yaml
  • Sample annotation table:
    sample_name protocol
    frog_1 anySampleType
    frog_2 anySampleType
    frog_3 anySampleType
    frog_4 anySampleType
  • Sample subannotation table:
    sample_name read1 read2
    frog_1 frog1a_data.txt frog1a_data2.txt
    frog_1 frog1b_data.txt frog1b_data2.txt
    frog_1 frog1c_data.txt frog1b_data2.txt

Let’s load the project config, create the Project object and see if multiple files are present

projectConfig4 = system.file(
"extdata",
paste0("example_peps-", branch),
"example_subtable4",
"project_config.yaml",
package = "pepr"
)
p4 = Project(projectConfig4)
#> Loading config file: /tmp/RtmpoymTo9/Rinstb3055bff7/pepr/extdata/example_peps-master/example_subtable4/project_config.yaml
# Check the read1 and read2 columns
p4Samples = sampleTable(p4)
p4Samples$read1
#> [[1]]
#> [1] "frog1a_data.txt" "frog1b_data.txt" "frog1c_data.txt"
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL
p4Samples$read2
#> [[1]]
#> [1] "frog1a_data2.txt" "frog1b_data2.txt" "frog1b_data2.txt"
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL

And inspect the whole table in p4@samples slot

sample_name protocol read1 read2
frog_1 anySampleType frog1a_data.txt, frog1b_data.txt, frog1c_data.txt frog1a_data2.txt, frog1b_data2.txt, frog1b_data2.txt
frog_2 anySampleType NULL NULL
frog_3 anySampleType NULL NULL
frog_4 anySampleType NULL NULL