Introduction to the ctrailsgov Package

Taylor Arnold and Michael J Kane

This vignette gives a very brief overview of the current package. To start, we load the package into R.

library(ctrialsgov)

In the next few sections, we see how to setup the data set, query it, and then visualize the output.

Create the Data

Before querying the ClinicalTrials.gov data, we need to load a pre-processed version of the data into R. There are three ways to do this. If you have installed a copy of the data set locally into PostGRES, the data can be created from scratch with the following block of code (it will take a couple of minutes to finish):

library(DBI)
library(RPostgreSQL)

drv <- dbDriver('PostgreSQL')
con <- DBI::dbConnect(drv, dbname="aact")
ctgov_create_data(con)

Alternatively, we can download a static version of the data from GitHub and load this into R without needing the setup a local version of the database. This will be cached locally so that it can be re-loaded without downloading each time. To download and load this data, use the following:

ctgov_load_cache()

Finally, we can load a small sample dataset (2% of the total) that is included with the package itself using the following:

ctgov_load_sample()

This is the version of the data that is used in most of the tests, examples, and in this vignette.

Querying the Data

The primary function for querying the dataset is called ctgov_query. It can be called after using any of the functions in the previous section. Here are a few examples of how the function works. We will see a few examples here; see the help pages for a complete list of options.

There are a number of fields in the data that use exact matches of categories. Here, for example, we find the interventional studies:

ctgov_query(study_type = "Interventional")
## # A tibble: 2,403 × 32
##    nct_id      start_date phase           enrollment brief_title  official_title
##    <chr>       <date>     <chr>                <int> <chr>        <chr>         
##  1 NCT04999163 2021-12-31 N/A                     50 Aortix Ther… Aortix Therap…
##  2 NCT05002153 2021-11-30 N/A                    300 The Role of… The Role of M…
##  3 NCT04472702 2021-11-30 N/A                     45 Fluoroscopi… Fluoroscopic …
##  4 NCT05032157 2021-11-30 Phase 3                450 A Phase 3 S… A Multicenter…
##  5 NCT04471142 2021-11-08 N/A                    270 Effectivene… Effectiveness…
##  6 NCT04772651 2021-11-01 N/A                    108 Mediterrane… Mediterranean…
##  7 NCT04390451 2021-11-01 Phase 1                 54 Initial Tes… Initial Testi…
##  8 NCT04696861 2021-11-01 N/A                     60 Telehealth … Telehealth to…
##  9 NCT03954431 2021-10-31 Phase 1/Phase 2        100 High-Resolu… Study of High…
## 10 NCT04273022 2021-10-31 N/A                     20 Effect of E… The Effect of…
## # … with 2,393 more rows, and 26 more variables:
## #   primary_completion_date <date>, study_type <chr>, rec_status <chr>,
## #   completion_date <date>, last_update <date>, description <chr>,
## #   eudract_num <chr>, other_id <chr>, allocation <chr>,
## #   intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## #   time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, …

Or, all of the interventional studies that have a primary industry sponsor:

ctgov_query(study_type = "Interventional", sponsor_type = "Industry")
## # A tibble: 640 × 32
##    nct_id      start_date phase           enrollment brief_title  official_title
##    <chr>       <date>     <chr>                <int> <chr>        <chr>         
##  1 NCT04999163 2021-12-31 N/A                     50 Aortix Ther… Aortix Therap…
##  2 NCT05032157 2021-11-30 Phase 3                450 A Phase 3 S… A Multicenter…
##  3 NCT05029856 2021-10-04 Phase 1/Phase 2        240 Evaluation … A Randomized,…
##  4 NCT04963179 2021-09-30 N/A                    154 PREvention … PREvention of…
##  5 NCT04875975 2021-09-30 Phase 2                 68 A Study to … A Randomized,…
##  6 NCT04909879 2021-09-30 Phase 2                100 Study of Al… Treatment of …
##  7 NCT04925674 2021-09-29 Phase 1                 60 Study of HE… Phase Ic Clin…
##  8 NCT04935177 2021-09-17 Phase 3                 64 Renal Funct… An Open-label…
##  9 NCT04956744 2021-08-31 Phase 1                 30 A Study to … A Phase 1, Do…
## 10 NCT04920253 2021-08-31 N/A                    180 Real World … Real World Ev…
## # … with 630 more rows, and 26 more variables: primary_completion_date <date>,
## #   study_type <chr>, rec_status <chr>, completion_date <date>,
## #   last_update <date>, description <chr>, eudract_num <chr>, other_id <chr>,
## #   allocation <chr>, intervention_model <chr>, observational_model <chr>,
## #   primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## #   minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …

A few fields have continuous values that can be searched by giving a vector with two values. The results return any values that fall between the lower bound (first value) and the upper bound (second value). Here, we find the studies that have between 40 and 42 patients enrolled in them:

ctgov_query(enrollment_range = c(40, 42))
## # A tibble: 125 × 32
##    nct_id      start_date phase           enrollment brief_title  official_title
##    <chr>       <date>     <chr>                <int> <chr>        <chr>         
##  1 NCT04188119 2021-09-30 Phase 2                 42 A Proof of … A Proof of Co…
##  2 NCT04992975 2021-08-31 <NA>                    40 Brain Iron … Brain Iron To…
##  3 NCT05001854 2021-08-31 Phase 2/Phase 3         40 Hemodynamic… Evaluation of…
##  4 NCT04749355 2021-08-14 Phase 2                 40 Phase 2, Op… A Phase 2, Op…
##  5 NCT04648319 2021-04-15 Phase 2                 40 A Study of … A Pilot Study…
##  6 NCT04744779 2021-03-31 N/A                     40 Office Base… Effectiveness…
##  7 NCT04841174 2021-03-30 N/A                     40 The Effect … The Effect of…
##  8 NCT04808180 2021-03-25 N/A                     40 Clinical Ef… Effects of Bi…
##  9 NCT04746105 2021-02-24 Phase 1                 40 A Clinical … A Study to Ev…
## 10 NCT04355780 2021-01-08 <NA>                    40 Immunologic… Immunologic F…
## # … with 115 more rows, and 26 more variables: primary_completion_date <date>,
## #   study_type <chr>, rec_status <chr>, completion_date <date>,
## #   last_update <date>, description <chr>, eudract_num <chr>, other_id <chr>,
## #   allocation <chr>, intervention_model <chr>, observational_model <chr>,
## #   primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## #   minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …

Setting one end of the range to missing avoids searching for that end of the range. For example, the following finds any studies with 1000 or more patients.

ctgov_query(enrollment_range = c(1000, NA))
## # A tibble: 204 × 32
##    nct_id      start_date phase   enrollment brief_title      official_title    
##    <chr>       <date>     <chr>        <int> <chr>            <chr>             
##  1 NCT05033782 2021-12-01 <NA>          1500 Impact of the M… Impact of the Mod…
##  2 NCT05033548 2021-10-10 <NA>          4000 Technology Enab… Technology Enable…
##  3 NCT04982614 2021-10-01 Phase 4       1400 HPV Vaccination… A Multi-site, Ope…
##  4 NCT05033678 2021-08-16 <NA>          8000 Implementation … Teledermoscopy an…
##  5 NCT04917185 2021-06-30 N/A           1000 EA for PAAS: A … Electro-acupunctu…
##  6 NCT04839757 2021-06-03 <NA>          1400 Dengue Vaccine … Preparing for the…
##  7 NCT04889924 2021-06-01 N/A           1666 ALND vs RDT in … Axillary Lymph No…
##  8 NCT04472845 2021-03-30 N/A           1018 HYPofractionate… HYPofractionated …
##  9 NCT04735744 2021-02-15 <NA>          1315 Evaluation of A… Evaluation of All…
## 10 NCT04626973 2021-01-15 N/A           3048 Effects of Ezet… Effects of Ezetim…
## # … with 194 more rows, and 26 more variables: primary_completion_date <date>,
## #   study_type <chr>, rec_status <chr>, completion_date <date>,
## #   last_update <date>, description <chr>, eudract_num <chr>, other_id <chr>,
## #   allocation <chr>, intervention_model <chr>, observational_model <chr>,
## #   primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## #   minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …

Similarly, we can give a range of dates. These are given in the form of strings as “YYYY-MM-DD”:

ctgov_query(date_range = c("2020-01-01", "2020-02-01"))
## # A tibble: 34 × 32
##    nct_id      start_date phase           enrollment brief_title  official_title
##    <chr>       <date>     <chr>                <int> <chr>        <chr>         
##  1 NCT04224597 2020-02-01 <NA>                    48 Evaluation … Evaluation of…
##  2 NCT04255524 2020-02-01 N/A                    200 Choroidal C… OCTA to Quant…
##  3 NCT04336605 2020-02-01 <NA>                 25000 Killing Pai… Killing Pain …
##  4 NCT04218669 2020-02-01 N/A                    105 The Approac… A Clinical Ra…
##  5 NCT04409613 2020-02-01 N/A                     59 Cost-Effect… Cost-Effectiv…
##  6 NCT04424576 2020-01-31 <NA>                    60 Ovarian Mor… Trajectory of…
##  7 NCT04115397 2020-01-31 Phase 4                 80 Bisphosphon… Towards Effic…
##  8 NCT04497064 2020-01-30 <NA>                   585 Breakfast K… Breakfast Kno…
##  9 NCT03892785 2020-01-27 Phase 3                200 MEthotrexat… MEthotrexate …
## 10 NCT03710122 2020-01-23 Phase 2/Phase 3        102 Vancomycin … A Prospective…
## # … with 24 more rows, and 26 more variables: primary_completion_date <date>,
## #   study_type <chr>, rec_status <chr>, completion_date <date>,
## #   last_update <date>, description <chr>, eudract_num <chr>, other_id <chr>,
## #   allocation <chr>, intervention_model <chr>, observational_model <chr>,
## #   primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## #   minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …

Finally, we can also search free text fields using keywords. The following for example finds and study that includes the phrase “lung cancer” (ignoring case) in the description field:

ctgov_query(description_kw = "lung cancer")
## # A tibble: 59 × 32
##    nct_id      start_date phase   enrollment brief_title      official_title    
##    <chr>       <date>     <chr>        <int> <chr>            <chr>             
##  1 NCT04814056 2021-06-01 Phase 4         15 To Evaluate the… An Open-Labeled, …
##  2 NCT04629027 2021-03-03 <NA>            80 Evaluation Syst… Establishment of …
##  3 NCT04179305 2020-10-25 N/A             58 Giving Informat… Giving Informatio…
##  4 NCT04452877 2020-08-19 Phase 2         20 A Study of Dabr… An Open-Label, Si…
##  5 NCT04422392 2020-07-13 Phase 2        107 Neoadjuvant PD-… Neoadjuvant PD-1 …
##  6 NCT04120454 2020-03-16 Phase 2         34 Ramucirumab and… An Investigator-S…
##  7 NCT04332367 2019-12-19 Phase 2         59 Carboplatin, Ta… Phase II, Single-…
##  8 NCT04309955 2019-12-01 N/A             60 Modified Versus… Randomized Clinic…
##  9 NCT04151940 2019-09-26 <NA>            40 PET/CT Changes … An Observational …
## 10 NCT04081688 2019-08-21 Phase 1         15 Atezolizumab an… A Phase I Trial o…
## # … with 49 more rows, and 26 more variables: primary_completion_date <date>,
## #   study_type <chr>, rec_status <chr>, completion_date <date>,
## #   last_update <date>, description <chr>, eudract_num <chr>, other_id <chr>,
## #   allocation <chr>, intervention_model <chr>, observational_model <chr>,
## #   primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## #   minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …

We can search two terms at once as well, by default it finds things that match at least one of the terms:

ctgov_query(description_kw = c("lung cancer", "colon cancer"))
## # A tibble: 59 × 32
##    nct_id      start_date phase   enrollment brief_title      official_title    
##    <chr>       <date>     <chr>        <int> <chr>            <chr>             
##  1 NCT04814056 2021-06-01 Phase 4         15 To Evaluate the… An Open-Labeled, …
##  2 NCT04629027 2021-03-03 <NA>            80 Evaluation Syst… Establishment of …
##  3 NCT04179305 2020-10-25 N/A             58 Giving Informat… Giving Informatio…
##  4 NCT04452877 2020-08-19 Phase 2         20 A Study of Dabr… An Open-Label, Si…
##  5 NCT04422392 2020-07-13 Phase 2        107 Neoadjuvant PD-… Neoadjuvant PD-1 …
##  6 NCT04120454 2020-03-16 Phase 2         34 Ramucirumab and… An Investigator-S…
##  7 NCT04332367 2019-12-19 Phase 2         59 Carboplatin, Ta… Phase II, Single-…
##  8 NCT04309955 2019-12-01 N/A             60 Modified Versus… Randomized Clinic…
##  9 NCT04151940 2019-09-26 <NA>            40 PET/CT Changes … An Observational …
## 10 NCT04081688 2019-08-21 Phase 1         15 Atezolizumab an… A Phase I Trial o…
## # … with 49 more rows, and 26 more variables: primary_completion_date <date>,
## #   study_type <chr>, rec_status <chr>, completion_date <date>,
## #   last_update <date>, description <chr>, eudract_num <chr>, other_id <chr>,
## #   allocation <chr>, intervention_model <chr>, observational_model <chr>,
## #   primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## #   minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …

But the match_all flag can be set to search for both terms at the same time (here, that returns no matches):

ctgov_query(description_kw = c("lung cancer", "colon cancer"), match_all = TRUE)
## # A tibble: 0 × 32
## # … with 32 variables: nct_id <chr>, start_date <date>, phase <chr>,
## #   enrollment <int>, brief_title <chr>, official_title <chr>,
## #   primary_completion_date <date>, study_type <chr>, rec_status <chr>,
## #   completion_date <date>, last_update <date>, description <chr>,
## #   eudract_num <chr>, other_id <chr>, allocation <chr>,
## #   intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## #   time_perspective <chr>, masking_description <chr>, …

Other keyword fields include official_title_kw, source_kw and criteria_kw.

Any of the options can be combined as needed.

ctgov_query(
  description_kw = "cancer",
  enrollment_range = c(100, 200),
  date_range = c("2019-01-01", "2020-02-01")
)
## # A tibble: 5 × 32
##   nct_id      start_date phase   enrollment brief_title official_title primary_complet…
##   <chr>       <date>     <chr>        <int> <chr>       <chr>          <date>          
## 1 NCT04035447 2020-01-22 N/A            120 Symptom Ma… Improving Sym… 2025-10-01      
## 2 NCT04227327 2020-01-07 Phase 2        121 Study Eval… A Phase 2, Op… 2023-07-31      
## 3 NCT04404244 2020-01-01 <NA>           100 Extraordin… Extraordinary… 2022-01-01      
## 4 NCT03902600 2019-03-12 <NA>           115 Moderately… Moderately Hy… 2022-05-31      
## 5 NCT03813953 2019-02-20 N/A            160 The Effect… The Effect of… 2019-05-31      
## # … with 25 more variables: study_type <chr>, rec_status <chr>,
## #   completion_date <date>, last_update <date>, description <chr>,
## #   eudract_num <chr>, other_id <chr>, allocation <chr>,
## #   intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## #   time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## #   minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …

Finally, we can also pass a current version of the data set to the query function, rather than starting with the full data set. This is useful when you want to combine queries in a more complex way. For example, this is equivalent to the above:

library(dplyr)

ctgov_query() %>%
  ctgov_query(description_kw = "cancer") %>%
  ctgov_query(enrollment_range = c(100, 200)) %>%
  ctgov_query(date_range = c("2019-01-01", "2020-02-01"))
## # A tibble: 5 × 32
##   nct_id      start_date phase   enrollment brief_title official_title primary_complet…
##   <chr>       <date>     <chr>        <int> <chr>       <chr>          <date>          
## 1 NCT04035447 2020-01-22 N/A            120 Symptom Ma… Improving Sym… 2025-10-01      
## 2 NCT04227327 2020-01-07 Phase 2        121 Study Eval… A Phase 2, Op… 2023-07-31      
## 3 NCT04404244 2020-01-01 <NA>           100 Extraordin… Extraordinary… 2022-01-01      
## 4 NCT03902600 2019-03-12 <NA>           115 Moderately… Moderately Hy… 2022-05-31      
## 5 NCT03813953 2019-02-20 N/A            160 The Effect… The Effect of… 2019-05-31      
## # … with 25 more variables: study_type <chr>, rec_status <chr>,
## #   completion_date <date>, last_update <date>, description <chr>,
## #   eudract_num <chr>, other_id <chr>, allocation <chr>,
## #   intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## #   time_perspective <chr>, masking_description <chr>,
## #   intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## #   minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …

Visualizing the output

The package also contains a number of tools for visualizing the output. Here is one example:

ctgov_query(
  description_kw = "cancer",
  enrollment_range = c(100, 200),
  date_range = c("2019-01-01", "2020-02-01")
) %>%
  ctgov_plot_timeline() +
    ggplot2::theme_minimal()