drugprepr
The Clinical Practice Research Datalink (CPRD) is a UK database of anonymised primary care electronic health records. It is widely used to study the effectiveness and safety of medications. However, prescription data from CPRD are often messy, with systemic issues such as missing information on prescription end dates and medication quantities, and prescribing instructions are typically provided as unstructured free text. In pharmaco-epidemiology studies, the data preparation steps used to determine drug exposure are rarely fully reported, yet assumptions made during this stage can have considerable implications for risk attribution of possible adverse events.
We have previously developed a framework1 for dealing with
missing information on prescription end dates and other issues such as
overlapping prescription periods. The framework was implemented using
software. Besides being only available for users, the earlier algorithm
did not deal with the problem of free-text prescriptions. The current R
package, drugprepr
(formerly drugprepCPRD
),
builds upon the earlier algorithm and aims to make the framework
available to a wider audience through its implementation as free,
open-source software.
This vignette describes how to use the drugprepr
package
to transform CPRD drug data contained in a typical
therapy.txt
file into information on individuals’ drug use
over time. We will walk through all the steps needed to perform the
transformation. We assume the user is already familiar with CPRD data
and has basic knowledge of R software.
The CPRD Gold data follows the CPRD
Gold specification. It is made up of several tables containing
information related to the patients. One of the CPRD tables, called
‘Therapy’, contains the prescription information for a patient. This
table can be linked to the ‘common dosages’ lookup table to get the text
instructions for the prescribed product (i.e. a drug). Table , based on
the fictional dataset cprd
that is bundled with this
package, presents hypothetical prescription data for two
individuals.
patid | pracid | start_date | prodcode | dosageid | text | qty | numdays | dose_duration |
---|---|---|---|---|---|---|---|---|
2156 | 156 | 2011-06-16 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 |
2156 | 156 | 2011-06-24 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 |
2156 | 156 | 2011-07-10 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 |
2156 | 156 | 2011-07-17 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 |
2156 | 156 | 2011-07-17 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 |
2156 | 156 | 2011-08-16 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | NA | 7 | 6 |
2156 | 156 | 2011-08-24 | 1 | NA | 40 | 7 | 6 | |
2156 | 156 | 2011-09-10 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | NA | 6 | 5 |
2156 | 156 | 2011-09-17 | 2 | NA | 24 | 6 | 5 | |
2256 | 160 | 2011-06-16 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 |
2256 | 160 | 2011-06-24 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 |
2256 | 160 | 2011-07-10 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 |
2256 | 160 | 2011-07-17 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 |
2256 | 160 | 2011-07-17 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 |
2256 | 160 | 2011-08-16 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | NA | 7 | 6 |
2256 | 160 | 2011-08-24 | 1 | NA | 40 | 7 | 6 | |
2256 | 160 | 2011-09-10 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | NA | 6 | 5 |
2256 | 160 | 2011-09-17 | 2 | NA | 24 | 6 | 5 |
The event_date
in the table is often used as the start
date of exposure, but the stop date of the exposure is not available
per se and we have to infer it from the available information.
CPRD provides two options, namely numdays
(number of days)
and dose_duration
, which can be used to determine the
prescription stop date. However, these values are often missing, and
even when present do not offer flexibility for the researcher in
computing the stop dates in the case of prescriptions with a variable
dose frequency (number of times the prescription is taken per day) and
dose number (e.g. the number of tablets to take at a time).
As described below, the drugprepr
package extracts the
dose frequency and dose number from the text instructions and provide a
series of data processing steps with multiple options to define start
and stop dates of a given prescription. The package works at the
prodcode
level to give granularity for each medication.
The package drugprepr
is made up of the following
functions that must be executed sequentially.
You can install the latest development version from
GitHub
with the following R commands:
Once the package is installed, we load the drugprepr
package into the working environment as follows.
This step uses another package developed for extraction of dose
frequency and dose number—R package doseminer
—and
computes the numerical daily dose using the formula $$
\text{ndd} = \frac{\text{DF} \times \text{DN}}{\text{DI}},
$$
where DF is dose frequency (number of doses per day), DN is dose
number (number of units of medication taken per dose), and DI is dose
interval (number of days between doses, where an interval of 1 means
every day). The user must define here what DF or DN values to use, in
case the freetext prescribing instructions indicate a range of possible
values. Possible values are min
, max
, and
mean
. In the case of regular prescriptions (prescriptions
with fixed instructions such as ‘take 2 tablets 4 times a day’), the
min
, max
, and mean
value will be
the same. Computation of numerical daily dose can be done using the
compute_ndd
function as follows.
patid | pracid | prodcode | text | ndd |
---|---|---|---|---|
2156 | 156 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 4 |
2156 | 156 | 2 | TAKE 1-2 THREE TIMES A DAY | 3 |
2156 | 156 | 1 | NA |
Here, we specified to use the minimum values for both the DF and DN
in the computation of ndd
. Running
compute_ndd()
creates an additional column names
ndd
, as illustrated in Table .
The next stage is to define the cut-off for plausible values of
prescription quantity and numeric daily dose for each product. The
information has to be provided by the users in a table format. The table
should have column names: prodcode
, max_qty
,
min_qty
, max_rec_ndd
,
min_rec_ndd
. Such information might be obtained from British National Formulary (BNF).
For our hypothetical case, we defined the min_max
data as
in Table .
prodcode | max_qty | min_qty | max_ndd | min_ndd |
---|---|---|---|---|
1 | 100 | 1 | 10 | 1 |
2 | 50 | 1 | 10 | 1 |
Once we prepare our min_max
data in a table format, we
can pass it to the second argument of drug_prep()
.
Internally, the function will join the plausible ranges to the different
drugs and mark respective quantities and doses as plausible or
implausible.
Once these preliminary pre-processing steps are completed, we can use
the main function of drugprepr
, drug_prep()
,
to evaluate the 10 decision nodes described above. This can be done by
executing an R command of the following form:
result <- drug_prep(data_ndd,
min_max_dat,
decisions = c('b', 'b1', 'b', 'b1', 'b_6',
'c', 'a', 'd', 'a', 'b_15'))
pracid | start_date | prodcode | dosageid | text | qty | numdays | dose_duration | ndd | optional | duration | stop_date |
---|---|---|---|---|---|---|---|---|---|---|---|
156 | 2011-06-16 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 | 4 | 0 | 10 | 2011-06-26 |
156 | 2011-06-24 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 | 4 | 0 | 10 | 2011-07-04 |
156 | 2011-08-16 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 | 4 | 0 | 10 | 2011-08-26 |
156 | 2011-08-24 | 1 | NA | 40 | 7 | 6 | 4 | 0 | 10 | 2011-09-03 | |
160 | 2011-06-16 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 | 4 | 0 | 10 | 2011-06-26 |
160 | 2011-06-24 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 | 4 | 0 | 10 | 2011-07-04 |
160 | 2011-08-16 | 1 | 1 | TAKE 1 OR 2 4 TIMES/DAY | 40 | 7 | 6 | 4 | 0 | 10 | 2011-08-26 |
160 | 2011-08-24 | 1 | NA | 40 | 7 | 6 | 4 | 0 | 10 | 2011-09-03 | |
156 | 2011-07-10 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 | 3 | 0 | 8 | 2011-07-18 |
156 | 2011-07-17 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 | 3 | 0 | 8 | 2011-07-25 |
156 | 2011-09-10 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 | 3 | 0 | 8 | 2011-09-18 |
156 | 2011-09-17 | 2 | NA | 24 | 6 | 5 | 3 | 0 | 8 | 2011-09-25 | |
160 | 2011-07-10 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 | 3 | 0 | 8 | 2011-07-18 |
160 | 2011-07-17 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 | 3 | 0 | 8 | 2011-07-25 |
160 | 2011-09-10 | 2 | 2 | TAKE 1-2 THREE TIMES A DAY | 24 | 6 | 5 | 3 | 0 | 8 | 2011-09-18 |
160 | 2011-09-17 | 2 | NA | 24 | 6 | 5 | 3 | 0 | 8 | 2011-09-25 |
The result of running the function drug_prep()
provides
the start and stop date (stop_date
) for each
prescription.
To save remembering lots of alphanumeric codes, a helper function can generate them for you from more human-readable parameters:
decisions <- make_decisions('ignore',
'mean population',
'missing',
'mean practice',
'truncate 6',
'qty / ndd',
'mean individual',
'mean',
'allow',
'close 15')
decisions
#> implausible_qty missing_qty implausible_ndd
#> "a" "b3" "b"
#> missing_ndd implausible_duration calculate_duration
#> "b2" "c_6" "c"
#> missing_duration clash_start overlapping
#> "b" "b" "a"
#> small_gaps
#> "b_15"
# drug_prep(example_therapy, plausible_values, decisions)
is a citation needed here?↩︎