Automatic identification of variables in epidemiological datasets using logic regression
Access Status
Authors
Date
2017Type
Metadata
Show full item recordCitation
Source Title
ISSN
School
Collection
Abstract
Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
Related items
Showing items related by title, author, creator and subject.
-
Sithole, Moses M. (1992)This thesis is concerned with the problem of selection of important variables in Principal Component Analysis (PCA) in such a way that the selected subsets of variables retain, as much as possible, the overall multivariate ...
-
Ntoumanis, Nikos ; Ng, J.; Barkoukis, V.; Backhouse, S. (2014)Background There is a growing body of empirical evidence on demographic and psychosocial predictors of doping intentions and behaviors utilizing a variety of variables and conceptual models. However, to date there has ...
-
Wand, B.; James, M.; Abbaszadeh, S.; George, P.; Formby, P.; Smith, Anne; O'Connell, N. (2014)Background: There is considerable interest in the role that disturbance of body-perception may play in long standing pain problems such as chronic low back pain (CLBP), both as a contributor to the clinical condition and ...