{"id":282,"date":"2024-11-30T07:02:37","date_gmt":"2024-11-30T07:02:37","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/11\/30\/effortless-data-handling-find-variables-across-multiple-data-files-with-r-e01c9715b4c7\/"},"modified":"2024-11-30T07:02:37","modified_gmt":"2024-11-30T07:02:37","slug":"effortless-data-handling-find-variables-across-multiple-data-files-with-r-e01c9715b4c7","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/11\/30\/effortless-data-handling-find-variables-across-multiple-data-files-with-r-e01c9715b4c7\/","title":{"rendered":"Effortless Data Handling: Find Variables Across Multiple Data Files with R"},"content":{"rendered":"<p>    Effortless Data Handling: Find Variables Across Multiple Data Files with R<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>A practical solution with code and\u00a0workflow<\/h4>\n<p>Lost in a maze of datasets and endless data dictionaries? Say goodbye to tedious variable hunting! Discover how to quickly identify and extract the variables you need from multiple SAS files using two simple R functions. Streamline your workflow, save time, and make data preparation a\u00a0breeze!<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*kKw8z8PrtNZT_q-M\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@imgix?utm_source=medium&amp;utm_medium=referral\">imgix<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p>As a researcher with over seven years of experience working with health data, I\u2019ve often been handed folders full of datasets. For example, imagine opening a folder containing 56 SAS files, each with unique data (example below). If you\u2019ve been in this situation, you know the frustration: trying to locate a specific variable in a sea of files feels like looking for a needle in a haystack.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/456\/1%2AkSejreAKQPAmzoNnz8potQ.png?ssl=1\"><figcaption>Screenshot taken by the author of a local folder. File names have been blurred to maintain the confidentiality of the datasets.<\/figcaption><\/figure>\n<p>At first glance, this may not seem like an issue if you already know where your variables of interest are. But often, you don\u2019t. While a data dictionary is usually provided, it\u2019s frequently a PDF document that lists variables across multiple pages. Finding what you need might involve searching (Ctrl+F) for a variable on page 100, only to realize the dataset\u2019s name is listed on page 10. Scrolling back and forth wastes\u00a0time.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Avr2E4_lEVGPQbVkWr1dtGg.png?ssl=1\"><figcaption>Screenshot taken by the author of a data dictionary. Variable names and labels have been blurred to maintain the confidentiality of the datasets.<\/figcaption><\/figure>\n<p>To save myself from this tedious process, I created a reproducible R workflow to read all datasets in a folder, generate a consolidated dataframe of variable names and their labels (example below), and identify where each variable is located. This approach has made my work faster and more efficient. Here\u2019s how you can do it, step by\u00a0step.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AXlD9m54zhFEr2XqeDNyXrg.png?ssl=1\"><figcaption>Screenshot taken by the author of of how the names_labels dataset looks like (see step 2). Variable names and labels have been blurred to maintain the confidentiality of the datasets.<\/figcaption><\/figure>\n<p><strong>Step-by-Step Guide<\/strong><\/p>\n<p><strong>Step 1: Use the get_names_labels Function<\/strong><\/p>\n<p>First, use the custom function get_names_labels (code provided at the end of this post). This function requires the folder path where all your datasets are\u00a0stored.<\/p>\n<pre>path_file &lt;- \"D:\/folder1\/folder2\/folder3\/folder_with_datasets\/\"<br>get_names_labels(path_file)<\/pre>\n<p><strong>Step 2: Generate a Variable Dictionary<\/strong><\/p>\n<p>The get_names_labels function will create a dataframe named names_labels (like the example above), which includes:<\/p>\n<p>\u00b7 Variable name (variable_name)<\/p>\n<p>\u00b7 Variable label (variable_label)<\/p>\n<p>\u00b7 The name of the dataset(s) where the variable was found (file_name)<\/p>\n<p>Depending on the number and size of the datasets, this process may take a minute or\u00a0two.<\/p>\n<p><strong>Step 3: Search for Variables<\/strong><\/p>\n<p>Once the names_labels dataframe is generated, you can search for the variables you need. Filter the variable_name or variable_label columns to locate relevant terms. For example, if you\u2019re looking for gender-related variables, they might be labeled as sex, gender, is_male, or is_female.<\/p>\n<p>Be mindful that similar variables might exist in multiple datasets. For instance, age could appear in the main questionnaire, a clinical dataset, and a laboratory dataset. These variables might look identical but differ based on how and where the data was collected. For\u00a0example:<\/p>\n<p>\u00b7 Age in the main questionnaire: Collected from all survey participants.<\/p>\n<p>\u00b7 Age in clinical\/lab datasets: Limited to a subset invited for further assessments or those who agreed to participate.<\/p>\n<p>In such cases, the variable from the main questionnaire might be more representative of the full population.<\/p>\n<p><strong>Step 4: Identify Relevant\u00a0Datasets<\/strong><\/p>\n<p>Once you\u2019ve determined which variables you need, filter the names_labels dataframe to identify the original datasets (file_name) containing them. If a variable appears in multiple datasets (e.g., ID), you\u2019ll need to identify which dataset includes all the variables you\u2019re interested in.<\/p>\n<pre># Say you want these two variables<br>variables_needed &lt;- c('ID', 'VAR1_A')<br>names_labels &lt;- names_labels[which(names_labels$variable_name %in% variables_needed), ]<\/pre>\n<p>If one of the variables can be found in multiple original datasets (e.g., ID), you will filter names_labels to keep only the original dataset with both variables (e.g., ID and VAR1_A). In our case, the names_labels dataframe will be reduced to only two rows, one for each of the two variables we were looking for, both of which will be found in the same original\u00a0dataset.<\/p>\n<pre>names_labels &lt;- names_labels %&gt;%<br>  group_by(file_name) %&gt;%<br>  mutate(count = n()) %&gt;%<br>  filter(count &gt;= 2)<\/pre>\n<p><strong>Step 5: Extract the\u00a0Data<\/strong><\/p>\n<p>Now, use the read_and_select function (provided at the end). Pass the name of the original dataset containing the variables of interest. This function creates a new dataframe in your R environment with only the selected variables. For example, if your variables are in ABC.sas7bdat, the function will create a new dataframe called ABC with just those variables.<\/p>\n<pre>unique(names_labels$file_name) # Sanity check, that there is only one dataframe<br>read_and_select(unique(names_labels$file_name)[1])<\/pre>\n<p><strong>Step 6: Clean Your Environment<\/strong><\/p>\n<p>To keep your workspace tidy, remove unnecessary elements and retain only the new dataframe(s) you need. For example, if your variables of interest came from ABC.sas7bdat, you\u2019ll keep the filtered dataframe ABC which was the output of the read_and_select function.<\/p>\n<pre>length(unique(names_labels$file_name))<br>names_labels$file_name &lt;- str_extract(names_labels$file_name, \"[^.]+\")<br>rm(list = setdiff(ls(), c(unique(names_labels$file_name))))<\/pre>\n<p><strong>Step 7: Merge Multiple Datasets (Optional)<\/strong><\/p>\n<p>If your variables of interest are in more than one dataset (e.g., ABC and DEF), you can merge them. Use a unique identifier, such as ID, to combine the datasets into a single dataframe. The result will be a unified dataframe with all the variables you need. You will get a df dataframe with all the observations and only the variables you\u00a0needed.<\/p>\n<pre># Get a list with the names of the dataframes in the environment (\u201cABC\u201d and \u201cDEF\u201d)<br>object_names &lt;- ls() <br><br><br># Get a list with the actual dataframe<br>object_list &lt;- mget(object_names)<br><br><br># Reduce the dataframes in the list (\u201cABC\u201d and \u201cDEF\u201d) by merging conditional on the unique identifier (\u201cID\u201d)<br>df &lt;- Reduce(function(x, y) merge(x, y, by = \"ID\", all = TRUE), object_list)<br><br># Clean your environment to keep only the dataframes (\u201cABC\u201d and \u201cDEF\u201d) and a new dataframe \u201cdf\u201d which will contain all the variables you needed.<br>rm(object_list, object_names)<\/pre>\n<p><strong>Why This Workflow\u00a0Works?<\/strong><\/p>\n<p>This approach saves time and organizes your work into a single, reproducible script. If you later decide to add more variables, simply revisit steps 2 and 3, update your list, and rerun the script. This flexibility is invaluable when dealing with large datasets. While you\u2019ll still need to consult documentation to understand variable definitions and data collection methods, this workflow reduces the effort required to locate and prepare your data. Handling multiple datasets doesn\u2019t have to be overwhelming. By leveraging my custom functions like get_names_labels and read_and_select, you can streamline your workflow for data preparation.<\/p>\n<p><strong><em>Have you faced similar challenges when working with multiple datasets? Share your thoughts or tips in the comments, or give this article a thumbs up if you found it helpful. Let\u2019s keep the conversation going and learn from each\u00a0other!<\/em><\/strong><\/p>\n<p>Below are the two custom functions. Save them in an R script file, and load the script into your working environment whenever needed. For example, you could save the file as _Functions.R for easy\u00a0access.<\/p>\n<pre># You can load the functions as<br>source('D:\/Folder1\/Folder2\/Folder3\/_Functions.R')<\/pre>\n<pre>library(haven)<br>library(tidyverse)<br>library(stringr)<br><br><br><br><br><br>## STEPS TO USE THESE FUNCTIONS:<br>## 1. DEFINE THE OBJECT 'PATH_FILE', WHICH IS A PATH TO THE DIRECTORY WHERE<br>##    ALL THE DATASETS ARE STORED.<br>## 2. APPLY THE FUNCTION 'get_names_labels' WITH THE PATH. THE FUNCTION WILL <br>##    RETURN A DATAFRAME NAMES 'names_labels'. <br>## 3. THE FUNCTION WILL RETURN A DATASET ('names_labels) SHOWING THE NAMES OF <br>##    THE VARIABLES, THE LABELS, AND THE DATASET. VISUALLY\/MANUALLY EXPLORE THE <br>##    DATASET TO SELECT THE VARIABLES WE NEED. CREATE A VECTOR WITH THE NAMES <br>##    OF THE VARIABLES WE NEED, AND NAME THIS VECTOR 'variables_needed'.<br>## 4. FROM THE DATASET 'names_labels', KEEP ONLY THE ROWS WITH THE VARIABLES WE<br>##    WILL USE (STORED IN THE VECTOR 'variables_needed').<br>## 5. APPLY THE FUNCTION 'read_and_select' TO EACH OF THE DATASETS WITH RELEVANT <br>##    VARIABLES. THIS FUNCTION WILL ONLY NEED THE NAME OF THE DATASET, WHICH IS<br>##    STORED IN THE LAST COLUMN OF DATASET 'names_labels'.<br><br><br><br><br><br><br><br><br><br><br>### FUNCTION TO 1) READ ALL DATASETS IN A FOLDER; 2) EXTRACT NAMES AND LABELS;<br>### 3) PUT NAMES AND LABELS IN A DATASET; AND 4) RETURN THE DATASET. THE ONLY<br>### INPUT NEEDED IS A PATH TO A DIRECTORY WHERE ALL THE DATASETS ARE STORED.<br><br>get_names_labels &lt;- function(path_file){<br>  results_df &lt;- list()<br>  <br>  sas_files &lt;- c(<br>    list.files(path = path_file, pattern = \"\\.sas7bdat$\")<br>  )<br>  <br>  for (i in 1:length(sas_files)) {<br>    print(sas_files[i])<br>    <br>    # Read the SAS file<br>    sas_data &lt;- read_sas(paste0(path_file, sas_files[i]))<br>    sas_data &lt;- as.data.frame(sas_data)<br>    <br>    # Get the variable names and labels<br>    var_names &lt;- names(sas_data)<br>    labels &lt;- sas_data %&gt;% <br>      map(~attributes(.)$label) %&gt;%<br>      map_chr(~ifelse(is.null(.), NA, .))<br>    <br>    # Combine the variable names and labels into a data frame<br>    var_df &lt;- data.frame(<br>      variable_name = var_names,<br>      variable_label = labels,<br>      file_name = sas_files[i],<br>      stringsAsFactors = FALSE<br>    )<br>    <br>    # Append the results to the overall data frame<br>    results_df[[i]] &lt;- var_df<br>  }<br>  <br>  results_df &lt;- do.call(rbind, results_df)<br>  <br>  #return(results_df)<br>  assign('names_labels', results_df, envir = .GlobalEnv)<br>  <br>}<br><br>################################################################################<br><br><br><br><br><br><br><br><br><br><br>### FUNCTION TO READ EACH DATASET AND KEEP ONLY THE VARIABLES WE SELECTED; THE<br>### FUNCTION WILL SAVE EACH DATASET IN THE ENVIRONMENT. THE ONLY INPUNT IS THE<br>### NAME OF THE DATASET.<br><br>read_and_select &lt;- function(df_file){<br>  <br>  df_tmp &lt;- read_sas(paste0(path_file, df_file))<br>  <br>  df_tmp &lt;- df_tmp %&gt;%<br>    select(unique(names_labels[which(names_labels$file_name == df_file), ]$variable_name)) %&gt;%<br>    as.data.frame()<br>  <br>  assign(str_extract(df_file, \"[^.]+\"), df_tmp,envir = .GlobalEnv)<br>  <br>}<br><br>################################################################################<\/pre>\n<p><em>You can find me on <\/em><a href=\"https:\/\/www.linkedin.com\/in\/rodrigo-m-carrillo-larco-md-phd-39016a27\/\"><em>LinkedIn<\/em><\/a><em> and happy to connect and\u00a0discuss.<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=e01c9715b4c7\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/effortless-data-handling-find-variables-across-multiple-data-files-with-r-e01c9715b4c7\">Effortless Data Handling: Find Variables Across Multiple Data Files with R<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Rodrigo M Carrillo Larco, MD, PhD<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Feffortless-data-handling-find-variables-across-multiple-data-files-with-r-e01c9715b4c7\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Effortless Data Handling: Find Variables Across Multiple Data Files with R A practical solution with code and\u00a0workflow Lost in a maze of datasets and endless data dictionaries? Say goodbye to tedious variable hunting! Discover how to quickly identify and extract the variables you need from multiple SAS files using two simple R functions. Streamline your [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,255,83,254,253,256],"tags":[84,257,258],"class_list":["post-282","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-data-handling","category-data-science","category-dataset","category-health-research","category-r","tag-data","tag-variable","tag-variables"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/282"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=282"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/282\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=282"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=282"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=282"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}