{"id":3097,"date":"2025-04-15T07:03:01","date_gmt":"2025-04-15T07:03:01","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/15\/an-llm-based-workflow-for-automated-tabular-data-validation\/"},"modified":"2025-04-15T07:03:01","modified_gmt":"2025-04-15T07:03:01","slug":"an-llm-based-workflow-for-automated-tabular-data-validation","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/15\/an-llm-based-workflow-for-automated-tabular-data-validation\/","title":{"rendered":"An LLM-Based Workflow for Automated Tabular Data Validation\u00a0"},"content":{"rendered":"<p>    An LLM-Based Workflow for Automated Tabular Data Validation\u00a0<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"has-text-align-left wp-block-paragraph\"><mdspan datatext=\"el1744658426682\" class=\"mdspan-comment\">This article<\/mdspan> is part of a series of articles on automating data cleaning for any tabular dataset:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/towardsdatascience.com\/effortless-spreadsheet-normalisation-with-llm\/\">Effortless Spreadsheet Normalisation With LLM<\/a><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">You can test the feature described in this article on your own dataset using the <a href=\"https:\/\/cleanmyexcel.io\/\">CleanMyExcel.io<\/a> service, which is free and requires no registration.<\/p>\n<h2 class=\"wp-block-heading\">What is Data Validity?<\/h2>\n<p class=\"wp-block-paragraph\">Data validity refers to data conformity to expected formats, types, and value ranges. This standardisation within a single column ensures the uniformity of data according to implicit or explicit requirements.<\/p>\n<p class=\"wp-block-paragraph\">Common issues related to data validity include:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Inappropriate variable types: Column data types that are not suited to analytical needs, e.g., temperature values in text format.<\/li>\n<li class=\"wp-block-list-item\">Columns with mixed data types: A single column containing both numerical and textual data.<\/li>\n<li class=\"wp-block-list-item\">Non-conformity to expected formats: For instance, invalid email addresses or URLs.<\/li>\n<li class=\"wp-block-list-item\">Out-of-range values: Column values that fall outside what is allowed or considered normal, e.g., negative age values or ages greater than 30 for high school students.<\/li>\n<li class=\"wp-block-list-item\">Time zone and DateTime format issues: Inconsistent or heterogeneous date formats within the dataset.<\/li>\n<li class=\"wp-block-list-item\">Lack of measurement standardisation or uniform scale: Variability in the units of measurement used for the same variable, e.g., mixing Celsius and Fahrenheit values for temperature.<\/li>\n<li class=\"wp-block-list-item\">Special characters or whitespace in numeric fields: Numeric data contaminated by non-numeric elements.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">And the list goes on.<\/p>\n<p class=\"wp-block-paragraph\">Error types such as <strong>duplicated records or entities<\/strong> and <strong>missing values<\/strong> do not fall into this category.<\/p>\n<p class=\"wp-block-paragraph\">But what is the typical strategy to identifying such data validity issues?\u00a0<\/p>\n<h2 class=\"wp-block-heading\">When data meets expectations<\/h2>\n<p class=\"wp-block-paragraph\">Data cleaning, while it can be very complex, can generally be broken down into two key phases:<\/p>\n<p class=\"wp-block-paragraph\">1. Detecting data errors\u00a0\u00a0<\/p>\n<p class=\"wp-block-paragraph\">2. Correcting these errors.<\/p>\n<p class=\"wp-block-paragraph\">At its core, data cleaning revolves around identifying and resolving discrepancies in datasets\u2014specifically, values that violate predefined constraints, which are from expectations about the data..<\/p>\n<p class=\"wp-block-paragraph\">It\u2019s important to acknowledge a fundamental fact: it\u2019s almost impossible, in real-world scenarios, to be exhaustive in identifying all potential data errors\u2014the sources of data issues are virtually infinite, ranging from human input mistakes to system failures\u2014and thus impossible to predict entirely. However, what we <em>can<\/em> do is define what we consider reasonably regular patterns in our data, known as data expectations\u2014reasonable assumptions about what \u201ccorrect\u201d data should look like. For example:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">If working with a dataset of high school students, we might expect ages to fall between 14 and 18 years old.<\/li>\n<li class=\"wp-block-list-item\">A customer database might require email addresses to follow a standard format (e.g., user@domain.com).<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">By establishing these expectations, we create a structured framework for detecting anomalies, making the data cleaning process both manageable and scalable.<\/p>\n<p class=\"wp-block-paragraph\">These expectations are derived from both semantic and statistical analysis. We understand that the column name \u201cage\u201d refers to the well-known concept of time spent living. Other column names may be drawn from the lexical field of high school, and column statistics (e.g. minimum, maximum, mean, etc.) offer insights into the distribution and range of values. Taken together, this information helps determine our expectations for that column:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Age values should be integers<\/li>\n<li class=\"wp-block-list-item\">Values should fall between 14 and 18<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Expectations tend to be as accurate as the time spent analysing the dataset. Naturally, if a dataset is used regularly by a team daily, the likelihood of discovering subtle data issues \u2014 and therefore refining expectations \u2014 increases significantly. That said, even simple expectations are rarely checked systematically in most environments, often due to time constraints or simply because it\u2019s not the most enjoyable or high-priority task on the to-do list.<\/p>\n<p class=\"wp-block-paragraph\">Once we\u2019ve defined our expectations, the next step is to check whether the data actually meets them. This means applying data constraints and looking for violations. For each expectation, one or more constraints can be defined. These <a href=\"https:\/\/towardsdatascience.com\/tag\/data-quality\/\" title=\"Data Quality\">Data Quality<\/a> rules can be translated into programmatic functions that return a binary decision \u2014 a Boolean value indicating whether a given value violates the tested constraint.<\/p>\n<p class=\"wp-block-paragraph\">This strategy is commonly implemented in many data quality management tools, which offer ways to detect all data errors in a dataset based on the defined constraints. An iterative process then begins to address each issue until all expectations are satisfied \u2014 i.e. no violations remain.<\/p>\n<p class=\"wp-block-paragraph\">This strategy may seem straightforward and easy to implement in theory. However, that\u2019s often not what we see in practice \u2014 data quality remains a major challenge and a time-consuming task in many organisations.<\/p>\n<h2 class=\"wp-block-heading\">An LLM-based workflow to generate data expectations, detect violations, and resolve them<\/h2>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcbmlZVTjnXMlPtEaASw5SoQvLywnpCiBExtnE9pN45Ak-6gmsGzdjRo7Q9xdJdV2aOtK_4IzKZEI3cXEc8SwNuGawU96vSigGikFD2fu_B-apSShpe12hON0niWRiolLpSjqeJ?key=lABtwTjQ29DDn4nC3kBCGRmV\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">This validation workflow is split into two main components: the validation of column data types and the compliance with expectations.<\/p>\n<p class=\"wp-block-paragraph\">One might handle both simultaneously, but in our experiments, properly converting each column\u2019s values in a data frame beforehand is a crucial preliminary step. It facilitates data cleaning by breaking down the entire process into a series of sequential actions, which improves performance, comprehension, and maintainability. This strategy is, of course, somewhat subjective, but it tends to avoid dealing with all data quality issues at once wherever possible.<\/p>\n<p class=\"wp-block-paragraph\">To illustrate and understand each step of the whole process, we\u2019ll consider this generated example:<\/p>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeYpouorauXeJOAvIfWyvphJT3znXGnE3phvilOjpOo0be2Q0EVw0dUbelPo32h_vmiBCvPNFwI8uGkw5ESwre9Lnyyl3zrkZsZ5tBEiQgEsXH3Q221K3behCFNaMBUkQ6L2QPI?key=lABtwTjQ29DDn4nC3kBCGRmV\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">Examples of data validity issues are spread across the table. Each row intentionally embeds one or more issues:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Row 1:<\/strong> Uses a non\u2011standard date format and an invalid URL scheme (non\u2011conformity to expected formats).<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Row 2:<\/strong> Contains a price value as text (\u201ctwenty\u201d) instead of a numeric value (inappropriate variable type).<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Row 3:<\/strong> Has a rating given as \u201c4 stars\u201d mixed with numeric ratings elsewhere (mixed data types).<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Row 4:<\/strong> Provides a rating value of \u201c10\u201d, which is out\u2011of\u2011range if ratings are expected to be between 1 and 5 (out\u2011of\u2011range value). Additionally, there is a typo in the word \u201cFood\u201d.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Row 5:<\/strong> Uses a price with a currency symbol (\u201c20\u20ac\u201d) and a rating with extra whitespace (\u201c5 \u201d), showing a lack of measurement standardisation and special characters\/whitespace issues.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Validate Column Data Types<\/h3>\n<h4 class=\"wp-block-heading\">Estimate column data types<\/h4>\n<p class=\"wp-block-paragraph\">The task here is to determine the most appropriate data type for each column in a data frame, based on the column\u2019s semantic meaning and statistical properties. The classification is limited to the following options: string, int, float, datetime, and boolean. These categories are generic enough to cover most data types commonly encountered.<\/p>\n<p class=\"wp-block-paragraph\">There are multiple ways to perform this classification, including deterministic approaches. The method chosen here leverages a large language model (<a href=\"https:\/\/towardsdatascience.com\/tag\/llm\/\" title=\"Llm\">Llm<\/a>), prompted with information about each column and the overall data frame context to guide its decision:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The list of <strong>column names<\/strong>\n<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Representative rows<\/strong> from the dataset, randomly sampled<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Column statistics<\/strong> describing each column (e.g. number of unique values, proportion of top values, etc.)<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><em>Example<\/em>:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<tbody>\n<tr>\n<td>1. Column Name: date\u00a0<br \/>\u00a0 Description: Represents the date and time information associated with each record.\u00a0<br \/>\u00a0 Suggested Data Type: datetime<\/p>\n<p>2. Column Name: category\u00a0<br \/>\u00a0 Description: Contains the categorical label defining the type or classification of the item.\u00a0<br \/>\u00a0 Suggested Data Type: string<\/p>\n<p>3. Column Name: price\u00a0<br \/>\u00a0 Description: Holds the numeric price value of an item expressed in monetary terms.\u00a0<br \/>\u00a0 Suggested Data Type: float<\/p>\n<p>4. Column Name: image_url\u00a0<br \/>\u00a0 Description: Stores the web address (URL) pointing to the image of the item.\u00a0<br \/>\u00a0 Suggested Data Type: string<\/p>\n<p>5. Column Name: rating\u00a0<br \/>\u00a0 Description: Represents the evaluation or rating of an item using a numeric score.\u00a0<br \/>\u00a0 Suggested Data Type: int<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h4 class=\"wp-block-heading\">Convert Column Values into the Estimated Data Type<\/h4>\n<p class=\"wp-block-paragraph\">Once the data type of each column has been predicted, the conversion of values can begin. Depending on the table framework used, this step might differ slightly, but the underlying logic remains similar. For instance, in the <a href=\"https:\/\/cleanmyexcel.io\/\">CleanMyExcel.io<\/a> service, Pandas is used as the core data frame engine. However, other libraries like Polars or PySpark are equally capable within the Python ecosystem.<br \/>All non-convertible values are set aside for further investigation.<\/p>\n<h4 class=\"wp-block-heading\">Analyse Non-convertible Values and Propose Substitutes<\/h4>\n<p class=\"wp-block-paragraph\">This step can be viewed as an imputation task. The previously flagged non-convertible values violate the column\u2019s expected data type. Because the potential causes are so diverse, this step can be quite challenging. Once again, an LLM offers a helpful trade-off to interpret the conversion errors and suggest possible replacements.<br \/>Sometimes, the correction is straightforward\u2014for example, converting an age value of twenty into the integer 20. In many other cases, a substitute is not so obvious, and tagging the value with a sentinel (placeholder) value is a better choice. In Pandas, for instance, the special object pd.NA is suitable for such cases.<\/p>\n<p class=\"wp-block-paragraph\"><em>Example<\/em>:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<tbody>\n<tr>\n<td>{<br \/>\u00a0 \u201cviolations\u201d: [<br \/>\u00a0 \u00a0 {<br \/>\u00a0 \u00a0 \u00a0 \u201cindex\u201d: 2,<br \/>\u00a0 \u00a0 \u00a0 \u201ccolumn_name\u201d: \u201crating\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201cvalue\u201d: \u201c4 stars\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201cviolation\u201d: \u201cContains non-numeric text in a numeric rating field.\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201csubstitute\u201d: \u201c4\u201d<br \/>\u00a0 \u00a0 },<br \/>\u00a0\u00a0\u00a0{<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201cindex\u201d: 1,<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201ccolumn_name\u201d: \u201cprice\u201d,<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201cvalue\u201d: \u201ctwenty\u201d,<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201cviolation\u201d: \u201cTextual representation that cannot be directly converted to a number.\u201d,<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201csubstitute\u201d: \u201c20\u201d<br \/>\u00a0\u00a0\u00a0\u00a0},<br \/>\u00a0\u00a0\u00a0\u00a0{<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201cindex\u201d: 4,<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201ccolumn_name\u201d: \u201cprice\u201d,<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201cvalue\u201d: \u201c20\u20ac\u201d,<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201cviolation\u201d: \u201cPrice value contains an extraneous currency symbol.\u201d,<br \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u201csubstitute\u201d: \u201c20\u201d<br \/>\u00a0\u00a0\u00a0\u00a0}<br \/>\u00a0 ]<br \/>}<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h4 class=\"wp-block-heading\">Replace Non-convertible Values<\/h4>\n<p class=\"wp-block-paragraph\">At this point, a programmatic function is applied to replace the problematic values with the proposed substitutes. The column is then tested again to ensure all values can now be converted into the estimated data type. If successful, the workflow proceeds to the expectations module. Otherwise, the previous steps are repeated until the column is validated.<\/p>\n<h3 class=\"wp-block-heading\">Validate Column Data Expectations<\/h3>\n<h4 class=\"wp-block-heading\">Generate Expectations for All Columns<\/h4>\n<p class=\"wp-block-paragraph\">The following elements are provided:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Data dictionary<\/strong>: column name, a short description, and the expected data type<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Representative rows<\/strong> from the dataset, randomly sampled<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Column statistics<\/strong>, such as number of unique values and proportion of top values<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Based on each column\u2019s semantic meaning and statistical properties, the goal is to define validation rules and expectations that ensure data quality and integrity. These expectations should fall into one of the following categories related to standardisation:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Valid ranges or intervals<\/li>\n<li class=\"wp-block-list-item\">Expected formats (e.g. for emails or phone numbers)<\/li>\n<li class=\"wp-block-list-item\">Allowed values (e.g. for categorical fields)<\/li>\n<li class=\"wp-block-list-item\">Column data standardisation (e.g. \u2018Mr\u2019, \u2018Mister\u2019, \u2018Mrs\u2019, \u2018Mrs.\u2019 becomes [\u2018Mr\u2019, \u2018Mrs\u2019])<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><em>Example<\/em>:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<tbody>\n<tr>\n<td>Column name: date<\/p>\n<p>\u2022 Expectation: Value must be a valid datetime.<br \/>\u2003- Reasoning: The column represents date and time information so each entry should follow a standard datetime format (for example, ISO 8601).\u00a0<br \/>\u2022 Expectation: Datetime values should include timezone information (preferably UTC).<br \/>\u2003- Reasoning: The provided sample timestamps include explicit UTC timezone information. This ensures consistency in time-based analyses.<\/p>\n<p>\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500<br \/>Column name: category<\/p>\n<p>\u2022 Expectation: Allowed values should be standardized to a predefined set.<br \/>\u2003- Reasoning: Based on the semantic meaning, valid categories might include \u201cBooks\u201d, \u201cElectronics\u201d, \u201cFood\u201d, \u201cClothing\u201d, and \u201cFurniture\u201d. (Note: The sample includes \u201cFod\u201d, which likely needs correcting to \u201cFood\u201d.)\u00a0<br \/>\u2022 Expectation: Entries should follow a standardized textual format (e.g., Title Case).<br \/>\u2003- Reasoning: Consistent capitalization and spelling will improve downstream analyses and reduce data cleaning issues.<\/p>\n<p>\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500<br \/>Column name: price<\/p>\n<p>\u2022 Expectation: Value must be a numeric float.<br \/>\u2003- Reasoning: Since the column holds monetary amounts, entries should be stored as numeric values (floats) for accurate calculations.<br \/>\u2022 Expectation: Price values should fall within a valid non-negative numeric interval (e.g., price \u2265 0).<br \/>\u2003- Reasoning: Negative prices generally do not make sense in a pricing context. Even if the minimum observed value in the sample is 9.99, allowing zero or positive values is more realistic for pricing data.<\/p>\n<p>\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500<br \/>Column name: image_url<\/p>\n<p>\u2022 Expectation: Value must be a valid URL with the expected format.<br \/>\u2003- Reasoning: Since the column stores image web addresses, each URL should adhere to standard URL formatting patterns (e.g., including a proper protocol schema).<br \/>\u2022 Expectation: The URL should start with \u201chttps:\/\/\u201d.<br \/>\u2003- Reasoning: The sample shows that one URL uses \u201chtp:\/\/\u201d, which is likely a typo. Enforcing a secure (https) URL standard improves data reliability and user security.<\/p>\n<p>\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500<br \/>Column name: rating<\/p>\n<p>\u2022 Expectation: Value must be an integer.<br \/>\u2003- Reasoning: The evaluation score is numeric, and as seen in the sample the rating is stored as an integer.<br \/>\u2022 Expectation: Rating values should fall within a valid interval, such as between 1 and 5.<br \/>\u2003- Reasoning: In many contexts, ratings are typically on a scale of 1 to 5. Although the sample includes a value of 10, it is likely a data quality issue. Enforcing this range standardizes the evaluation scale.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h4 class=\"wp-block-heading\">Generate Validation Code<\/h4>\n<p class=\"wp-block-paragraph\">Once expectations have been defined, the goal is to create a structured code that checks the data against these constraints. The code format may vary depending on the chosen validation library, such as <a href=\"https:\/\/pandera.readthedocs.io\/\">Pandera<\/a> (used in <a href=\"https:\/\/cleanmyexcel.io\/\">CleanMyExcel.io<\/a>), <a href=\"https:\/\/docs.pydantic.dev\/latest\/\">Pydantic<\/a>, <a href=\"https:\/\/greatexpectations.io\/\">Great Expectations<\/a>, <a href=\"https:\/\/www.soda.io\/\">Soda<\/a>, etc.<\/p>\n<p class=\"wp-block-paragraph\">To make debugging easier, the validation code should apply checks elementwise so that when a failure occurs, the row index and column name are clearly identified. This helps to pinpoint and resolve issues effectively.<\/p>\n<h4 class=\"wp-block-heading\">Analyse Violations and Propose Substitutes<\/h4>\n<p class=\"wp-block-paragraph\">When a violation is detected, it must be resolved. Each issue is flagged with a short explanation and a precise location (row index + column name). An LLM is used to estimate the best possible replacement value based on the violation\u2019s description. Again, this proves useful due to the variety and unpredictability of data issues. If the appropriate substitute is unclear, a sentinel value is applied, depending on the data frame package in use.<\/p>\n<p class=\"wp-block-paragraph\"><em>Example<\/em>:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<tbody>\n<tr>\n<td>{<br \/>\u00a0 \u201cviolations\u201d: [<br \/>\u00a0 \u00a0 {<br \/>\u00a0 \u00a0 \u00a0 \u201cindex\u201d: 3,<br \/>\u00a0 \u00a0 \u00a0 \u201ccolumn_name\u201d: \u201ccategory\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201cvalue\u201d: \u201cFod\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201cviolation\u201d: \u201ccategory should be one of [\u2018Books\u2019, \u2018Electronics\u2019, \u2018Food\u2019, \u2018Clothing\u2019, \u2018Furniture\u2019]\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201csubstitute\u201d: \u201cFood\u201d<br \/>\u00a0 \u00a0 },<br \/>\u00a0 \u00a0 {<br \/>\u00a0 \u00a0 \u00a0 \u201cindex\u201d: 0,<br \/>\u00a0 \u00a0 \u00a0 \u201ccolumn_name\u201d: \u201cimage_url\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201cvalue\u201d: \u201chtp:\/\/imageexample.com\/pic.jpg\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201cviolation\u201d: \u201cimage_url should start with \u2018https:\/\/&#8217;\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201csubstitute\u201d: \u201chttps:\/\/imageexample.com\/pic.jpg\u201d<br \/>\u00a0 \u00a0 },<br \/>\u00a0 \u00a0 {<br \/>\u00a0 \u00a0 \u00a0 \u201cindex\u201d: 3,<br \/>\u00a0 \u00a0 \u00a0 \u201ccolumn_name\u201d: \u201crating\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201cvalue\u201d: \u201c10\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201cviolation\u201d: \u201crating should be between 1 and 5\u201d,<br \/>\u00a0 \u00a0 \u00a0 \u201csubstitute\u201d: \u201c5\u201d<br \/>\u00a0 \u00a0 }<br \/>\u00a0 ]<br \/>}<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">The remaining steps are similar to the iteration process used during the validation of column data types. Once all violations are resolved and no further issues are detected, the data frame is fully validated.<\/p>\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeimHu7_ZOMOLJy2Plhtqo3W6rQcy8m4WgLEK8FqLGVdtVA77VSeVK_p1hR8CzdyQe81hi2vORUZDSU42vEmZD4KPfnWcCWRbTAJx47tK6ZK_9_z5WvYVpCzcsIHY3DIrk07tHUqg?key=lABtwTjQ29DDn4nC3kBCGRmV\" alt=\"\"><\/figure>\n<p class=\"wp-block-paragraph\">You can test the feature described in this article on your own dataset using the <a href=\"https:\/\/cleanmyexcel.io\/\">CleanMyExcel.io<\/a> service, which is free and requires no registration.<\/p>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">Expectations may sometimes lack domain expertise \u2014 integrating human input can help surface more diverse, specific, and reliable expectations.<\/p>\n<p class=\"wp-block-paragraph\">A key challenge lies in automation during the resolution process. A human-in-the-loop approach could introduce more transparency, particularly in the selection of substitute or imputed values.<\/p>\n<p class=\"wp-block-paragraph\">This article is part of a series of articles on automating data cleaning for any tabular dataset:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/towardsdatascience.com\/effortless-spreadsheet-normalisation-with-llm\/\">Effortless Spreadsheet Normalisation With LLM<\/a><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">In upcoming articles, we\u2019ll explore related topics already on the roadmap, including:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">A detailed description of the spreadsheet encoder used in the article above.<\/li>\n<li class=\"wp-block-list-item\">Data uniqueness: preventing duplicate entities within the dataset.<\/li>\n<li class=\"wp-block-list-item\">Data completeness: handling missing values effectively.<\/li>\n<li class=\"wp-block-list-item\">Evaluating data reshaping, validity, and other key aspects of data quality.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Stay tuned!<\/p>\n<p class=\"wp-block-paragraph\">Thank you to Marc Hobballah for reviewing this article and providing feedback.<\/p>\n<p class=\"wp-block-paragraph\">All images, unless otherwise noted, are by the author.<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/an-llm-based-workflow-for-automated-tabular-data-validation\/\">An LLM-Based Workflow for Automated Tabular Data Validation\u00a0<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Simon Grah<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/an-llm-based-workflow-for-automated-tabular-data-validation\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>An LLM-Based Workflow for Automated Tabular Data Validation\u00a0 This article is part of a series of articles on automating data cleaning for any tabular dataset: Effortless Spreadsheet Normalisation With LLM You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration. What [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1726,577,83,67,87,2375],"tags":[84,2376,760],"class_list":["post-3097","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-data-validation","category-data-quality","category-data-science","category-deep-dives","category-llm","category-tabular-data","tag-data","tag-types","tag-values"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3097"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3097"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3097\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3097"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3097"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3097"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}