{"id":10497,"date":"2026-02-16T07:02:24","date_gmt":"2026-02-16T07:02:24","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2026\/02\/16\/llms_for_data_pipelines_without_losing_control\/"},"modified":"2026-02-16T07:02:24","modified_gmt":"2026-02-16T07:02:24","slug":"llms_for_data_pipelines_without_losing_control","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2026\/02\/16\/llms_for_data_pipelines_without_losing_control\/","title":{"rendered":"LLMs for data pipelines without losing control (API \u2192 DuckDB in ~10 mins)"},"content":{"rendered":"<p>    LLMs for data pipelines without losing control (API \u2192 DuckDB in ~10 mins)<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<table>\n<tr>\n<td> <a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1r4vmyi\/llms_for_data_pipelines_without_losing_control\/\"> <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/preview.redd.it\/18uels5nsijg1.png?ssl=1\" alt=\"LLMs for data pipelines without losing control (API \u2192 DuckDB in ~10 mins)\" title=\"LLMs for data pipelines without losing control (API \u2192 DuckDB in ~10 mins)\"> <\/a> <\/td>\n<td> <!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>Hey folks,<\/p>\n<p>I\u2019ve been doing data engineering long enough to believe that \u201creal\u201d pipelines meant writing every parser by hand, dealing with pagination myself, and debugging nested JSON until it finally stopped exploding.<\/p>\n<p>I\u2019ve also been pretty skeptical of the \u201cjust prompt it\u201d approach.<\/p>\n<p>Lately though, I\u2019ve been experimenting with a workflow that feels less like hype and more like controlled engineering, instead of starting with a blank <a href=\"http:\/\/pipeline.py\/\"><code>pipeline.py<\/code><\/a>, I:<\/p>\n<ul>\n<li>start from a scaffold (template already wired for pagination, config patterns, etc.)<\/li>\n<li>feed the LLM structured docs<\/li>\n<li>run it, let it fail<\/li>\n<li>paste the error back<\/li>\n<li>fix in one tight loop<\/li>\n<li>validate using metadata (so I\u2019m checking what actually loaded)<\/li>\n<\/ul>\n<p>LLM does the mechanical work, I stay in charge of structure + validation<\/p>\n<p><a href=\"https:\/\/preview.redd.it\/18uels5nsijg1.png?width=1536&amp;format=png&amp;auto=webp&amp;s=5fb68e761f9b30f573f098c7c342f18d73ab741c\">AI-assisted data ingestion<\/a><\/p>\n<p>We\u2019re doing a live session on Feb 17 to test this in real time, going from empty folder \u2192 github commits dashboard (duckdb + dlt + marimo) and walking through the full loop live<\/p>\n<p>if you\u2019ve got an annoying API (weird pagination, nested structures, bad docs), bring it, that\u2019s more interesting than the happy path.<\/p>\n<p>we wrote up the full workflow with examples <a href=\"https:\/\/dlthub.com\/blog\/dtc-llm-native\">here<\/a> <\/p>\n<p>Curious, what\u2019s the dealbreaker for you using LLMs in pipelines?<\/p>\n<\/p><\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Thinker_Assignment\"> \/u\/Thinker_Assignment <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1r4vmyi\/llms_for_data_pipelines_without_losing_control\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1r4vmyi\/llms_for_data_pipelines_without_losing_control\/\">[comments]<\/a><\/span> <\/td>\n<\/tr>\n<\/table>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    \/u\/Thinker_Assignment<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1r4vmyi\/llms_for_data_pipelines_without_losing_control\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>LLMs for data pipelines without losing control (API \u2192 DuckDB in ~10 mins) Hey folks, I\u2019ve been doing data engineering long enough to believe that \u201creal\u201d pipelines meant writing every parser by hand, dealing with pagination myself, and debugging nested JSON until it finally stopped exploding. I\u2019ve also been pretty skeptical of the \u201cjust prompt [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,99],"tags":[84,865,4104],"class_list":["post-10497","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-datascience","tag-data","tag-pipelines","tag-ve"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/10497"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=10497"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/10497\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=10497"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=10497"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=10497"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}