{"id":3124,"date":"2025-04-16T07:02:21","date_gmt":"2025-04-16T07:02:21","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/16\/an-unbiased-review-of-snowflakes-document-ai\/"},"modified":"2025-04-16T07:02:21","modified_gmt":"2025-04-16T07:02:21","slug":"an-unbiased-review-of-snowflakes-document-ai","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/16\/an-unbiased-review-of-snowflakes-document-ai\/","title":{"rendered":"An Unbiased Review of Snowflake\u2019s Document AI"},"content":{"rendered":"<p>    An Unbiased Review of Snowflake\u2019s Document AI<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">As data <mdspan datatext=\"el1744747249097\" class=\"mdspan-comment\">professionals<\/mdspan>, we\u2019re comfortable with tabular data\u2026<\/p>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/image-46.png?ssl=1\" alt=\"\" class=\"wp-image-601545\" style=\"width:657px;height:auto\"><figcaption class=\"wp-element-caption\">Tabular data. Image by Author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We can also handle words, json, xml feeds, and pictures of cats. But what about a cardboard box full of things like this? <\/p>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/annie-spratt-recgFWxDO1Y-unsplash.jpg?ssl=1\" alt=\"\" class=\"wp-image-601544\" style=\"width:437px;height:auto\"><figcaption class=\"wp-element-caption\">(Image by Annie Spratt, <a href=\"https:\/\/unsplash.com\/photos\/a-receipt-sitting-on-top-of-a-wooden-table-recgFWxDO1Y\">Unsplash<\/a>)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The info on this receipt wants so badly to be in a tabular database somewhere. Wouldn\u2019t it be great if we could scan all these, run them through an LLM, and save the results in a table? <\/p>\n<p class=\"wp-block-paragraph\">Lucky for us, we live in the era of <a href=\"https:\/\/towardsdatascience.com\/tag\/document-ai\/\" title=\"Document Ai\">Document Ai<\/a>. Document AI combines OCR with LLMs and allows us to build a bridge between the paper world and the digital database world.<\/p>\n<p class=\"wp-block-paragraph\">All the major cloud vendors have some version of this\u2026 <\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/cloud.google.com\/document-ai?hl=en\">Google (Document AI)<\/a>,  <\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/document-ai\/\">Microsoft (Document AI)<\/a> <\/li>\n<li class=\"wp-block-list-item\">\n<a href=\"https:\/\/aws.amazon.com\/ai\/generative-ai\/use-cases\/document-processing\/\">AWS (Intelligent Document Processing<\/a>)<\/li>\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/docs.snowflake.com\/en\/user-guide\/snowflake-cortex\/document-ai\/overview\">Snowflake (Document AI)<\/a><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Here I\u2019ll share my thoughts on <a href=\"https:\/\/towardsdatascience.com\/tag\/snowflake\/\" title=\"Snowflake\">Snowflake<\/a>\u2019s Document AI. Aside from using Snowflake at work, I have no affiliation with Snowflake. They didn\u2019t commission me to write this piece and I\u2019m not part of any ambassador program. All of that is to say I can write an <em>unbiased<\/em> review of <a href=\"https:\/\/docs.snowflake.com\/en\/user-guide\/snowflake-cortex\/document-ai\/overview\">Snowflake\u2019s Document AI<\/a>.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">What is Document AI?\u00a0<\/h2>\n<p class=\"wp-block-paragraph\">Document AI allows users to quickly extract information from digital documents. When we say \u201cdocuments\u201d we mean pictures with words.\u00a0Don\u2019t confuse this with <a href=\"https:\/\/aws.amazon.com\/documentdb\/\">niche NoSQL things<\/a>. <\/p>\n<p class=\"wp-block-paragraph\">The product combines OCR and LLM models so that a user can create a set of prompts and execute those prompts against a large collection of documents all at once. <\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"430\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/image-47-1024x430.png?resize=1024%2C430&#038;ssl=1\" alt=\"\" class=\"wp-image-601546\"><figcaption class=\"wp-element-caption\">Snowflake\u2019s Document AI on a (scrubbed) resume. Image by author.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">LLMs and OCR both have room for error. Snowflake solved this by (1) banging their heads against OCR until it\u2019s sharp \u2014 I see you, Snowflake developer \u2014 and (2) letting me fine-tune my LLM.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Fine-tuning the Snowflake LLM feels a lot more like <a href=\"https:\/\/www.merriam-webster.com\/dictionary\/glamping\">glamping<\/a> than some rugged outdoor adventure. I review 20+ documents, hit the \u201ctrain model\u201d button, then rinse and repeat until performance is satisfactory. Am I even a data scientist anymore?<\/p>\n<p class=\"wp-block-paragraph\">Once the model is trained, I can run my prompts on 1000 documents at a time. I like to save the results to a table but you could do whatever you want with the results real time.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Why does it matter?\u00a0<\/h2>\n<p class=\"wp-block-paragraph\">This product is cool for several reasons.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">You can build a bridge between the paper and digital world. I never thought the big box of paper invoices under my desk would make it into my cloud data warehouse, but now it can.\u00a0 Scan the paper invoice, upload it to snowflake, run my Document AI model, and wham! I have my desired information parsed into a tidy table.\n<\/li>\n<li class=\"wp-block-list-item\">It\u2019s frighteningly convenient to invoke a machine-learning model via SQL. Why didn\u2019t we think of this sooner? In a old times this was a few hundred of lines of code to load the raw data (SQL &gt;&gt; python\/spark\/etc.), clean it, engineer features, train\/test split, train a model, make predictions, and then often write the predictions back into SQL.\u00a0<\/li>\n<\/ul>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">To build this in-house would be a major undertaking. Yes, OCR has been around a long time but can still be finicky. Fine-tuning an LLM obviously hasn\u2019t been around too long, but is getting easier by the week. To piece these together in a way that achieves high accuracy for a variety of documents could take a long time to hack on your own. Months of months of polish. <\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Of course some elements are still built in house. Once I extract information from the document I have to figure out what to do with that information. That\u2019s relatively quick work, though.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Our Use Case \u2014 Bring on Flu Season:<\/h2>\n<p class=\"wp-block-paragraph\">I work at a company called <a href=\"https:\/\/www.intelycare.com\/\">IntelyCare<\/a>. We operate in the healthcare staffing space, which means we help hospitals, nursing homes, and rehab centers find quality clinicians for individual shifts, extended contracts, or full-time\/part-time engagements.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Many of our facilities require clinicians to have an up-to-date flu shot. Last year, our clinicians submitted over 10,000 flu shots in addition to hundreds of thousands of other documents. We manually reviewed all of these manually to ensure validity. Part of the joy of working in the healthcare staffing world!<\/p>\n<p class=\"wp-block-paragraph\"><strong>Spoiler Alert: Using Document AI, we were able to reduce the number of flu-shot documents needing manual review by ~50% and all in just a couple of weeks.<\/strong><\/p>\n<p class=\"wp-block-paragraph\">To pull this off, we did the following:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Uploaded a pile of flu-shot documents to snowflake.<\/li>\n<\/ul>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Massaged the prompts, trained the model, massaged the prompts some more, retrained the model some more\u2026\u00a0<\/li>\n<li class=\"wp-block-list-item\">Built out the logic to compare the model output against the clinician\u2019s profile (e.g. do the names match?).\u00a0Definitely some trial and error here with formatting names, dates, etc.<\/li>\n<li class=\"wp-block-list-item\">Built out the \u201cdecision logic\u201d to either approve the document or send it back to the humans.<\/li>\n<li class=\"wp-block-list-item\">Tested the full pipeline on bigger pile of manually reviewed documents. Took a close look at any false positives.<\/li>\n<li class=\"wp-block-list-item\">Repeated until our confusion matrix was satisfactory.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">For this project, false positives pose a business risk. We don\u2019t want to approve a document that\u2019s expired or missing key information. We kept iterating until the false-positive rate hit zero. We\u2019ll have some false positives eventually, but fewer than what we have now with a human review process. <\/p>\n<p class=\"wp-block-paragraph\">False negatives, however, are harmless. If our pipeline doesn\u2019t like a flu shot, it simply routes the document to the human team for review.\u00a0If they go on to approve the document, it\u2019s business as usual. <\/p>\n<p class=\"wp-block-paragraph\">The model does well with the clean\/easy documents, which account for ~50% of all flu shots. If it\u2019s messy or confusing, it goes back to the humans as before.\u00a0<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Things we learned along the way<\/h2>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><em>The model does best at reading the document, not making decisions or doing math based on the document.<\/em><\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Initially, our prompts attempted to determine validity of the document.<\/p>\n<p class=\"has-text-align-center wp-block-paragraph\">Bad: <em>Is the document already expired?<\/em><\/p>\n<p class=\"has-text-align-left wp-block-paragraph\">We found it <em>far<\/em> more effective to limit our prompts to questions that could be answered by looking at the document. The LLM doesn\u2019t <em>determine<\/em> anything. It just grabs the relevant data points off the page.<\/p>\n<p class=\"has-text-align-center wp-block-paragraph\">Good: <em>What is the expiration date?\u00a0<\/em><\/p>\n<p class=\"wp-block-paragraph\">Save the results and do the math downstream. <\/p>\n<ol start=\"2\" class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>You still need to be thoughtful about training data<\/strong><\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">We had a few duplicate flu shots from one clinician in our training data. Call this clinician Ben. One of our prompts was, \u201cwhat is the patient\u2019s name?\u201d Because \u201cBen\u201d was in the training data multiple times, any remotely unclear document would return with \u201cBen\u201d as the patient name. <\/p>\n<p>So overfitting is still a thing. Over\/under sampling is still a thing. We tried again with a more thoughtful collection of training documents and things did much better.<\/p>\n<p class=\"wp-block-paragraph\">Document AI is pretty magical, but not <em>that<\/em> magical. Fundamentals still matter. <\/p>\n<ol start=\"3\" class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>The model could be fooled by writing on a napkin.<\/strong><\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">To my knowledge, Snowflake does not have a way to render the document image as an <a href=\"https:\/\/abdulkaderhelwan.medium.com\/introduction-to-image-embeddings-55b8247d13f2\">embedding<\/a>. You can create an embedding from the extracted text, but that won\u2019t tell you if the text was written by hand or not. As long as the <em>text<\/em> is valid, the model and downstream logic will give it a green light.<\/p>\n<p class=\"wp-block-paragraph\">You could fix this pretty easily by comparing image embeddings of submitted documents to the embeddings of accepted documents. Any document with an embedding way out in left field is sent back for human review. This is straightforward work, but you\u2019ll have to do it outside Snowflake for now.\u00a0<\/p>\n<ol start=\"4\" class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Not as expensive as I was expecting\u00a0<\/strong><\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">Snowflake has a reputation of being spendy. And for HIPAA compliance concerns we run a higher-tier Snowflake account for this project. I tend to worry about running up a Snowflake tab.<\/p>\n<p class=\"wp-block-paragraph\">In the end we had to try extra hard to spend more than $100\/week while training the model. We ran thousands of documents through the model every few days to measure its accuracy while iterating on the model, but never managed to break the budget. <\/p>\n<p class=\"wp-block-paragraph\">Better still, we\u2019re saving money on the manual review process. The costs for AI reviewing 1000 documents (approves ~500 documents) is ~20% of the cost we spend on humans reviewing the remaining 500. All in, a 40% reduction in costs for reviewing flu-shots.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Summing up<\/h2>\n<p class=\"wp-block-paragraph\">I\u2019ve been impressed with how quickly we could complete a project of this scope using Document AI. We\u2019ve gone from months to days. I give it 4 stars out of 5, and am open to giving it a 5th star if Snowflake ever gives us access to image embeddings.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Since flu shots, we\u2019ve deployed similar models for other documents with similar or better results. And with all this prep work, instead of dreading the upcoming flu season, we\u2019re ready to bring it on. <\/p>\n<p class=\"wp-block-paragraph\">\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/an-unbiased-review-of-snowflakes-document-ai\/\">An Unbiased Review of Snowflake\u2019s Document AI<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Ben Tengelsen<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/an-unbiased-review-of-snowflakes-document-ai\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>An Unbiased Review of Snowflake\u2019s Document AI As data professionals, we\u2019re comfortable with tabular data\u2026 Tabular data. Image by Author. We can also handle words, json, xml feeds, and pictures of cats. But what about a cardboard box full of things like this? (Image by Annie Spratt, Unsplash) The info on this receipt wants so [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,2381,2382,71,70,2383],"tags":[98,1939,2384],"class_list":["post-3124","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-document-ai","category-healtcare","category-large-language-models","category-machine-learning","category-snowflake","tag-ai","tag-document","tag-snowflake"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3124"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3124"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3124\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3124"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3124"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3124"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}