Tag: text

Learning with the Nash-Sutcliffe loss

Learning with the Nash-Sutcliffe loss arXiv:2603.00968v1 Announce Type: new Abstract: The Nash-Sutcliffe efficiency ($text{NSE}$) is a widely used, positively oriented relative measure for evaluating forecasts across multiple time series. However, it lacks a decision-theoretic foundation for this purpose. To address this, we examine its negatively oriented counterpart, which we refer to as Nash-Sutcliffe loss, defined…

March 3, 2026
Detecting and Mitigating Treatment Leakage in Text-Based Causal Inference: Distillation and Sensitivity Analysis

Detecting and Mitigating Treatment Leakage in Text-Based Causal Inference: Distillation and Sensitivity Analysis arXiv:2601.02400v1 Announce Type: cross Abstract: Text-based causal inference increasingly employs textual data as proxies for unobserved confounders, yet this approach introduces a previously undertheorized source of bias: treatment leakage. Treatment leakage occurs when text intended to capture confounding information also contains signals…

January 7, 2026
GliNER2: Extracting Structured Information from Text

GliNER2: Extracting Structured Information from Text From unstructured text to structured Knowledge Graphs The post GliNER2: Extracting Structured Information from Text appeared first on Towards Data Science. Tomaz Bratanic Go to original source

January 7, 2026
Separate Numbers and Text in One Column Using Power Query

Separate Numbers and Text in One Column Using Power Query An Excel sheet with a column containing numbers and text? What a mess! The post Separate Numbers and Text in One Column Using Power Query appeared first on Towards Data Science. Salvatore Cagliari Go to original source

December 17, 2025
LLM-Powered CPI Prediction Inference with Online Text Time Series

LLM-Powered CPI Prediction Inference with Online Text Time Series arXiv:2506.09516v1 Announce Type: new Abstract: Forecasting the Consumer Price Index (CPI) is an important yet challenging task in economics, where most existing approaches rely on low-frequency, survey-based data. With the recent advances of large language models (LLMs), there is growing potential to leverage high-frequency online text…

June 12, 2025
A Two-Sample Test of Text Generation Similarity

A Two-Sample Test of Text Generation Similarity arXiv:2505.05269v1 Announce Type: new Abstract: The surge in digitized text data requires reliable inferential methods on observed textual patterns. This article proposes a novel two-sample text test for comparing similarity between two groups of documents. The hypothesis is whether the probabilistic mapping generating the textual data is identical…

May 9, 2025
Real-Time Interactive Sentiment Analysis in Python

Real-Time Interactive Sentiment Analysis in Python You know what the best part of being an engineer is? You can just build stuff. It’s like a superpower. One rainy afternoon I had this random idea of creating a sentiment visualization of a text input with a smiley face that changes it’s expression base on how positive…

May 8, 2025
Retrieval Augmented Classification: Improving Text Classification with External Knowledge

Retrieval Augmented Classification: Improving Text Classification with External Knowledge Text Classification stands as one of the most basic yet most important applications of natural language processing. It has a vital role in many real-world applications that go from filtering unwanted emails like spam, detecting product categories or classifying user intent in a chat-bot application. The…

May 7, 2025
R.E.D.: Scaling Text Classification with Expert Delegation

R.E.D.: Scaling Text Classification with Expert Delegation With the new age of problem-solving augmented by Large Language Models (LLMs), only a handful of problems remain that have subpar solutions. Most classification problems (at a PoC level) can be solved by leveraging LLMs at 70–90% Precision/F1 with just good prompt engineering techniques, as well as adaptive…

March 21, 2025
LLaDA: The Diffusion Model That Could Redefine Language Generation

LLaDA: The Diffusion Model That Could Redefine Language Generation Introduction What if we could make language models think more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them? This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to…

February 27, 2025
Multimodal Search Engine Agents Powered by BLIP-2 and Gemini

Multimodal Search Engine Agents Powered by BLIP-2 and Gemini This post was co-authored with Rafael Guedes. Introduction Traditional models can only process a single type of data, such as text, images, or tabular data. Multimodality is a trending concept in the AI research community, referring to a model’s ability to learn from multiple types of…

February 20, 2025
Machine Learning + openAI: solving a text classification problem

Machine Learning + openAI: solving a text classification problem How I migrated an old solution to a more elegant, robust and scalable solution using text classification from openAI Continue reading on Towards Data Science » Ricardo Ribas Go to original source

January 12, 2025
Who Wrote This? Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities

Who Wrote This? Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities arXiv:2501.02406v1 Announce Type: new Abstract: Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly difficult as text generated by Large Language Models (LLMs)…

January 7, 2025
Conditional Variational Autoencoders for Text to Image Generation

Conditional Variational Autoencoders for Text to Image Generation Investigating an early generative architecture and applying it to image generation from text input Recently I was tasked with text-to-image synthesis using a conditional variational autoencoder (CVAE). Being one of the earlier generative structures, it has its limitations but is easily implementable. This article will cover CVAEs at…

December 22, 2024
Semantically Compress Text to Save On LLM Costs

Semantically Compress Text to Save On LLM Costs LLMs are great… if they can fit all of your data Photo by Christopher Burns on Unsplash Originally published at https://blog.developer.bazaarvoice.com on October 28, 2024. Introduction Large language models are fantastic tools for unstructured text, but what if your text doesn’t fit in the context window? Bazaarvoice faced exactly this…

December 21, 2024