Tag: dataset

  • Fine-Tuning vLLMs for Document Understanding

    Fine-Tuning vLLMs for Document Understanding In this article, I discuss how you can fine-tune VLMs (visual large language models, often called vLLMs) like Qwen 2.5 VL 7B. I will introduce you to a dataset of handwritten digits, which the base version of Qwen 2.5 VL struggles with. We will then inspect the dataset, annotate it,…

  • Generalized Kernel Inducing Points by Duality Gap for Dataset Distillation

    Generalized Kernel Inducing Points by Duality Gap for Dataset Distillation arXiv:2502.12607v1 Announce Type: new Abstract: We propose Duality Gap KIP (DGKIP), an extension of the Kernel Inducing Points (KIP) method for dataset distillation. While existing dataset distillation methods often rely on bi-level optimization, DGKIP eliminates the need for such optimization by leveraging duality theory in…

  • You Get a Dataset and Need to Find a “Good” Model Quickly (in Hours or Days), what’s your strategy?

    You Get a Dataset and Need to Find a “Good” Model Quickly (in Hours or Days), what’s your strategy? Typical Scenario: Your friend gives you a dataset and challenges you to beat their model’s performance. They don’t tell you what they did, but they provide a single CSV file and the performance metric to optimize.…

  • Transcriptions dataset

    Transcriptions dataset We teamed up with Miska Knapek to transcribe our 170 episodes into full written text — resulting in 1,539,957 spoken words overall, including 61 mentions of weather, 923 mentions of maps and 48 mentions of AI. Check out a little tour of the data, browse and search episodes using our brand new archive page, or, for the technically inclined, check out data and code…