{"id":1277,"date":"2025-01-18T07:03:58","date_gmt":"2025-01-18T07:03:58","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/18\/effective-ml-with-limited-data-where-to-start-194492e7a6f8\/"},"modified":"2025-01-18T07:03:58","modified_gmt":"2025-01-18T07:03:58","slug":"effective-ml-with-limited-data-where-to-start-194492e7a6f8","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/18\/effective-ml-with-limited-data-where-to-start-194492e7a6f8\/","title":{"rendered":"Where to Start When Data is Limited"},"content":{"rendered":"<p>    Where to Start When Data is Limited<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>A launch pad for projects with small\u00a0datasets<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Az8V20Ww4ZGw4NpytI7DhSA.jpeg?ssl=1\"><figcaption>Photo by Google DeepMind: <a href=\"https:\/\/www.pexels.com\/photo\/an-artist-s-illustration-of-artificial-intelligence-ai-this-image-depicts-how-ai-can-help-humans-to-understand-the-complexity-of-biology-it-was-created-by-artist-khyati-trehan-as-part-17484975\/\">https:\/\/www.pexels.com\/photo\/an-artist-s-illustration-of-artificial-intelligence-ai-this-image-depicts-how-ai-can-help-humans-to-understand-the-complexity-of-biology-it-was-created-by-artist-khyati-trehan-as-part-17484975\/<\/a><\/figcaption><\/figure>\n<p>Machine Learning (ML) has driven remarkable breakthroughs in computer vision, natural language processing, and speech recognition, largely due to the abundance of data in these fields. However, many challenges\u200a\u2014\u200aespecially those tied to specific product features or scientific research\u200a\u2014\u200asuffer from limited data quality and quantity. This guide provides a roadmap for tackling small data problems based on your data constraints, and offers potential solutions, guiding your decision making early\u00a0on.<\/p>\n<h3>1. Difficulty with Gathering more\u00a0Data<\/h3>\n<p>Raw data is rarely a blocker for ML projects. High-quality labels on the other hand, are often prohibitively expensive and laborious to collect. Where obtaining an expert-labelled \u201cground truth\u201d requires domain expertise, intensive fieldwork, or specialised knowledge. For instance, your problem might focus on rare events, maybe, endangered species monitoring, extreme climate events, or unusual manufacturing defects. Other times, business specific or scientific questions might be too specialised for off-the-shelf large-scale datasets. Ultimately this means many projects fail because label acquisition is too expensive.<\/p>\n<h3>2. Core Challenges of Small\u00a0Datasets<\/h3>\n<p>With only a small dataset, any new project starts off with inherent risks. How much of the true variability does your dataset capture? In many ways this question is unanswerable the smaller your dataset gets. Making testing and validation increasingly difficult, and leaving a great deal of uncertainty about how well your model actually generalises. Your model doesn\u2019t know what your data doesn\u2019t capture. This means, with potentially only a few hundred samples, both the richness of the features you can extract, and the number of features you can use decreases, without significant risk of overfitting (that in many cases you can\u2019t measure). This often leaves you limited to classical ML algorithms (Random Forest, SVM etc\u2026), or heavily regularised deep learning methods. The presence of class imbalance will only exacerbate your problems. Making small datasets even more sensitive to noise, where only a few incorrect labels or faulty measurements will cause havoc and headaches.<\/p>\n<h3>3. Questions to Get you\u00a0Started<\/h3>\n<p>For me, working the problem starts with asking a few simple questions about the data, labelling process, and end goals. By framing your problem with a \u201cchecklist\u201d, we can clarify the constraints of your data. Have a go at answering the questions below:<\/p>\n<p><strong>Is your dataset fully, partially, or mostly unlabelled?<\/strong><\/p>\n<ul>\n<li>\n<strong>Fully labeled<\/strong>: You have labels for (nearly) all samples in your\u00a0dataset.<\/li>\n<li>\n<strong>Partially labelled<\/strong>: A portion of the dataset has labels, but there\u2019s a large portion of unlabelled data.<\/li>\n<li>\n<strong>Mostly unlabelled<\/strong>: You have very few (or no) labeled data\u00a0points.<\/li>\n<\/ul>\n<p><strong>How reliable are the labels you do\u00a0have?<\/strong><\/p>\n<ul>\n<li>\n<strong>Highly reliable<\/strong>: If multiple annotates agree on labels, or they are confirmed by trusted experts or well-established protocols.<\/li>\n<li>\n<strong>Noisy or weak<\/strong>: Labels may be crowd-sourced, generated automatically, or prone to human or sensor\u00a0error.<\/li>\n<\/ul>\n<p><strong>Are you solving one problem, or do you have multiple (related) tasks?<\/strong><\/p>\n<ul>\n<li>\n<strong>Single-task<\/strong>: A singular objective, such as a binary classification or a single regression target.<\/li>\n<li>\n<strong>Multi-task<\/strong>: Multiple outputs or multiple objectives.<\/li>\n<\/ul>\n<p><strong>Are you dealing with rare events or heavily imbalanced classes?<\/strong><\/p>\n<ul>\n<li>\n<strong>Yes<\/strong>: Positive examples are very scarce (e.g., \u201cequipment failure,\u201d \u201cadverse drug reactions,\u201d or \u201cfinancial fraud\u201d).<\/li>\n<li>\n<strong>No<\/strong>: Classes are somewhat balanced, or your task doesn\u2019t involve highly skewed distributions.<\/li>\n<\/ul>\n<p><strong>Do you have expert knowledge available, and if so, in what\u00a0form?<\/strong><\/p>\n<ul>\n<li>\n<strong>Human experts<\/strong>: You can periodically query domain experts to label new data or verify predictions.<\/li>\n<li>\n<strong>Model-based experts<\/strong>: You have access to well-established simulation or physical models (e.g., fluid dynamics, chemical kinetics) that can inform or constrain your ML\u00a0model.<\/li>\n<li>\n<strong>No<\/strong>: No relevant domain expertise available to guide or correct the\u00a0model.<\/li>\n<\/ul>\n<p><strong>Is labelling new data possible, and at what\u00a0cost?<\/strong><\/p>\n<ul>\n<li>\n<strong>Feasible and affordable<\/strong>: You can acquire more labeled examples if necessary.<\/li>\n<li>\n<strong>Difficult or expensive<\/strong>: Labelling is time-intensive, costly, or requires specialised domain knowledge (e.g., medical diagnosis, advanced scientific measurements).<\/li>\n<\/ul>\n<p><strong>Do you have prior knowledge or access to pre-trained models relevant to your\u00a0data?<\/strong><\/p>\n<ul>\n<li>\n<strong>Yes<\/strong>: There exist large-scale models or datasets in your domain (e.g., ImageNet for images, BERT for\u00a0text).<\/li>\n<li>\n<strong>No<\/strong>: Your domain is niche or specialised, and there aren\u2019t obvious pre-trained resources.<\/li>\n<\/ul>\n<h3>4. Matching Questions to Techniques<\/h3>\n<p>With your answers to the questions above ready, we can move towards establishing a list of potential techniques for tackling your problem. In practice, small dataset problems require hyper-nuanced experimentation, and so before implementing the techniques below give yourself a solid foundation by starting with a simple model, get a full pipeline working as quickly as possible and always cross-validate. This gives you a baseline to iteratively apply new techniques based on your error analysis, while focusing on conducting small scale experiments. This also helps avoid building an overly complicated pipeline that\u2019s never properly validated. With a baseline in place, chances are your dataset will evolve rapidly. Tools like DVC or MLflow help track dataset versions and ensure reproducibility. In a small-data scenario, even a handful of new labeled examples can significantly change model performance\u200a\u2014\u200aversion control helps systematically manage\u00a0that.<\/p>\n<p>With that in mind, here\u2019s how your answers to the questions above point towards specific strategies described later in this\u00a0post:<\/p>\n<p><strong>Fully Labeled + Single Task + Sufficiently Reliable\u00a0Labels:<\/strong><\/p>\n<ul>\n<li>\n<strong>Data Augmentation<\/strong> (Section 5.7) to increase effective sample\u00a0size.<\/li>\n<li>\n<strong>Ensemble Methods<\/strong> (Section 5.9) if you can afford multiple model training\u00a0cycles.<\/li>\n<li>\n<strong>Transfer Learning<\/strong> (Section 5.1) if a pre-trained model in your domain (or a related domain) is available.<\/li>\n<\/ul>\n<p><strong>Partially Labeled + Labelling is Reliable or Achievable:<\/strong><\/p>\n<ul>\n<li>\n<strong>Semi-Supervised Learning<\/strong> (Section 5) to leverage a larger pool of unlabelled data.<\/li>\n<li>\n<strong>Active Learning<\/strong> (Section 5.6) if you have a human expert who can label the most informative samples.<\/li>\n<li>\n<strong>Data Augmentation<\/strong> (Section 5.7) where possible.<\/li>\n<\/ul>\n<p><strong>Rarely Labeled or Mostly Unlabelled + Expert Knowledge Available:<\/strong><\/p>\n<ul>\n<li>\n<strong>Active Learning<\/strong> (Section 5.6) to selectively query an expert (especially if the expert is a\u00a0person).<\/li>\n<li>\n<strong>Process-Aware (Hybrid) Models<\/strong> (Section 5.10) if your \u201cexpert\u201d is a well-established simulation or\u00a0model.<\/li>\n<\/ul>\n<p><strong>Rarely Labeled or Mostly Unlabelled + No Expert \/ No Additional Labels:<\/strong><\/p>\n<ul>\n<li>\n<strong>Self-Supervised Learning<\/strong> (Section 5.2) to exploit inherent structure in unlabelled data.<\/li>\n<li>\n<strong>Few-Shot or Zero-Shot Learning<\/strong> (Section 5.4) if you can rely on meta-learning or textual descriptions to handle novel\u00a0classes.<\/li>\n<li>\n<strong>Weakly Supervised Learning<\/strong> (Section 5.5) if your labels exist but are imprecise or high-level.<\/li>\n<\/ul>\n<p><strong>Multiple Related\u00a0Tasks:<\/strong><\/p>\n<ul>\n<li>\n<strong>Multitask Learning<\/strong> (Section 5.8) to share representations between tasks, effectively pooling \u201csignal\u201d across the entire\u00a0dataset.<\/li>\n<\/ul>\n<p><strong>Dealing with Noisy or Weak\u00a0Labels:<\/strong><\/p>\n<ul>\n<li>\n<strong>Weakly Supervised Learning<\/strong> (Section 5.5) which explicitly handles label\u00a0noise.<\/li>\n<li>Combine with <strong>Active Learning<\/strong> or a small \u201cgold standard\u201d subset to clean up the worst labelling errors.<\/li>\n<\/ul>\n<p><strong>Highly Imbalanced \/ Rare\u00a0Events:<\/strong><\/p>\n<ul>\n<li>\n<strong>Data Augmentation<\/strong> (Section 5.7) targeting minority classes (e.g., synthetic minority oversampling).<\/li>\n<li>\n<strong>Active Learning<\/strong> (Section 5.6) to specifically label more of the rare\u00a0cases.<\/li>\n<li>\n<strong>Process-Aware Models<\/strong> (Section 5.10) or domain expertise to confirm rare cases, if possible.<\/li>\n<\/ul>\n<p><strong>Have a Pre-Trained Model or Domain-Specific Knowledge:<\/strong><\/p>\n<ul>\n<li>\n<strong>Transfer Learning<\/strong> (Section 5.1) is often the quickest\u00a0win.<\/li>\n<li>\n<strong>Process-Aware Models<\/strong> (Section 5.10) if combining your domain knowledge with ML can reduce data requirements.<\/li>\n<\/ul>\n<h3>5. Strategies for Tackling Small\u00a0Data<\/h3>\n<p>Hopefully, the above has provided a starting point for solving your small data problem. It\u2019s worth noting that many of the techniques discussed are complex and resource intensive. So keep in mind you\u2019ll likely need to get buy-in from your team and project managers before starting. This is best done through clear, concise communication of the potential value they might provide. Frame experiments as strategic, foundational work that can be reused, refined, and leveraged for future projects. Focus on demonstrating clear, measurable impact from a short, tightly-scoped pilot.<\/p>\n<p>Despite the relatively simple picture painted of each technique below, it\u2019s important to keep in mind there\u2019s no one-size-fits-all solution, and applying these techniques isn\u2019t like stacking lego bricks, nor do they work out-of-the-box. To get you started I\u2019ve provided a brief overview of each technique, this is by no means exhaustive, but looks to offer a starting point for your own research.<\/p>\n<h3>5.1 Transfer\u00a0Learning<\/h3>\n<p>Transfer learning is about reusing existing models to solve new related problems. By starting with pre-trained weights, you leverage representations learned from large, diverse datasets and fine-tune the model on your smaller, target\u00a0dataset.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>Leverages powerful features learnt from larger, often diverse datasets.<\/li>\n<li>Fine-tuning pre-trained models typically leads to higher accuracy, even with limited samples, while reducing training\u00a0time.<\/li>\n<li>Ideal when compute resources or project timelines prevent training a model from\u00a0scratch.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Select a model aligned with your problem domain or a large general-purpose \u201cfoundation model\u201d like Mistral (language) or CLIP\/SAM (vision), accessible on platforms like Hugging Face. These models often outperform domain-specific pre-trained models due to their general-purpose capabilities.<\/li>\n<li>Freeze layers that capture general features while fine-tuning only a few layers on\u00a0top.<\/li>\n<li>To counter the risk of overfitting to your small datasets try pruning. Here, less important weights or connections are removed reducing the number of trainable parameters and increasing inference speed.<\/li>\n<li>If interpretability is required, large black-box models may not be\u00a0ideal.<\/li>\n<li>Without access to the pre-trained models source dataset, you risk reinforcing sampling biases during fine-tuning.<\/li>\n<\/ul>\n<p>A nice example of transfer learning is described in the <a href=\"https:\/\/pmc.ncbi.nlm.nih.gov\/articles\/PMC8913041\/#:~:text=The%20customized%20ResNet%20model%20presented,the%20clinical%20decision-making%20process.\">following paper<\/a>. Where leveraging a pre-trained ResNet model enabled better classification of chest X-ray images and detecting COVID-19. Supported by the use of dropout and batch normalisation, the researchers froze the initial layers of the ResNet base model, while fine-tuning later layers, capturing task-specific, high-level features. This proved to be a cost effective method for achieving high accuracy with a small\u00a0dataset.<\/p>\n<h3>5.2 Self-Supervised Learning<\/h3>\n<p>Self-supervised learning is a pre-training technique where artificial tasks (\u201cpretext tasks\u201d) are created to learn representations from broad unlabelled data. Examples include predicting masked tokens for text or rotation prediction, colorisation for images. The result is general-purpose representations you can later pair with transfer-learning (section 5.1) or semi-supervised (section 5) and fine-tune with your smaller\u00a0dataset.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>Pre-trained models serve as a strong initialisation point, reducing the risk of future overfitting.<\/li>\n<li>Learns to represent data in a way that captures intrinsic patterns and structures (e.g., spatial, temporal, or semantic relationships), making them more effective for downstream tasks.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Pre-tasks like cropping, rotation, colour jitter, or noise injection are excellent for visual tasks. However it\u2019s a balance, as excessive augmentation can distort the distribution of small\u00a0data.<\/li>\n<li>Ensure unlabelled data is representative of the small dataset\u2019s distribution to help the model learn features that generalise well.<\/li>\n<li>Self-supervised methods can be compute-intensive; often requiring enough unlabelled data to truly benefit and a large computation budget.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2010.02559\">LEGAL-BERT<\/a> is a prominent example of self-supervised learning. Legal-BERT is a domain-specific variant of the BERT language model, pre-trained on a large dataset of legal documents to improve its understanding of legal language, terminology, and context. The key, is the use of unlabelled data, where techniques such as masked language modelling (the model learns to predict masked words) and next sentence prediction (learning the relationships between sentences, and determining if one follows another) removes the requirement for labelling. This text embedding model can then be used for more specific legal based ML\u00a0tasks.<\/p>\n<h3>5.3 Semi-Supervised Learning<\/h3>\n<p>Leverages a small labeled dataset in addition to a larger unlabelled set. The model iteratively refines predictions on unlabelled data, to generate task specific predictions that can be used as \u201cpseudo-labels\u201d for further iterations.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>Labeled data guides the task-specific objective, while the unlabelled data is used to improve generalisation (e.g., through pseudo-labelling, consistency regularisation, or other techniques).<\/li>\n<li>Improves decision boundaries and can boost generalisation.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Consistency regularisation is a method that assumes model predictions should be consistent across small perturbations (noise, augmentations) made to unlabelled data. The idea is to \u201csmooth\u201d the decision boundary of sparsely populated high-dimensional space.<\/li>\n<li>Pseudo-labelling allows you to train an initial model with a small dataset and use future predictions on unlabelled data as \u201cpseudo\u201d labels for future training. With the aim of generalising better and reducing overfitting.<\/li>\n<\/ul>\n<p>Financial fraud detection is a problem that naturally lends itself to semi-supervised learning, with very little real labelled data (confirmed fraud cases) and a large set of unlabelled transaction data. The <a href=\"https:\/\/arxiv.org\/pdf\/2003.01171\">following paper<\/a> proposes a neat solution, by modelling transactions, users, and devices as nodes in a graph, where edges represent relationships, such as shared accounts or devices. The small set of labeled fraudulent data is then used to train the model by propagating fraud signals across the graph to the unlabelled nodes. For example, if a fraudulent transaction (labeled node) is linked to multiple unlabelled nodes (e.g., related users or devices), the model learns patterns and connections that might indicate\u00a0fraud.<\/p>\n<h3>5.4 Few-Shot and Zero-Shot Learning<\/h3>\n<p>Few and zero-shot learning refers to a broad collection of techniques designed to tackle very small datasets head on. Generally these methods train a model to identify \u201cnovel\u201d classes unseen during training, with a small labelled dataset used primarily for\u00a0testing.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>These approaches enable models to quickly adapt to new tasks or classes without extensive retraining.<\/li>\n<li>Useful for domains with rare or unique categories, such as rare diseases or niche object detection.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Probably the most common technique, known as similarity-based learning, trains a model to compare pairs of items and decide if they belong to the same class. By learning a similarity or distance measure the model can generalise to unseen classes by comparing new instances to class prototypes (your small set of labelled data during testing) during testing. This approach requires a good way to represent different types of input (embedding), often created using Siamese neural networks or similar\u00a0models.<\/li>\n<li>Optimisation-based meta-learning, aims to train a model to quickly adapt to new tasks or classes using only a small amount of training data. A popular example is model-agnostic meta-learning (MAML). Where a \u201cmeta-learner\u201d is trained on many small tasks, each with its own training and testing examples. The goal is to teach the model to start from a good initial state, so when it encounters a new task, it can quickly learn and adjust with minimal additional training. These are not simple methods to implement.<\/li>\n<li>A more classical technique, one-class classification, is where a binary classifier (like one class SVM) is trained on data from only one class, and learns to detect outliers during\u00a0testing.<\/li>\n<li>Zero-shot approaches, such as CLIP or large language models with prompt engineering, enable classification or detection of unseen categories using textual cues (e.g., \u201ca photo of a new product\u00a0type\u201d).<\/li>\n<li>In zero-shot cases, combine with active learning (human in the loop) to label the most informative examples.<\/li>\n<\/ul>\n<p>It\u2019s important to maintain realistic expectations when implementing few-shot and zero-shot techniques. Often, the aim is to achieve usable or \u201cgood enough\u201d performance. As a direct comparison of traditional deep-learning (DL) methods, the <a href=\"https:\/\/pdf.sciencedirectassets.com\/273474\/1-s2.0-S1574954121X00039\/1-s2.0-S1574954121001114\/am.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEBUaCXVzLWVhc3QtMSJHMEUCIQDfYiAEiSUcw2kzloRHD9NuMQQYU4UDfC9wmkQE6RZb1AIgGjzmOp1Im5GTx8lYQLujZKHyjjH%2BzF7beKSPgHSulvgqvAUI7f%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAFGgwwNTkwMDM1NDY4NjUiDDj9ztsZsdQl7rPIACqQBZ%2FVvm4rA3XllaxusPEzjNGRUqqZJM1gqAVRbq6H5c3tCw7%2FyrZ6x3bu%2B%2BK%2BEb5kb%2BRKKC29zHCqn8BNutC6hSogpnmpUN31t7lz7VwHiJNmupKmZ14dSkuAvB7XX%2BjHb5X2U7tGJJonZFmKFZ2cAXHjOEUGV0ji%2F1zpupi4Uood9y58OoKtsiewidmtDSAaOo8LvVkYjWlLpGEkEoio2VhFVHhlxM0v%2F4VVrnMRf%2Bp3zvntLHw7cjJGS3i%2BXACH4kkqtUMUaQXTP7NfGMpZEml0aUOnbTk%2FPAwtR8XfJ2N77fLyoFzJNiBDz1NHlqkifHVcGIFKcTsmPPDa1FI%2Fh%2F1aMVhT%2B%2BxjuotZLaipCJqJQ7rNTzTrhaYgg1qrggZR62w93%2BL0v%2FmYuXcrDK4y%2Fs8%2FxhAsDsG5qp9JS%2FOiPeI2Pb598j7mG5srWEHn5c0QsGJyjSvpqDzx%2FcSACK%2BCLNj6I78UbSjs6SfVEaD%2B%2B7xV9i36BWbmxmqCy08OvD95%2Ffk15%2FATPLd5iqcF8%2BsuQqPgA%2Be2y%2BVX9AWqu7HtfUR811wrVccoD%2FVTZvx0TBsN2eJSCx2TF6O1VCTAI%2FQI%2Bw6D4re3%2FgsAjqOYIs67MAkPhPlCXSw0pgXe9AMZis79GZgRyLCnnz0NVQ%2FafuDQomOwpoxuqJAJsYgSzCop1qUVTvt68wTFNujBHYH8Rpw6tonDLeCppM5xLM53W0DcanW8UoQZ6LIkTpIzKvbV9XZy%2FvhH6wD2w%2FrIFMoxuE8BYvMafFctxwDPM5h8VAFZVlG49V4dH7JQvM4HaPkFEhDt4teWlsN5KAKBj544VA2nlMnn4rCjgyjn7cMCGHR%2BZx7giCmyQ6rhAq3AMuSHea6AMMyl37sGOrEBWnXjFKRG7wNGIe6CW5I23D%2BwvnxsVUvIv%2FVs3%2FnZ%2Br1cZT6glBVELdl5l3uezsZC9iadJFKQKHlA4O4gXpGDo5LlCAC8LdkeKsfnRDces%2FZZ%2BHcf7R9koLYSWlAC7YHEIGnEATzgqON%2FkAkbjZWQL0tFDcnrlIeNJZRvOu4FmaYQ2Tto0e7PdYOCFGcObr6QPpP2nlnsNLZ%2FxZwjtERbbrwU0SnUbb7lg1T52m3hUQ0b&amp;X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Date=20250103T125851Z&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Expires=299&amp;X-Amz-Credential=ASIAQ3PHCVTYYHKOBB5V%2F20250103%2Fus-east-1%2Fs3%2Faws4_request&amp;X-Amz-Signature=25041d1b37618c306207456b41a3905fc0658b831498fb0c390a8a88c9eebb07&amp;hash=328282127df9bb81cffbb7e6d3a5707280d4a4f28ce847b48bd3349c6f77f785&amp;host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&amp;pii=S1574954121001114&amp;tid=pdf-5118178c-5c3b-43bb-87b2-198f0b1c130b&amp;sid=8953876d7875a54f3e199841e3e9122c6748gxrqb&amp;type=client\">following study<\/a> compares both DL and few-shot learning (FSL) for classifying 20 coral reef fish species from underwater images with applications for detecting rare species with limited available data. It should come as no surprise that the best model tested was a DL model based on ResNet. With ~3500 examples for each species the model achieved an accuracy of 78%. However, collecting this volume of data for rare species is beyond practical. Subsequently, the number of samples was reduced to 315 per species, and the accuracy dropped to 42%. In contrast, the FSL model, achieved comparable results with as few as 5 labeled images per species, and better performance beyond 10 shots. Here, the Reptile algorithm was used, which is a meta-learning-based FSL approach. This was trained by repeatedly solving small classification problems (e.g., distinguishing a few classes) drawn from the MiniImageNet dataset (a useful benchmark dataset for FSL). During fine-tuning, the model was then trained using a few labeled examples (1 to 30 shots per species).<\/p>\n<h3>5.5 Weakly Supervised Learning<\/h3>\n<p>Weakly supervised learning describes a set of techniques for building models with noisy, inaccurate or restricted sources to label large quantities of data. We can split the topic into three: incomplete, inexact, and inaccurate supervision, distinguished by the confidence in the labels. Incomplete supervision occurs when only a subset of examples has ground-truth labels. Inexact supervision involves coarsely-grained labels, like labelling an MRI image as \u201clung cancer\u201d without specifying detailed attributes. Inaccurate supervision arises when labels are biased or incorrect due to\u00a0human.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>Partial or inaccurate data is often simpler and cheaper to get hold\u00a0of.<\/li>\n<li>Enables models to learn from a larger pool of information without the need for extensive manual labelling.<\/li>\n<li>Focuses on extracting meaningful patterns or features from data, that can amplify the value of any existing well labeled examples.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Use a small subset of high-quality labels (or an ensemble) to correct systematic labelling errors.<\/li>\n<li>For scenarios where coarse-grained labels are available (e.g., image-level labels but not detailed instance-level labels), Multi-instance learning can be employed. Focusing on bag-level classification since instance-level inaccuracies are less impactful.<\/li>\n<li>Label filtering, correction, and inference techniques can mitigate label noise and minimise reliance on expensive manual\u00a0labels.<\/li>\n<\/ul>\n<p>The primary goal of this technique is to estimate more informative or higher dimensional data with limited information. As an example, <a href=\"https:\/\/arxiv.org\/abs\/1902.09868\">this paper<\/a> presents a weakly supervised learning approach to estimating a 3D human poses. The method relies on 2D pose annotations, avoiding the need for expensive 3D ground-truth data. Using an adversarial reprojection network (RepNet), the model predicts 3D poses and reprojects them into 2D views to compare with 2D annotations, minimising reprojection error. This approach leverages adversarial training to enforce plausibility of 3D poses and showcases the potential of weakly supervised methods for complex tasks like 3D pose estimation with limited labeled\u00a0data.<\/p>\n<h3>5.6 Active\u00a0Learning<\/h3>\n<p>Active learning seeks to optimise labelling efforts by identifying unlabelled samples that, once labeled, will provide the model with the most informative data. A common approach is uncertainty sampling, which selects samples where the model\u2019s predictions are least certain. This uncertainty is often quantified using measures such as entropy or margin sampling. This is highly iterative; each round influences the model\u2019s next set of predictions.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>Optimises expert time; you label fewer samples\u00a0overall.<\/li>\n<li>Quickly identifies edge cases that improve model robustness.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Diversity sampling is an alternative selection approach that focuses on diverse area of the feature space. For instance, clustering can be used to select a few representative samples from each\u00a0cluster.<\/li>\n<li>Try to use multiple selection methods to avoid introducing bias.<\/li>\n<li>Introducing an expert human in the loop can be logistically difficult, managing availability with a labelling workflow that can be slow\/expensive.<\/li>\n<\/ul>\n<p>This technique has been extensively used in chemical analysis and materials research. Where, large databases of real and simulated molecular structures and their properties have been collected over decades. These databases are particularly useful for drug discovery, where simulations like docking are used to predict how small molecules (e.g., potential drugs) interact with targets such as proteins or enzymes. However, the computational cost of performing these types of calculations over millions of molecules makes brute force studies impractical. This is where active learning comes in. <a href=\"https:\/\/pubs.rsc.org\/en\/content\/articlehtml\/2021\/sc\/d0sc06805e#cit18\">One such study<\/a> showed that by training a predictive model on an initial subset of docking results and iteratively selecting the most uncertain molecules for further simulations, researchers were able to drastically reduce the number of molecules tested while still identifying the best candidates.<\/p>\n<h3>5.7 Data Augmentation<\/h3>\n<p>Artificially increase your dataset by applying transformations to existing examples\u200a\u2014\u200asuch as flipping or cropping images, translation or synonym replacement for text and time shifts or random cropping for time-series. Alternatively, upsample underrepresented data with ADASYN (Adaptive Synthetic Sampling) and SMOTE (Synthetic Minority Over-sampling Technique).<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>The model focuses on more general and meaningful features rather than specific details tied to the training\u00a0set.<\/li>\n<li>Instead of collecting and labelling more data, augmentation provides a cost-effective alternative.<\/li>\n<li>Improves generalisation by increasing the diversity of training data, helping learn robust and invariant features rather than overfitting to specific patterns.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Keep transformations domain-relevant (e.g., flipping images vertically might make sense for flower images, less so for medical\u00a0X-rays).<\/li>\n<li>Pay attention that any augmentations don\u2019t distort the original data distribution, preserving the underlying patterns.<\/li>\n<li>Explore GANs, VAEs, or diffusion models to produce synthetic data\u200a\u2014\u200abut this often requires careful tuning, domain-aware constraints, and enough initial\u00a0data.<\/li>\n<li>Synthetic oversampling (like SMOTE) can introduce noise or spurious correlations if the classes or feature space are complex and not well understood.<\/li>\n<\/ul>\n<p>Data augmentation is an incredibly broad topic, with numerous surveys exploring the current state-of-the-art across various fields, including computer vision (<a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S2590005622000911\">review paper<\/a>), natural language processing (<a href=\"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00542\/115238\/An-Empirical-Survey-of-Data-Augmentation-for\">review paper<\/a>), and time-series data (<a href=\"https:\/\/journals.plos.org\/plosone\/article?id=10.1371\/journal.pone.0254841\">review paper<\/a>). It has become an integral component of most machine learning pipelines due to its ability to enhance model generalisation. This is particularly critical for small datasets, where augmenting input data by introducing variations, such as transformations or noise, and removing redundant or irrelevant features can significantly improve a model\u2019s robustness and performance.<\/p>\n<h3>5.8 Multitask Learning<\/h3>\n<p>Here we train one model to solve several tasks simultaneously. This improves how well models perform by encouraging them to find patterns or solutions that work well for multiple goals at the same time. Lower layers capture general features that benefit all tasks, even if you have limited data for\u00a0some.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>Shared representations are learned across tasks, effectively increasing sample\u00a0size.<\/li>\n<li>The model is less likely to overfit, as it must account for patterns relevant to all tasks, not just\u00a0one.<\/li>\n<li>Knowledge learned from one task can provide insights that improve performance on\u00a0another.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Tasks need some overlap or synergy to meaningfully share representations; otherwise this method will hurt performance.<\/li>\n<li>Adjust per-task weights carefully to avoid letting one task dominate training.<\/li>\n<\/ul>\n<p>The scarcity of data for many practical applications of ML makes sharing both data and models across tasks an attractive proposition. This is enabled by Multitask learning, where tasks benefit from shared knowledge and correlations in overlapping domains. However, it requires a large, diverse dataset that integrates multiple related properties. <a href=\"https:\/\/www.cell.com\/patterns\/fulltext\/S2666-3899(21)00058-1\">Polymer design is one example<\/a> where this has been successful. Here, a hybrid dataset of 36 properties across 13,000 polymers, covering a mix of mechanical, thermal, and chemical characteristics, was used to train a deep-learning-based MTL architecture. The multitask model outperformed single-task models for every polymer property. Particularly, for underrepresented properties.<\/p>\n<h3>5.9 Ensemble\u00a0Learning<\/h3>\n<p>Ensembles aggregate predictions from several base models to improve robustness. Generally, ML algorithms can be limited in a variety of ways: high variance, high bias, and low accuracy. This manifests as different uncertainty distributions for different models across predictions. Ensemble methods limit the variance and bias errors associated with a single model; for example, bagging reduces variance without increasing the bias, while boosting reduces\u00a0bias.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>Diversifies \u201copinions\u201d across different model architectures.<\/li>\n<li>Reduces variance, mitigating overfitting risk.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Avoid complex base models which can easily overfit small datasets. Instead, use regularised models such as shallow trees or linear models with added constraints to control complexity.<\/li>\n<li>Bootstrap aggregating (bagging) methods like Random Forest can be particularly useful for small datasets. By training multiple models on bootstrapped subsets of the data, you can reduce overfitting while increasing robustness. This is effective for algorithms prone to high variance, such as decision\u00a0trees.<\/li>\n<li>Combine different base models types (e.g., SVM, tree-based models, and logistic regression) with a simple meta-model like logistic regression to combine predictions.<\/li>\n<\/ul>\n<p>As an example, the <a href=\"https:\/\/www.nature.com\/articles\/s41598-021-93783-8\">following paper<\/a> highlights ensemble learning as a method to improve the classification of cervical cytology images. In this case, three pre-trained neural networks\u200a\u2014\u200aInception v3, Xception, and DenseNet-169\u200a\u2014\u200awere used. The diversity of these base models ensured the ensemble benefits from each models unique strengths and feature extraction capabilities. This combined with the fusion of model confidences, via a method that rewards confident, accurate predictions while penalising uncertain ones, maximised the utility of the limited data. Combined with transfer learning, the final predictions were robust to the errors of any particular model, despite the small dataset\u00a0used.<\/p>\n<h3>5.10 Process-Aware (Hybrid)\u00a0Models<\/h3>\n<p>Integrate domain-specific knowledge or physics-based constraints into ML models. This embeds prior knowledge, reducing the model\u2019s reliance on large data to infer patterns. For example, using partial differential equations alongside neural networks for fluid dynamics.<\/p>\n<p><strong>Why it\u00a0helps:<\/strong><\/p>\n<ul>\n<li>Reduces the data needed to learn patterns that are already well understood.<\/li>\n<li>Acts as a form of regularisation, guiding the model to plausible solutions even when the data is sparse or\u00a0noisy.<\/li>\n<li>Improves interpretability and trust in domain-critical contexts.<\/li>\n<\/ul>\n<p><strong>Tips:<\/strong><\/p>\n<ul>\n<li>Continually verify that model outputs make physical\/biological sense, not just numerical sense.<\/li>\n<li>Keep domain constraints separate but feed them as inputs or constraints in your model\u2019s loss function.<\/li>\n<li>Be careful to balance domain-based constraints with your models ability to learn new phenomena.<\/li>\n<li>In practice, bridging domain-specific knowledge with data-driven methods often involves serious collaboration, specialised code, or hardware.<\/li>\n<\/ul>\n<p>Constraining a model, in this way requires a deep understanding of your problem domain, and is often applied to problems where the environment the model operates in is well understood, such as physical systems. <a href=\"https:\/\/www.sciencedirect.com\/science\/article\/abs\/pii\/S030626192201546X?via%3Dihub\">An example of this is lithium-ion battery modelling<\/a>, where domain knowledge of battery dynamics is integrated into the ML process. This allows the model to capture complex behaviours and uncertainties missed by traditional physical models, ensuring physically consistent predictions and improved performance under real-world conditions like battery\u00a0aging.<\/p>\n<h3>6. Conclusion<\/h3>\n<p>For me, projects constrained by limited data are some of the most interesting projects to work on\u200a\u2014\u200adespite the higher risk of failure, they offer an opportunity to explore the state-of-the-art and experiment. These are tough problems! However, systematically applying the strategies covered in this post can greatly improve your odds of delivering a robust, effective model. Embrace the iterative nature of these problems: refine labels, employ augmentations, and analyse errors in quick cycles. Short pilot experiments help validate each technique\u2019s impact before you invest\u00a0further.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=194492e7a6f8\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/effective-ml-with-limited-data-where-to-start-194492e7a6f8\">Where to Start When Data is Limited<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Jake Minns<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Feffective-ml-with-limited-data-where-to-start-194492e7a6f8\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Where to Start When Data is Limited A launch pad for projects with small\u00a0datasets Photo by Google DeepMind: https:\/\/www.pexels.com\/photo\/an-artist-s-illustration-of-artificial-intelligence-ai-this-image-depicts-how-ai-can-help-humans-to-understand-the-complexity-of-biology-it-was-created-by-artist-khyati-trehan-as-part-17484975\/ Machine Learning (ML) has driven remarkable breakthroughs in computer vision, natural language processing, and speech recognition, largely due to the abundance of data in these fields. However, many challenges\u200a\u2014\u200aespecially those tied to specific product features or [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,83,67,70,1357],"tags":[84,33,163],"class_list":["post-1277","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-data-science","category-deep-dives","category-machine-learning","category-small-data","tag-data","tag-small","tag-your"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1277"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1277"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1277\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1277"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1277"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1277"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}