{"id":9102,"date":"2025-12-15T07:02:27","date_gmt":"2025-12-15T07:02:27","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/12\/15\/has_anyone_tried_training_models_on_raw\/"},"modified":"2025-12-15T07:02:27","modified_gmt":"2025-12-15T07:02:27","slug":"has_anyone_tried_training_models_on_raw","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/12\/15\/has_anyone_tried_training_models_on_raw\/","title":{"rendered":"Has anyone tried training models on raw discussions instead of curated datasets?"},"content":{"rendered":"<p>    Has anyone tried training models on raw discussions instead of curated datasets?<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<!-- SC_OFF --><\/p>\n<div class=\"md\">\n<p>I\u2019ve always followed the usual advice when training models, like clean the data, normalize everything, remove noise, structure it nicely<\/p>\n<p>Recently I tried something different. Instead of polished datasets, I fed models long, messy discussion threads, real conversations, people arguing, correcting themselves, misunderstanding things, changing their mind mid sentence, explaining badly before explaining well<\/p>\n<p>No labels. No clean structure. Just raw text. What surprised me is that in some reasoning and writing tasks, the models trained on this kind of data felt more grounded, like less brittle not necessarily more accuratebut better at handling ambiguity and edge cases<\/p>\n<p>It made me wonder if what we often call noise is actually part of the signal!<\/p>\n<p>Human reasoning is messy by nature. Doubt, uncertainty, shortcuts, corrections, clean datasets remove all of that,but that\u2019s not how people think or talk in the real world<\/p>\n<p>I\u2019m not saying clean data is bad just questioning whether we\u2019re over optimizing for neatness at the cost of realism<\/p>\n<p>Anyone else has experimented with this or seen similar effects in applied ML work?<\/p>\n<\/p><\/div>\n<p><!-- SC_ON -->   submitted by   <a href=\"https:\/\/www.reddit.com\/user\/Mediocre_Common_4126\"> \/u\/Mediocre_Common_4126 <\/a> <br \/> <span><a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1pmp2zn\/has_anyone_tried_training_models_on_raw\/\">[link]<\/a><\/span>   <span><a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1pmp2zn\/has_anyone_tried_training_models_on_raw\/\">[comments]<\/a><\/span>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    \/u\/Mediocre_Common_4126<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/www.reddit.com\/r\/datascience\/comments\/1pmp2zn\/has_anyone_tried_training_models_on_raw\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Has anyone tried training models on raw discussions instead of curated datasets? I\u2019ve always followed the usual advice when training models, like clean the data, normalize everything, remove noise, structure it nicely Recently I tried something different. Instead of polished datasets, I fed models long, messy discussion threads, real conversations, people arguing, correcting themselves, misunderstanding [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,99],"tags":[4427,1429,73],"class_list":["post-9102","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-datascience","tag-clean","tag-datasets","tag-models"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/9102"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=9102"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/9102\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=9102"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=9102"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=9102"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}