{"id":1298,"date":"2025-01-20T07:03:00","date_gmt":"2025-01-20T07:03:00","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/20\/how-to-log-your-data-with-mlflow-23c8027e4021\/"},"modified":"2025-01-20T07:03:00","modified_gmt":"2025-01-20T07:03:00","slug":"how-to-log-your-data-with-mlflow-23c8027e4021","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/20\/how-to-log-your-data-with-mlflow-23c8027e4021\/","title":{"rendered":"How to Log Your Data with MLflow"},"content":{"rendered":"<p>    How to Log Your Data with MLflow<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>MLflow, MLOps, Data\u00a0Science<\/h4>\n<h4>Mastering data logging in MLOps for your AI\u00a0workflow<\/h4>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*Nv7A1N7dQOMBIpd6\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@chrisliverani?utm_source=medium&amp;utm_medium=referral\">Chris Liverani<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<h3>Preface<\/h3>\n<p><strong>Data is one of the most critical components of the machine learning process.<\/strong> In fact, the quality of the data used in training a model often determines the success or failure of the entire project. While algorithms and models are important, they are powerless without data that is accurate, clean, and representative of the problem you\u2019re trying to solve. Whether you are dealing with structured data, unstructured data, or large-scale datasets, the preparation and understanding of data lay the foundation for any machine learning system. A well-curated dataset can provide the necessary signals for a model to learn effectively, while poor or biased data can lead to incorrect predictions, overfitting, or even harmful societal impacts when models are deployed in real-world applications.<\/p>\n<p>With the increasing complexity of machine learning workflows, ensuring the reproducibility and traceability of experiments has become a key concern. This is where <strong>MLOps<\/strong> (Machine Learning Operations) comes into play. MLOps is the practice of bringing together data science and operations to automate and streamline machine learning workflows. <strong>One critical aspect of MLOps is the tracking of datasets throughout the entire lifecycle of a machine learning project. <\/strong>Tracking is not just about capturing the parameters and metrics of models; it\u2019s equally important to log the datasets used during each phase of the process. This ensures that, down the line, when models are re-evaluated or retrained, the same datasets can be referenced, tested, or reused. It allows for better comparison of results, understanding of changes, and most importantly, ensuring that the results are reproducible by others. <strong>As a best practice for modern ML systems, I recommend that all ML practitioners worldwide log their data during training.<\/strong><\/p>\n<blockquote><p>Data Yesterday vs. Data\u00a0Today?<\/p><\/blockquote>\n<p>In this context, <strong>MLflow<\/strong> plays a pivotal role by offering a suite of tools that help streamline these MLOps practices. <strong>One of the key components of MLflow is its <\/strong><strong>mlflow.data module, which provides the ability to log datasets as part of machine learning experiments.<\/strong> The mlflow.data module ensures that datasets are properly documented, their metadata is tracked, and they can be retrieved and reused in future runs. This helps prevent common problems like &#8220;data drift&#8221; where models start to perform worse because of subtle changes in the underlying data. MLflow\u2019s ability to track datasets alongside models and parameters ensures that any new experiment can reliably compare results against previous runs using the exact same data, providing transparency and\u00a0clarity.<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*4yatTi_DeW53ivR1\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@sushioutlaw?utm_source=medium&amp;utm_medium=referral\">Brian McGowan<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<h3>Overview<\/h3>\n<p><strong>This article will guide you through the best practices of logging a dataset in MLflow using the California Housing Dataset as an example.<\/strong> We\u2019ll explore mlflow.data interfaces and demonstrate how you can track and manage your dataset during an ML experiment.<\/p>\n<p>Imagine you are a data scientist working on a project to predict housing prices in California based on various features such as median income, population, and location. You\u2019ve spent hours curating a dataset from multiple sources, cleaning it, and ensuring it\u2019s ready for training. Now you\u2019re ready to run your machine learning experiments. Logging your dataset at this stage is crucial because it serves as a snapshot of the exact data used in this specific training run. Should you need to revisit the experiment months later\u200a\u2014\u200aperhaps to improve the model, tune hyperparameters, or audit the results\u200a\u2014\u200ayou want to ensure that you are using the same dataset to maintain consistency and comparability. Without logging your dataset, it\u2019s easy to lose track of which version of the data was used, especially if the data is updated or changed over\u00a0time.<\/p>\n<p>Without further ado, let\u2019s get\u00a0started!<\/p>\n<h3>Set Up<\/h3>\n<p>Setting up an MLflow server locally is straightforward. Use the following command:<\/p>\n<pre>mlflow server --host 127.0.0.1 --port 8080<\/pre>\n<p>Then set the tracking\u00a0URI.<\/p>\n<pre>mlflow.set_tracking_uri(\"http:\/\/127.0.0.1:8080\")<\/pre>\n<p>For more advanced configurations, refer to <a href=\"https:\/\/mlflow.org\/docs\/latest\/tracking\/server.html\">the MLflow documentation<\/a>.<\/p>\n<h3>The California housing\u00a0dataset<\/h3>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*edSt566bty4z0rio\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@robertbye?utm_source=medium&amp;utm_medium=referral\">Robert Bye<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p>For this article, we are using the California housing dataset (CC BY license). However, you can apply the same principles to log and track any dataset of your\u00a0choice.<\/p>\n<p>For more information on the California housing dataset, refer to <a href=\"https:\/\/inria.github.io\/scikit-learn-mooc\/python_scripts\/datasets_california_housing.html\">this\u00a0doc<\/a>.<\/p>\n<h3>Dataset and DatasetSource<\/h3>\n<h4>mlflow.data.dataset.Dataset<\/h4>\n<p>Before diving into dataset logging, evaluation, and retrieval, it\u2019s important to understand the concept of datasets in MLflow. MLflow provides the mlflow.data.dataset.Dataset object, which represents datasets used in with MLflow Tracking.<\/p>\n<pre>class mlflow.data.dataset.Dataset(source: mlflow.data.dataset_source.DatasetSource, name: Optional[str] = None, digest: Optional[str] = None)<\/pre>\n<p>This object comes with key properties:<\/p>\n<ul>\n<li>A required parameter, <strong>source<\/strong> (the data source of your dataset as mlflow.data.dataset_source.DatasetSource object)<\/li>\n<li>\n<strong>digest<\/strong> (fingerprint for your dataset) and <strong>name<\/strong> (name for your dataset), which can be set via parameters.<\/li>\n<li>\n<strong>schema<\/strong> and <strong>profile<\/strong> to describe the dataset\u2019s structure and statistical properties.<\/li>\n<li>Information about the dataset\u2019s <strong>source<\/strong>, such as its storage location.<\/li>\n<\/ul>\n<p>You can easily convert the dataset into a dictionary using to_dict() or a JSON string using to_json().<\/p>\n<h4>Support for Popular Dataset\u00a0Formats<\/h4>\n<p>MLflow makes it easy to work with various types of datasets through specialized classes that extend the core mlflow.data.dataset.Dataset. At the time of writing this article, here are some of the notable dataset classes supported by\u00a0MLflow:<\/p>\n<ul>\n<li>\n<strong>pandas<\/strong>: mlflow.data.pandas_dataset.PandasDataset<\/li>\n<li>\n<strong>NumPy<\/strong>: mlflow.data.numpy_dataset.NumpyDataset<\/li>\n<li>\n<strong>Spark<\/strong>: mlflow.data.spark_dataset.SparkDataset<\/li>\n<li>\n<strong>Hugging Face<\/strong>: mlflow.data.huggingface_dataset.HuggingFaceDataset<\/li>\n<li>\n<strong>TensorFlow<\/strong>: mlflow.data.tensorflow_dataset.TensorFlowDataset<\/li>\n<li>\n<strong>Evaluation Datasets<\/strong>: mlflow.data.evaluation_dataset.EvaluationDataset<\/li>\n<\/ul>\n<p>All these classes come with a convenient mlflow.data.from_* API for loading datasets directly into MLflow. This makes it easy to construct and manage datasets, regardless of their underlying format.<\/p>\n<h4>mlflow.data.dataset_source.DatasetSource<\/h4>\n<p>The mlflow.data.dataset.DatasetSource class is used to represent the origin of the dataset in MLflow. When creating a mlflow.data.dataset.Dataset object, the source parameter can be specified either as a string (e.g., a file path or URL) or as an instance of the mlflow.data.dataset.DatasetSource class.<\/p>\n<pre>class mlflow.data.dataset_source.DatasetSource<\/pre>\n<p>If a string is provided as the source, MLflow internally calls the resolve_dataset_source function. This function iterates through a predefined list of data sources and DatasetSource classes to determine the most appropriate source type. However, MLflow&#8217;s ability to accurately resolve the dataset&#8217;s source is limited, especially when the candidate_sources argument (a list of potential sources) is set to None, which is the\u00a0default.<\/p>\n<p>In cases where the DatasetSource class cannot resolve the raw source, an MLflow exception is raised. <strong>For best practices, I recommend explicitly create and use an instance of the <\/strong><strong>mlflow.data.dataset.DatasetSource class when defining the dataset&#8217;s origin.<\/strong><\/p>\n<ul>\n<li>class <strong>HTTPDatasetSource<\/strong>(DatasetSource)<\/li>\n<li>class <strong>DeltaDatasetSource<\/strong>(DatasetSource)<\/li>\n<li>class <strong>FileSystemDatasetSource<\/strong>(DatasetSource)<\/li>\n<li>class <strong>HuggingFaceDatasetSource<\/strong>(DatasetSource)<\/li>\n<li>class <strong>SparkDatasetSource<\/strong>(DatasetSource)<\/li>\n<\/ul>\n<p><a href=\"https:\/\/medium.com\/@yunglinchang\/subscribe\">Get an email whenever Jack Chang publishes.<\/a><\/p>\n<h3>Logging datasets with mlflow.log_input() API<\/h3>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*9BDqWar5ruJ85Rv8\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@purzlbaum?utm_source=medium&amp;utm_medium=referral\">Claudio Schwarz<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p>One of the most straightforward ways to log datasets in MLflow is through the mlflow.log_input() API. This allows you to log datasets in any format that is compatible with mlflow.data.dataset.Dataset, which can be extremely helpful when managing large-scale experiments.<\/p>\n<h4>Step-by-Step Guide<\/h4>\n<p>First, let\u2019s fetch the California Housing dataset and convert it into a pandas.DataFrame for easier manipulation. Here, we create a dataframe that combines both the feature data (california_data) and the target data (california_target).<\/p>\n<pre>california_housing = fetch_california_housing()<br>california_data: pd.DataFrame = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)<br>california_target: pd.DataFrame = pd.DataFrame(california_housing.target, columns=['Target'])<br><br>california_housing_df: pd.DataFrame = pd.concat([california_data, california_target], axis=1)<\/pre>\n<p>To log the dataset with meaningful metadata, we define a few parameters like the data source URL, dataset name, and target column. These will provide helpful context when retrieving the dataset\u00a0later.<\/p>\n<p>If we look deeper in the fetch_california_housing <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/blob\/311bf6bad\/sklearn\/datasets\/_california_housing.py#L46\">source code<\/a>, we can see the data was originated from <a href=\"https:\/\/www.dcc.fc.up.pt\/~ltorgo\/Regression\/cal_housing.tgz\">https:\/\/www.dcc.fc.up.pt\/~ltorgo\/Regression\/cal_housing.tgz<\/a>.<\/p>\n<pre>dataset_source_url: str = 'https:\/\/www.dcc.fc.up.pt\/~ltorgo\/Regression\/cal_housing.tgz'<br>dataset_source: DatasetSource = HTTPDatasetSource(url=dataset_source_url)<br>dataset_name: str = 'California Housing Dataset'<br>dataset_target: str = 'Target'<br>dataset_tags = {<br>    'description': california_housing.DESCR,<br>}<\/pre>\n<p>Once the data and metadata are defined, we can convert the pandas.DataFrame into an mlflow.data.Dataset object.<\/p>\n<pre>dataset: PandasDataset = mlflow.data.from_pandas(<br>    df=california_housing_df, source=dataset_source, targets=dataset_target, name=dataset_name<br>)<br><br>print(f'Dataset name: {dataset.name}')<br>print(f'Dataset digest: {dataset.digest}')<br>print(f'Dataset source: {dataset.source}')<br>print(f'Dataset schema: {dataset.schema}')<br>print(f'Dataset profile: {dataset.profile}')<br>print(f'Dataset targets: {dataset.targets}')<br>print(f'Dataset predictions: {dataset.predictions}')<br>print(dataset.df.head())<\/pre>\n<p>Example Output:<\/p>\n<pre>Dataset name: California Housing Dataset<br>Dataset digest: 55270605<br>Dataset source: &lt;mlflow.data.http_dataset_source.HTTPDatasetSource object at 0x101153a90&gt;<br>Dataset schema: ['MedInc': double (required), 'HouseAge': double (required), 'AveRooms': double (required), 'AveBedrms': double (required), 'Population': double (required), 'AveOccup': double (required), 'Latitude': double (required), 'Longitude': double (required), 'Target': double (required)]<br>Dataset profile: {'num_rows': 20640, 'num_elements': 185760}<br>Dataset targets: Target<br>Dataset predictions: None<br>   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  Target<br>0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23   4.526<br>1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22   3.585<br>2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24   3.521<br>3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25   3.413<br>4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25   3.422<\/pre>\n<p>Note that You can even convert the dataset to a dictionary to access additional properties like source_type:<\/p>\n<pre>for k,v in dataset.to_dict().items():<br>    print(f\"{k}: {v}\")<\/pre>\n<pre>name: California Housing Dataset<br>digest: 55270605<br>source: {\"url\": \"https:\/\/www.dcc.fc.up.pt\/~ltorgo\/Regression\/cal_housing.tgz\"}<br>source_type: http<br>schema: {\"mlflow_colspec\": [{\"type\": \"double\", \"name\": \"MedInc\", \"required\": true}, {\"type\": \"double\", \"name\": \"HouseAge\", \"required\": true}, {\"type\": \"double\", \"name\": \"AveRooms\", \"required\": true}, {\"type\": \"double\", \"name\": \"AveBedrms\", \"required\": true}, {\"type\": \"double\", \"name\": \"Population\", \"required\": true}, {\"type\": \"double\", \"name\": \"AveOccup\", \"required\": true}, {\"type\": \"double\", \"name\": \"Latitude\", \"required\": true}, {\"type\": \"double\", \"name\": \"Longitude\", \"required\": true}, {\"type\": \"double\", \"name\": \"Target\", \"required\": true}]}<br>profile: {\"num_rows\": 20640, \"num_elements\": 185760}<\/pre>\n<p>Now that we have our dataset ready, it\u2019s time to log it in an MLflow run. This allows us to capture the dataset\u2019s metadata, making it part of the experiment for future reference.<\/p>\n<pre>with mlflow.start_run():<br>    mlflow.log_input(dataset=dataset, context='training', tags=dataset_tags)<\/pre>\n<pre>\ud83c\udfc3 View run sassy-jay-279 at: http:\/\/127.0.0.1:8080\/#\/experiments\/0\/runs\/5ef16e2e81bf40068c68ce536121538c<br>\ud83e\uddea View experiment at: http:\/\/127.0.0.1:8080\/#\/experiments\/0<\/pre>\n<p>Let\u2019s explore the dataset in the MLflow UI (). You\u2019ll find your dataset listed under the default experiment. In the <strong>Datasets Used<\/strong> section, you can view the context of the dataset, which in this case is marked as being used for training. Additionally, all the relevant fields and properties of the dataset will be displayed.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AOQpmB1aKkpJWmK6bbwQSUw.png?ssl=1\"><figcaption>Training dataset in the MLflow UI; Source:\u00a0Me<\/figcaption><\/figure>\n<blockquote><p>Congrats! You have logged your first\u00a0dataset!<\/p><\/blockquote>\n<h3>Logging datasets when evaluating mlflow<strong>.<\/strong>evaluate() API<\/h3>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*ePJic03XXazKt726\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@dinoreichmuth?utm_source=medium&amp;utm_medium=referral\">Dino Reichmuth<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<p>Let\u2019s continue our journey by exploring how to evaluate datasets using the mlflow.evaluate() API. This functionality, which integrates datasets with MLflow\u2019s evaluation framework, was introduced in <strong>MLflow 2.8.0<\/strong>. Users of earlier versions of MLflow will not have access to this\u00a0feature.<\/p>\n<h4>Step-by-Step Guide<\/h4>\n<p>First, let\u2019s perform a train-test split on the California housing\u00a0data:<\/p>\n<pre>X_train, X_test, y_train, y_test = train_test_split(california_data, california_target, test_size=0.25, random_state=42)<\/pre>\n<p>For this part, we will be using similar metadata to create the training dataset but note that the training and evaluation datasets have different names.<\/p>\n<pre>training_dataset_name: str = 'California Housing Training Dataset'<br>training_dataset_target: str = 'Target'<br>eval_dataset_name: str = 'California Housing Evaluation Dataset'<br>eval_dataset_target: str = 'Target'<br>eval_dataset_prediction: str = 'Prediction'<\/pre>\n<p>For modeling, let\u2019s fit a Random Forest Regression model.<\/p>\n<pre>model = RandomForestRegressor(random_state=42)<br>model.fit(X_train, y_train.to_numpy().flatten())<\/pre>\n<p>Once the model is trained, we need to prepare an evaluation dataset. The mlflow.data.from_pandas() function will be used to create this dataset, which will be passed to the mlflow.evaluate() function for model evaluation. Note that the predictions parameter is specified here to indicate the column containing the model&#8217;s predicted output.<\/p>\n<pre>y_test_pred: pd.Series = model.predict(X=X_test)<br>eval_df: pd.DataFrame = X_test.copy()<br>eval_df[eval_dataset_target] = y_test.to_numpy().flatten()<br>eval_df[eval_dataset_prediction] = y_test_pred<br><br>eval_dataset: PandasDataset = mlflow.data.from_pandas(<br>    df=eval_df, targets=eval_dataset_target, name=eval_dataset_name, predictions=eval_dataset_prediction<br>)<\/pre>\n<p>With the training and evaluation datasets prepared, it\u2019s time to log the model and evaluate its performance using\u00a0MLflow.<\/p>\n<pre>mlflow.sklearn.autolog()<br>with mlflow.start_run():<br>    mlflow.log_input(dataset=training_dataset, context='training')<br><br>    mlflow.sklearn.log_model(model, artifact_path='rf', input_example=X_test)<br><br>    result = mlflow.evaluate(<br>        data=eval_dataset,<br>        predictions=None,<br>        model_type='regressor',<br>    )<br><br>    print(f'metrics: {result.metrics}')<\/pre>\n<p>The <strong>Default Evaluator<\/strong> is used here, and it logs several important metrics automatically:<\/p>\n<blockquote><p>example_count, mean_absolute_error, mean_squared_error, root_mean_squared_error, sum_on_target, mean_on_target, r2_score, max_error, mean_absolute_percentage_error<\/p><\/blockquote>\n<p>These metrics can be found in the Metrics section of the experiment run in the MLflow UI.<strong> I recommend experimenting with different model types using <\/strong><strong>mlflow.evaluate to explore the full capabilities of the Default Evaluator.<\/strong> It provides a range of valuable metrics as well as useful visualizations.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AORosQSoktnVuIM8cpz8qrA.png?ssl=1\"><figcaption>Regression metrics with mlflow.evaluate; Source:\u00a0Me<\/figcaption><\/figure>\n<p>Note that if you\u2019re working with an MLflow PandasDataset, you must specify the column containing the model\u2019s predicted output using the predictions parameter in the mlflow.data.from_pandas() function. When calling mlflow.evaluate(), set predictions = None because the predictions column is already included in the dataset. This ensures proper integration and evaluation.<\/p>\n<p>Example Output:<\/p>\n<pre>2025\/01\/16 15:11:36 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...<br>2025\/01\/16 15:11:37 INFO mlflow.models.evaluation.evaluators.shap: Shap explainer ExactExplainer is used.<br>metrics: {'example_count': 5160, 'mean_absolute_error': np.float64(0.32909333102713195), 'mean_squared_error': np.float64(0.2545599452819612), 'root_mean_squared_error': np.float64(0.5045393396772557), 'sum_on_target': np.float64(10646.03931999994), 'mean_on_target': np.float64(2.0631859147286704), 'r2_score': 0.8076205696273513, 'max_error': np.float64(3.626845299999994), 'mean_absolute_percentage_error': np.float64(0.1909308987066793)}<br>\ud83c\udfc3 View run bouncy-fox-193 at: http:\/\/127.0.0.1:8080\/#\/experiments\/0\/runs\/65b25856e28142fd85c54b38db4f2b3d<br>\ud83e\uddea View experiment at: http:\/\/127.0.0.1:8080\/#\/experiments\/0<\/pre>\n<p>Let\u2019s head over to the MLflow UI to view the results. You will see that our evaluation dataset has been successfully logged within the same run as the training. As a result, the run now contains both the training and evaluation datasets.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2As6ydN2FHNmkWi3cXDOGlnA.png?ssl=1\"><figcaption>Evaulation dataset in the MLflow UI; Source:\u00a0Me<\/figcaption><\/figure>\n<h3>Epilogue<\/h3>\n<p>Logging datasets is critical not just for reproducibility but also for accountability. For instance, if you are working in a regulated industry such as healthcare or finance, it is essential to demonstrate that the data used to train a model meets certain standards and has not been altered without proper tracking. In collaborative projects, being able to share your dataset logs also facilitates more efficient collaboration and sharing of results. By logging datasets with tools like MLflow, you ensure that your experiments are transparent, reproducible, and robust, helping build trust in your machine learning outcomes.<\/p>\n<p>In summary, datasets are the heart of machine learning, and logging them is fundamental to tracking, reproducibility, and transparency within any MLOps workflow. MLflow\u2019s mlflow.data module provides the tools necessary to ensure that every step of the data journey is captured, logged, and retrievable for future use, ensuring consistency and improving the overall reliability of machine learning experiments.<\/p>\n<blockquote><p>Here is <a href=\"https:\/\/github.com\/yunglinchang\/mlflow-examples\">the Github repo<\/a> for all the codes in the\u00a0article!<\/p><\/blockquote>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1024\/0*4hlmAC5vKBd9LmLX\"><figcaption>Photo by <a href=\"https:\/\/unsplash.com\/@jordancormack?utm_source=medium&amp;utm_medium=referral\">Jordan Cormack<\/a> on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\">Unsplash<\/a><\/figcaption><\/figure>\n<h3>Recap and Takeaways<\/h3>\n<ul>\n<li>\n<strong>Logging datasets with <\/strong><strong>mlflow.log_input() API<\/strong>: This is used for logging your training data, ensuring that all relevant metadata is captured for traceability and reproducibility within your experiments.<\/li>\n<li>\n<strong>Logging datasets when evaluating with <\/strong><strong>mlflow.evaluate() API<\/strong>: This is used for evaluating your models and automatically logs key performance metrics, helping you track the effectiveness of your model during evaluation.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/medium.com\/@yunglinchang\/subscribe\">Get an email whenever Jack Chang publishes.<\/a><\/p>\n<p><strong>Feel free to share your thoughts in the comments.<\/strong> I love to learn about data and reflect on (write about) what I\u2019ve learned in practical applications. If you enjoyed this article, please give it a clap to show your support. You can contact me via <a href=\"https:\/\/www.linkedin.com\/in\/yung-linchang\/\">LinkedIn<\/a> if you have more to discuss. Also, feel free to follow me on Medium for more data science articles to\u00a0come!<\/p>\n<blockquote><p><em>Come play along in the data science playground!<\/em><\/p><\/blockquote>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=23c8027e4021\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/how-to-log-your-data-with-mlflow-23c8027e4021\">How to Log Your Data with MLflow<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Jack Chang<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-log-your-data-with-mlflow-23c8027e4021\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to Log Your Data with MLflow MLflow, MLOps, Data\u00a0Science Mastering data logging in MLOps for your AI\u00a0workflow Photo by Chris Liverani on\u00a0Unsplash Preface Data is one of the most critical components of the machine learning process. In fact, the quality of the data used in training a model often determines the success or failure [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[151,62,83,70,972,222],"tags":[84,341,1369],"class_list":["post-1298","post","type-post","status-publish","format-standard","hentry","category-ai","category-aimldsaimlds","category-data-science","category-machine-learning","category-mlflow","category-mlops","tag-data","tag-machine","tag-mlops"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1298"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1298"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1298\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1298"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1298"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1298"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}