{"id":3312,"date":"2025-04-24T07:03:19","date_gmt":"2025-04-24T07:03:19","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/04\/24\/exporting-mlflow-experiments-from-restricted-hpc-systems\/"},"modified":"2025-04-24T07:03:19","modified_gmt":"2025-04-24T07:03:19","slug":"exporting-mlflow-experiments-from-restricted-hpc-systems","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/04\/24\/exporting-mlflow-experiments-from-restricted-hpc-systems\/","title":{"rendered":"Exporting MLflow Experiments from Restricted HPC Systems"},"content":{"rendered":"<p>    Exporting MLflow Experiments from Restricted HPC Systems<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1745458967409\" class=\"mdspan-comment\">Many High-Performance<\/mdspan> Computing (HPC) environments, especially in research and educational institutions, restrict communications to outbound TCP connections. Running a simple command-line <em>ping<\/em> or <em>curl<\/em> with the MLflow tracking URL on the HPC bash shell to check packet transfer can be successful. However, communication fails and times out while running jobs on nodes.<\/p>\n<p class=\"wp-block-paragraph\">This makes it impossible to track and manage experiments on MLflow. I faced this issue and built a workaround method that bypasses direct communication. We will focus on:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Setting up a local HPC MLflow server on a port with local directory storage.<\/li>\n<li class=\"wp-block-list-item\">Use the local tracking URL while running <a href=\"https:\/\/towardsdatascience.com\/tag\/machine-learning\/\" title=\"Machine Learning\">Machine Learning<\/a> experiments.<\/li>\n<li class=\"wp-block-list-item\">Export the experiment data to a local temporary folder.<\/li>\n<li class=\"wp-block-list-item\">Transfer experiment data from the local temp folder on HPC to the Remote <a href=\"https:\/\/towardsdatascience.com\/tag\/mlflow\/\" title=\"Mlflow\">Mlflow<\/a> server.<\/li>\n<li class=\"wp-block-list-item\">Import the experiment data into the databases of the Remote MLflow server.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">I have deployed Charmed MLflow (MLflow server, MySQL, MinIO) using juju, and the whole thing is hosted on MicroK8s localhost. You can find the installation guide from Canonical <a href=\"https:\/\/documentation.ubuntu.com\/charmed-mlflow\/en\/latest\/tutorial\/mlflow\/\">here<\/a>.<\/p>\n<h2 class=\"wp-block-heading\">Prerequisites<\/h2>\n<p class=\"wp-block-paragraph\">Make sure you have <em>Python<\/em> loaded on your HPC and installed on your MLflow server.For this entire article, I assume you have <em>Python 3.2<\/em>. You can make changes accordingly.<\/p>\n<h5 class=\"wp-block-heading\"><strong>On HPC:<\/strong><\/h5>\n<p class=\"wp-block-paragraph\">1) Create a virtual environment<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">python3 -m venv mlflow\nsource mlflow\/bin\/activate<\/code><\/pre>\n<p class=\"wp-block-paragraph\">2) Install MLflow<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">pip install mlflow<\/code><\/pre>\n<h5 class=\"wp-block-heading\"><strong>On both HPC and MLflow Server:<\/strong><\/h5>\n<p class=\"wp-block-paragraph\">1) Install mlflow-export-import<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">pip install git+https:\/\/\/github.com\/mlflow\/mlflow-export-import\/#egg=mlflow-export-import<\/code><\/pre>\n<h2 class=\"wp-block-heading\">On HPC:<\/h2>\n<p class=\"wp-block-paragraph\">1) Decide on a port where you want the local MLflow server to run. You can use the below command to check if the port is free (should not contain any process IDS):<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">lsof -i :&lt;port-number&gt;<\/code><\/pre>\n<p class=\"wp-block-paragraph\">2) Set the environment variable for applications that want to use MLflow:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">export MLFLOW_TRACKING_URI=http:\/\/localhost:&lt;port-number&gt;<\/code><\/pre>\n<p class=\"wp-block-paragraph\">3) Start the MLflow server using the below command:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">mlflow server \n    --backend-store-uri file:\/path\/to\/local\/storage\/mlruns \n    --default-artifact-root file:\/path\/to\/local\/storage\/mlruns \n    --host 0.0.0.0 \n    --port 5000<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Here, we set the path to the local storage in a folder called mlruns. Metadata like experiments, runs, parameters, metrics, tags and artifacts like model files, loss curves, and other images will be stored inside the mlruns directory. We can set the host as 0.0.0.0 or 127.0.0.1(more secure). Since the whole process is short-lived, I went with 0.0.0.0. Finally, assign a port number that is not used by any other application.<\/p>\n<p class=\"wp-block-paragraph\">(Optional) Sometimes, your HPC might not detect <em>libpython3.12,<\/em> which basically makes Python run. You can follow the steps below to find and add it to your path.<\/p>\n<p class=\"wp-block-paragraph\">Search for <em>libpython3.12<\/em>:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">find \/hpc\/packages -name \"libpython3.12*.so*\" 2&gt;\/dev\/null<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Returns something like: \/path\/to\/python\/3.12\/lib\/libpython3.12.so.1.0<\/p>\n<p class=\"wp-block-paragraph\">Set the path as an environment variable:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">export LD_LIBRARY_PATH=\/path\/to\/python\/3.12\/lib:$LD_LIBRARY_PATH<\/code><\/pre>\n<p class=\"wp-block-paragraph\">4) We will export the experiment data from the mlruns local storage directory to a temp folder:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">python3 -m mlflow_export_import.experiment.export_experiment --experiment \"&lt;experiment-name&gt;\" --output-dir \/tmp\/exported_runs<\/code><\/pre>\n<p class=\"wp-block-paragraph\">(Optional) Running the <em>export_experiment<\/em> function on the HPC bash shell may cause thread utilisation errors like:<\/p>\n<p class=\"wp-block-paragraph\"><code>OpenBLAS blas_thread_init: pthread_create failed for thread X of 64: Resource temporarily unavailable<\/code><\/p>\n<p class=\"wp-block-paragraph\">This happens because MLflow internally uses <em>SciPy<\/em> for artifacts and metadata handling, which requests threads through <em>OpenBLAS,<\/em> which is more than the allowed limit set by your HPC. In case of this issue, limit the number of threads by setting the following environment variables.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">export OPENBLAS_NUM_THREADS=4\nexport OMP_NUM_THREADS=4\nexport MKL_NUM_THREADS=4<\/code><\/pre>\n<p class=\"wp-block-paragraph\">\u00a0If the issue persists, try reducing the thread limit to 2.<\/p>\n<p class=\"wp-block-paragraph\">5) Transfer experiment runs to MLflow Server:<\/p>\n<p class=\"wp-block-paragraph\">Move everything from the HPC to the temporary folder on the MLflow server.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">rsync -avz \/tmp\/exported_runs &lt;mlflow-server-username&gt;@&lt;host-address&gt;:\/tmp<\/code><\/pre>\n<p class=\"wp-block-paragraph\">6) Stop the local MLflow server and clean up the ports:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">lsof -i :&lt;port-number&gt;\nkill -9 &lt;pid&gt;<\/code><\/pre>\n<h2 class=\"wp-block-heading\">On MLflow Server:<\/h2>\n<p class=\"wp-block-paragraph\">Our goal is to transfer experimental data from the tmp folder to <em>MySQL<\/em> and <em>MinIO<\/em>.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">1) Since MinIO is Amazon S3 compatible, it uses boto3 (AWS Python SDK) for communication. So, we will set up proxy AWS-like credentials and use them to communicate with MinIO using boto3.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-shell-session\">juju config mlflow-minio access-key=&lt;access-key&gt; secret-key=&lt;secret-access-key&gt;<\/code><\/pre>\n<p class=\"wp-block-paragraph\">2) Below are the commands to transfer the data.<\/p>\n<p class=\"wp-block-paragraph\">Setting the MLflow server and MinIO addresses in our environment. To avoid repeating this, we can enter this in our .bashrc file.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">export MLFLOW_TRACKING_URI=\"http:\/\/&lt;cluster-ip_or_nodeport_or_load-balancer&gt;:port\"\nexport MLFLOW_S3_ENDPOINT_URL=\"http:\/\/&lt;cluster-ip_or_nodeport_or_load-balancer&gt;:port\"<\/code><\/pre>\n<p class=\"wp-block-paragraph\">\u00a0All the experiment files can be found under the exported_runs folder in the tmp directory. The <em>import-experiment<\/em> function finishes our job.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">python3 -m mlflow_export_import.experiment.import_experiment   --experiment-name \"experiment-name\"   --input-dir \/tmp\/exported_runs<\/code><\/pre>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">The workaround helped me in tracking experiments even when communications and data transfers were restricted on my HPC cluster. Spinning up a local MLflow server instance, exporting experiments, and then importing them to my remote MLflow server provided me with flexibility without having to change my workflow.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">However, if you are dealing with sensitive data, make sure your transfer method is secure. Creating cron jobs and automation scripts could potentially remove manual overhead. Also, be mindful of your local storage, as it is easy to fill up.<\/p>\n<p class=\"wp-block-paragraph\">In the end, if you are working in similar environments, this article can provide you with a solution without requiring any admin privileges in a short time. Hopefully, this helps teams who are stuck with the same issue. Thanks for reading this article!<\/p>\n<p class=\"wp-block-paragraph\">You can connect with me on <a href=\"https:\/\/www.linkedin.com\/in\/nagharjun-mathi-mariappan-b61499169\/\">LinkedIn<\/a>.<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/exporting-mlflow-experiments-from-restricted-hpc-systems\/\">Exporting MLflow Experiments from Restricted HPC Systems<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Nagharjun Mathi Mariappan<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/exporting-mlflow-experiments-from-restricted-hpc-systems\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Exporting MLflow Experiments from Restricted HPC Systems Many High-Performance Computing (HPC) environments, especially in research and educational institutions, restrict communications to outbound TCP connections. Running a simple command-line ping or curl with the MLflow tracking URL on the HPC bash shell to check packet transfer can be successful. However, communication fails and times out while [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,401,83,70,972,222,160],"tags":[2460,2461,975],"class_list":["post-3312","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-data-engineering","category-data-science","category-machine-learning","category-mlflow","category-mlops","category-programming","tag-hpc","tag-local","tag-mlflow"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3312"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3312"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3312\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3312"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3312"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3312"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}