{"id":969,"date":"2025-01-05T07:01:14","date_gmt":"2025-01-05T07:01:14","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/01\/05\/journey-to-full-stack-data-scientist-model-deployment-f385f244ec67\/"},"modified":"2025-01-05T07:01:14","modified_gmt":"2025-01-05T07:01:14","slug":"journey-to-full-stack-data-scientist-model-deployment-f385f244ec67","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/01\/05\/journey-to-full-stack-data-scientist-model-deployment-f385f244ec67\/","title":{"rendered":"Journey to Full-Stack Data Scientist: Model Deployment"},"content":{"rendered":"<p>    Journey to Full-Stack Data Scientist: Model Deployment<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>An introduction to productionizing machine learning models using APIs and\u00a0Docker.<\/h4>\n<h3>Growing Responsibilities of Data Scientists<\/h3>\n<p>The title of data scientist is ever-changing and often vague. It usually involves one who is fluent in mathematics, programming, and machine learning. They spend time cleaning data, building models, fine-tuning, and conducting experimentation. They must also have great communication skills, a good grasp on their domain, and other soft\u00a0skills.<\/p>\n<p>However, this is not always exactly the case. If you spend enough time scrolling through job boards, \u201cData Scientist\u201d can differ quite a bit. Some read more like a data engineer, focusing on pipelines and big data platforms. Some are closer to a data analyst, focusing on data cleaning and dashboarding. And as of late, there are many that are similar to software or ML engineering, focusing on object-oriented programming, building applications, deploying models, and sometimes even web development.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A4UmVrOm7F2yBkxfP3BMjmA.png?ssl=1\"><figcaption>Image by\u00a0Author<\/figcaption><\/figure>\n<p>And there are those who expect all of this and more, thus, the \u201cfull-stack Data Scientist\u201d. With this in mind, data scientists should consider looking to go beyond developing models in a notebook and expand their skillset to other areas like ML Ops. As <a href=\"https:\/\/datamachines.xyz\/\">Pau Labarta Bajo<\/a> says: \u201c<strong>ML models inside Jupyter notebooks have a business value of $\ud835\udfec.\ud835\udfec\ud835\udfec<\/strong>\u201d.<\/p>\n<p><strong>This article will go over how data scientists can successfully deploy their machine learning models from notebooks to fully productionized APIs by using FastAPI and\u00a0Docker.<\/strong><\/p>\n<h4>Thoughts on the \u201cFull-Stack\u201d Data Scientist<\/h4>\n<p>First, my personal opinion on the \u201cfull-stack data scientist\u201d. With all of these emerging expectations, it is important for us to learn and be comfortable with other skills that we may not have learned in our education or early career. However, the expectation seems to be to master all of these skills, on top of keeping up with traditional data science. And while there are a few out there who are capable of this, it is not feasible for most of\u00a0us.<\/p>\n<p>I don&#8217;t believe that becoming a full-stack data scientist means mastering every one of these skills, technologies, etc. <strong>I think that a full-stack data scientist is about being able to wear all of the hats in the data science lifecycle through continuous learning and development.<\/strong><\/p>\n<p>While it may not be my expertise, I should be able to collaborate with data engineers to optimize pipelines. And while I am much more comfortable with developing models, I should be able to wear my \u201cML Engineer\u201d hat and help get a model into deployment. A great data scientist will always have their niches, but will also have a working knowledge of other areas and can quickly learn new skills if and when they need\u00a0to.<\/p>\n<h3>Model Development<\/h3>\n<p>First, for our example, we need to develop a model. <strong>Since this article focuses on model deployment, we will not worry about the performance of the model.<\/strong> <strong>Instead, we will build a simple model with limited features to focus on learning model deployment.<\/strong><\/p>\n<p>In this example, we will predict a data professional\u2019s salary based on a few features, such as experience, job title, company size,\u00a0etc.<\/p>\n<p><em>See data here: <\/em><a href=\"https:\/\/www.kaggle.com\/datasets\/ruchi798\/data-science-job-salaries\"><em>https:\/\/www.kaggle.com\/datasets\/ruchi798\/data-science-job-salaries<\/em><\/a><em> (CC0: Public Domain). I slightly modified the data to reduce the number of options for certain features.<\/em><\/p>\n<pre>#import packages for data manipulation<br>import pandas as pd<br>import numpy as np<br><br>#import packages for machine learning<br>from sklearn import linear_model<br>from sklearn.model_selection import train_test_split<br>from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder<br>from sklearn.metrics import mean_squared_error, r2_score<br><br>#import packages for data management<br>import joblib<\/pre>\n<p>First, let\u2019s take a look at the\u00a0data.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/796\/1%2Aww0uUrJKZq8NRBXpQ3-KoA.png?ssl=1\"><figcaption>Image by\u00a0Author<\/figcaption><\/figure>\n<p>Since all of our features are categorical, we will use encoding to transform our data to numerical. Below, we use ordinal encoders to encode experience level and company size. These are ordinal because they represent some kind of progression (1 = entry level, 2 = mid-level, etc.).<\/p>\n<p>For job title and employment type, we will create a dummy variables for each option (note we drop the first to avoid multicollinearity).<\/p>\n<pre>#use ordinal encoder to encode experience level<br>encoder = OrdinalEncoder(categories=[['EN', 'MI', 'SE', 'EX']])<br>salary_data['experience_level_encoded'] = encoder.fit_transform(salary_data[['experience_level']])<br><br>#use ordinal encoder to encode company size<br>encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])<br>salary_data['company_size_encoded'] = encoder.fit_transform(salary_data[['company_size']])<br><br>#encode employmeny type and job title using dummy columns<br>salary_data = pd.get_dummies(salary_data, columns = ['employment_type', 'job_title'], drop_first = True, dtype = int)<br><br>#drop original columns<br>salary_data = salary_data.drop(columns = ['experience_level', 'company_size'])<\/pre>\n<p>Now that we have transformed our model inputs, we can create our training and test sets. We will input these features into a simple linear regression model to predict the employee\u2019s salary.<\/p>\n<pre>#define independent and dependent features<br>X = salary_data.drop(columns = 'salary_in_usd')<br>y = salary_data['salary_in_usd']<br><br>#split between training and testing sets<br>X_train, X_test, y_train, y_test = train_test_split(<br>  X, y, random_state = 104, test_size = 0.2, shuffle = True)<br><br>#fit linear regression model<br>regr = linear_model.LinearRegression()<br>regr.fit(X_train, y_train)<br><br>#make predictions<br>y_pred = regr.predict(X_test)<br><br>#print the coefficients<br>print(\"Coefficients: n\", regr.coef_)<br><br>#print the MSE<br>print(\"Mean squared error: %.2f\" % mean_squared_error(y_test, y_pred))<br><br>#print the adjusted R2 value<br>print(\"R2: %.2f\" % r2_score(y_test, y_pred))<\/pre>\n<p>Let\u2019s see how our model\u00a0did.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/677\/1%2ALSPVauJjxXMIsoE9ZTN-2w.png?ssl=1\"><figcaption>Image by\u00a0Author<\/figcaption><\/figure>\n<p>Looks like our R-squared is 0.27, yikes. A lot more work would need to be done with this model. We would likely need more data and additional information on the observations. But for the sake of this article, we will move forward and save our\u00a0model.<\/p>\n<pre>#save model using joblib<br>joblib.dump(regr, 'lin_regress.sav')<\/pre>\n<h3>Creating an\u00a0API<\/h3>\n<p>There are several ways to deploy a model. One of those ways is with an API. <strong>An API (Application Programming Interface) enables two pieces of software to communicate with each other.<\/strong> There are several API architectures like SOAP, RPC, and REST APIs. We will use a REST API, which is the most popular and flexible architecture to access a\u00a0service.<\/p>\n<p>For our framework, we will use FastAPI (<a href=\"https:\/\/fastapi.tiangolo.com\/\">https:\/\/fastapi.tiangolo.com\/<\/a>), which is great for beginners as its fairly easy to use and has tons of documentation and examples.<\/p>\n<p>With REST APIs, there are five methods that are commonly used: POST, GET, PUT, PATCH, and DELETE. These correspond to create, read, update, and delete operations. Our script below (Main.py) will follow these\u00a0steps:<\/p>\n<ol>\n<li>Initialize the FastAPI framework and define the request\u00a0format.<\/li>\n<li>Download the\u00a0model.<\/li>\n<li>Create a GET endpoint to retrieve the\u00a0model.<\/li>\n<li>Create a POST endpoint to allow the user to send it new data and create a prediction.<\/li>\n<li>Define the host IP and port (location to operate the\u00a0API).<\/li>\n<\/ol>\n<pre>import uvicorn<br>import pandas as pd<br>from fastapi import FastAPI<br>from pydantic import BaseModel<br>import joblib<br><br># Initialize FastAPI<br>app = FastAPI()<br><br># Define the request body format for predictions<br>class PredictionFeatures(BaseModel):<br>    experience_level_encoded: float<br>    company_size_encoded: float<br>    employment_type_PT: int<br>    job_title_Data_Engineer: int<br>    job_title_Data_Manager: int<br>    job_title_Data_Scientist: int<br>    job_title_Machine_Learning_Engineer: int<br><br># Global variable to store the loaded model<br>model = None<br><br># Download the model<br>def download_model():<br>    global model<br>    model = joblib.load('lin_regress.sav')<br><br># Download the model immediately when the script runs<br>download_model()<br><br><br># API Root endpoint<br>@app.get(\"\/\")<br>async def index():<br>    return {\"message\": \"Welcome to the Data Science Income API. Use the \/predict feature to predict your income.\"}<br><br># Prediction endpoint<br>@app.post(\"\/predict\")<br>async def predict(features: PredictionFeatures):<br>    <br>    # Create input DataFrame for prediction<br>    input_data = pd.DataFrame([{<br>        \"experience_level_encoded\": features.experience_level_encoded,<br>        \"company_size_encoded\": features.company_size_encoded,<br>        \"employment_type_PT\": features.employment_type_PT,<br>        \"job_title_Data Engineer\": features.job_title_Data_Engineer,<br>        \"job_title_Data Manager\": features.job_title_Data_Manager,<br>        \"job_title_Data Scientist\": features.job_title_Data_Scientist,<br>        \"job_title_Machine Learning Engineer\": features.job_title_Machine_Learning_Engineer<br>    }])<br><br>    # Predict using the loaded model<br>    prediction = model.predict(input_data)[0]<br><br>    return {<br>        \"Salary (USD)\": prediction<br>    }<br><br>if __name__ == \"__main__\":<br>    uvicorn.run(app, host=\"0.0.0.0\", port=8000)<\/pre>\n<p>Now let\u2019s use the command line to test the API. First, change the directory to your project. Then, run the API using\u00a0uvicorn.<\/p>\n<pre>cd \"C:UsersadaviOneDriveDesktopSalary Model\"<br>py -m uvicorn main:app --reload<\/pre>\n<p>The command line gives me a link to follow. I am then greeted with the message from the GET endpoint. Nice!<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2Au3TwjDDIDqneVlwZwbt9FQ.png?ssl=1\"><figcaption>Image by\u00a0Author<\/figcaption><\/figure>\n<p>Lastly, let\u2019s create a test script to submit new data and retrieve a prediction. Using the requests library, we define the URL and submit a new observation.<\/p>\n<pre>import requests<br><br>url = 'http:\/\/127.0.0.1:8000\/predict'<br><br>#dummy data to test API<br>data = {\"experience_level_encoded\": 3.0,<br>        \"company_size_encoded\": 3.0,<br>        \"employment_type_PT\": 0,<br>        \"job_title_Data_Engineer\": 0,<br>        \"job_title_Data_Manager\": 1,<br>        \"job_title_Data_Scientist\": 0,<br>        \"job_title_Machine_Learning_Engineer\": 0}<br><br>#make a POST request to the API<br>response = requests.post(url, json=data)<br><br>#print response<br>response.json()<\/pre>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/606\/1%2AfgRbzQYybq14C-0lDLekNA.png?ssl=1\"><figcaption>Image by\u00a0Author<\/figcaption><\/figure>\n<p>The prediction is then returned in JSON format thanks to the POST endpoint. Great, we have a functioning API!<\/p>\n<h3>Deploying Model Using\u00a0Docker<\/h3>\n<h4>What is\u00a0Docker?<\/h4>\n<p>Now we have a way to interact with our model, but the model is still not deployed. Let\u2019s say we have a team of 20 people, all of whom we want to have the API running on their computer. This is likely to be a headache. Replicating data science applications can be challenging as there are a number of roadblocks, such as different operating systems, dependencies, tech stacks,\u00a0etc.<\/p>\n<p>This is where Docker comes in. Docker is a platform that enables developers to package their applications and all of their dependencies in \u201ccontainers\u201d. Anyone who has access to a container can run the application without worrying about downloading the correct versions of packages, changing operating systems, etc. Docker containers are also very fast and lightweight, giving an advantage over virtual environments or machines.<\/p>\n<p>Download Docker Desktop here: <a href=\"https:\/\/www.docker.com\/\">https:\/\/www.docker.com\/<\/a><\/p>\n<h4>Creating a DockerFile and\u00a0Image<\/h4>\n<p>Before we create a container, we must first create an image. A Docker image is a snapshot of the application and its dependencies. It basically outlines the instructions for the container.<\/p>\n<p>To create an image, you must create a Dockerfile (<a href=\"https:\/\/docs.docker.com\/reference\/dockerfile\/\">https:\/\/docs.docker.com\/reference\/dockerfile\/<\/a>). The Dockerfile is a text-based document that is stored inside the project and provides the instructions on how to assemble the image. The Dockerfile cannot be a\u00a0.txt file. It must have no extension.<strong> The easiest way to create a Dockerfile is through VSCode. Simply add a new file, and name it \u201cDockerfile\u201d.<\/strong><\/p>\n<p>I built the following Dockerfile using their beginner documentation. It follows these\u00a0steps:<\/p>\n<ol>\n<li>Install python\u00a03.9.<\/li>\n<li>Create a new directory and copy the project\u00a0files.<\/li>\n<li>Install the necessary packages using requirements.txt.<\/li>\n<li>Specify the port\u00a0(8000).<\/li>\n<li>Run the application.<\/li>\n<\/ol>\n<pre># A Dockerfile is a text document that contains all the commands<br># a user could call on the command line to assemble an image.<br><br>FROM python:3.9.4-buster<br><br># Our Debian with python is now installed.<br><br>RUN mkdir build<br><br># We create folder named build for our stuff.<br><br>WORKDIR \/build<br><br># Now we just want to our WORKDIR to be \/build<br><br>COPY . .<br><br># FROM [path to files from the folder we run docker run]<br># TO [current WORKDIR]<br># We copy our files (files from .dockerignore are ignored)<br># to the WORKDIR<br><br>RUN pip install --no-cache-dir -r requirements.txt<br><br># OK, now we pip install our requirements<br><br>EXPOSE 8000<br><br># Instruction informs Docker that the container listens on port 8000<br><br>WORKDIR \/build\/app<br><br># Now we just want to our WORKDIR to be \/build\/app for simplicity<br><br>CMD [\"uvicorn\", \"main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\"]<br><br># This command runs our uvicorn server<\/pre>\n<p>Now that we have our Dockerfile, we can create the image with the following command. The name of the image will be \u201capiserver\u201d.<\/p>\n<pre>#build docker image<br>docker build . -t apiserver<\/pre>\n<p>If we navigate to Docker Desktop, we can see that the image was successfully created.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AjkOHsGCCSDJKoY1e2EIHqQ.png?ssl=1\"><figcaption>Image by\u00a0Author<\/figcaption><\/figure>\n<h4>Creating a Docker Container<\/h4>\n<p>Now that we have an image, creating the container is very simple. Once we run the image with a few instructions, the container is created. Below, we run the image and specify the\u00a0port.<\/p>\n<pre>#run docker image<br>#acces at http:\/\/localhost:8000<br>docker run --rm -it  -p 8000:8000\/tcp apiserver:latest<\/pre>\n<p>If we navigate back to Docker Desktop again, we can see the container. Docker gives containers random names, which can become difficult to track. If you develop many applications, it is useful to rename\u00a0them.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2A8KDdMzeExXsHw0IYm6DKgg.png?ssl=1\"><figcaption>Image by\u00a0Author<\/figcaption><\/figure>\n<p>The model is now deployed! Going back to our team of 20, all they need is Docker installed on their machine and access to our container. Then they can run the container and use the API as\u00a0needed.<\/p>\n<h3>Conclusion<\/h3>\n<p>In conclusion, with new expectations for data scientists, it is vital to learn other skills like software engineering and ML Ops. The need for \u201cfull-stack data scientists\u201d is growing as organizations need those that can engage in all stages of the data science lifecycle.<\/p>\n<p>Taking machine learning models out of notebooks and into production is a great first step to become a full-stack data scientist. By using tools like FastAPI and Docker, you can share the hard work it took to build your model by allowing others to use it\u00a0too.<\/p>\n<p><em>I hope you have enjoyed my article! Please feel free to comment, ask questions, or request other\u00a0topics.<\/em><\/p>\n<p><em>Connect with me on LinkedIn: <\/em><a href=\"https:\/\/www.linkedin.com\/in\/alexdavis2020\/\"><em>https:\/\/www.linkedin.com\/in\/alexdavis2020\/<\/em><\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=f385f244ec67\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/journey-to-full-stack-data-scientist-model-deployment-f385f244ec67\">Journey to Full-Stack Data Scientist: Model Deployment<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Alex Davis<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fjourney-to-full-stack-data-scientist-model-deployment-f385f244ec67\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Journey to Full-Stack Data Scientist: Model Deployment An introduction to productionizing machine learning models using APIs and\u00a0Docker. Growing Responsibilities of Data Scientists The title of data scientist is ever-changing and often vague. It usually involves one who is fluent in mathematics, programming, and machine learning. They spend time cleaning data, building models, fine-tuning, and conducting [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1115,83,1082,222,1114],"tags":[84,1116,106],"class_list":["post-969","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-api","category-data-science","category-docker","category-mlops","category-regression","tag-data","tag-full","tag-scientist"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/969"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=969"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/969\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=969"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=969"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=969"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}