Full-stack Data Science: Building & deploying an ML app tutorial – Part 1

Data Scientists NEED to learn to package and deploy their own models.

I’m not being a gatekeeper here, I’m giving you facts. I interview, hire, and lead data professionals, whether it be Data Scientists, Data Analysts, Machine Learning Engineers, or Data Engineers. Packaging and deploying models are consistently gaps for people without a software engineering background.

That’s why in this article and video I’ll show you a rapid deployment of a Natural Language Processing (NLP) app, from start to finish. I’m not worrying about developing the model because modeling isn’t the gap I see in the market.

Alright, let’s get started.

Setting up the project environment

I used PyCharm as my IDE, as you can see in the video above, but you should be fine with any python IDE.

To get started, we’re going to open our terminal and run the following command to create the application directory:

mkdir ner-service

Now ‘cd’ into the directory

cd ner-service

Create the poetry project by running:

poetry init

Note: Alternatively, you could run poetry new ner-service which would start a structured poetry project for you.

After you run the ‘poetry init’ command, you’ll go through a project setup that looks something like this:

Now that we have the poetry project setup, let’s launch the poetry shell.

poetry shell

If you’ve done everything properly to this point, you should see something like this in your terminal:

Setting up the project file structure

Alright, now that we have our environment up and running, let’s setup our project structure.

The first thing we’ll want to do is create our ‘src’ directory. To do that we’ll run:

mkdir src; cd src

Then we’ll want to create the __init__.py and main.py files.

touch __init__.py main.py

At this point, your project structure should look like this:

Spacy’s Named Entity Recognition (NER)

In this article and video guide, I didn’t spend much time on the NER or Spacy explanations, but based on responses from early viewers of the video, that was a mistake. So let’s talk quickly about SpaCy.

Background Information: SpaCy & NER

SpaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython

Wikipedia

SpaCy is a fantastic library used to simplify the building and development of NLP solutions. In this project, for the sake of simplicity, we’re using SpaCy’s built-in named-entity recognition (NER) feature.

If you’d like to learn more about NER, you can check out the screenshot below or see the link in the image description.

Screenshot of Wikipedia definition of Named-entity recognition: Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. -Wikipedia
Screenshot from Wikipedia definition of named-entity recognition (link)

Implementing SpaCy’s NER system

To get the project moving and keep things iterative, we’re going to use the example code from SpaCy’s Named Entity Recognition 101 as our boilerplate.

#~/ner-service/src/main.py
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Now, to use spaCy, we’ll need to add it to our environment. You can do that by running the following code:

poetry add spacy

You’re terminal will look something like this after:

Then you’ll want to add the language model.

poetry add https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz

Then just to make sure everything has properly installed, run:

poetry update

At this point, your pyproject.toml file should look something like this:

Now, let’s quickly test that the code works in our environment. Run the following command:

python main.py

If everything is running as expected in your environment you should get an output like this:

Setup the API with FastAPI

Now that we have Poetry setup and SpaCy working in our environment, let’s set up our API.

Before we jump in, I’m going to introduce FastAPI. Feel free to skip to the Implementing FastAPI section.

What is FastAPI?

FastAPI

According to the FastAPI website:

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints.

https://fastapi.tiangolo.com/

I don’t want to be too lazy about this, but that quote pretty much sums it up. FastAPI is fast, easy, clean, and extensible.

Why use FastAPI?

There are a ton of reasons to use FastAPI, but I’ll list a few reasons my team and I at Aptive Resources switched over from Flask to FastAPI for most of our Python services.

  • FastAPI comes with Swagger docs built-in. This is awesome for rapid prototyping and testing your API.
  • Clear, concise documentation and examples.
  • Extensibility.
  • Speed, speed, speed to production.
Screenshot from the example NER-Service Swagger doc.

Implementing FastAPI

Ok, so now that we generally know what FastAPI is and why to use it, let’s add it to our project with the following commands:

poetry add fastapi uvicorn

That command will install FastAPI and Uvicorn. According to the documentation, “Uvicorn is a lightning-fast ASGI server implementation, using uvloop and httptools.” For our use case, Uvicorn helps us serve our app to the world.

Now that FastAPI and Uvicorn are installed, let’s go back to main.py and implement FastAPI.

#~/ner-service/src/main.py
from fastapi import FastAPI
from typing import List
import spacy

from .models import Payload, Entities


app = FastAPI()

nlp = spacy.load("en_core_web_sm")


@app.post('/ner-service') 
async def get_ner(payload: Payload): 
    tokenize_content: List[spacy.tokens.doc.Doc] = [
        nlp(content.content) for content in payload.data
    ]
    document_enities = []
    for doc in tokenize_content:
        document_enities.append([ {'text': ent.text, 'entity_type': ent.label_} for ent in doc.ents ])
    return [
        Entities(post_url=post.post_url, entities=ents)
        for post, ents in zip(payload.data, document_enities)
    ]

That’s a lot of new code added to the file, so let’s go through it piece by piece.

 app = FastAPI()

This part simply instantiates the FastAPI application.

@app.post('/ner-service') 

This sets our route or path. For example, www.mktr.ai/ner-service would have the above path if this were a service we ran from the MKTR.AI website.

@app.post('/ner-service') 
async def get_ner(payload: Payload): 
    ...

The async for path operation functions are super helpful and I suggest you take a look at FastAPI’s documentation to learn more.

Notice the payload: Payload portion is telling the application the type of data to expect. We’ll get to that in a minute when we make a models.py file. For now, think of it as a way to format the data we’ll accept in a request to our API.

tokenize_content: List[spacy.tokens.doc.Doc] = [
        nlp(content.content) for content in payload.data
    ]

Here we’re using list comprehension to tokenize the text data that’s passed to our API. The List[spacy.tokens.doc.Doc] portion declares the type/format of the data we’re assigning to the tokenize_content variable. This may be a little redundant but becomes more important as you attempt to account for edge cases and potential issues in production.

document_enities = []
for doc in tokenize_content:
        document_enities.append([ {'text': ent.text, 'entity_type': ent.label_} for ent in doc.ents ])

Here we’re creating a list, document_entities, and using list comprehension to create a dictionary with the text and entity type for each piece of text passed to the API. The document_entities list is a list of dictionaries.

return [
        Entities(post_url=post.post_url, entities=ents)
        for post, ents in zip(payload.data, document_enities)
    ]

Finally, we format our response object. Based on the previous chunks, you can probably tell what’s going on here. Basically, the Entities() piece hydrates the Entities objects for each text string passed.

Ok, now that the main.py file is good to go, we’re going to create another python file named models.py like this:

touch models.py

Then let’s add the following code to models.py:

#~/ner-service/src/models.py
from typing import List
from pydantic import BaseModel


class Content(BaseModel):
    post_url: str
    content: str


class Payload(BaseModel): 
    data: List[Content] # this makes list of Content objects


class SingleEntity(BaseModel):
    text: str
    entity_type: str


class Entities(BaseModel):
    post_url: str
    entities: List[SingleEntity] # this makes a list of SingleEntity objects

Basically, Content() is a single object with a post_url string and a content string. The Payload() object is a list of Content objects.

The same goes for Entities and SingleEntity.

Test your FastAPI locally

Now that you have the baseline code written, let’s test it out by running the following code:

uvicorn main:app --reload

After you run the above code, your terminal should look something like this:

If everything is going as planned, you should be able to visit http://127.0.0.1:8000/docs and check out your FastAPI app and swagger doc.

In closing…

Alright, alright, alright! You have a FastAPI app running locally. Congrats!

In the next post, we’ll containerize our app using Docker, push the docker image to DockerHub, setup a GCP Virtual Machine, and run our app so the world can use it.

If you’re inpatient, you can always cut to the chase and watch the original YouTube video for this project and the GitHub repo.

Until next time, I’m Mike and this is MKTR.AI.

Leave a Reply

Your email address will not be published. Required fields are marked *