Let’s build an industry classifier!

Natural language processing (NLP) is, in my opinion, one of the funnest areas of machine learning to build and goof around with. The results tend to vary from highly useful to inadvertently hilarious.

One task that is tailor-made for classification and NLP is industry code classification. Essentially, you learn about what a company does - i.e. what products or services that it grows, makes, or sells - and then assign it an exclusive, hierarchical code that best defines its operations.

For this guide I will be using Statistics Canada as my data source, and the North American Industry Classification System (NAICS) as our target set, but this approach could work with any.

The NAICS manual used in this example comes from this page. We’ll be using the smaller ‘structure’ data file found here, although the larger ‘elements’ file is probably more interesting from a prediction perspective. Modeling-wise, we will use the usual suspects, including scikit-learn. I’m using Python 3.8 here, but the code and frameworks here should work fine with a wide variety of setups.

First let’s read the data from CSV and keep what we’re interested in.

from sklearn import model
import pandas as pd


# assumption: the CSV file is available in the current working directory here
FILENAME = './naics-scian-2017-structure-v3-eng.csv'

df = pd.read_csv(FILENAME)

Now let’s run the data through a TFIDF (term frequency inverse document frequencys) vecotrizer to get some value counts. Essentially turn the words of the description into something we can put into a model - a standard RTFD here.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB

X = 

…

And there you are! It’s quick-and-dirty, but I think pretty fun to goof around with. And frankly, depending on your application, might be perfectly useful as-is for something like a suggestion field in a drop-down.