Our Machine Learning Process

A Series C startup wanted us to find every company that holds corporate retreats. This is some of the work we did that turned terabytes of unstructured data into structured data leveraging recent advancements in Natural Language Processing (NLP) and Large Language Models (LLMs).

Good data comes first

Good results come from good data. Since 2018, we’ve been crawling 50-60 million websites weekly, with an emphasis on covering corporate pages, review sites, news articles, social media, and more. Combined with our contributor network, this data is the backbone of all our machine learning models, and improves with more data ingested every day.

Furthermore, we track when changes occur, allowing us to generate insights with inaccessible and deleted data. For instance, if a company offered remote perks last year but stopped this year, that is a signal that their remote policy has changed.

LLM-based classification models

Searching for the general keywords (“retreats,” ”social events,” “off-site,” etc.) gives us an initial result of 29.1 million matches, very few of which are actually companies that hold corporate retreats. This simple approach has a significant limitation: English contains a lot of nuances and is hard for algorithms to parse, and the same phrase in different parts of the sentence could mean entirely opposite things:

“…the company’s social atmosphere…”
This describes a social atmosphere.
“…our company’s social in Tahoe annually…”
This describes a corporate retreat.
“…executive corporate retreat, bonuses with good performance.”
Dangling modifier; does “with good performance” only apply to “bonuses”?
“…the park provides a unique retreat in the city’s…”
“Retreat” here is used to describe a park.
“…retreat into the market…”
”Retreat” here is a verb.

Traditionally, a human has to read through every result, but given the sheer volume this would be highly impractical. We use transfer learning on a Large Language Model architecture and few-shot learning (FSL) to rapidly train sophisticated models that can adapt and generalize as well as humans do. This gives us the ability to understand nuance and deliver industry-leading results.

Our approach is great at understanding context; for retreats, our models ignore travel agents, travel agencies, retreat planners, directories, and corporate retreat venues because these organizations sell corporate retreat solutions rather than offer them. We do this by parsing the company description, official website, reviews, and press releases. Legal documents and notices containing certain keywords are also ignored using a classifier we built for another client.

Once the corporate retreat classifier is built, we run it on a subset of the matches and compare the results to the human-labeled result. This was one of our initial passes:

Table

To increase the confidence and improve our model, we took a look at the false positive and false negative results. Here is a sample of the problems we’ve noticed:

“Paid lunch and company socials like Cocktail Fridays…”
“Company social” here does not refer to a retreat.
“…company offsite bi-annually.”
“Offsite” here means corporate retreat.
Company Offsite Manager”
“Offsite” here means “not in office”.
“…company’s social butterfly…”
Idioms are ignored.

We use FSL to update our model with a few examples at a time. Due to the generalizability of our model and its depth of language understanding, we can get rapid performance increases on specific tasks. Often it takes under a hundred examples to fine-tune our models to common tasks. More importantly, these additions are cumulative – so the benefits we get from training on one task carries over to the others. The base model is constantly learning, providing us with a unique and extendable asset.

Going back to our retreat classification, here is our current performance on the task:

Table

To prove generalizability, here are some examples that our model classified correctly without being specifically trained on similar examples:

“…research options and coordinate corporate retreat…”
This job description shows who to reach out to.
“Collaborates with the Director of Human Resources on retreat presentation…”
Caught using context extraction around “retreat”.
“…take a team-building trip with your colleagues every year..”
Caught using context extraction around “trip”.

Furthermore, we collate the data for the same company so we can build a rich contextual history based on insights from different data sources, which allows us to do post-processing to answer more complex questions:

“…get-together in Hawaii…”
Marked if the retreat location is not the same as the office location.
“…attend different company conferences…”
Found only in sales job descriptions, not company-wide, suggesting that it is an external conference.

Results

We follow these steps for every insight we generate. For example, when parsing frequency data, these are common problems we’ve had to take into account:

“…off-site the weekend after Labor Day
This means once a year.
“…travel once every 3 months for our retreats that last 4 days…”
It can tell that the frequency is quarterly, not 4 days.
“…corporate retreats, lunches every week.”
Because it is unlikely that there is a corporate retreat “every week”, the frequency only applies to “lunches”.

In the end, this was the custom view we built for the client:

Table

We crawl and run the classifiers continuously, updating data whenever changes are detected.

Limitations

1.

Recent seismic events aren’t taken into account for context. Because new data takes time to be collated with old data, the model lags in gaining awareness of current events (e.g. “Given the recent election results…”)

2.

Because we predominantly trained this model in English, it performs less accurately in other languages.

Interested in working on data engineering problems that utilizes the latest advancement in Large Language Models? We’re hiring!

© Graze.ai. All rights reserved.