In the end, this was the custom view we built for the client:
A Series C startup wanted us to find every company that holds corporate retreats. This is some of the work we did that turned terabytes of unstructured data into structured data leveraging recent advancements in Natural Language Processing (NLP) and Large Language Models (LLMs).
Good results come from good data. Since 2018, we’ve been crawling 50-60 million websites weekly, with an emphasis on covering corporate pages, review sites, news articles, social media, and more. Combined with our contributor network, this data is the backbone of all our machine learning models, and improves with more data ingested every day.
Furthermore, we track when changes occur, allowing us to generate insights with inaccessible and deleted data. For instance, if a company offered remote perks last year but stopped this year, that is a signal that their remote policy has changed.
Searching for the general keywords (“retreats,” ”social events,” “off-site,” etc.) gives us an initial result of 29.1 million matches, very few of which are actually companies that hold corporate retreats. This simple approach has a significant limitation: English contains a lot of nuances and is hard for algorithms to parse, and the same phrase in different parts of the sentence could mean entirely opposite things:
Traditionally, a human has to read through every result, but given the sheer volume this would be highly impractical. We use transfer learning on a Large Language Model architecture and few-shot learning (FSL) to rapidly train sophisticated models that can adapt and generalize as well as humans do. This gives us the ability to understand nuance and deliver industry-leading results.
Our approach is great at understanding context; for retreats, our models ignore travel agents, travel agencies, retreat planners, directories, and corporate retreat venues because these organizations sell corporate retreat solutions rather than offer them. We do this by parsing the company description, official website, reviews, and press releases. Legal documents and notices containing certain keywords are also ignored using a classifier we built for another client.
Once the corporate retreat classifier is built, we run it on a subset of the matches and compare the results to the human-labeled result. This was one of our initial passes:
To increase the confidence and improve our model, we took a look at the false positive and false negative results. Here is a sample of the problems we’ve noticed:
We use FSL to update our model with a few examples at a time. Due to the generalizability of our model and its depth of language understanding, we can get rapid performance increases on specific tasks. Often it takes under a hundred examples to fine-tune our models to common tasks. More importantly, these additions are cumulative – so the benefits we get from training on one task carries over to the others. The base model is constantly learning, providing us with a unique and extendable asset.
Going back to our retreat classification, here is our current performance on the task:
To prove generalizability, here are some examples that our model classified correctly without being specifically trained on similar examples:
Furthermore, we collate the data for the same company so we can build a rich contextual history based on insights from different data sources, which allows us to do post-processing to answer more complex questions:
Results
We follow these steps for every insight we generate. For example, when parsing frequency data, these are common problems we’ve had to take into account:
In the end, this was the custom view we built for the client:
We crawl and run the classifiers continuously, updating data whenever changes are detected.
Limitations
1.
2.
© Graze.ai. All rights reserved.