Building a Defensible Machine Learning Company in the Age of Foundation Models
Revisiting the Defensibility Flywheel
Two years ago, I wrote something along the lines of: “You either die a chatbot startup or live long enough to see yourself become an ML tooling company” This time, however, I might have to rephrase it to:
“You either die a copywriting AI tool or live long enough to see yourself become out-bundled by Microsoft.”
Many remarkable things have happened in the AI space, turning heads every week since I first wrote about Defensibility in ML. 2022 gave birth to new deep learning models like Dall-E 2 by Open AI, or Stable Diffusion by Stability AI, RunwayML, and the LMU Munich, leading to mindblowing images artificially generated from pure text prompts that swamped across Tech Twitter (I used Midjourney to generate the picture above).
However, the award for the product of the year goes to the space of language models, namely ChatGPT by OpenAI. While language models have already performed remarkably well, ChatGPT wrapped a chat UI around their state-of-the-art models so everybody could experience the previously “hidden” power behind APIs. (To be fair, ChatGPT is much more than only UI wrapped around a language model - it is fine-tuned by human interaction using instruct tuning and Reinforcement Learning with Human Feedback). ChatGPT gained 1 million users within one week, making this product one of the most successful launches in history. In addition, their successor models GPT-4 are about to make huge waves in early 2023, with their arrival hyped up to the degree that the highly anticipated launch has already turned into a meme.
With the performance and accessibility of those new models reaching new highs, tech and research folks have coined new terms and newly-investable categories like Generative AI, Large Language Models, and Foundation Models. Existing and new companies are jumping on the bandwagon, looking to power up their value proposition with those foundation models.
What are those companies, and why do we call the models foundation models?
Companies working with foundation models
Foundation model is an umbrella term for any deep learning model pre-trained on lots of data that users fine-tune and adapt to solve various tasks such as generating text, images, and beyond. Due to their scale regarding data, architecture, and computation, they perform like magic and are generally resource-intensive. Foundation models include OpenAI’s GPT & Dall-E, Stable Diffusion, or BERT.
Three types of companies can deal with foundation models (I took this framework from Elad Gil’s blog post that you should read):
Layer 1 - Platforms / ML infrastructure: This includes companies that come up with new architectures, train (foundation) models, or make them accessible for other companies to use, such as OpenAI, HuggingFace, and Google, as well as companies that provide general infrastructure and tooling to help ML get embedded/go into production like Groq, Deepset, or Scale.
Layer 2 - Companies consuming those (foundation) models:
Layer 2.1: Existing companies that add/replace features with ML (e.g., GitHub + Copilot, Notion + AI, etc.)
Layer 2.2: Companies that are entirely rethought with ML (e.g., RunwayML, Jasper AI, Adept AI, etc.)
While most of the recent launches of the companies above are groundbreaking and attract a lot of attention and users, the question has increasingly come up in parallel: how can those companies build long-lasting defensible companies while building on top of foundation models? Why do we even care?
There is no free lunch
While one could theoretically train and tune their model from scratch, most (early-stage) companies don’t have the resources (knowledge, talent, money) to do so initially. So instead, many companies initially “outsource” those ML capabilities and use models like GPT out of the box via their chat interface / wrap around it (like Texts.com) or, more frequently, for inference and finetuning via their API (like Copy.ai).
But: It’s not like you can download the architecture, including pre-trained weights, and make it “your” model. Suppose you, for example, decide to build on top of OpenAI. In that case, you practically start being dependent on a third-party provider for now, with everybody having (almost) the same access to those models as you. And there is no reason for OpenAI to change its product and business model, despite its name. If there will be a monopoly/oligopoly for foundation models (a little bit more on this in the “Predictions” section below), businesses leveraging those third-party models need to cope with various implications, especially financially.
As a result, Layer 2.1 + 2.2 companies need to strategically think about building a defensible company in the age of Foundation Models. (However, different physical laws apply to Layer 1 as for them it’s mainly about the Triple R’s: Resources, Research, and Reach; see Microsoft’s 1B investment into OpenAI)
The Cost of ML and Defensibility Flywheel as the answers?
Two years ago & before the rise of Foundation Models, my answer was that companies should take into account two simple concepts when trying to build a long-term defensible company:
The Cost of ML
The Defensibility Flywheel
Let me briefly explain those two concepts with examples:
Cost of ML
The cost of ML (= the opportunity cost and sunk cost of embedding ML, and the actual cost to pay a pre-trained model like GPT + acquire data + constantly fine-tune and update the model + ) should ideally not exceed the user utility.
Example of taking the Cost of ML into account:
Netflix’s recommender system: The initial user utility of Netflix was derived from the selection of movies and shows on demand. The subsequent utility is derived from those videos recommended by ML. Their users picked up Netflix for the broad selection and production of videos, even without getting ML recommendations initially. Due to the number of users and user engagement on the platform, Netflix could leverage the proprietary data and layer an effective recommender system over time that can even help decide what to produce and feature. Similar things helped Spotify become a role model in recommender systems despite not starting as an ML product.
Example of not taking the Cost of ML into account:
Chatbot companies pre-2021: ML performance was so bad that user utility even decreased when companies introduced chatbots in their customer service. The user experience was so bad that NOT investing in adding an ML feature would have been wiser. A better strategy probably would have been to start with training students/gig workers to serve a human-enabled customer service software and introduce ML later over time (e.g., OnePilot out of France)
Defensibility Flywheel
Let’s revisit the Flywheel with a simplified example of how TikTok uses and accelerates the flywheel effect. The Flywheel can be applied to any company in layers 2.1 and 2.2 in Elad’s framework.
Value-driving product: For starters, TikTok gives content creators tools to create videos on their mobile devices easily. TikTok mixes popular content with content that might be relevant to you based on your social graph, location, device, etc. (rule-based system/filters are sufficient; ML is not required yet to deliver high user utility!) In addition, TikTok serves users with video content from various accounts.
More user interaction: The more TikTok can serve relevant content to the user, the more users interact with TikTok (spending more time on a video or liking and following)
More proprietary/active data: TikTok measures and captures the user interaction (time spend on video, likes, follows, etc.) and builds up its proprietary dataset representing individual user preferences
Pre-trained model: TikTok uses a variety of (external) pre-trained models for several “side”-tasks like
object detection (e.g., this video contains a dog),
speech-to-text / transcription of audio (e.g., transcribing what the video creator is talking about dogs),
classification of video content (e.g., this video gets some classification tags/labels like “dog,” “animal,” and “tips”).
The pre-trained models can be obtained through
Foundation Models/”Off-the-shelf” models: TikTok can use something like OpenAI’s Whisper to transcribe its videos or GPT to help contextualize, summarize, and categorize the videos
Data Partnerships: By partnering up with other companies or data sellers (e.g., using phone numbers of contacts to enrich their social graph), TikTok can pre-train their own model for specific tasks
Fine-tuned model: TikTok uses the proprietary data from step 3 as a feature extractor to train and fine-tune their own recommender systems and enrich the user experience with even more relevant content for the user (i.e., animal lovers shall get even more cute animal Toks)
Superior user experience: Because TikTok can serve more relevant content, users spend even more time and engage even more on TikTok, and that gets the flywheel rolling
Value-driving product: More engagement is a proxy for a value-driving product. The cycle starts over again, and the Flywheel accelerates.
What has changed with Foundation Models?
Do those paradigm shifts driven by foundation models change how companies should think about the “Cost of ML” and “Defensibility Flywheel” above?
Not really - the main point of entry and focus for most companies in layers 2.1 and 2.2 will still be to serve a value-driving product and build up as much distribution as possible, with or without ML, initially. Companies, in general, should never be ML-first. In the long term, users don’t care if a company is “Generative AI”-first, like users today don’t really care if a product is hosted on S3.
But: While foundation models have not changed the entire equation of how companies should think about building up defensibility, there might be some points worth commenting on and (re-) emphasizing now:
The short-term Cost of ML minimizes over time
Companies can kickstart their products by building on top of external APIs or open-source models with minimized costs using “boilerplate” ML, similar to how we can leverage no-code tools for basic use cases like landing pages today. In the short term, the cost of ML will be heavily minimized, and pre-trained models will be easily made available, e.g., via OpenAI’s API. The quality of those pre-trained models/foundation models meets today’s users’ expectations (see ChatGPT’s public perception), so user utility exceeds the Cost of ML. However, the long-term costs of ML are unknown as, for example, we don’t know how dependency on, e.g., OpenAI and maturity of open-source alternatives will pan out exactly. Will this lead to a “model-tax”?
Model-tax in 2020s analogous to Cloud-tax in 2010s?
Analogy: Back when companies like Amazon and Google invested in data centers and software on top of it and practically gave birth to the cloud, software companies stopped building the infrastructure layer on their own and started to build on top of the cloud
Now, the estimated annualized committed cloud spend as % of the cost of revenue of a company is ~50% —> this constitutes the “cloud-tax rate” software companies are paying to play.
Similar things might happen with foundation models in the future
OpenAI or Stability AI “took on the burden” and invested in training large-scale models (even though pure model training for the open source model Stable Diffusion supposedly would have only cost $600K).
Both the model training & maintenance, as well as the maintaining the infrastructure to deploy those models (especially if companies want to adopt Tesla’s Data Engine), would be too cumbersome (even if costs will continue to go down over time like storage and compute did in the traditional cloud space). So it would not make sense for any random startup to train their own models from scratch but rather start building on top of existing APIs (similar to how companies start building on top of AWS / Vercel instead of building their own infra hardware and software)
—> Will future ML companies pay a model-tax like companies on the cloud pay a cloud-tax?
There will likely be a model-tax, but supply and demand will determine the “rate” of the model-tax. While demand will only increase, supply depends on whether Google, Amazon, and Meta get their shit together (=give the public access to their state-of-the-art models), leading to a more competitive oligopolistic market with soft commoditization tendencies (more thoughts below in the predictions section). Due to economies of scale, branding, and cornered resources, big tech could reach escape velocity, and we probably won’t see the total commoditization of foundation models, after all.
Platform dependency
While short-term cost will be minimized, long-term cost driven by platform dependency remains unclear, especially as you won’t own your model when starting off with models like GPT:
You can use GPT for inference, but finetuning only works by paying for and using their API
Plus, OpenAI can decide who to give preferential treatment to as OpenAI has its venture fund AND accelerator
OpenAI invested in Descript, Spoke, Mem.ai, etc., and probably will give them better pricing and earlier access to newer state-of-the-art models
There is no guarantee that OpenAI won’t fund a competing model to yours in the future that will get access to GPT-X before you do
Even worse, OpenAI could pull off what, for example, Amazon did with Amazon Basic brands:
Through Amazon’s very own flywheel, Amazon ended up owning the demand collecting tons of usage and transactional data (e.g., glance views) no one else was capable of at a similar scale
At one point, Amazon started to launch its own brand, “Amazon Basic,” after it could see what products worked well on its platform.
This means OpenAI could also verticalize more and go into the application layer themselves after collecting data over their API on what types of businesses work really well.
Commoditization by Open Source
To decrease platform dependency, many companies could also turn to open-source alternatives, such as Stable Diffusion, BLOOM, or PaLM+RHLF. Even big tech companies like Google could accelerate commoditization by open-sourcing pre-trained models like LaMDA to level the playing field (coupled with good incentivization mechanisms, so companies then pay for GCP, instead). But:
the quality of output might always be one step behind commercialized / closed-sourced versions of big tech,
companies generally don’t have the resources to train the models themselves,
they don’t have the resources to set up and maintain the infrastructure to support their ML efforts (there is much more than just training models to put ML into production),
So most companies will continue to pay for managed and hosted services, and we’ll continue seeing a big chunk of companies go straight with OpenAI, for example. Especially because so far, big tech is pioneering while open-source is replicating (e.g., I believe multi-modal foundation models will first emerge from OpenAI, Facebook, and the likes, and not from the open source space)
Case Study: How Lensa AI won against AvatarAI.me
Summary: While AvatarAI.me did indeed serve a value-driving product, they underestimated that the cost of ML reached a minimum for everyone and lost against the competitor Lensa AI by Prisma Labs, which built a value-driving product long before and won via distribution
What happened: AvatarAI was (supposedly) the first company that could generate so-called AI avatars from your portrait pictures - your face as an Anime character etc.
Under the hood, they were using Stable Diffusion, which is open source and theoretically accessible to everyone.
They hit product-market fit immediately and made 6 digits of $ per day (!) selling AI avatars.
Prisma Labs, who have built mobile photo editing apps powered by AI before (since 2016), saw that opportunity and quickly copied AvatarAI’s feature by leveraging Stable Diffusion, too.
Because Prisma Labs had the experience and built a more value-driving product (users prefer using a mobile app to generate AI avatars), while AvatarAI was unable to, Prisma Labs could catch up and overtake AvatarAI; eventually making $1M per day in revenues.
Prisma Labs uncovered a unique insight: Users had pictures of themselves (which both products need to generate those avatars) available on their mobile devices. It was cumbersome to upload those on desktop/web, in comparison, so Prisma Labs simply had a more value-driving approach.
According to AvatarAI’s founder, it was not only a value-driving product (= users prefer using a mobile app to generate AI avatars) and its distribution but the lower price point and initiative to leverage influencers.
Lesson: There is no advantage in being the first or ML-first; you win with a value-driving product and distribution that potentially can lead to a flywheel. (Note that the game is far from over - companies like Snapchat could also integrate those features essentially for free, making Lensa AI obsolete, in turn)
Predictions and Take-aways
Many companies that will scale in the next era aren’t necessarily AI-first (but maybe fake it via a rule-based approach or even use humans in the back) - they primarily stand out due to their value-driving product and use their scale and distribution.
Objection regarding “AI-first”: You might have the counterargument that previous paradigm shifts like mobile allowed products like WhatsApp to emerge that were inherently mobile-first, which will be analogous with AI now. While I agree that there would be no Whatsapp without mobile, I would not say that being mobile-first should have been the primary part of every company’s playbook. My advice would rather be: While you should capitalize from paradigm shifts (like mobile or AI), you should still focus on a value-driving product (in WhatsApp’s case: deliver text messages from A to B instantly and as convenient as possible)
Objection regarding “distribution”: You might also have the counterargument that TikTok overtook other social media platforms without any initial distribution. You might even quote, “users came for the tools and stayed for the network.” While I also generally agree here, I would say TikTok still became successful due to distribution - it was just not their own as their (and Musically’s) watermarked content initially grew on existing social media platforms like Instagram and Youtube
Similarly, Google won’t lose “Search” - as many people are trying to predict now. They will continue to dominate due to their existing distribution, but other search engines like Bing and You.com will gain more market share. In addition, ML won’t stop with Search (in fact, Google implemented BERT in 2019 already) at big tech companies - they will embed ML in all their products, like Google’s Workspace and Microsoft Office.
While many new ML companies today are only UI wrappers on top of foundation models with little defensibility, the companies that will win are the ones that nevertheless own the interface to the user.
For most cases, the interface to the user is the browser and the address/search bar - which again is dominated by Google - expect them to make a splash here.
And Google has partnerships with all the other browsers as the default search engine, which is the gate to the web. This is not easy for outsiders to crack.
Google is more low-key because the bar for them to deploy features and products is higher than for less established companies. Let’s not forget: They are behind many innovations in the ML space, such as Transformers, Federated Learning, BERT, and PaLM. Despite not being a first mover, they have proven they can build products and exploit their distribution and bundling power.
Many other companies that have a chance at winning will try heavily exploiting the primary interface/browser to the user, like Adept.ai.
It will be interesting to see whether companies can even go more meta and own the UI on top of the browser - my bet is on startups like Raycast and incumbents like Apple that will try to own the “query layer” (Raycast is doing it with an application launcher approach) on top of the browser layer.
Some ML-powered companies will win, but less so because of models alone and more so because of their engineering and algorithmic savviness, e.g., by stitching multiple AI systems together (look at how Adept is using a mix of language models, reinforcement learning to help solve a task on website end-to-end, e.g., shopping something for you on Craigslist. Also, imagine Zoom’s opportunity to tap into the asynchronous market by combining transcription + summarization features) or being able to develop the right infrastructure to support their ML initiatives (think Tesla’s Data Engine). Some of them will naturally evolve into a layer 1 / infra company, too (like RunwayML’s contribution to Stable Diffusion, or Adept, which is both a layer 1 and 2.2 company)
New job opportunities: Consultancies and software companies with a heavy service part will find a new revenue stream in helping other companies embed Foundation Models in their business. We’ll also see new types of influencers, from famous prompt engineers like Riley Goodside to artists like Fabians.eth next to established influencers like Roon who will help the tech become mainstream. While I am not sure if prompt engineers will become a mainstream job role, we’ll see a rise of more sophisticated (gig economy) workers helping improve and finetune the models through instruct tuning and giving feedback/reward (think reCAPTCHA).
Companies that will win and build up defensibility over time have thought about these things from the start:
How to set up an infrastructure to gather proprietary active data and feed it back into model training, e.g., how Tesla does it with their data engine
Don’t overanalyze and build quickly with, e.g., OpenAI’s models, but keep an eye out for alternative layer 1 companies to cope with potential platform risk
E.g., how (pre-)seed startups use AWS cloud credits but also look to use GCP free credits and Vercel’s free tier
Thus, many Layer 1 companies that will emerge/establish are the ones that help companies easily
discover the suitable pre-trained model for the specific task (like HuggingFace)
connect models to proprietary (siloed) data
train their own (large) language models
gather proprietary data
feed proprietary data back into the model
—> essentially help democratize Tesla’s data pipeline for everyone
Proprietary data from different domains will also generate new domain-specific models - those, in turn, can be made available to others and be commercialized, making Andrew Ng’s world of “10,000 custom AI models” or Chamath’s world of “model-as-a-service” come true.
VCs will (over)fund layer 1-2.2 companies, which is good and helps develop and adopt that space. Still, we’ll see many companies rise and fall like AvatarAI, as many capabilities will be bundled up by companies that already have distribution.
We’ll stop speaking about AI in its current form because of the “AI effect”: “Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.” AI will be ubiquitous and not worth mentioning anymore
There is no way to stop AI from happening - our society will change around it.
e.g., teachers can’t stop students from using Jasper, but teaching and examination methods will converge to accommodate the use of AI (e.g., through a synchronous conversational examination). Students will always be lazy but smart enough to work around AI detectors.
e.g., roads will eventually be (re-)built in a way that is optimized for autonomous (electrical) vehicles, i.e., you will see fewer / different street signs and less space for car lanes due to improved utilization of space
Unfortunately, AI did not write this post (I tried Jasper) but at least corrected it (by Grammarly, the OGs in this space!). Thanks to Sebastian Schaal, Milos Rusic, Leigh Marie Braswell, Jeannette zu Fürstenberg, and Paula Hübner for your feedback!
Really interesting article and very complete. Thanks for sharing!
Let me add a remark/question about some companies that would be in between the two categories: those that offer a consumable AI service via API to be integrated into software or automation, while relying themselves on more "low-level" third-party services. However, they offer additional processing and provide an "intelligence" that justifies their use.
Concretely, this concerns for example document parsing services (contracts, invoices, resumes, etc.) that rely on Google or AWS for OCR but have trained their own NLP engines. The same goes for services based on speech recognition. Where would you categorize them?
We are working at Eden AI (www.edenai.co) to aggregate and harmonize foundation models to make them easier to use for the people who use them and to make it very easy for them to swap them according to the performance achieved for their specific data or the price evolution (which is a very important point). The borderline can however be quite thin between models that are 100% provided by one vendor and those that are partially based on someone else.
Great read, thank you :) I was wondering how you predict defensibility for Level 1 companies? Will this become an enclosed circle of a few powerful companies or do you see challengers gaining traction?