A use case where a typical enterprise would need to have data in the cloud to fully enjoy the benefits of AI
Ṁ600 / 600
bounty left

Can anyone explain this tweet?

Talking to a number of larger enterprises and vendors at AWS ReInvent and one thing is clear: “Your data is your differentiator!”

Why? When it comes to AI it’s really hard to move your data to your models. It’s expensive to move data (egress fees…). It’s complicated to move data (lineage breaks, governance policies need to be updated, etc). The path of least resistance is bringing your models and AI applications to your data (and not the other way around).

It becomes even more important to get your data in order to capture value from the current wave of AI. As many have said, you don’t have an AI strategy without a data strategy

To me this is very counterintuitive. For regular ML, the kind of thing that you use Spark or Hadoop obviously, yeah. You may need hundreds of terabytes of data in the cloud to train a model.

But for (generative) AI, egress fees and the data requirements are simply way slower. The cost is the API call for OpenAI or AWS Bedrock, not the egress fees or bandwidth in the internet. GPT-3 was trained in 500Gb of data, which costs $40 to retrieve from AWS.

I see no reason why an enterprise could not be wall to wall on Amazon Web Services and for their AI needs, request OpenAI or GCP over the internet for GPT-4 or Gemini.

Obviously, I understand the contractual constraints. You can use your AWS Credits to use Bedrock. Some safety concerns. But these pale in comparison with the usual reasons why enterprises and startups go to the cloud (Capex to Opex, division of labor, serverless, hosted services, better access to SaaS and PaaS applications, scalability, cheaper development costs, etc)

Get
Ṁ1,000
to start trading!
Sort by:

He’s saying it would be prohibitively expensive to sample, clean, update, transform and store proprietary data as you do to train AI models, but that the former is critically important to the success of the latter.

When it comes to the ETL and data pipelines typically required to transform some original, internal data set into something usable in a reliable, repeatable way, the $40 it costs to retrieve the data is not the right cost calculation.

You need a good data strategy to get any dataset to the point where it can confidently be used as training data, because these preliminary steps require a lot of operations and a lot of compute. These operations and this strategy are what give you confidence in the model, or at least the knowledge of what to train in order to improve it.

This is fairly standard and has been known for a long time. ETL and AI models were/are being commoditized, whereas data you collect as a business is priceless.

He is also saying that it's tricky to move data around, but not AI models - and he's not wrong.

© Manifold Markets, Inc.TermsPrivacy