Want your AI/ML project to succeed? Better pay close attention to the infrastructure on which you plan to run it. Research director Nick Patience returns to talk about the specialized needs of AI/ML with host Eric Hanselman. The more organic nature of AI/ML development needs support for its specialized needs. Training is not just happening in datacenters and the volumes of data involved create their own pressures. They dig into the need for data access and, once again, wind up at The Edge.
Learn more about 451 ResearchClick Here
Subscribe to Next in TechSubscribe
Transcript provided by Kensho
Welcome to Next in Tech, an S&P Global Market Intelligence podcast where the world of emerging tech lives. I'm your host, Eric Hanselman, Principal Research Analyst for the 451 Research arm of S&P Global Market Intelligence. And today, we will be discussing AI infrastructure with Nick Patience, the Research Director for Data, AI and Analytics. Nick, welcome back to the Next in Tech podcast.
Thanks for having me, Eric. Always a pleasure.
When we think about a lot of AI and machine learning and all of the aspects that are wrapped around them, it's easy to overlook some of the pieces that are necessary to actually make all this possible. I think there's a sort of expectation that you just do AI and ML and off you go. The trick is doing all the other bits and pieces to make it all happen, building models and all the rest. But infrastructure is a big part of this.
Yes, it is. I mean -- I guess people may think, why bother focusing on AI infrastructure? Is AI infrastructure any different from regular IT infrastructure? And we found that it is. We did our latest Voice of the Enterprise AI & Machine Learning survey, it's focused on infrastructure.
And we found that 2/3 of organizations plan their AI infrastructure differently to their other IT infrastructure. So it's definitely a thing. Whether in 5 years' time, it will be any different, I don't know. But -- and also, just to be clear, what we're talking about infrastructure here could mean anything. It could mean a complete public cloud strategy. It could mean everything on-premise, or it can mean a hybrid of a -- and a mixture of any of those 2 environments.
And obviously, hybrid is a big thing. I know you've covered it in various other editions of this podcast. But -- so we're seeing a real mix of those things. As always, there's variations in industries. And so, yes, we find some financial services and some of those other industries are obviously somewhat more prone to use on-premises than cloud. But it differs.
Well, it brings up that sort of general idea of, yes, you can certainly do AI in various different forms, but they're all of the different components. I mean you were talking about the fact that AI is not just one magical, conceptual thing. This really is a whole process, a whole set of capabilities and that all have different requirements and different needs of the systems on which they're going to run.
Now you're going to build models. You have to train the models, then you actually have to run the models. But the environments for training and running the models have very different requirements, to the extent that you've actually got processor manufacturers who are tuning processors to have additional capabilities. You've got specialized silicon that's being brought to bear on these problems. A lot of different pieces for all of the different parts of what is this much larger ecosystem.
Yes, there's a lot to think about. And it is still at that stage where we think that people are having to think about it. So you get -- if you are a data scientist, you care almost as much about bias in your model as you do about what process you are running on because they really -- these things are -- the processes are expensive, the resources are expensive. And the way that your model works and interacts with -- whether it's a GPU or CPU or TPU or anything like that really matters. So for the foreseeable future, this is something that people do care about and need to care about in various different roles.
Well -- and there's the piece of both figuring out how you train the model. And during training, you've got to have access to lots of data. When you're actually running the model, you're running in a different environment. You have to have ways to go distribute the models out to the points at which they're being used. And this winds up being not just any one aspect, but really a fairly complex set of systems that you have to build together to work right.
Yes, indeed. I mean as ever, when these -- when AI and machine learning, like any other previous wave of computing gets to be -- if not -- it's not mainstream by any means yet, but it gets to be more mainstream, become more part of the regular software development process, all those interacting elements, whether it be software, hardware and people and they have to be considered.
I mean one of the things I always think about with machine learning versus previous ways we've been developing software for 50-plus years, is it's almost organic in a way. So previously, when you had data stored in relation with databases, nice and structured data. And then you had if-then-else statements essentially as the basis of programs and a set of rules that mirrored business processes, that's all fine, and it works. And until somebody needs to change the process and maybe get different kinds of data in, it should work.
The thing with machine learning, obviously, is that it is dependent on the data that trains the models in the start -- to start with and then the new data that's coming through all the time. And that can mean the model can decay, it can have drift, it can have all these other things going on, which means they need monitoring, constantly.
And so this is why this is quite different. This is -- it learns. This is machine learning. And so as the model learns and adapts, the organization needs to be aware of how that is adapting because ultimately, the model is making predictions. And these predictions could be absolutely integral to your business. And they could be driving decisions that you are making in the real world.
And so it is quite a different -- and it's quite a different environment from regular software development we've had over the decades. But gradually, and we're seeing this with organizations, they are sort of folding in the elements, the kind of peculiar elements of machine learning and the roles such as data scientists, the machine learning engineers are becoming more and more common part of the kind of software development process.
Well -- and it's something where you also have to ensure that you've got the various both points of training and points of use built out to be able to accommodate those capabilities. I mean one of the big things we talk about and to the next stages of ML implementations is being able to actually get those models closer to the point of data creation, which means the term that we come back to, it seems like in every episode of the podcast, the edge. AI and ML at the edge have got a whole set of constraints there that you've got to build out infrastructure to be able to support as well.
Yes. Exactly. And the edge can mean, as you've -- as you said before in many of other episodes, many different things. It could be a device. It could be the "enterprise edge, the network edge." It could be a server sitting on -- under your desk. It could be a laptop. But it also could be sensors on pipelines. It could be point-of-sale devices in retail stores. It could be all those kinds of things.
Now where AI is concerned, most of the interest, ultimately, is around inference at the edge. So making the predictions at the edge. But there's a lot more to it around that. For instance, obviously, as you said, a lot of data creation devices, a lot of the interest is around collection from the edge. So because obviously, those edge devices are often the ones that throw off data. So we see smartphones and end-user devices being obviously extremely popular collection points, but then we get a lot of industry-specific collection points.
So we find in our survey in manufacturing, for instance, the top edge source is factory and assembly equipment, while in retail, it's point of sale and inventory management devices and where -- and then in the energy sector, for instance, the top one were specific environmental sensors. And so all these kind of devices throw off different types of data, one thing they all kind of have in common is a need to avoid large latency on long periods where it takes the data to get from the edge to where it needs to be.
We find different kind of challenges at the edge, but we find organizations also embracing it. So for instance, if you kind of simplify the ML process down to data collection and prep, then training and then inference, we found in our survey, 71% of those surveyed do at least some training, for instance, on the edge. And again, that edge could be a server or the edge could be a very, very small device.
And then we found 81% use 2 or more different venues for model training because there's lots of other venues. You could have distant data centers. You could have nearby data centers. You could have the actual device itself. There's lots of variation as to where this work is getting done. But it's a mixture of that data collection, data prep, training and ultimately inference.
And we all know the stories about -- around autonomous vehicles and all that kind of thing where you want to press the brake and for it to react and things like that. So latency is a big issue. But there's a lot of other issues around that. There's a lot of need for more high-performance computing we find as well in -- across the board.
Well -- and it's -- the issue of one latency, you want to be able to get results in a sufficiently timely fashion of your application close to that source of the data, the application vehicle, what have you. But there's that other aspect, which is that we -- there are all these stories of autonomous vehicles, jet engines, power turbines throwing off gigabytes and terabytes of data. The practical matter also is that it's just really hard to consider backhauling all of this data back to a data center someplace. You have to have processing at point of use at the edge, just simply to deal with the data volumes you're working with.
Yes. Exactly. As you said, there's lots of -- people throw out those numbers around data production without thinking necessarily what the next stage is and what implications that has for, in this case, yes, machine learning, but also any other kind of analytics process or anything like that. And we found organizations are very -- are pretty aware that there are some, I would say, peculiar needs of AI, but -- special needs that AI machine learning and those large data volumes throw up.
So we always ask organizations, "What kind of infrastructure resources, which specific ones would improve the performance of your AI and ML production workloads?" And for the last 2 years in a row, higher-performance networking has come top. And that speaks to that issue that you talked about there around the movement of data.
Yes. It's -- the people are seeing stress. They're in that problem of we created all the data and now geez, we don't have the networks to actually get it back to process.
Exactly. Exactly. And we've -- and that high-performance networking resource came -- was very close second this year. And the top one actually this year was cloud-based accelerators for training and inference. So there's obviously -- and accelerators here, we mean primarily, GPUs, but also increasingly other specific ones from specific vendors. But these are cloud-based ones.
And there's a lot of interest in this area. And there's also, obviously, a lot of interest in on-premises GPUs, but they're more common, I guess, and they've been around longer. And yes, when we ask this, people -- when people say this is the thing we think would improve the most, that doesn't mean they're not using it. A lot of people still are -- already are, rather, but they have -- some of them have specific concerns. And we ask people, "What are your concerns around accelerators, whether it be in the cloud or on-premises?" And reliability was top, performance was second and scalability was actually quite a distant third.
So there's a lot of interest. There's a little bit of skepticism around it, but we think that the cloud-based accelerator market is obviously here to stay in and will grow extremely rapidly. And I know our colleague, John Abbott, is working on the new version of his report on accelerators in general and tracking more than 50 start-ups plus all the big companies doing -- selling accelerators, whether they sell them directly or they sell them to hyperscale cloud vendors or anything like that. There's a lot of stuff going on here, and it's one of the most interesting areas of AI infrastructure, I think.
It is the rebirth of a lot of that silicon support for AI, really a silicon spring in the ML and AI world.
Yes, indeed. Yes, I mean, obviously, the shortages that are well documented and the challenges of getting the few companies that can actually make these things, to get your silicon produced and turned into some sort of processor is well documented, I think, but is -- but yes, we'll get there eventually.
And I guess the challenge -- the interesting thing from our point of view as an analyst are will the smaller specialist companies that have been well funded in the last 4 or 5 years, be the ones to get there? Or will it be the ones that can leverage their relationships with these -- the companies that actually can process the chips and make the chips for them and get their sort of jump ahead of the queue as it were. So it would be really interesting to see that happening over the next year or so.
Yes, fab access starts to become one of the critical parts of success in terms of moving this forward. So we've been talking about a lot of the piece parts, but ML and AI really also has this big operational aspect that's just as critical to ensuring that this whole process actually works and gets results that are meaningful. What are the various aspects that it really takes to be able to get this into production and the operational pieces of it?
Yes. The -- as you say, operational being the key word. I mean this concept has been introduced in the last 2 or 3 years called MLOps, machine learning operations. And it's -- it depends on who you ask. I mean it used to be a term that referred to the stage from production to monitoring and management, and that was very specific. It didn't concern itself with the initial data access, data preparation, labeling and training and all that kind of stuff.
Increasingly, as more and more companies and larger company -- larger vendors have been -- have got into the space, it essentially is now really meaning the end-to-end life cycle of models from, yes, from a kind of an idea at the beginning right through to the data identification, preparation, labeling, training of the model, deployment of the model, monitoring the model.
But we find -- yes, when in terms of the bottlenecks the organizations come across, we find almost 2 in 5 machine learning projects in proof-of-concept or abandoned before they get to production. And that's a pretty high number.
Wow. Yes. If you look at sort of the normal IT project success rate, yes, that's...
And it kind of speaks to that issue I was talking about earlier, this is different. This is a different way of writing software, to put it in a very, very simplistic way. But there's some common issues that people come up against, and it's mainly around data access. And so if we took -- if we look at the extreme example, more than half of those we surveyed, they were unable to get any of the data they needed or get the data they needed to build their machine learning models, abandoned all of their ML projects.
I mean it kind of makes sense because if you don't have the data, you don't really have a model. But we found within that, there's a lot of nuance. And so although the average was 39%, almost 2 and 5, there's a lot of different industry nuance within that.
But yes, there's also -- alongside data access, those facing difficulties around model explainability have the second highest rate of project abandonment. So are you able to explain how your model is making a prediction to your regulator, for instance? I always kind of think of that, you're sitting in a room or on a remote meeting with your regulator, who is quite likely to be a lawyer, unlikely to be a data scientist, unlikely to have a computer science degree or anything like that, you need to be able to explain how it's making decisions it's making.
And so whereas in the early days of experimentation, say, 5, 6 years ago, that wouldn't have been an issue because data scientists were experimenting. They were building and training models, throwing off predictions and people were very interested in the predictions they're making, but they weren't really asking too many questions about how it got there. And that -- those days are over essentially.
And we're finding model explainability, bias detection, all these issues are becoming huge, huge issues, partly because, again, this is going towards mainstream. We're not there yet by any means, but it's going towards that direction. It's making increasingly important decisions in regulated markets.
And we think, yes, ultimately, the AI sector itself will become a regulated sector. Technology itself, as we know, is moving that way. But AI had some very specific examples of potential regulation, which is perhaps a subject for another podcast.
Well, yes, thinking about the regulatory aspects, I mean, given that it -- as you're describing it, it is such an organic process. It starts to really become a much more complex thing to be able to govern and explain. But there are the ESG aspects of this. When thinking about the environmental, social and governance aspects, there is, of course, so much focus, of late, on the explainability of so many different models, but there also are environmental aspects that are coming out of these environments as well.
Yes, indeed. Well, this is the first time in the -- in this survey that we actually asked an -- ESG questions. And we asked one of each. We asked one E, one S and one G. And the E question was we're asking people how concerned they are about the environmental impact, as you say, of executing, building, training and deploying ML models.
The S question was how concerned are you about machine learning being used for immoral unethical purposes? And then the governance question was really a government question. It's about how concerned are you about government regulation of AI and ML initiatives.
And we found as organizations and people within those organizations get more experienced, in other words, get -- they -- which we measure by saying they have ML in production versus having ML then proof of concept or having neither but planning to do so in the next 12 months. As they move much towards having it in production, they become more concerned about all of these issues.
With the governance one being the one they're most concerned with, we found 44% of people with machine learning production are very concerned about AI regulations. And then we found 41% are very concerned about the ethical or moral issues, and 38% were very concerned about the environmental issues. And I think that makes sense because I think the government -- governance question, the immoral -- the morals or the ethics are probably easier for people to understand than the environmental ones.
But obviously, these -- as these models -- the models themselves are not particularly large, but the data sets are very large. But if you have 50 models in production, it's not really an issue. But if you have, yes, 15,000 to 30,000, 50,000 models in production, there can be a lot of -- there's a lot of compute resources that's going to be needed there when -- regardless of whether it's come right back to the beginning, whether it's on premises, whether it's in the cloud purely or it's hybrid. And those compute resources obviously have an environmental impact.
And so we're finding it's not people are not necessarily completely aware of how to measure that yet. I think AI sustainability as well as AI forces to kind of sustainable AI and AI sustainability are going to be 2 sides of a very interesting coin in the next 5 or 10 years. And it's something we'll be tracking very closely because as you say, well, yes, everybody here at S&P and everyone at 451, all very interested in the ESG aspects of everything we do, and I see exactly the same thing in AI and machine learning.
We're in the relatively early days of being able to understand IT resource impacts directly working on models of improving and understanding of the carbon footprint of various IT processes, but here's one that's -- gets identified as particularly computationally intensive and, of course, an area in which there's going to be more concern.
Exactly. And that's -- I mean, indeed, that's the entire reason we have a survey about AI infrastructure because it is different as we saw at the beginning, 66% of organizations plan for it differently, and they're arguably right to do so.
Wow. Well, it's -- I think we've come full circle in terms of sort of that discussion of why is it special and what is the impact of AI and ML infrastructure. As you said, there's a lot more that we could dig into, into a lot of these impacts, but we are at time for this episode. So Nick, thanks very much for being back. And clearly, we need to get you back to dig into a lot of the different aspects that go beyond the nuts and bolts of the infrastructure piece that we talked about today.
Yes. Thank you very much for having me again, Eric, and love to come back.
And thank you very much for our audience who's staying with us. I hope that you will join us for our next episode, where we will be touching on a whole range of different topics because there is always something Next in Tech.
No content (including ratings, credit-related analyses and data, valuations, model, software or other application or output therefrom) or any part thereof (Content) may be modified, reverse engineered, reproduced or distributed in any form by any means, or stored in a database or retrieval system, without the prior written permission of Standard & Poor's Financial Services LLC or its affiliates (collectively, S&P).