AMA With Adrian Loteanu

  1. In your opinion, what is the greatest performance challenge for the current state of ML?
  2. I have experience as Performance and ML Engineering. Any advice to get a position as an ML Performance Engineer?
  3. What trending architectures does Intel invest in for 2030s technology?
  4. What are the future architectures that will cause a paradigm shift in the face of market trends?
  5. Is there a possibility of breaking TSMC’s dominance and monopoly in the future?
  6. What is the right way of optimizing the workload of ML and AI on the hardware level?
  7. Is it possible in the future that AI can be used in the interview process?
  8. I learnt CoreML is the key ML platform in Apple. So I’m wondering what’s the advantage of CoreML compare with other open source platforms, such as Google’s TFLite, Pytorch mobile?
  9. How do we maintain class balance in data as defective samples are often less than good ones?
  10. How do you identify performance bottlenecks in machine learning workflows and algorithms?
  11. With the emergence of Generative AI and high computing availability will we get closer to the Bayes limit?
  12. If I am an aspiring machine learning engineer, should I focus more on the skill in deploying a model in the cloud compared to the ML algorithms?
  13. Can you share an example of a challenging performance issue you encountered in the past and how you resolved it?
  14. In the field of machine learning, performance is often a trade-off with accuracy. How do you strike the right balance between performance and accuracy when optimizing ML models for Apple devices? Are there any specific challenges you face in this regard?
  15. Emerging Trends: As an ML Performance Engineer, what emerging trends in machine learning or deep learning are you most excited about and how do you think these trends will influence the performance engineering landscape?
  16. Adapting ML Models for Performance: Given that the landscape of hardware platforms for running ML models (like GPUs, ASICs, and FPGAs) is continuously evolving, how do you approach the challenge of optimizing and adapting ML models for these diverse hardware architectures?
  17. What are your thoughts regarding Distilled models
  18. How does Apple balance between on-device machine learning and cloud-based processing, and what are the factors that determine the decision to use one approach over the other?
  19. As machine learning models become more complex and data-intensive, how do you ensure a balance between performance and accuracy in your optimization efforts, especially when deploying models on devices with varying computational capabilities?
  20. Balancing Performance and Accuracy: In many cases, optimizing for speed can lead to a loss in model accuracy. How do you strike a balance between maintaining the accuracy of ML models and improving their performance?
    1. With current models with billions of parameters, do you think the performance gains will come from software, hardware or a hw/sw combined solution?

      In your opinion, what is the greatest performance challenge for the current state of ML?

      Right now ML models are growing in complexity faster than we can improve hardware so a big challenge is trying to create new computer architectures specifically optimized for ML so that we can keep fueling the AI revolution.

      It is particularly hard because the ML landscape is rapidly changing and the hardware has to optimize for a moving target. A secondary issue is that we can scale compute power faster than we can scale bandwidth to memory.

      I’d say the biggest challenge is being able to increase the efficiency of ML hardware (compute per energy) and the main way to do that is to figure out how to minimize data movement (inside the computer) and what data we move we need to move very efficiently.

      Data flow architectures are a big progress in this direction (and they’ve been around for a while). New memory technology helps (HBM, GDDR) and on the horizon there are interesting ideas such as photonic computing and possibly in a more distant future quantum computers.

      And there is research in neuromorphic computing that can also give big efficiency gains for some applications. Also there are things like smart prefetching mechanisms that can get data close to the ALU right before it is actually needed which can help.

      The world’s top supercomputers at the moment exceed 1 Exaflop and according to estimates they exceed the human brain in raw compute power but they fill a whole building and consume MW of power whereas the human brain runs on about 20W gained from sugar.

      We need to increase compute efficiency and density in order to continue scaling the models we can run.

      I have experience as Performance and ML Engineering. Any advice to get a position as an ML Performance Engineer?

      The combination of ML and performance is very sought after in the right places. It’s hard to find people skilled at both. To clarify something, probably all ML Engineers have to have performance in mind when designing their models and do a level of performance work (e.g. quantization, using sparsity, adjusting the network).

      In my experience when “performance” is explicitly mentioned in the role it is because you work either on hardware (computer architectures for ML) or very low level software (e.g. optimizing math primitives for a given computer architecture).

      So such roles are tipically available at companies that have a hardware component to their business or that run large scale ML operations in big datacenters and also need to consider performance and efficiency. It is a niche role in the grand scheme of things as such companies are not very common in all goes.

      In the US in places like Silicon Valley you will find a lot of roles nowadays which pertain to ML Performance specifically as there is a large concentration of companies that focus on ML hardware.

      Examples inlcude Google, Apple, NVidia, Intel, AMD as well as an assortment of startups like Cerebras. If you are skilled in both categories (performance + ML) I advise you to apply at roles within these companies or similar ones that suit your desires.

      Depending on the specific role you will have to prepare for the interview differently but generally a solid knowledge of coding, ML network architectures and computer architecture will go a long way.

      The companies that need to fill these roles generally have a hard time finding adequate candidates skilled in both these categories at the moment so you have this going for you.

      So apply on the company websites and maybe make solid Linkedin and Readyfly profiles so that recruiters are impressed and call you for interviews. If you are outside of the US I think it is harder to find such a role (though not impossible), and if you require visa support the bar for the interview will probably be set higher but this is a secondary consideration.

      With current models with billions of parameters, do you think the performance gains will come from software, hardware or a hw/sw combined solution?

      If by “performance” you mean how intelligent the model seems to be or how well it performs on a given task I’d say that there are still gains to be made by making models larger though we are starting to see diminishing returns. This is pretty much the cutting edge of what we know so stay tuned to the scientific literature for updates.

      The reality is that we can make models much smaller than for example GPT4 that perform just as well or nearly as well. It has historically happened that a model would solve a problem that could not be solved before and once we had that model we could create more refined ones that could work just as well or even better but that are a lot smaller and more efficient.

      So there are gains to be made by both refining the network architectures (software) but also by making them bigger. For the latter of course we need hardware support, we are already close to what is feasible to do on modern computer architectures. LLMs can be challenging as they can’t even fit into the RAM of most computers.

      Rumors are that GPT 3.5 needed to run on 4 NVidia GPUs not because of the lack of compute power but because it did not fit into the memory of a single GPU. And of course training it cost over 30 million dollars.

      So scaling the networks requires progress in hardware and I belive big gains are possible from this but also we need to refine the network architectures in order to solve new problems effectively. Refining the network architectures requires high performance hardware so that researchers can train various options and look at the results in reasonable amounts of time and at reasonable cost.

      At the moment I belive hardware is the main bottleneck and if we give researchers more powerful hardware they can scale the networks in both size and efficiency (by tuning the network architectures) in order to make them a lot smarter and more efficient for a given level of problem complexity.

      I also believe that it may be possible to find more efficient ways to train our networks than backpropagation and gradient descent though this is an active area of research. While training is not as visible to the end user of the network if the network takes too long to train then it is impractical for researchers to experiment with it.

      Disclaimer: I do not speak for Intel or any other company, what I say here are my own opinions and views and is based exclusively on publicly available data. Product roadmaps in the CPU (and GPU or XPU) industry are generally no longer than about 5 years.

      Beyond that there is too much uncertainty with regards to the technologies available and the requirements to be able to define what a product would look like.

      Even at the 5 year horizon it is hard to give more than some general targets for performance and efficiency and high level plans for how to achieve them. 5 years in advance a product exists as some design documents, spreadsheets and possibly high level simulations.

      One problem is that the software landscape changes over time and it is hard to say right now what kind of software users will want to use in 5 years so that you can optimize for it. Intel actually has futurists hired to try to predict this.

      This is how 5-6 years ago they started investing in AI acceleration capabilities which now appeared in the high end Xeon servers despite AI not being such a big requirement at the time. We don’t know how peolpe will use their computers and what technology will be available in 2030 and beyond so I doubt any CPU/GPU company has any products defined. That said there is forward looking research often done in collaboration with universities around the world.

      From methods of making smaller transistors to efficient general and problem specific architectures to technologies like neuromorphic computing, photonic computing, spintronics, carbon nanotubes or quantum computing there is a lot of effort invested in pathfinding throuought the industry.

      Once we get closer to the 2030s we can evaluate what has the best potential (and is feasible to implement) and start to define products based on the capabilities that the technology has.

      I believe that in the next decade we will see a push towards tightly integrated multi chip package systems with fast and efficient integrated memory as well as special purpose accelerators.

      Transistor scaling will continue to help with improving performance and efficiency and I think we will see some level of discarding legacy capabilities and re-organizing computer architectures and microarchitectures to be faster, more scalable and more efficient.

      Is there a possibility of breaking TSMC’s dominance and monopoly in the future?

      A few years ago the same question could have been asked about Intel. While in this industry it is generally hard to erode an advantage and there is strong inertia I don’t think anyone is guaranteed dominance unless they out-innovate the others. Given that scaling transistors is becoming increasingly difficult there is a strong chance that the current leader in the race might hit a road block that will slow him down and allow others to catch up.

      There is also a chance of new and disruptive technologies which will inevitably come at one point and might change the landscape. Scaling silicon transistors is becoming increasingly hard and at some point it will stop and at that point others can catch up. And whoever is better at what comes after will take the lead.

      What is the right way of optimizing the workload of ML and AI on the hardware level?

      Optimization generally happens at all levels of the technology stack. The general rule for how to optimize anything is to first analyze it very carefully and find where the bottlenecks are.

      After that you need to look at what is causing the bottlenecks (ex. a region of code uses all available memory bandwith) and try to find a better way of doing that operation (ex. make it more cache friendly to avoid using bandwidth). When thinking about how to create the best hardware for a given task you first need to define what your targets are (performance, power efficiency, thermal constraints, cost, etc).

      You also need to understand the workloads that you are optimizing for very well. After that it becomes a game of choosing tradeoffs using available thechnologies in order to hit your goals. For ML a big problem is minimizing data movement while ensuring your ALUs are feed with the required data. ML accelerators are tipically systolic arrays (but can also be big SIMD units depending on what you optimize for). Beyond that you need well optimized low level libraries that offer the common kernels and operations used as building blocks for neural networks.

      Engineers dedicate do this will profile how those kernels run on the hardware and try to adjust them to extract as much performance as possible. And higher still you have compilers and ML libraries that make use of the low level kernels that need to be made efficient. At all levels you need to profile, analyze, find bottlenecks and figure out how to remove them.

      Is it possible in the future that AI can be used in the interview process?

      I am concerned it is used currently in morally questionable ways. Yes, I think AI will be increasingly used as part of interview processes by both interviewer and intervewee. From scanning and filtering CVs to scheduling interviews to even conducting initial interview rounds and tesing technical skills I think AI will play a very important role in interview processes in the near future. And candidates themselves will likely use AI tools to write their CVs, manage their schedules and other uses.

      I learnt CoreML is the key ML platform in Apple. So I’m wondering what’s the advantage of CoreML compare with other open source platforms, such as Google’s TFLite, Pytorch mobile?

      This is strictly my personal view and in no way connected to Apple about which I do not talk. Let’s say you’re a car engine designer. You are tasked with making an engine that will go into a very small set of premium sport car models which are made by the same company you work for.

      You can collaborate closely with the teams that design the other aspects of the car and you can validate, tune and optimize your engine for the small, specific and controlled set of cars that use it.

      You can test your engine in the cars you are targeting as they are being designed and you can work to tune the whole car for the ultimate driving experience. Your friend also designs car engines and is just as good as you but his engines need to work on dozens of car designs from multiple manufacturers at different price points and with different constraints. Worse still, his engine will go into everything from compact economic family cars to tractors.

      Since it goes into so many different designs it is much harder for your friend to spend the same effort on testing and tuning for each individual design it will go into. Which of these engines do you think will perform better? I think it is a similar story with software.

      How do we maintain class balance in data as defective samples are often less than good ones?

      This is a problem frequently encountered when training models. There are multiple approaches to solving this and choosing the right one usually depends on the problem you are trying to solve.

      In some cases you can either duplicate the bad samples or create new ones by distorting the good ones in some way. You can also synthetically create bad/noisy samples. These are data augmentation techniques. You can also train multiple models on the bad samples and an equally sized sample of the good ones and combine the models.

      You can also associate different weights to the samples such that the loss function will penalize more for miscassifying bad samples than good ones. There are several other methods that I can remember, these would probably be detailed very well by ChatGPT.

      How do you identify performance bottlenecks in machine learning workflows and algorithms?

      Like in any other applicatoin the key is to profile it and analyze it in detail. Making use of hardware performance counters and applications that can read them (ex. VTune, perf, operf) and show them for each region of your code is generally crucial for finding how much time is spent in every part of your code and what the bottlenecks are, how well is the hardware used (ex. what % of the theoretical FLOPs you are using, how much memory bandwidth you are using, what is the IPC, etc.).

      Knowing the application/workflow is also important. If you know for example how many math operations are needed to complete a given ML workload you can calculate what the upper bound for performance that you can achieve on a given hardware by knowing what it’s peak FLOPs are. If you are not in the ballpark of that value then there is often an opportunity to improve, you may be for example bottlenecked by memory bandwidth, memory latency, data dependencies, etc.

      By using profiling tools you can find out where most time is being spent and you can analyze these bottlenecks and determine what is preventing them from being faster by looking at what they are supposed to do and what the performance counters say. Another method that is possible in some cases is to use simulators of the platforms and workflows that you are interested in.

      By using simulators you can usually get very in depth insights into how the software is using the “virtual” hardware and figure out what takes the most time and how it can be improved.

      With the emergence of Generative AI and high computing availability will we get closer to the Bayes limit?

      Very good question. I for one don’t understand how generative AI can bring us closer to the Bayes limit. The Bayes limit itself refers generally to classification or prediction tasks.

      For example if one would want to predict the outcome of a football match while you can gather more and more data about the players, conditions, etc. and create very powerfull models based on data from past matches there is a fundamental limit to the accuracy you can have, a minimum error rate that you must live with. While generative AI models could potentially be used to find information about some of the phenomena that need to be modelled it itself does not do classification or prediction directly. I could be missing something though.

      Now with other types of models we can do classification and increasing compute power and model complexity will bring us closer to the theoretical limit we have for accuracy. But, depending on the task we will also neeed to collect increasingly more information to give our models which can be impractical. In the football match example we may need to know the internal state of the player’s brains, simulate the weather around the stadium to account for wind gusts, simulate every blade of grass. Even with this data and a yet unavailable amount of compute power we’d still not be near the Bayes limit here.

      We might collect data on every particle within a 90 light minute radius of the stadium (in theory nothing outside of that sphere can influence the match) and on a galactic supercomputer simulate every particle and predict the outcome of the match with very high probability. But due to the inherent quantum randomness we will not get perfect predictions. That is probably where the Bayes limit lies for this example.

      Now for other types of problems we will probably get quite close to the Bayes limit as the models will form an understanding of the underlying processes in the data.

      If I am an aspiring machine learning engineer, should I focus more on the skill in deploying a model in the cloud compared to the ML algorithms?

      I think you should have a balance of both these days. There are usually 2 main types of roles related to ML today, one type is the Data Scientist which is generally the person that creates the models, curates the dataset, trains the models and provides them to the ML Engineer.

      The ML Engineer is the person that generally takes care of the ML pipeline, optimizes models for performance (quantization, sparsity, etc), can fine tune the model and generally hadle deployment. The latter will generally need to be more aware of cloud systems while the former might not need to know as much. Though the datasets will be stored in the cloud and the training will happen there as well.

      In my view I think it is fundamental to know ML (algorithms, networks, learning techniques, dataset management, etc). This is generally the harder part. Using cloud systems is generally easier to master once you know the first part.

      Can you share an example of a challenging performance issue you encountered in the past and how you resolved it?

      At one point I was profiling a cluster of servers that was running K-Means clustering on a very large dataset using a popular cloud framework. I noticed that the run time was very small compared to what should have been possible theoretically.

      By running the numbers I noticed the system was off by more than 10x. I then profiled the systems themselves using performance counter reading tools and could confirm that the acheived FLOPS was well below the CPU’s capability at the time. Now the framework was running in a JVM which made it hard to see in detail why this was happening so I used the print assembly functionality to see the generated assembly code.

      Reading it it became clear that the JVM was generating scalar code and not making use of AVX. I could write a small C++ kernel that was orders of magnitude faster. Looking at why the JVM didn’t generate optimal code it came down to floating point arithmetic not being commutative and associative. Basically the JVM kept trying to maintain the program order of floating point operations.

      One fix was simply to reorder the operations in the code by unrolling the loops which would get the JVM to generate better code. Beyond that given the nature of the algorithm one could say that it was wasteful to use the implicit double precision operations used by JVMs. 32 bits was more than enough.

      Creating a well optimized C++ library that was called from the Scala code for the math intensive operations improved performance by a very large factor. Though one can argue it makes the code less portable in some cases it can be worth the tradeoff.

      In the field of machine learning, performance is often a trade-off with accuracy. How do you strike the right balance between performance and accuracy when optimizing ML models for Apple devices? Are there any specific challenges you face in this regard?

      I won’t comment about what I do at my job here. But as a general rule when you have a specific task in mind for your model you can determine what level of accuracy is required. Having accuracy greater than that level may not bring extra benefit to the users.

      So as long as you achieve the desired level of accuracy you can trade off some accuracy for extra performance. In fact if accuracy is good enough for a given task it might bring more value to the user for the response to come faster or to use less power to generate it.

      I am particularly interested in the idea of using ML for chip design. It has been successfully used for generating floor plans, component placement and routing but I think there is potential beyond this to have ML algorithms generate the overall microarchitecrure as well.

      I also like the idea of federated learning, where we essentially move the model to where the data is, train it there locally and then move it to another dataset, therefore training a model on a distributed data set without having to move data around.

      Adapting ML Models for Performance: Given that the landscape of hardware platforms for running ML models (like GPUs, ASICs, and FPGAs) is continuously evolving, how do you approach the challenge of optimizing and adapting ML models for these diverse hardware architectures?

      Interesting quesiton. As technology evolves we seem to have more and more heterogeneous computer systems both at the chip level and at the system and even datacenter level. In your particular example you mention GPUs, ASICs and FPGAs.

      Ultimately a lot of this optimization can be done through abstraction and it is done in a lot of popular frameworks today. You will essentially have a set of compute kernels available in libraries that are optimized and tuned for different types of hardware.

      Depending on what hardware is available and what the user chooses frameworks like Tesorflow or PyTorch can call these underlying libraries when performing computations and therefore will benefit from the performance gains. It generally falls onto the developer of the GPUs/ASICs to provide these high performance libraries.

      For FPGAs it will generally fall onto whoever writes the circuit that the FPGA implements. This way the data scientist or ML Engineer that develops and deploys the model can focus on optimizing the model itself while there will be special developers of the high performance compute kernels that focus on making sure the network runs fast and efficiently.

      What are your thoughts regarding Distilled models

      Distilled models are models that are trained by other more complex so called teacher models and that learn to mimic the behavior of the teacher model. A distilled model is generally smaller and less complex thus offering some advantages such as being able to run inference faster and more efficiently as well as taking up less space in RAM and on the disk, using less memory bandwidth and generally being easier to handle.

      The disadvantage can be a loss of accuracy although often a distilled model can learn to generallize better than the more complex teacher model. I think that using distilled models can be a good tradeoff in some situations, it depends on your desired level of accuracy and the performance and power requirements of the device you are running them on.

      How does Apple balance between on-device machine learning and cloud-based processing, and what are the factors that determine the decision to use one approach over the other?

      I won’t comment anything about my emplyer here, I am speaking only from my own experience and only my own opinions. Generally there are advantages to doing ML processing locally. Among these is privacy as data does not leave your device, latency can be better especially if the models are relatively computationally light and of course the models can work offline and potentially without needing a subscription to a cloud service.

      Running in the cloud can allow you to run much larger models than you could run on your own device (due to for example lack of RAM), models that are significantly more computationally intensive and requrie more compute power than your device has and it can save you having to use your battery for the computations. Depending on what you are trying to achieve and what is most important for that particualr applicaiton you can create your own mix of cloud/local services.

      As machine learning models become more complex and data-intensive, how do you ensure a balance between performance and accuracy in your optimization efforts, especially when deploying models on devices with varying computational capabilities?

      First you need to define your accuracy and performance targets and then you start optimizing for them. For example in some cases you may need state of the art accuracy whereas in others quick response time is more important than the best quality answers.

      The main challenges are the compute power of the device (you don’t want to exceed a certain amount of latency for your specific task), possibly memory (model can’t fit in RAM or takes too much from other applications), and power/energy usage (you don’t want the device to overheat, make too much noise and/or drain the battery too fast). Choosing the right model for each application and device is important.

      Techniques like quantization, pruning and sparsity can reduce the size and computational requirements of your model while trading off some accuracy. Training distilled models is another way. Also having different flavors of each model for each device can work, for example have a model that uses a smaller rezolution input on less powerful devices. And of course a lot of effort in writing low level software kernels that implement the model’s functionality.

      Optimizing the low level software is the key to extracting the most performance and efficiency out of your hardware. Finally in some cases you can consider offloading the inference operations to the cloud if you expect your device to be connected most of the time.

      Balancing Performance and Accuracy: In many cases, optimizing for speed can lead to a loss in model accuracy. How do you strike a balance between maintaining the accuracy of ML models and improving their performance?

      It’s usually a complex tradeoff that needs to be specifically tuned for each application. I’d say the first step is to define your performance and accuracy targets. By performance I mean not only latency but also things like power efficiency and memory requirements.

      For some applications you need really fast answers and may not need state of the art accuracy. For example real time object tracking. Other applications may need good accuracy over response time like for example translation.

      Once you know your targets you can choose an appropriate model architecture that can hit them. Usually you will start with choosing a model that hits and possibly exceeds the accuracy target and then you apply various optimization techniques to get it to hit the performance target. Techniques include quantization, sparsity, pruning, distilation and others. You also want to make sure that the low level code that implements your model’s functionality is well tuned to make the best use of your available hardware.

      The model architecture you choose should be one where you have a reasonable chance of hitting your performance targets but through optimization and tuning you can get even up to an order of magnitude better performance in a lot of cases depending on where you start from. It may also be the case that you can’t hit your targets with available technology and perhaps you need to wait for better hardware to become available.

      Disclaimer: Adrian is expressing his personal views and not his employer’s.