
The HammerHAI project offers to establish an AI Factory at the High-Performance Computing Center Stuttgart (USTUTT/HLRS), which is supported by a strong consortium from Germany, to successfully meet the growing demand for artificial intelligence (AI) infrastructure across Europe. HammerHAI will be a one-stop shop for many AI users, focusing primarily on start-ups, small and medium-sized enterprises (SMEs), large industrial companies, and supporting academic institutions and the public sector. It will offer tailored services and infrastructure to accelerate AI innovation and help develop a competitive AI ecosystem in Europe. The AI Factory will be located in a region that is one of Europe’s powerhouses in manufacturing and engineering innovation, and it will be integrated into an ecosystem that promotes talent building that will be the basis of an ongoing digital transition. The AI Factory HammerHAI will provide secure, scalable, and AI-optimised supercomputing resources to meet the needs of start-ups, SMEs, industry, and research institutions. Its infrastructure will enable users to easily migrate their AI applications from laptops or cloud environments to supercomputers, providing the computing power needed to develop large-scale AI models. Hereby, the AI Factory will support the entire AI lifecycle, from data preparation to model training, deployment, monitoring, and retraining, and will provide a comprehensive package of services to ensure efficient and effective AI development and operation. HammerHAI will meet Europe's growing demand for sovereign AI products by providing scalable, secure, and AI-optimised supercomputing infrastructure integrated into an existing research, education, and innovation ecosystem. HammerHAI will fast-track start-ups, professionals, SMEs, industries, and research to realise the full potential of AI.
The cloud computing industry has grown massively over the last decade and with that new areas of application have arisen. Some areas require specialized hardware, which needs to be placed in locations close to the user. User requirements such as ultra-low latency, security and location awareness are becoming more and more common, for example, in Smart Cities, industrial automation and data analytics. Modern cloud applications have also become more complex as they usually run on a distributed computer system, split up into components that must run with high availability. Unifying such diverse systems into centrally controlled compute clusters and providing sophisticated scheduling decisions across them are two major challenges in this field. Scheduling decisions for a cluster consisting of cloud and edge nodes must consider unique characteristics such as variability in node and network capacity. The common solution for orchestrating large clusters is Kubernetes, however, it is designed for reliable homogeneous clusters. Many applications and extensions are available for Kubernetes. Unfortunately, none of them accounts for optimization of both performance and energy or addresses data and job locality. In DECICE, we develop an open and portable cloud management framework for automatic and adaptive optimization of applications by mapping jobs to the most suitable resources in a heterogeneous system landscape. By utilizing holistic monitoring, we construct a digital twin of the system that reflects on the original system. An AI-scheduler makes decisions on placement of job and data as well as conducting job rescheduling to adjust to system changes. A virtual training environment is provided that generates test data for training of ML-models and the exploration of what-if scenarios. The portable framework is integrated into the Kubernetes ecosystem and validated using relevant use cases on real-world heterogeneous systems.