There are ecosystems of natural language processing, image and video processing, voice processing, and code or software processing and development, further robotics, and expert systems or business intelligence [99, 100], altogether represented by DALL-E (DALL-E3 newly), ImageGPT, InstructGPT and ChatGPT, Bard or Gemini, Ernie Bot, Tongyi Qianwen, Sense Time SenseChat, Bedrock, and many other tools by OpenAI, Microsoft, Google, Baidu, Alibaba Group, Amazon, also MidJourney (that released version 6 recently), Stable Diffusion (currently released version 3, which demonstrates unmatched performance on the ControlNet network, designed to control diffusion models in image generation, and LayerDiffusion that introduces latent transparency, which allows the generation of a single transparent image or multiple transparent layers, combined into a single blended image [101]),, Tellius, OPENNN, Theano, and many other tools by multiple producers.

These ecosystems exist, evolve, and (some of them) work (though sometimes obscured, even covered up) already over decades and render mature. And most importantly, there are massive foundation models [102] – GPT (version 4 released in March, 2023) by OpenAI, Gemma, Gemini 1.5 Pro (successor to LaMDA and PaLM 2), Imagen Model Family, MedLM, or Codey by Google, foundation models by IBM, and others – trained on large, unlabeled datasets and fine-tuned to become a starting point for a wide array of applications.

Social networks are teeming with posts recommending dozens of artificial intelligence applications that will provide users with greater performance, time savings, and effortless earnings [103, 104, 105, 106]. However, there is indeed a solid base of research and development that is delivering impressive technological progress at a pace that is nothing short of breathtaking. At the beginning of March 2024, a new generation of the Claude family of LLMs was released – Claude 3 foudation model, which sets new industry benchmarks across a wide range of cognitive tasks [107]. The family includes three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Each successive model offers increasingly powerful performance, allowing users to select the optimal balance of intelligence, speed, and cost for their specific application. Product of San Francisco headquartered Anthropic, Claude 3 challenges the hegemony of GPT 4 [108]. Launching Chat with RTX simultaneously, Nvidia also competes with OpenAI or its GPT respectively (plus, for free, running locally on a PC guaranteeing the user data privacy, enabling choice of AI model from Ilama or Mistral and choice of dataset, which can include getting answers from YouTube) [109].

Nonetheless, Claude´s lead renders a more robust [110]. ChatGPT is optimized for dialogue using reinforcement learning (RL) with human feedback (HF); to enhance their helpfulness and harmlessness, Claude models have undergone fine-tuning using constitutional AI and reinforcement learning from human feedback. The preposition makes the difference: in RL with HF (ChatGPT case), the AI model is initially trained using supervised learning with human demonstrations. Human AI trainers provide example conversations where they play both the user and the AI assistant, and the model learns to generate responses by imitating these demonstrations. After the initial training, the model is fine-tuned using reinforcement learning: it interacts with users and receives feedback (rewards) based on the quality of its responses. The goal is to improve the model’s performance over time by adjusting its behavior based on this feedback. RL from HF (the Claude 3 case) focuses solely on reinforcement learning. The model launches training according to an initial policy (a way of generating responses) interacting with users and receiving feedback (rewards) based on the quality of its responses; unlike RL with HF, there are no initial supervised demonstrations, the model learns directly from user interactions, adjusting its behavior based on the received rewards, aiming to optimize its responses without relying on pre-existing examples. In addition, involved in Claude´s training, constitutional AI can be considered a set of guiding principles – a rulebook that helps the AI make decisions and respond helpfully and ethically. Sharing similarities with rule-based algorithms, constitutional AI is not the same: instead of strict rules, constitutional AI operates based on guiding principles (a “constitution“); it is more flexible and adaptive allowing for context-aware decisions. Constitutional AI can self-correct and learn from feedback, it balances dynamically following principles with learning from real-world interactions [111].

Another challenge comes from Europe: Paris-based Mistral — a nine-month-old startup with only a few dozen employees — is corralling enough investment and attention, including a high-profile Microsoft, Nvidia, and Salesforce partnership, to put it in the top tier of AI companies globally. Mistral’s top models Mistral Large and Mixtral already rival GPT-4’s performance in accuracy and common sense reasoning. Designed for complex multilingual reasoning tasks, including text understanding, transformation, and code generation, Mistral Large is natively capable of function calling, facilitating application development, and tech stack modernization at scale. On the other hand, a sparse mixture-of-experts (SMoE) model pre-trained on data extracted from the open web, operating as a decoder-only model, Mixtral excels in efficiency, performance, and handling multiple languages and code generation, which is highly appreciated by developers that can fine-tune and modify Mixtral for specific business needs [112, 113]. Mistral also launched Le Chat, a chatbot „multilingual by design“ [114].

In addition, Meta announces to release its newest LLM—Llama 3—in July 2023, which is expected to match GPT-4 capabilities and respond to image-based questions. Llama 3 aspires to handle challenging queries—offering context, instead of dismissing them—reducing inappropriate or inaccurate responses [115]. This timely announcement comes just as rival Google was forced to pause its Gemini-powered image-generation feature after it misrepresented historical images.

And there are not only large language models but small ones, too. Designed to perform well for simpler tasks and more easily to fine-tune to meet specific needs, these small models (SLMs) are more accessible and easier to use with limited resources, be it money, time, or training data. What the recent releases mean is not a start of a shift from large to small but an extension of available categories of models adjusted to fit best to specific performance scenarios. SLMs are well suited for those looking to build applications that can run locally on a device (as opposed to the cloud), where a task does not require extensive reasoning or a quick response is needed. A story of bedtime reading to an AI researcher´s 4-year-old daughter goes behind the formation of SLMs when he thought to himself „How did she learn this word? How does she know how to connect these words?“ That led the Microsoft Research machine-learning expert to wonder how much an AI model could learn using only words a 4-year-old could understand – and ultimately to an innovative training approach that has produced a new class of more capable language models that promises to make AI more accessible to a variety of specific needs and purposes. At the end of April 2024, Microsoft announced the Phi-3 family of open models, the most capable and cost-effective SLMs available. Phi-3 models outperform models of the same size and next size across various benchmarks that evaluate language, coding, and math capabilities. [116] Performing better than models twice its size (Microsoft says), starting from the end of April 2024, the first in that family, Phi-3-mini, measuring 3.8 billion parameters is publicly available in the Microsoft Azure AI model catalog, on Hugging Face, a platform for machine learning models, as well as Ollama, a lightweight framework for running models on local machines, and as an Nvidia NIM microservice with a standard API interface that can be deployed anywhere [117].

As of 2022, we are experiencing a kind of artificial intelligence storm; yet for many reasons, it would be a mistake to forget history. After the first, largely experimental AI models – the percepton [118] and Samuel’s checkers player [119] – slowly-slowly, practical or even commercial AI applications have been arriving: from ELIZA, the first chatbot in history that replicated a therapist giving general answers to users‘ questions, simulating a real conversation, developed in 1966 by Joseph Weizenbaum for MIT, through Siri, one of the first successful virtual assistants, developed in 2010 by Apple for iOS devices, still active and evolving, Alexa developed in 2014 by Amazon, to Google Assistant or BERT, an artificial intelligence model that can understand the context of the conversation and provide more accurate and personalized responses, used as the basis of many modern virtual assistants, both developed by Google in 2016 and 2018 respectively. The „good old“ AI applications that have been reading postal codes for USPS since the very beginning of the 1990s, have been deciding on custody and bail in many US states and liquidating insurance claims, and have been a tool of economic efficiency of the US healthcare through last two or three decades [120] deserve noting, too.

Game developers have been adept at creating artificial worlds and telling stories based on them by their very nature. Expertise of game developers in creating and deploying diverse algorithms that generate the game narratives and scenes often goes back decades before AI was defined as an umbrella term [121]. Generative AI shapes ever-more the fashion industry and the art world, where brands and artists – or AI users? – can create original designs that look like human artists created them. Such is the case of [122], a Dutch startup that provides a self-service platform where users can create their own hyper-realistic AI-driven fashion avatars in just minutes. Users may customize the virtual models‘ size, body type, shape, and identity—even down to whether they are happy or sad. In the financial sector, banks are using generative AI to automate tasks such as checking account openings and loan approvals [123].

Founded in 2016 and becoming the GitHub today of AI in the next wave, Hugging Face is a platform on which developers can discuss anything from bug tracking to API integration to overall project development [124]. With AI drawing massive enthusiasm about being the potential epicenter of future economies, the startup is in a good position to stand the test of time.

Cofounded only in 2023 by OpenAI-backgrounded Aravind Srinivas, Perplexity AI developed the first-of-its-kind LLM-powered answer engine in just six months as a low-cost project. Perplexity does not have its own LLM, it uses language models via an API. In the free version, ChatGPT or Claude Instant, in the paid version top models GPT-4, Claude 3, or Gemini Ultra (subscribers can choose) [125]. According to the Wall Street Journal, Perplexity is finalizing the acquisition of new investments that could value the company at around $1 billion, making it a „unicorn“ [126]. Jeff Bezos, among others, invested in Perplexity two months ago, and Daniel Gross [127] should bring new money. The company has annual revenues of about 10 million dollars and about 50 million users: the figures deserve attention showing the extremely low revenue ratios that the investors are ready to accept when it comes to AI.

Not only professionals but the general public, too, has started to employ AI: from initial embarrassment and doubts to more mature approaches and understanding. Today, AI copilots are revolutionizing productivity, creativity, and efficiency across diverse domains from general use (Microsoft Copilot [128]) through coding (GitHub Copilot [129]), sales, marketing, and customer service (Salesforce Einstein [130]), and cross-system communication (Moveworks [131]) to Stable Diffusion [132] that replaces a need for multiple specialized applications by handling tasks like report building, customer service replies, sales emails, and more.

To advanced users, chatbots serve as a sixth sense or a counterpart able to respond to questions. The quality of the responses is disputable as a whole, however better and worse questions exist that generate better or worse answers. Labeled as prompting, the art of questioning that stands behind any authorship creation since time immemorial experiences a revival and an upgrade to the next level. Throughout history, originally invented to transmit and preserve information of prime importance and quality, literacy has evolved from a unique skill and knowledge of a narrow circle of high-ranking experts [133] to general elementary education of the entire population that spread to apply in everyday life, also for banal purposes. Similarly, today, the range of computer non-professionals is rapidly expanding who benefit from the bold AI support to create specific amateur applications – or are getting used to exploiting handy AI applications for the most common needs. Musavir.AI [134], for example, offers to generate movie-like renders, architecture, and sci-fi in general, tailor refined lighting aesthetics – or, through MyAvatar feature, generate diverse new hairstyles for oneself. Enjoyed as a hobby or for entrepreneurial purposes, countless popular text-to-image models set another class of AI applications: in March 2024 released in version 1.0, Ideogram [135] is an example.

AI does not hesitate to enter good old familiar tools like Microsoft Windows, Excel, or Word to enhance productivity, data analysis, and decision-making. Ideas is an AI-powered insights service that identifies trends, patterns, and outliers in data sets, allowing users to analyze and understand their data rapidly and providing functionalities like taking a picture of a printed data table using Android device and converting it into an editable Excel spreadsheet, automatically converting the pictures into fully editable tables, eliminating manual data entry and allowing formulas to spill over multiple cells dynamically, adapting to changes in data size [136]. Windows 11, too, introduces several AI-powered features that enhance productivity, creativity, and overall user experience, including copilot, updated paint functionality for photo editing and „art“ creation, photo movie editor, clip champ assisting with footage editing, enhanced ink drawing, smart (and safe) app control, and security tool [137].

Targeting professional users, Devin is a groundbreaking AI software engineer designed to revolutionize software development through collaboration with humans. Devin´s key features include autonomous coding assistant – a fully autonomous software engineer, capable of planning and executing complex coding tasks, learning new technologies on the fly, building and deploying full applications from scratch, automatically finding and fixing bugs, training its own AI models, and contributing to production codebases. Devin handles long-term reasoning, dynamic learning, coding, and design skills [138].

It is a real influx of ever-new AI tools what the present experiences. Thousands of startups, along wealthy industry tycoons and their spin-offs, have begun applying generative AI to create virtual assistants who can respond appropriately to human requests with natural language processing with dialogue management capabilities, image -, and video processing. During 2023, more than ten thousand new applications were released [139]. Obviously, keeping up with all of them is beyond the limits of any human – unless AI is involved. In March 2023, OpenAI released yet another state-of-the-art large language model, GPT-4, equipped with multimodal capabilities and superior performance on benchmarks designed for humans. Stanford released Alpaca 7B, a relatively small open-source model that matches the performance of GPT-3.5. Concurrently, Google introduced Bard that became Gemini recently – the „creative and helpful collaborator, here to supercharge your imagination, boost your productivity, and bring your ideas to life“ [140], Chinese search engine behemoth Baidu released (just before April 14, 2023) Ernie Bot [141], Alibaba Group introduced (April 11, 2023) chatbot Tongyi Qianwen, Sense Time SenseChat [142], and Amazon its new AI application Bedrock, which makes available to developers generative tools for creating texts and images.

Through 2023, Amazon has been offering four AI applications, the native Titan included, and employs the most people in AI development – more than both Google and Microsoft [143]. In addition, Amazon has been investing massively into the OpenAI rival Anthropic: the September 2023 deal put $1.25 billion into the company, the investment to be topped to a maximum of $4 billion, which became a reality at the end of March 2024. Lacking the capability to develop adequate models independently, companies like Amazon and Microsoft have had to act vicariously through others, primarily OpenAI and Anthropic. The two have reaped immense benefits by allying with one or the other of these moneyed rivals, and as yet have not seen many downsides [144].

X´s or Elon Musk´s answer to OpenAI´s Chat GPT is Grok – an AI search assistant designed to answer your queries while keeping you entertained and engaged. xAI launched the Grok-1.5 chatbot on the social media platform X at the end of March 2024, with the enhanced version aiming to surpass current AI technologies. On popular benchmarks, Grok-1 is about as capable as Meta’s open-source Llama 2 chatbot model and surpasses OpenAI’s GPT-3.5, xAI claims [145]. And thirdly, Databrick released DBRX concurrently, a new generative AI model akin to OpenAI’s GPT series and Google’s Gemini. Available on GitHub and the AI dev platform Hugging Face for research as well as for commercial use, base (DBRX Base) and fine-tuned (DBRX Instruct) versions of DBRX can be run and tuned on public, custom, or otherwise proprietary data. Databricks says that it spent roughly $10 million and two months training DBRX, which it claims (quoting from a press release) “outperform[s] all existing open source models on standard benchmarks.” But — and here is the marketing rub — it’s exceptionally uneasy to use DBRX unless you’re a Databricks customer: to run DBRX in the standard configuration, a server or PC with at least four Nvidia H100 GPUs (or any other configuration of GPUs that add up to around 320GB of memory) is needed [146].

In January 2024, OpenAI has laid another touchstone, launching a specialised AI chatbot store called GPT Store that offers a wide range of applications based on advanced AI models such as GPT-4 and DALL-E3 [147]. The store is accessible free (so far) to subscribers of OpenAI’s premium programs like ChatGPT Plus, ChatGPT Enterprise, or ChatGPT Team. The chatbots are categorised into areas such as lifestyle, writing, research, programming, and education, democratizing the accessibility of AI technology by enabling users to create and utilize custom GPTs tailored to their specific needs. No less importantly, contrary to the initial hype surrounding the technology, opening GPT store indicates that in the upcoming years the emphasis will shift away from artificial general intelligence (AGI) towards more specialized chatbots designed to meet personalized demands.

As another example of the trend, Google Maps has got an AI boost [148]: immersive view that includes street-level imagery and 3D models of any location, live traffic and weather simulations, navigation around congestion and unfavorable weather; lens in maps – AI-driven feature for enhanced environmental understanding that identifies and labels every object with the camera; improved navigation providing highly accurate and detailed maps for navigation, information about local businesses, landmarks, and must-see spots, and exploration along the route, including comprehensive information on charging stations and compatibility and convenience for electric journey; new areal view API (application programming interface) enabling 3D birds-eye view for applications and websites and Google’s AI for object identification and extraction; photo-first results for search terms and AI image recognition for accurate matching [149].

Concerning image processing, also applications Leonardo AI, Playground AI, Bluewillow AI, Bing AI, Adobe Firefly, or Bright AI deserve attention [150]. In March 2024, Google has introduced VideoPoet, literally ChatGPT for text-to-video, image-to-video, and video editing [151]. The year-to-year progress in the quality of the images and generated is undeniable [152]. At the beginning of 2024, significant text-to-video applications were released: Sora by OpenAI (to be discussed in section (5) [153]), Lumiere by Google [154], and EMO [155]. Powered by a state-of-the-art „space-time“ neural network and self-suervised learning (its competitors alike), Lumiere effortlessly crafts 5-second video clips in one go (Sora works up to 1 minute, but …). It masters stylized generation, video editing, cinemagraphs, and video painting [156]. EMO, short for emote portrait live, is an artificial intelligence system developed by researchers at Alibaba’s Institute for Intelligent Computing that transforms a single image and audio input into expressive portrait videos. It can take a single portrait photo and bring it to life using a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks. The system ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations – not only convincing speaking videos but also singing videos in various styles [157]. In the context of this review, it is the state-of-the-art neural networks and learning strategies allowing for understanding the physical word and variety of action scenes [158] – a common ground with approaching to architectural concepts – what makes these applications worth attention.

In March 2024, OpenAI introduced a new language model, the Voice Engine, which can clone a speaker’s voice based on just 15 seconds of audio recording. The technology has been successfully tested in collaboration with several partners, for example, to create a synthetic voice for a young girl who lost the ability to speak normally due to a brain tumor: such is a class of use allowing people with speech disorders to talk in a natural-sounding voice that is developed by the Brazilian company Livox. Further applications show in education, where it helps generate educational content, or when localizing videos into different languages while preserving the speaker’s original accent [159]. OpenAI has not yet released the model as a separate product because of possible misuse – creating deepfakes as an example.

And the race continues; one month later, Microsoft Research introduces VASA-1, which takes a single portrait photo and speech audio and produces a hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements generated in real-time [160].

Concurrently, in an interview with Bill Gates, Sam Altman of OpenAI revealed what was coming up in GPT-4.5 (or GPT-5) [161]. Altman highlighted integration with other modes of information beyond text, better logic and analysis capabilities, and consistency in performance as priorities for AI progress over the next two years. OpenAI has already launched image and audio capabilities for their current models; but those capabilities will go much further, aligning with people’s desire for AI systems engaging with more elements of the real world beyond just text. Improving logical reasoning and inferencing is another key priority for the next two years. The aim will be for models to become better at analyzing prompts, synthesizing information, and drawing insightful conclusions rather than just generating speculative or untrustworthy responses. Hopefully, reliability will stem from better reasoning.

As hectic and chaotic as this overview may seem, actually, it is quite a polished picture – which, moreover, skips the field of hardware in general, and (with some exceptions) financing and investments. The overview lacks attention to architecture and the development of the built environment: this is not an omission or a narrative intention, but an image of reality: abundance and glut on the one side and lagging on the other. Architecture and the built environment are on the edge of current attention to application development and investment in specialized tools and productive AI environments.


Introduction figure: Labbe, J.: Diverse AI-tools deployment at a multi-level design development. Luka Developent Scenario, Prague. MS architekti, Prague. 2024. author´s archive.

Michal Sourek

Exkluzivní partner

Hlavní partneři