An artificial neural network is a collection of connected units or nodes called artificial neurons designed to model loosely how the neurons in a biological brain have been supposed to look and work. Like synapses in a biological brain, each connection can transmit a signal to other neurons. A deep neural network is an artificial neural network with multiple layers between the input and output layers; in a shortcut, a deep neural network makes machine learning deep learning [59, 60]. In essence, two computing principles apply in artificial neural networks today: feedforward computing and backpropagation. The goal is always to train the models generated to cope with the criteria typically inserted by vast sample datasets. Feedforward computing refers to a type of workflow without feedback connections that would form closed loops; the latter term marks a way of computing the partial derivatives during training. When training a model in the feedforward manner, the input “flows” forward through the network layers from the input to the output. By backpropagation, the model parameters update in the opposite direction: from the output layer to the input one. Backpropagation, a strategy to compute the gradient in a neural network, is a general technique; it is not restricted to feedforward networks, it works for recurrent neural networks (to be introduced soon) as well [61].

In general, discriminative and/or decoding techniques identify objects and infer what is “true” and what is “fake”. As a principle, generative AI systems create objects such as pictures, audio, writing samples, and outline anything that computer-controlled systems like 3D printers can build [62]. Most often, generative and discriminative or decoding systems operate paired in generative adversarial network (to be introduced soon) models setting the business-as-usual rather than state-of-the-art of today´s AI industry. Typically, a system labeled as generative AI is self-learning, it uses unsupervised learning (but can use other types of machine learning, too), and deploys anomaly detection and problem solving – it can come up with innovative solutions or approaches based on its experience with similar outputs/inputs pairings in the past [63].

First introduced in 1987, the pioneers were convolutional neural networks (CNNs), also known as shift invariant or space invariant neural networks, most commonly applied to analyze visual imagery [64]. ImageNet, a groundbreaking project from the 2010s builds on this technology [65]. Introduced by Geoffrey Hinton, capsule networks aim to overcome the limitations of traditional CNNs by representing hierarchical structures in images [66]. They focus on learning spatial relationships between features.

Graph neural networks (GNNs) are another field of research aiming at processing of graph data. The concept of GNNs emerged in the 1980s exploring neural network architectures for processing graph-structured data. Next, in 2015, the foundational work on graph convolutional networks (GCNs) was introduced by Thomas Kipf and Max Welling to leverage spectral graph theory and convolutional operations to learn node representations in graphs [67]. By extension of the GCN framework variants such as GraphSAGE, graph attention networks (GAT), and ChebNet were developed to address scalability, inductive learning, and handling heterogeneous graphs [68]. New architectures, attention mechanisms, and graph pooling techniques have been proposed recently to explore applications in recommendation systems, social networks, and drug development. The works continue focusing on scalability, interpretability, and handling large-scale graphs.

Designed by Ian Goodfellow and his colleagues in 2014, generative adversarial network (GAN) is a class of machine learning frameworks that until recently, evolving, have represented state-of-the-art of the field [69, 70]. A composition of a generator, which produces data, and a discriminator, which evaluates the authenticity of generated data, through adversarial training, GANs learn to create data aiming to become indistinguishable from real (original input) data. … the generator gets no training to minimize the distance to a specific image, but rather to fool the discriminator. This enables the model to learn in an unsupervised manner; however, GANs have also proved useful for semi-supervised learning, fully supervised learning, and reinforcement learning [71]. GANs have revolutionized generative modeling across various domains, including computer vision. Over the years, numerous variants have emerged such as Conditional GANs, which condition the generator on specific attributes, Wasserstein GANs, which improve training stability, CycleGANs excelling in image-to-image translation, or StyleGANs known for high-quality image synthesis [72].

Over time, too, drawbacks of GANs have bothered both application developers and users: mode collapses, where the generator produces a limited variety of samples, failing to capture the full diversity of the target distribution – in other words, focusing on a few modes and ignores others; instability during training; hindering convergence and quality of generated samples; minimax game between generator and discriminator can be challenging to optimize – a Nash equilibrium can be found as a result, which is not always straightforward [73]; inconsistency of generated images, which may suffer from artifacts, blurriness, or lack of fine details; training large-scale GANs on high-resolution images can be computationally expensive; finding optimal hyperparameters can be challenging; measuring the performance of GANs is difficult: common metrics like inception score or frechet inception distance have limitations and may not fully represent the quality of generated samples; understanding the latent space where the generator operates remains an open research question [74]; GANs rely substantially on the quality and diversity of the training data: if the dataset is biased or lacks diversity, the generated samples may inherit the limitations [75].

To overcome the shortcomings, modified GAN architectures (such as StyleGAN3 by Nvidia [76] or game-theoretical models [77), GANs for imbalance problems (improving the performance of computer vision algorithms [78]), SMOTE-GAN (combining synthetic minority oversampling technique (SMOTE) with GANs to generate synthetic minority samples for imbalanced datasets), CycleGAN (that focuses on image-to-image translation tasks, such as style transfer, without paired training data, and learns mappings between two domains using cycle consistency loss), or ConditionalGANs (incorporating additional information (such as class labels) during training, cGANs control the generated samples by conditioning on specific attributes) have been used.

Various applications of GANs are still emerging: FrankenGAN for urban context massing, detailing, and texturing, Pix2PixHD by Nvidia for high-resolution photorealistic image-to-image translation, GAN Loci, or GauGAN [79]. Still then, as the field evolves, diffusion models continue to push the boundaries of generative AI and, in particular, tend to surpass GANs in training methodology (taking advantage of the denoising process, refining data step by step), offering stability and robustness during training, being more straightforward in terms of optimization, and excelling in producing sharp and detailed features – in short, better quality images than GANs [80]. Operating in a compressed or latent space, which can make them more computationally efficient, latent diffusion models push the boundaries further [81].

Recursive (RvNNs) [82] and recurrent neural networks (RNNs) [83] represent state-of-the-art today, also in the context of imitation-based learning and self-learn. RNNs, capable of processing sequential data, are particularly useful for natural language processing (NLP) tasks given the sequential nature of linguistic data, such as sentences and paragraphs, where the time factor differentiates elements of the sequence. They process sequences until reaching an end token, regardless of sequence length; unlike other networks (e.g., CNNs) that handle only fixed-length inputs, RNNs can process variable-length inputs without changing the network architecture. As the network processes elements of a sequence one by one, the hidden state stores and combines information from the entire sequence. RNNs also share weights across time steps, simplifying the network architecture and allowing efficient training [84]. A type of RNN designed to handle sequential data, long short-term memory (LSTM) networks excel in tasks like natural language processing and time series prediction due to their ability to capture long-term dependencies [85]. Another type of RNNs, echo state networks (ESNs) with fixed random connections are particularly useful for time-series prediction and reservoir computing [86]. RvNNs can tackle hierarchical patterns to capture relationships in hierarchical structures, making them suitable for tasks involving nested or tree-like data. RvNNs know how to represent nested relationships and are well suited for tasks that involve tree-like data [87].

Radial basis function networks (RBFNs) use radial basis functions as activation functions for function approximation and pattern recognition [88].

Introducing another class of neural networks, variational autoencoders (VAEs) and adversarial autoencoders (AAEs) also overcome some of the drawbacks of GANs. Unlike GAN, instead of the generator – discriminator pair, VAE combines two distinct approaches – encoding and decoding [89]. Encoder abstracts data by compressing while decoder brings the data back to its initial format. Through the decompression, or „reparametrization“, the decoder generates variations of the respective phenomenon [90] – typically in furniture design, fashion, photography, or video generating, and also in algorithms concerning architecture and urban design [91]. VAEs combine generative and inference models – elements of both autoencoders and probabilistic modeling. They learn a probabilistic latent space and can generate new samples while preserving data distribution. Commonly used for unsupervised learning and generative tasks, in video generation, VAEs can learn a compact representation of video frames (or other data) in a lower-dimensional latent space, the encoder maps input data (video frames) to a latent space, and the decoder reconstructs the data from the latent representation. A hybrid of autoencoders and GANs, AAEs use adversarial training to learn a latent representation. They aim to improve sample quality and diversity [92].

Networks they mimic

Focusing on architectural designing and given its nature and complexity, this paper will further delve into the potential of imitation-based learning and self-learn strategies and techniques. To cope with the specificities and demands of such techniques (having not architecture or its design but robotics as an intention), various artificial neural network models have been developed. Autoregressive neural networks generate sequences element by element, making them suitable for imitation tasks. Aligning with imitation learning, they learn to predict the next element based on previous ones. Equipped with command compensation, they enhance imitation learning based on bilateral control, allowing robots to operate almost simultaneously with an expert’s demonstration [93]. A new autoregressive neural network model (S2SM) has been proposed to fit that requires only the slave state as input and predicts both the next slave and master states. (The goal of imitation-based learning is to learn a policy or control strategy that allows the system to mimic the behavior exhibited during demonstrations (usually performed by an expert). The slave state serves as the input for this learning process while the master state corresponds to the desired or reference state, which the algorithm aims to achieve. It represents the ideal or target configuration or behavior. In a bilateral control framework, the master state often comes from an external source (such as a human operator or an expert) and guides the system.) S2SM improves task success rates and stability under environmental variations.

Brain-inspired deep imitation networks have been developed to enhance the generalization ability and adaptability of deep neural networks for autonomous driving systems across various scenarios. The model draws insights from inferring functions that are believed to be close to the ones performed in the human brain to improve neural network performance in diverse environments [94].

In transformer-based deep imitation learning, a variant of self-attention architecture – transformer is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world [95]. Typically, transformers possess self-attention mechanisms allowing the network to focus on important features in the input data and ignore distractions, which can improve the performance of the network and the ability of parallel processing: unlike RNNs, transformers process all input data at the same time rather than sequentially, which allows for faster training. Other properties of transformers are scalability, which makes them suitable for a wide range of applications, from natural language processing to robot manipulation, and versatility, allowing for use for both sequence-to-sequence tasks (like translation) and single-input tasks (like classification). Not a standalone network, attention mechanisms enhance the performance of various models (including transformers) [96]. They allow networks to focus on relevant parts of input sequences, making them valuable for tasks like machine translation and image captioning.

Inspired by the asymmetry of human neural networks, dual neural circuit policy (NCP) architectures are a design that helps improve the generalization ability of the networks [97].

In behavioral imitation with artificial neural networks, hidden layers of imitation learning models successfully encode task-relevant neural representations. These models predict individual brain dynamics more accurately than models trained on other subjects’ gameplay or control models [98].

The introduced neural network approaches cater to different requirements and scenarios, offering solutions for diverse modifications in imitation-based learning. In summary, the landscape of deep neural networks is rich and diverse, with each architecture tailored to specific tasks and domains. The research goes on to explore novel designs and strategies to improve the capabilities and characteristics according to specific purposes. Adaptations of such approaches may be forthcoming once imitation-based learning and self-learn strategies are considered that could address architectural design tasks, (not only) overcoming the shortcomings of GANs that are ruling the field so far.


Introduction figure: Svarovska, E.: PromeAI deployment at progressing from a simple volumetric model to near photo-realistic rendering. Student´s design. Department of Architecture, Faculty of Civil Engineering, CTU in Prague. 2023. author´s archive

Michal Sourek

Exkluzivní partner

Hlavní partneři