Definition
Multimodal AI processes relationships across different forms of data. A model may connect an image with a caption, speech with text, video with temporal descriptions, or audio with visual and metadata records.
Architecture
Systems may use separate encoders for each modality, a shared representation space, a language model as an orchestration layer, or an end-to-end architecture. The integration method determines what relationships the system can learn.
Creative use
Applications include image search from language, audio description, video indexing, generative storyboards, performance systems, accessibility interfaces and compound archives.
Electronic Artefacts position
ORETH and Palimpsests make multimodal structure concrete: sound, images, notes, analyses and provenance form one compound record. A multimodal system should preserve those distinctions rather than flattening every input into an opaque embedding.
Limitations
Performance can be uneven across modalities. Text may dominate a nominally visual system, temporal structure may be lost, and generated outputs may obscure their sources. Evaluation must test each modality and cross-modal task independently.
References
See CLIP, Generative AI, Signal Archaeology, ORETH and Palimpsests.