Retrieval of Spatio-Temporal Knowledge from Cyber-Physical Building Systems

Joseph Aerathu

The inspiration for this research work was J.A.R.V.I.S. from the Iron-Man series; can we build something like that for buildings today?

First, let’s check where we are right now…

We see that current LLMs do not have a way to access a real building’s information in real-time, process the data, and do inference over it. This thesis proposes a way to do exactly that! At the end of it, the final output was as follows:

Overview

This thesis developed data-processing pipelines, agents, and a UI for ingesting building information models (IFC), converting them to RDF, integrating telemetry, constructing an enterprise graph, and running evaluations for an office space owned by the CASE lab at RPI, Troy, NY.

The complete thesis defence presentation can be found in this Google Slides link.

Contents and highlights

  • IFC ingestion & RDF conversion: ingest/ contains the pipelines, SHACL validation, and converters (ifc_to_rdf.py, ifc_to_rdf_pipeline.py).

  • Graph & storage: api/graphdb.py, api/enterprise_graph.py and related modules handle RDF storage and graph operations.

  • Agents: agents/detection_agent.py, agents/diagnosis_agent.py, agents/recommendation_agent.py implement detection/diagnosis/recommendation workflows.

  • API / tasks: api/ holds HTTP endpoints, Celery wiring (api/celery_app.py, api/tasks.py), telemetry ingestion and webhook adapters.

  • CASE evaluation: case_graphrag/ and data/evaluation/ contain evaluation harnesses, datasets and scripts used for experimental results.

  • UI: ui/ is a Vite + TypeScript frontend for visualization and interaction.

  • Scripts: scripts/ includes helpers to run pipelines, seed timeseries data, and compute evaluation metrics.

External services used

  • An RDF triple store / graph database for storing Turtle/RDF outputs.

  • A timeseries DB (e.g., TimescaleDB/Postgres) for telemetry.

  • A message broker for Celery (Redis/RabbitMQ) if running background workers.

Design notes

  • Input data: IFC files, timeseries telemetry, CASE evaluation datasets.

  • Outputs: RDF graphs (TTL and graph DB), processed evaluation metrics, agent logs and outputs.

  • Error modes: SHACL validation failures, missing external services, malformed inputs.

Edge cases & recommendations

  • Large IFC files may require significant memory; test with representative files and consider streaming or chunked processing.

  • Ensure timeseries normalization before running metrics to avoid biased results.

  • Add environment variable documentation before deploying to shared environments.

Source

Link to the repository.