Publications

You can also find my articles on my Google Scholar profile.

Conference Papers and Pre-prints

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin, David Schinagl, Samuel Schulter, Horst Possegger
In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2025

We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advances that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.

Bibtex

@inproceedings{fruhwirth2025stsbench,
title = {{STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving}},
author = {Fruhwirth-Reisinger, Christian and Malić, Dušan and Lin, Wei and Schinagl, David and Schulter, Samuel and Possegger, Horst},
booktitle = {Proc. of the Conference on Neural Information Processing Systems (NeurIPS)},
year = {2025}
}

LiSu: A Dataset and Method for LiDAR Surface Normal Estimation

Dušan Malić, Christian Fruhwirth-Reisinger, Samuel Schulter, Horst Possegger
In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2025

While surface normals are widely used to analyse 3D scene geometry, surface normal estimation from LiDAR point clouds remains severely underexplored. This is caused by the lack of large-scale annotated datasets on the one hand, and lack of methods that can robustly handle the sparse and often noisy LiDAR data in a reasonable time on the other hand. We address these limitations using a traffic simulation engine and present LiSu, the first large-scale, synthetic LiDAR point cloud dataset with ground truth surface normal annotations, eliminating the need for tedious manual labeling. Additionally, we propose a novel method that exploits the spatiotemporal characteristics of autonomous driving data to enhance surface normal estimation accuracy. By incorporating two regularization terms, we enforce spatial consistency among neighboring points and temporal smoothness across consecutive LiDAR frames. These regularizers are particularly effective in self-training settings, where they mitigate the impact of noisy pseudo-labels, enabling robust real-world deployment. We demonstrate the effectiveness of our method on LiSu, achieving state-of-the-art performance in LiDAR surface normal estimation. Moreover, we showcase its full potential in addressing the challenging task of synthetic-to-real domain adaptation, leading to improved neural surface reconstruction on real-world data.

Bibtex

@inproceedings{malic2025lisu,
title = {{LiSu: A Dataset and Method for LiDAR Surface Normal Estimation}},
author = {Malić, Dušan and Fruhwirth-Reisinger, Christian and Schulter, Samuel and Possegger, Horst},
booktitle = {Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}

GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection

Dušan Malić, Christian Fruhwirth-Reisinger, Samuel Schulter, Horst Possegger
In Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR), 2025

LiDAR-based 3D detectors need large datasets for training, yet they struggle to generalize to novel domains. Domain Generalization (DG) aims to mitigate this by training detectors that are invariant to such domain shifts. Current DG approaches exclusively rely on global geometric features (point cloud Cartesian coordinates) as input features. Over-reliance on these global geometric features can, however, cause 3D detectors to prioritize object location and absolute position, resulting in poor cross-domain performance. To mitigate this, we propose to exploit explicit local point cloud structure for DG, in particular by encoding point cloud neighborhoods with Gaussian blobs, GBlobs. Our proposed formulation is highly efficient and requires no additional parameters. Without any bells and whistles, simply by integrating GBlobs in existing detectors, we beat the current state-of-the-art in challenging single-source DG benchmarks by over 21 mAP (Waymo->KITTI), 13 mAP (KITTI->Waymo), and 12 mAP (nuScenes->KITTI), without sacrificing in-domain performance. Additionally, GBlobs demonstrate exceptional performance in multi-source DG, surpassing the current state-of-the-art by 17, 12, and 5 mAP on Waymo, KITTI, and ONCE, respectively.

Bibtex

@inproceedings{malic2025gblobs,
title = {{GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection }},
author = {Malić, Dušan and Fruhwirth-Reisinger, Christian and Schulter, Samuel and Possegger, Horst},
booktitle = {Proc. of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025}
}

Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection

Christian Fruhwirth-Reisinger, Wei Lin, Dušan Malić, Horst Bischof, Horst Possegger
In Proc. of the British Machine Vision Conference (BMVC), 2024

Accurate 3D object detection in LiDAR point clouds is crucial for autonomous driving systems. To achieve state-of-the-art performance, the supervised training of detectors requires large amounts of human-annotated data, which is expensive to obtain and restricted to predefined object categories. To mitigate manual labeling efforts, recent unsupervised object detection approaches generate class-agnostic pseudo-labels for moving objects, subsequently serving as supervision signal to bootstrap a detector. Despite promising results, these approaches do not provide class labels or generalize well to static objects. Furthermore, they are mostly restricted to data containing multiple drives from the same scene or images from a precisely calibrated and synchronized camera setup. To overcome these limitations, we propose a vision-language-guided unsupervised 3D detection approach that operates exclusively on LiDAR point clouds. We transfer CLIP knowledge to classify point clusters of static and moving objects, which we discover by exploiting the inherent spatio-temporal information of LiDAR point clouds for clustering, tracking, as well as box and label refinement. Our approach outperforms state-of-the-art unsupervised 3D object detectors on the Waymo Open Dataset (+23 AP3D) and Argoverse 2 (+7.9 AP3D) and provides class labels not solely based on object size assumptions, marking a significant advancement in the field.

Bibtex

@inproceedings{reisinger2024vilgod,
title = {{Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection}},
author = {Fruhwirth-Reisinger, Christian and Lin, Wei and Malić, Dušan and Possegger, Horst},
booktitle = {Proc. of the British Machine Vision Conference (BMVC)},
year = {2024}
}

MAELi: Masked Autoencoder for Large-Scale LiDAR Point Clouds

Georg Krispel, David Schinagl, Christian Fruhwirth-Reisinger, Horst Possegger, Horst Bischof
In Proc. of the Winter Conference on Applications of Computer Vision (WACV), 2024

The sensing process of large-scale LiDAR point clouds inevitably causes large blind spots, i.e. regions not visible to the sensor. We demonstrate how these inherent sampling properties can be effectively utilized for self-supervised representation learning by designing a highly effective pre-training framework that considerably reduces the need for tedious 3D annotations to train state-of-the-art object detectors. Our Masked AutoEncoder for LiDAR point clouds (MAELi) intuitively leverages the sparsity of LiDAR point clouds in both the encoder and decoder during reconstruction. This results in more expressive and useful initialization, which can be directly applied to downstream perception tasks, such as 3D object detection or semantic segmentation for autonomous driving. In a novel reconstruction approach, MAELi distinguishes between empty and occluded space and employs a new masking strategy that targets the LiDAR’s inherent spherical projection. Thereby, without any ground truth whatsoever and trained on single frames only, MAELi obtains an understanding of the underlying 3D scene geometry and semantics. To demonstrate the potential of MAELi, we pre-train backbones in an end-to-end manner and show the effectiveness of our unsupervised pre-trained weights on the tasks of 3D object detection and semantic segmentation.

Bibtex

@inproceedings{krispel2024maeli,
title = {{MAELi: Masked Autoencoder for Large-Scale LiDAR Point Clouds}},
author = {Krispel, Georg and Schinagl, David and Fruhwirth-Reisinger, Christian and Possegger, Horst and Bischof, Horst},
booktitle = {Proc. of the Winter Conference on Applications of Computer Vision (WACV) },
year = {2024}
}

GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data

David Schinagl, Georg Krispel, Christian Fruhwirth-Reisinger, Horst Possegger, Horst Bischof
In Proc. of the International Conference on Computer Vision (ICCV), 2023

Widely-used LiDAR-based 3D object detectors often neglect fundamental geometric information readily available from the object proposals in their confidence estimation. This is mostly due to architectural design choices, which were often adopted from the 2D image domain, where geometric context is rarely available. In 3D, however, considering the object properties and its surroundings in a holistic way is important to distinguish between true and false positive detections, eg occluded pedestrians in a group. To address this, we present GACE, an intuitive and highly efficient method to improve the confidence estimation of a given black-box 3D object detector. We aggregate geometric cues of detections and their spatial relationships, which enables us to properly assess their plausibility and consequently, improve the confidence estimation. This leads to consistent performance gains over a variety of state-of-the-art detectors. Across all evaluated detectors, GACE proves to be especially beneficial for the vulnerable road user classes, ie pedestrians and cyclists.

Bibtex

@inproceedings{schinagl2023gace,
title = {{GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data}},
author = {Schinagl, David and Krispel, Georg and Fruhwirth-Reisinger, Christian and Possegger, Horst and Bischof, Horst},
booktitle = {Proc. of the International Conference on Computer Vision (ICCV)},
year = {2023}
}

SAILOR: Scaling Anchors via Insights into Latent Object Representation

Dušan Malić, Christian Fruhwirth-Reisinger, Horst Possegger, Horst Bischof
In Proc. of the Winter Conference on Applications of Computer Vision (WACV), 2023

LiDAR 3D object detection models are inevitably biased towards their training dataset. The detector clearly exhibits this bias when employed on a target dataset, particularly towards object sizes. However, object sizes vary heavily between domains due to, for instance, different labeling policies or geographical locations. State-of-the-art unsupervised domain adaptation approaches outsource methods to overcome the object size bias. Mainstream size adaptation approaches exploit target domain statistics, contradicting the original unsupervised assumption. Our novel unsupervised anchor calibration method addresses this limitation. Given a model trained on the source data, we estimate the optimal target anchors in a completely unsupervised manner. The main idea stems from an intuitive observation: by varying the anchor sizes for the target domain, we inevitably introduce noise or even remove valuable object cues. The latent object representation, perturbed by the anchor size, is closest to the learned source features only under the optimal target anchors. We leverage this observation for anchor size optimization. Our experimental results show that, without any retraining, we achieve competitive results even compared to state-of-the-art weakly-supervised size adaptation approaches. In addition, our anchor calibration can be combined with such existing methods, making them completely unsupervised.

Bibtex

@inproceedings{malic2023sailor,
title = {{SAILOR: Scaling Anchors via Insights into Latent Object Representation}},
author = {Malić, Dušan and Fruhwirth-Reisinger, Christian and Possegger, Horst and Bischof, Horst},
booktitle = {Proc. of the Winter Conference on Applications of Computer Vision (WACV)},
year = {2023}
}

FAST3D: Flow-Aware Self-Training for 3D Object Detectors

Christian Fruhwirth-Reisinger, Michael Opitz, Horst Possegger, Horst Bischof
In Proc. of the British Machine Vision Conference (BMVC), 2021

In the field of autonomous driving, self-training is widely applied to mitigate distribution shifts in LiDAR-based 3D object detectors. This eliminates the need for expensive, high-quality labels whenever the environment changes (e.g., geographic location, sensor setup, weather condition). State-of-the-art self-training approaches, however, mostly ignore the temporal nature of autonomous driving data. To address this issue, we propose a flow-aware self-training method that enables unsupervised domain adaptation for 3D object detectors on continuous LiDAR point clouds. In order to get reliable pseudo-labels, we leverage scene flow to propagate detections through time. In particular, we introduce a flow-based multi-target tracker, that exploits flow consistency to filter and refine resulting tracks. The emerged precise pseudo-labels then serve as a basis for model re-training. Starting with a pre-trained KITTI model, we conduct experiments on the challenging Waymo Open Dataset to demonstrate the effectiveness of our approach. Without any prior target domain knowledge, our results show a significant improvement over the state-of-the-art.

Bibtex

@inproceedings{reisinger2021fast3d,
title = {{FAST3D: Flow-Aware Self-Training for 3D Object Detectors}},
author = {Fruhwirth-Reisinger, Christian and Opitz, Michael and Possegger, Horst and Bischof, Horst},
booktitle = {Proc. of the British Machine Vision Conference (BMVC)},
year = {2021}
}

DRT: Detection Refinement for Multiple Object Tracking

Bisheng Wang, Christian Fruhwirth-Reisinger, Horst Possegger, Horst Bischof, Guo Cao
In Proc. of the British Machine Vision Conference (BMVC), 2021

Deep learning methods have led to remarkable progress in multiple object tracking (MOT). However, when tracking in crowded scenes, existing methods still suffer from both inaccurate and missing detections. This paper proposes Detection Refinement for Tracking (DRT) to address these two issues for people tracking. First, we construct an encoder-decoder backbone network with a novel semi-supervised heatmap training procedure, which leverages human heatmaps to obtain a more precise localization of the targets. Second, we integrate a “one patch, multiple predictions” mechanism into DRT which refines the detection results and recovers occluded pedestrians at the same time. Additionally, we leverage a data-driven LSTM-based motion model which can recover lost targets at a negligible computational cost. Compared with strong baseline methods, our DRT achieves significant improvements on publicly available MOT datasets. In addition, DRT generalizes well, i.e. it can be applied to any detector to improve their performance.

Bibtex

@inproceedings{wang2021drt,
title = {{DRT: Detection Refinement for Multiple Object Tracking}},
author = {Wang, Bisheng and Fruhwirth-Reisinger, Christian and Possegger, Horst and Bischof, Horst and Cao, Guo},
booktitle = {Proc. of the British Machine Vision Conference (BMVC)},
year = {2021}
}

Towards Data-driven Multi-target Tracking for Autonomous Driving

Christian Fruhwirth-Reisinger, Georg Krispel, Horst Possegger, Horst Bischof
In Proc. of the Computer Vision Winter Workshop (CVVW), 2020

We investigate the potential of recurrent neural networks (RNNs) to improve traditional online multi-target tracking of traffic participants from an ego-vehicle perspective. To this end, we build a modular tracking framework, based on interacting multiple models (IMM) and unscented Kalman filters (UKF). Following the tracking-by-detection paradigm, we leverage geometric target properties provided by publicly available 3D object detectors. We then train and integrate two RNNs: A state prediction network replaces hand-crafted motion models in our filters and a data association network finds detection-to-track assignment probabilities. In our extensive evaluation on the publicly available KITTI dataset we show that our trained models achieve competitive results and are significantly more robust in the case of unreliable object detections.

Bibtex

@inproceedings{reisinger2020tddm,
title = {{Towards Data-driven Multi-target Tracking for Autonomous Driving}},
author = {Fruhwirth-Reisinger, Christian and Krispel, Georg and Possegger, Horst and Bischof, Horst},
booktitle = {Proc. of the Computer Vision Winter Workshop (CVWW)},
year = {2020}
}