Visiting Research Scholar
Bot Intelligence Group, Robotics Institute, Carnegie Mellon University
Supervisor: Jean Oh
M.S./Ph.D. Student in Artificial Intelligence, Dongguk University
I build and evaluate trustworthy multimodal AI systems at the intersection of vision, language, and embodied interaction. I aim to advance from AI model evaluation toward human-centered interactive and robotic systems, with a focus on reliable decision making in generative and foundation models for real-world interactive systems.
sieunchoi@dgu.ac.kr
A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.
Endoscopic retrograde cholangiopancreatography (ERCP) remains the standard for common bile duct (CBD) stone removal. However, current guidelines often result in unnecessary procedures. This study aimed to develop and validate a machine learning model using synthetic data augmentation to improve CBD stone prediction. Electronic health records from patients with suspected CBD stones were analyzed from three independent tertiary centers (733 patients for internal validation, 348 for external validation). A large language model (LLM) generated curated synthetic data to augment the training dataset. The ExtraTrees classifier was selected after evaluating multiple algorithms. Model performance was assessed using AUROC, calibration curves, and decision curve analysis. Compared to existing clinical guidelines, the model substantially reduced unnecessary ERCPs while maintaining low false-negative rates and strong clinical utility.
High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. Our results show that the current metrics may not capture critical limitations of the models and indicate good performance, underscoring the need for failure-focused analysis to understand model limitations and guide future progress. In a path-planning setting with unknown cells, GPT-5 achieved a high success rate of 93%; Yet, the failed cases exhibit fundamental limitations of the models, e.g., the lack of structural spatial understanding essential for navigation. We also find that newer models are not always more reliable than their predecessors on this end. In reasoning under safety-relevant information, Gemini-2.5 Flash achieved only 67% on the challenging emergency-evacuation task, underperforming Gemini-2.0 Flash, which reached 100% under the same condition. Across all evaluations, models exhibited structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions. These findings show that foundation models still exhibit substantial failures in navigation-related decision making and require fine-grained evaluation before they can be trusted.
As the online food-delivery market expanded rapidly after the COVID-19 pandemic, mobile ordering platforms became mainstream. However, the functions and user interfaces (UIs) of existing delivery services were largely designed from the perspective of non-disabled users. Interviews with visually impaired users further revealed that the complexity of current delivery applications leads to substantial time and effort during routine ordering tasks. To address this problem, this paper presents BBlink, a user-experience-centered delivery platform designed for visually impaired users.
Accurate cardiac ultrasound segmentation is critical for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are inherently challenging due to low contrast, speckle noise, irregular anatomical boundaries, and significant domain shift across acquisition devices and patient populations. Existing methods, primarily driven by appearance-based learning, often struggle to maintain boundary precision and structural consistency under these conditions. To address these limitations, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The proposed framework performs effective information fusion by integrating multi-resolution feature representations with contour-derived structural priors. Specifically, a High-Resolution Network (HRNet) backbone preserves high-resolution spatial information while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps through cross-attention, enabling structure-aware refinement that enhances boundary delineation and suppresses noise-induced artifacts. In addition, a dual-head supervision strategy jointly optimizes segmentation and boundary predictions to enforce structural consistency. The proposed method is evaluated on the Cardiac Acquisitions for Multi-structure Ultrasound Segmentation (CAMUS) dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate that CGQR-Net achieves superior segmentation accuracy, improved boundary precision, and strong robustness across different imaging conditions. These findings highlight the effectiveness of integrating contour-level structural information with feature-level representations, providing a robust and generalizable solution for cardiac ultrasound segmentation in real-world clinical and consumer healthcare applications.
Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs.
Bot Intelligence Group, Robotics Institute, Carnegie Mellon University
Supervisor: Jean Oh
Machine Learning Lab, Dongguk University
Supervisor: Jihie Kim
Cipherome Inc., Technology Team
AI Lab, Dongguk University
Supervisor: Gijoo Yang
M2M Lab, Purdue University
Supervisor: Eric T. Matson
Designed a VLM-driven, director-in-the-loop workflow for stop-motion filmmaking using a robot arm (xArm 6).