Sieun Choi

M.S./Ph.D. Student in Artificial Intelligence, Dongguk University

I build and evaluate trustworthy multimodal AI systems at the intersection of vision, language, and embodied interaction. I aim to advance from AI model evaluation toward human-centered interactive and robotic systems, with a focus on reliable decision making in generative and foundation models for real-world interactive systems.

Generative AI Multimodal Learning AI Safety & Reliability Embodied AI & Human-Robot Interaction

CV Google Scholar GitHub LinkedIn Email

sieunchoi@dgu.ac.kr

News

2026SEA was accepted to CVPR 2026.
2026Before We Trust Them was accepted to AAAI 2026 Summer Symposium Series.
2026Machine Learning Prediction of Common Bile Duct Stones was published in Scientific Reports.
2026Exposing Blindspots was accepted to IASEAI 2026 Main Conference.
2025-2026Worked as a visiting researcher at the BIG Lab, Carnegie Mellon University Robotics Institute, fully funded by MSIT of Korea under the Global Research Support Program in the Digital Field (RS-2024-00426860).
2025Attended CES 2025 as a recipient of the ICT Challenge 2024 Minister’s Award.
2024Received the Minister’s Award at ICT Challenge 2024.
2024Served as a student volunteer at IJCAI 2024 (Jeju, South Korea).
2024Started the M.S./Ph.D. program in the Machine Learning Lab at Dongguk University.
2023-2024Worked as an intern at Cipherome, Inc. (San Jose, CA).

Publications

SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering

J. Park, S. Choi, J. Seo, M. Sohn, Y. Kim, and J. Kim.

CVPR 2026 (Accepted).

Paper

View abstract

Overview of SEA and CommonSketch. Left: SEA quantifies abstraction efficiency by balancing recognizability and detail. High scores (top-left) favor simple yet identifiable sketches, while low scores (bottom-right) denote ambiguity or over-detail. Right: CommonSketch includes element-level annotations and captions, enabling element-aware evaluation of sketch abstraction.

Computation pipeline and case-based interpretation of the SEA metric. Given a sketch and its class label, SEA combines class recognizability P from a classifier with the commonsense element space E extracted by an LLM and the number of visually grounded elements V identified by a VLM. It then computes a reward-penalty balance and maps it to a bounded score SEA in (-1,1), where higher scores indicate sketches that preserve recognizability with minimal yet sufficient visual detail. Illustrative cases show abstraction failure due to low recognizability (left), incomplete abstraction caused by excessive detail (middle), and abstraction-efficient sketching that achieves high recognizability with fewer expressed elements (right).

A sketch is a distilled form of visual abstraction that conveys core concepts through simplified yet purposeful strokes while omitting extraneous detail. Despite its expressive power, quantifying the efficiency of semantic abstraction in sketches remains challenging. Existing evaluation methods that rely on reference images, low-level visual features, or recognition accuracy do not capture abstraction, the defining property of sketches. To address these limitations, we introduce SEA (Sketch Evaluation metric for Abstraction efficiency), a reference-free metric that assesses how economically a sketch represents class-defining visual elements while preserving semantic recognizability. These elements are derived per class from commonsense knowledge about features typically depicted in sketches. SEA leverages a visual question answering model to determine the presence of each element and returns a quantitative score that reflects semantic retention under visual economy. To support this metric, we present CommonSketch, the first semantically annotated sketch dataset, comprising 23,100 human-drawn sketches across 300 classes, each paired with a caption and element-level annotations. Experiments show that SEA aligns closely with human judgments and reliably discriminates levels of abstraction efficiency, while CommonSketch serves as a benchmark providing systematic evaluation of element-level sketch understanding across various vision-language models.
Machine Learning Prediction of Common Bile Duct Stones Using Synthetic Data to Guide Emergency ERCP Decisions

S. Kang*, N. Park*, I. S. Shin*, E. Lee, S. Choi, J. Jung, J. K. Lee, Y. N. Lee, J. H. Moon, K. R. Joo, J. S. Kim, and J. Kim.

Scientific Reports, 2026.

Paper

View abstract

Study flow chart. This flowchart illustrates the patient selection and data allocation used for the two-stage validation framework. On the left, the DUMC cohort (2010-2023) is depicted for model development and internal validation. After applying exclusion criteria, 733 patients were separated into training, validation, and hold-out test subsets. The training set was augmented with SMOTE and integrated with curated synthetic data to form the final combined training dataset. The validation dataset facilitated hyperparameter optimization, and the hold-out test set was designated for final internal performance assessment. On the right, external validation was performed using a completely independent dataset from SCHBC and KHNMC (2023-2024).

Endoscopic retrograde cholangiopancreatography (ERCP) remains the standard for common bile duct (CBD) stone removal. However, current guidelines often result in unnecessary procedures. This study aimed to develop and validate a machine learning model using synthetic data augmentation to improve CBD stone prediction. Electronic health records from patients with suspected CBD stones were analyzed from three independent tertiary centers (733 patients for internal validation, 348 for external validation). A large language model (LLM) generated curated synthetic data to augment the training dataset. The ExtraTrees classifier was selected after evaluating multiple algorithms. Model performance was assessed using AUROC, calibration curves, and decision curve analysis. Compared to existing clinical guidelines, the model substantially reduced unnecessary ERCPs while maintaining low false-negative rates and strong clinical utility.
Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models First Author

H. Seo*, S. Choi*, M. Hong*, Y. Zhou, J. Kim, L. Ismaila, N. Etori, M. Agarwal, Z. Liu, J. Kim, and J. Oh.

IASEAI 2026 Main Conference (Accepted).

Project Page / Paper

View abstract

Figure 2: Overall framework overview. (a) Schema inputs: six countries, eight categories, and three era-aware prompts. (b) Experimental pipeline: T2I base generation and three I2I editing studies. (c) Multi-layered evaluation: integrating automatic, culture-aware metrics, and human evaluation.

Generative image models produce striking visuals yet often misrepresent culture. Prior work has examined cultural bias mainly in text-to-image (T2I) systems, leaving image-to-image (I2I) editors underexplored. We bridge this gap with a unified evaluation across six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized protocol that yields comparable diagnostics. Using open models with fixed settings, we derive cross-country, cross-era, and cross-category evaluations. Our framework combines standard automatic metrics, a culture-aware retrieval-augmented VQA, and expert human judgments collected from native reviewers. To enable reproducibility, we release the complete image corpus, prompts, and configurations. Our study reveals three findings: (1) under country-agnostic prompts, models default to Global-North, modern-leaning depictions that flatten cross-country distinctions; (2) iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; and (3) I2I models apply superficial cues (palette shifts, generic props) rather than era-consistent, context-aware changes, often retaining source identity for Global-South targets. These results highlight that culture-sensitive edits remain unreliable in current systems. By releasing standardized data, prompts, and human evaluation protocols, we provide a reproducible, culture-centered benchmark for diagnosing and tracking cultural bias in generative image models.
Before We Trust Them: Decision-Making Failures in Navigation of Foundation Models

J. Han*, J. Seo*, J. Min*, S. Choi, H. Seo, J. Kim, and J. Oh.

AAAI 2026 Summer Symposium Series (Accepted).

Project Page / Paper

View abstract

Overview of the three evaluation settings and representative input formats used in our evaluation. The figure summarizes reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. In prompts for safety-relevant reasoning, red text indicates task-difficulty phrases and blue text indicates important contextual clues.

High success rates on navigation-related tasks do not necessarily translate into reliable decision making by foundation models. To examine this gap, we evaluate current models on six diagnostic tasks spanning three settings: reasoning under complete spatial information, reasoning under incomplete spatial information, and reasoning under safety-relevant information. Our results show that the current metrics may not capture critical limitations of the models and indicate good performance, underscoring the need for failure-focused analysis to understand model limitations and guide future progress. In a path-planning setting with unknown cells, GPT-5 achieved a high success rate of 93%; Yet, the failed cases exhibit fundamental limitations of the models, e.g., the lack of structural spatial understanding essential for navigation. We also find that newer models are not always more reliable than their predecessors on this end. In reasoning under safety-relevant information, Gemini-2.5 Flash achieved only 67% on the challenging emergency-evacuation task, underperforming Gemini-2.0 Flash, which reached 100% under the same condition. Across all evaluations, models exhibited structural collapse, hallucinated reasoning, constraint violations, and unsafe decisions. These findings show that foundation models still exhibit substantial failures in navigation-related decision making and require fine-grained evaluation before they can be trusted.
Dedicated Delivery Platform Based on the User Experience of the Visually Impaired

Y. Kim, Y. Kim, Y. Shin, J. Lee, and S. Choi.

KIPS, 2022, pp. 974-976.

English summary: This paper proposes an accessibility-first delivery platform informed by user experience research with visually impaired users, focusing on practical usability in real delivery workflows.

Original Korean Paper / Translated Version

View abstract

Key interview findings and corresponding design implications.

As the online food-delivery market expanded rapidly after the COVID-19 pandemic, mobile ordering platforms became mainstream. However, the functions and user interfaces (UIs) of existing delivery services were largely designed from the perspective of non-disabled users. Interviews with visually impaired users further revealed that the complexity of current delivery applications leads to substantial time and effort during routine ordering tasks. To address this problem, this paper presents BBlink, a user-experience-centered delivery platform designed for visually impaired users.

Preprints

Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing First Author

H. Seo*, S. Choi*, M. Hong*, J. Kim, and J. Oh.

Project Page / Paper

View abstract

Qualitative examples of demographic-conditioned failures in I2I editing across different prompts and source demographics.

Overview of our study on demographic-conditioned failures in instruction-guided I2I portrait editing. We build a controlled benchmark from FairFace and pair source portraits with edit prompts. For each image-prompt pair, we run three I2I editing models to generate outputs. For diagnosing soft erasure and stereotype replacement, we evaluate iedit; for feature prompt mitigation, we add a feature prompt pfeat and re-run editing. Outputs are assessed via human evaluation and a VLM ensemble.

Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems.
Contour-Guided Query-Based Feature Fusion for Boundary-Aware and Generalizable Cardiac Ultrasound Segmentation

Z. Ullah, S. Choi, and J. Kim.

arXiv preprint, 2026 (Under Review).

Paper

View abstract

Architecture of the proposed CGQR-Net. HRNet extracts multi-resolution features from the input echocardiography image, and a coarse segmentation head provides an initial structural prediction. Contours extracted from the coarse mask are converted into query embeddings and used to refine fused multi-scale features through cross-attention. The refined representation is then passed to segmentation and boundary heads to produce the final boundary-aware multi-class segmentation.

Accurate cardiac ultrasound segmentation is critical for reliable assessment of ventricular function in intelligent healthcare systems. However, echocardiographic images are inherently challenging due to low contrast, speckle noise, irregular anatomical boundaries, and significant domain shift across acquisition devices and patient populations. Existing methods, primarily driven by appearance-based learning, often struggle to maintain boundary precision and structural consistency under these conditions. To address these limitations, we propose a Contour-Guided Query Refinement Network (CGQR-Net) for boundary-aware cardiac ultrasound segmentation. The proposed framework performs effective information fusion by integrating multi-resolution feature representations with contour-derived structural priors. Specifically, a High-Resolution Network (HRNet) backbone preserves high-resolution spatial information while capturing multi-scale context. A coarse segmentation is first generated, from which anatomical contours are extracted and encoded into learnable query embeddings. These contour-guided queries interact with fused feature maps through cross-attention, enabling structure-aware refinement that enhances boundary delineation and suppresses noise-induced artifacts. In addition, a dual-head supervision strategy jointly optimizes segmentation and boundary predictions to enforce structural consistency. The proposed method is evaluated on the Cardiac Acquisitions for Multi-structure Ultrasound Segmentation (CAMUS) dataset and further validated on the CardiacNet dataset to assess cross-dataset generalization. Experimental results demonstrate that CGQR-Net achieves superior segmentation accuracy, improved boundary precision, and strong robustness across different imaging conditions. These findings highlight the effectiveness of integrating contour-level structural information with feature-level representations, providing a robust and generalizable solution for cardiac ultrasound segmentation in real-world clinical and consumer healthcare applications.
StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

J. Park, S. Choi, J. Seo, and J. Kim.

arXiv preprint, 2025 (Under Review).

Project Page / Paper / Dataset

View abstract

Overview of the StableSketcher framework and the SketchDUO dataset. StableSketcher generates sketch images from text prompts using Stable Diffusion with a fine-tuned VAE and improves prompt fidelity through VQA-based reward feedback. SketchDUO provides sketch images, captions, and QA sets.

SketchDUO construction process. The pipeline proceeds from left to right: class selection and sketch definition, positive sketch collection and negative sketch construction, and augmentation. In caption generation, GPT-4o produces template-constrained instance-level captions that are manually refined before TIFA-based QA pairs are constructed. The final SketchDUO dataset contains 35,851 sketch-caption pairs and 54,370 QA pairs.

Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs.

Research & Working Experience

Visiting Research Scholar

Bot Intelligence Group, Robotics Institute, Carnegie Mellon University

Supervisor: Jean Oh

Pittsburgh, PA, USAAug 2025 - Feb 2026

M.S./Ph.D. Student

Machine Learning Lab, Dongguk University

Supervisor: Jihie Kim

Seoul, South KoreaMar 2024 - Present

Intern

Cipherome Inc., Technology Team

San Jose, CA, USAMar 2023 - Feb 2024

Undergraduate Researcher

AI Lab, Dongguk University

Supervisor: Gijoo Yang

Seoul, South KoreaFeb 2022 - Dec 2022

Visiting Scholar

M2M Lab, Purdue University

Supervisor: Eric T. Matson

West Lafayette, IN, USAOct 2021 - Dec 2021

Projects

VLM-guided Collaborative Robotic Cinematography

Carnegie Mellon University Visiting Research Scholar Project, Oct 2025 - Jan 2026

Designed a VLM-driven, director-in-the-loop workflow for stop-motion filmmaking using a robot arm (xArm 6).

Watch Demo Film View Proposal

Teaching Experience

Teaching Assistant, Dongguk University

Spring 2026: Introduction to Programming (RGC1092)
Spring 2026: Convergence Capstone Design (SCS4031)
Spring 2025: Computer System (SCS2011); Computer Network and Security (SCS4032); Data Science Capstone Design (DSC4007)
Fall 2024: Computer System (SCS2011); Data Science Capstone Design (DSC4007)
Spring 2024: Computer Network and Security (SCS4032); Data Science Capstone Design (DSC4007)

Teaching Materials Author, Dongguk University

Open Source Software Project (SCS4045) (Docker & Git)
Data Science Capstone Design (DSC4007) (Slack & Git)
Computer Network and Security (SCS4032) (Socket Programming & Wireshark)

Intellectual Property

Patent

Method and System for Sketch Image Generation. Korean Patent Application No. 10-2025-0108830 (filed Aug 7, 2025).

Software

Machine Learning-Based Prediction Using Synthetic Data: A Novel Multicenter Model for Guiding Disease Diagnosis Decision-Making. Software Registration No. C-2025-031588 (registered Aug 18, 2025).
Machine Learning-Based Prediction of Common Bile Duct Stones Using Synthetic Data: A Novel Multicenter Model for Guiding ERCP Decision-Making. Software Registration No. C-2025-030387 (registered Aug 8, 2025).
3D Image-Tabular Fusion Model for Clinical Reasoning of Micro-Localized Lesions. Software Registration No. C-2025-030390 (registered Aug 8, 2025).

Achievements & Honors

ICT Challenge 2024, 1st Place, Ministry of Science and ICT, South Korea (2024).
Merit-Based Scholarship, Dongguk University (2022-2023).
Academic Excellence Award, Dongguk University (2020-2022).
Software-Centered Academic Excellence Scholarship, Dongguk University (2021).
Dongguk Leader Scholarship (Student Representative), Dongguk University (2021).
ICT Convergence Services Idea Competition based on 5G and AI, 1st Place (2021).
Academic Excellence Scholarship, Dongguk University (2020).

Services

Student Volunteer, IJCAI 2024 Volunteer Program, Jeju, South Korea (2024).