I am a fourth-year PhD student at Boston University, advised by Prof. Eshed Ohn-Bar.
Prior to BU, I worked with Dr. Dong Huang as a research assistant at the Human
Sensing Lab. I got my master's degree (2016-2018) at the Robotics Institute of Carnegie Mellon University, where I worked with Prof. Kris
Kitani, on pedestrian detection on low-profile robot.
We introduce FeD, a highly efficient MLLM-based sensorimotor driving model, enabled by three key improvements: 1) language-based feedback
refining trained using auto-generated feedback data. 2) training the model via distillation from a privileged agent with Bird's Eye View (BEV)
of the scene, allowing our model to robustly use just RGB data at test time. 3) predicting driving waypoints in a masked-token fashion from the
waypoint tokens' internal representations, i.e., not relying on the slow sequentially generative process.
We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with
robust off-the-shelf operation across diverse datasets and settings. We empirically demonstrate the benefits of semi-supervised
training for learning a general-purpose direct VO regression network. Moreover, we demonstrate multi-modal supervision, including
segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task.
We propose a novel knowledge distillation framework for effectively teaching a sensorimotor student agent to drive from the
supervision of a privileged teacher agent. Our proposed sensorimotor agent results in a robust image-based behavior cloning
agent in CARLA, improving over current models by over 20.6% in driving score without requiring LiDAR, historical observations,
ensemble of models, on-policy data aggregation or reinforcement learning.
We introduce a novel vision-and-language navigation (VLN) task of learning to provide real-time guidance to a blind follower situated
in complex dynamic navigation scenarios. We collect a multi-modal real-world benchmark with in-situ Orientation and Mobility (O&M)
instructional guidance. We leverage the real-world study to inform the design of a larger-scale simulation benchmark. In the end, we present ASSISTER, an imitation-learned agent that can
embody such effective guidance.
We introduce SelfD, a framework for learning scalable driving by utilizing large amounts of online monocular images. Our key idea is to leverage iterative semi-supervised training when learning imitative agents from unlabeled data. We employ a large dataset of publicly available YouTube videos to train SelfD and comprehensively analyze its generalization benefits across challenging navigation scenarios. SelfD demonstrates consistent improvements (by up to 24%) in driving performance evaluation on nuScenes, Argoverse, Waymo, and CARLA.
We introduce X-World, an accessibility-centered development environment for vision-based autonomous systems, which enables spawning dynamic agents with various mobility aids. The simulation supports generation of ample amounts of finely annotated, multi-modal data in a safe, cheap, and privacy-preserving manner. We highlight novel difficulties introduced by our benchmark and tasks, as well as opportunities for future developments. We further validate and extend our analysis by introducing a complementary real-world evaluation benchmark.
We propose the Learning by Watching (LbW) framework which enables learning a driving policy without requiring full knowledge of neither the state nor expert actions. To increase its data, i.e., with new perspectives and maneuvers, LbW makes use of the demonstrations of other vehicles in a given scene by (1) transforming the egovehicle’s observations to their points of view, and (2) inferring their expert actions. Our LbW agent learns more robust driving policies while enabling data-efficient learning, including quick adaptation of the policy to rare and novel scenarios.
We design an end-to-end DNN tracking approach, Flow-Fuse-Tracker (FFT), consisting of two efficient techniques: target flowing and target fusing. In target flowing, a FlowTracker DNN module learns the indefinite number of target-wise motions jointly from pixel-level optical flows. In target fusing, a FuseTracker DNN module refines and fuses targets proposed by FlowTracker and frame-wise object detection, instead of trusting either of the two inaccurate sources of target proposal.