XL2 &5TypeV2ObjIDDDirkAyF a7EcAlgoEcMEcNEcBSizeEcIndexEcDistCSumAlgoPartNumsPartETags d9cfc5e4ac15f86e282dca413716a79cPartSizesPartASizesPartIdxSizeMTime5MetaSysx-rustfs-internal-inline-datatruex-minio-internal-inline-datatrueMetaUsretag d9cfc5e4ac15f86e282dca413716a79ccontent-typeapplication/x-texv|I null `.w%:^vZQS0%% %% This is file `sample-sigconf-authordraft.tex', %% generated with the docstrip utility. %% %% The original source files were: %% %% samples.dtx (with options: `all,proceedings,bibtex,authordraft') %% %% IMPORTANT NOTICE: %% %% For the copyright see the source file. %% %% Any modified versions of this file must be renamed %% with new filenames distinct from sample-sigconf-authordraft.tex. %% %% For distribution of the original source see the terms %% for copying and modification in the file samples.dtx. %% %% This generated file may be distributed as long as the %% original source files, as listed above, are part of the %% same distribution. (The sources need not necessarily be %% in the same archive or directory.) %% %% %% Commands for TeXCount %TC:macro \cite [option:text,text] %TC:macro \citep [option:text,text] %TC:macro \citet [option:text,text] %TC:envir table 0 1 %TC:envir table* 0 1 %TC:envir tabular [ignore] word %TC:envir displaymath 0 word %TC:envir math 0 word %TC:envir comment 0 0 %% %% The first command in your LaTeX source must be the \documentclass %% command. %% %% For submission and review of your manuscript please change the %% command to \documentclass[manuscript, screen, review]{acmart}. %% %% When submitting camera ready or to TAPS, please change the command %% to \documentclass[sigconf]{acmart} or whichever template is required %% for your publication. %% %% \documentclass[sigconf, screen, review, anonymous]{acmart} \usepackage{multirow} \usepackage{algorithmic} \usepackage{algorithm} \usepackage{colortbl} \usepackage{wrapfig} \usepackage{hyperref} \usepackage{cleveref} %% %% \BibTeX command to typeset BibTeX logo in the docs \AtBeginDocument{% \providecommand\BibTeX{{% Bib\TeX}}} %% Rights management information. This information is sent to you %% when you complete the rights form. These commands have SAMPLE %% values in them; it is your responsibility as an author to replace %% the commands and values with those provided to you when you %% complete the rights form. \setcopyright{acmlicensed} \copyrightyear{2018} \acmYear{2018} \acmDOI{XXXXXXX.XXXXXXX} %% These commands are for a PROCEEDINGS abstract or paper. \acmConference[Conference acronym 'XX]{Make sure to enter the correct conference title from your rights confirmation email}{June 03--05, 2018}{Woodstock, NY} %% %% Uncomment \acmBooktitle if the title of the proceedings is different %% from ``Proceedings of ...''! %% %%\acmBooktitle{Woodstock '18: ACM Symposium on Neural Gaze Detection, %% June 03--05, 2018, Woodstock, NY} \acmISBN{978-1-4503-XXXX-X/2018/06} %% %% Submission ID. %% Use this when submitting an article to a sponsored event. You'll %% receive a unique submission ID from the organizers %% of the event, and this ID should be used as the parameter to this command. %%\acmSubmissionID{123-A56-BU3} %% %% For managing citations, it is recommended to use bibliography %% files in BibTeX format. %% %% You can then either use BibTeX with the ACM-Reference-Format style, %% or BibLaTeX with the acmnumeric or acmauthoryear sytles, that include %% support for advanced citation of software artefact from the %% biblatex-software package, also separately available on CTAN. %% %% Look at the sample-*-biblatex.tex files for templates showcasing %% the biblatex styles. %% %% %% The majority of ACM publications use numbered citations and %% references. The command \citestyle{authoryear} switches to the %% "author year" style. %% %% If you are preparing content for an event %% sponsored by ACM SIGGRAPH, you must use the "author year" style of %% citations and references. %% Uncommenting %% the next command will enable that style. %%\citestyle{acmauthoryear} %% %% end of the preamble, start of the body of the document source. \begin{document} %% %% The "title" command has an optional parameter, %% allowing the author to define a "short title" to be used in page headers. \title{SonoSelect: Efficient Ultrasound Perception via \\ Active Probe Exploration } %% %% The "author" command and its associated commands are used to define %% the authors and their affiliations. %% Of note is the shared affiliation of the first two authors, and the %% "authornote" and "authornotemark" commands %% used to denote shared contribution to the research. \author{Ben Trovato} \authornote{Both authors contributed equally to this research.} \email{trovato@corporation.com} \orcid{1234-5678-9012} \author{G.K.M. Tobin} \authornotemark[1] \email{webmaster@marysville-ohio.com} \affiliation{% \institution{Institute for Clarity in Documentation} \city{Dublin} \state{Ohio} \country{USA} } \author{Lars Th{\o}rv{\"a}ld} \affiliation{% \institution{The Th{\o}rv{\"a}ld Group} \city{Hekla} \country{Iceland}} \email{larst@affiliation.org} \author{Valerie B\'eranger} \affiliation{% \institution{Inria Paris-Rocquencourt} \city{Rocquencourt} \country{France} } \author{Aparna Patel} \affiliation{% \institution{Rajiv Gandhi University} \city{Doimukh} \state{Arunachal Pradesh} \country{India}} \author{Huifen Chan} \affiliation{% \institution{Tsinghua University} \city{Haidian Qu} \state{Beijing Shi} \country{China}} \author{Charles Palmer} \affiliation{% \institution{Palmer Research Laboratories} \city{San Antonio} \state{Texas} \country{USA}} \email{cpalmer@prl.com} \author{John Smith} \affiliation{% \institution{The Th{\o}rv{\"a}ld Group} \city{Hekla} \country{Iceland}} \email{jsmith@affiliation.org} \author{Julius P. Kumquat} \affiliation{% \institution{The Kumquat Consortium} \city{New York} \country{USA}} \email{jpkumquat@consortium.net} %% %% By default, the full list of authors will be used in the page %% headers. Often, this list is too long, and will overlap %% other information printed in the page headers. This command allows %% the author to define a more concise list %% of authors' names for this purpose. \renewcommand{\shortauthors}{Trovato et al.} %% %% The abstract is a short summary of the work to be presented in the %% article. \begin{abstract} Ultrasound perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy, increase scanning and processing costs. To address this, we define an active view exploration task for ultrasound and propose SonoSelect, an ultrasound-specific method that adaptively guides probe movement based on current observations. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each new 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. On top of this formulation, we propose an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. Experiments on our ultrasound simulator show that SonoSelect achieves promising multi-view organ classification accuracy using only 2 out of N views. Furthermore, for a more difficult kidney cyst detection task, it reaches 49.5\% kidney coverage and 30.7\% cyst coverage, with short trajectories consistently centered on the target cyst. % with more target-focused trajectories. \end{abstract} %% %% The code below is generated by the tool at http://dl.acm.org/ccs.cfm. %% Please copy and paste the code instead of the example below. %% \begin{CCSXML} 10010405.10010444.10010449 Applied computing~Health informatics 300 10010147.10010178.10010224 Computing methodologies~Computer vision 500 \end{CCSXML} \ccsdesc[300]{Applied computing~Health informatics} \ccsdesc[500]{Computing methodologies~Computer vision} %% %% Keywords. The author(s) should pick words that accurately describe %% the work being presented. Separate the keywords with commas. \keywords{Robotic Ultrasound, Multi-view Perception, View Selection} %% %% This command processes the author and affiliation and title %% information and builds the first part of the formatted document. \maketitle \section{Introduction} \label{sec:intro} % Medical ultrasound commonly acquire multiple scan views through probe motion to reduce diagnostic ambiguity (ultrasound 诊断一般需要多去实现view) In medical ultrasound, the acquisition of multiple scan views through probe motion is a common practice to reduce diagnostic ambiguity~\cite{wang2021deep, wu2020deep, zhou2021deep}. As a non-invasive and real-time imaging modality, ultrasound is essential in clinical diagnosis, yet remains highly view-dependent~\cite{jiang2023robotic, munir2025survey, elmekki2025comprehensive}. A single static image often fails to provide sufficient structural information due to anatomical occlusions and a limited field-of-view. Consequently, multi-view perception is necessary to improve anatomical coverage and reduce diagnostic uncertainty ~\cite{dai2021transmed}. % However, the use of multiple cameras comes at a high cost 但是有效去实现这个多view,目前是个难题,这个目前是个成本很高的事情(需要人工什么的) However, efficiently acquiring informative views remains challenging. In manual practice, repositioning the probe to find optimal planes relies on the experience of the operator, requiring repeated repositioning and real-time interpretation, which makes the process tedious and difficult to standardize ~\cite{jiang2023robotic, wang2021deep, munir2025survey}. To reduce this operator burden, existing methods attempt to automate probe navigation, yet most rely on uninformed geometric guidance or localized visual servoing ~\cite{peralta2020next}. These methods optimize probe motion based on immediate geometric or image-quality feedback without considering the broader anatomical context, causing the probe to repeatedly sample nearby regions with diminishing new information while missing viewpoints needed to resolve occlusions or cover unseen anatomy. A key question is then how to determine, from partial observations, which probe positions to acquire next to maximize diagnostic coverage within a limited scanning budget. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/intro1.pdf} \caption{\textbf{Motivation of SonoSelect for efficient robotic ultrasound scanning.} \textbf{Left:} Uninformed exploration samples views redundantly within a local region. The target may remain hidden behind an overlying organ and outside the explored area. \textbf{Right:} SonoSelect autonomously selects the next scanning region based on diagnostic value, directing the probe to systematically explore and detect the target through informative views.} \label{fig:teaser} \end{figure} % 所以在我们这篇文章,我们希望把view selection这件事情能够自动化,代替人工的选取(这一段写fig 1的内容, 主要是写我们的任务定义), To automate view selection and reduce reliance on manual navigation, we introduce an active view exploration task. As illustrated in \cref{fig:teaser}, uninformed exploration methods tend to sample views redundantly within a local region, leaving the target organ unobserved when it lies beyond the explored area (\cref{fig:teaser},left). In contrast, active view exploration directs the probe toward regions estimated to contribute the most to diagnosis, systematically expanding the observed region rather than concentrating on locally accessible areas (\cref{fig:teaser}, right). This observation-driven view selection reduces redundant acquisition and increases the likelihood of obtaining the specific viewpoints needed for accurate diagnosis. % 因此,我们提出了SonoSelect,(简单介绍一下,也说下传统ppo的差别) We propose SonoSelect, a method that adaptively guides probe movement based on current observations to acquire informative ultrasound views. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each acquired 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. We train SonoSelect with an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. The region selection and motion control modules are jointly optimized through a shared training scheme, keeping the overall framework compact. % 在多个ultrasound任务上,我们的效果很好(比ppo好) % 我们的效率也很高,(一般需要几个view就好了) Experiments on our ultrasound simulator demonstrate that SonoSelect achieves promising multi-view organ classification accuracy using only a small number of views, and reaches effective kidney and cyst coverage on a more challenging cyst detection task. The learned trajectories consistently center on the target anatomy, and SonoSelect maintains its performance when generalizing to unseen patient anatomies. % 未来的意义,我们这个是为了是为了实现Fully Autonomous Robotic Ultrasound system中的重要的一步 The proposed active view exploration approach has practical implications for both current clinical workflows and future automation. The learned policy can serve as a decision support tool in clinical ultrasound, suggesting informative scanning regions to assist sonographers. For robotic ultrasound systems, it can supply target coordinates to guide autonomous probe motion planning. Our contributions are: (1) We cast ultrasound active view exploration as a sequential decision-making problem and show, through a preliminary study, that a small number of adaptively chosen views can match or exceed exhaustive acquisition, motivating observation-driven view selection. (2) We propose SonoSelect, a framework that fuses each 2D view into a 3D spatial memory and uses an ultrasound-specific objective favoring organ coverage, low reconstruction uncertainty, and reduced redundancy to guide the next probe position, jointly optimized within a shared training scheme. (3) We evaluate SonoSelect on both a multi-view classification task and a continuous cyst detection task, demonstrating effective diagnostic coverage and robust generalization to unseen patient anatomies. \section{Related Work} \label{sec:related} \textbf{Ultrasound Perception: From Single-View to Multi-view.} Most existing research in robotic ultrasound focuses on improving the perception of individual 2D slices, such as organ segmentation, classification, and lesion detection ~\cite{jiang2023robotic, huang2023review}. While these methods have achieved high accuracy in controlled planes, they inherently suffer from the limited field-of-view and acoustic occlusions characteristic of single-view ultrasound ~\cite{jiang2025towards}. To acquire more comprehensive observations, autonomous scanning systems have been developed, but they typically follow predefined trajectories or optimize for local image quality and acoustic coupling through force-aware control ~\cite{chatelain2017confidence, ning2023inverse, tirindelli2020force}, without selecting views based on their diagnostic contribution. Recent learning-based advances have attempted to navigate toward target views from local observations ~\cite{hase2020ultrasound, jiang2024intelligent}. However, these approaches primarily focus on acquiring a single predefined standard plane rather than comprehensive 3D perception. Other recent works explore constraint-aware safe exploration ~\cite{duan2024safe} or build tissue-view maps for specific structures ~\cite{su2025tissue}, but these efforts remain limited to local path planning for individual anatomical targets. As a result, the problem of sequentially selecting views to build up a comprehensive 3D understanding of the scanned regionremains open. We formulate it as active multi-view exploration, where the agent selects a sequence of views to maximize diagnostic coverage across the full scanning region rather than navigating to a single target plane. \textbf{Viewpoint Selection: From Computer Vision to Ultrasound.} The concept of Next-Best-View (NBV) planning was originally established in the computer vision community to solve 3D reconstruction and active localization for objects using RGB or RGB-D sensors ~\cite{isler2016information, di2024learning}. These methods, ranging from classical information-theoretic entropy reduction ~\cite{isler2016information} to recent learning-based active vision policies ~\cite{chen2024gennbv, feng2024naruto, xue2024neural}, operate under free-space assumptions: the sensor can be positioned at arbitrary viewpoints around the object, and the imaging process follows predictable optical properties such as known projection geometry and consistent illumination. Translating NBV principles into the ultrasound domain introduces physical challenges that violate these assumptions. The probe is constrained to maintain continuous contact with the body surface, limiting the set of reachable viewpoints. Acoustic shadowing caused by bone or gas occludes structures that would be visible from other orientations, and signal-dependent speckle noise reduces the reliability of pixel-level uncertainty estimates. Together, these factors mean that purely uncertainty-driven exploration strategies, which perform well under free-space conditions, can be misled by imaging artifacts in ultrasound. Our work adapts the core idea of NBV planning, selecting the next observation to maximize diagnostic gain, to the contact-constrained ultrasound setting. Rather than relying on geometric uncertainty alone, the agent learns to account for anatomical context when choosing where to scan next. We train the exploration policy in high-throughput simulators that support robotic ultrasound tasks ~\cite{makoviychuk2021isaac, ao2025sonogym, schmidgall2024surgical}, where large-scale parallel rollouts provide sufficient experience for the agent to learn anatomy-aware scanning strategies that prioritize diagnostically informative coverage over geometric traversal. \section{Methodology} \label{sec:method} \subsection{Problem Definition} \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/overview.pdf} \caption{\textbf{Overview of the Active Multi-view Ultrasound Exploration system.} Framed as a resource-constrained POMDP, the hierarchical agent executes kinematic actions $a_t$ conditioned on the current state $s_t$. Through continuous environment interaction, the spatial memory $\hat{V}_t$ is iteratively updated with new ultrasound slices until the scanning budget is exhausted.} \label{fig:system_overview} \end{figure} % 一段话写任务定义(这里需要有个大图) We formulate active view exploration for ultrasound perception as a % resource-constrained POMDP~\cite{kaelbling1998planning}. Specifically, the unobservable state $s$ represents the complete 3D anatomy, which the agent can only access through partial 2D ultrasound slices. The objective is to learn an exploration policy $\pi_\phi(a_t|s_t)$ that maps the current state $s_t$ to continuous kinematic actions $a_t$, maximizing the cumulative coverage of the target anatomical structure within a fixed budget of $T$ steps. Concretely, the agent faces three subproblems: (1) estimating, from incomplete observations, how much anatomical coverage each unvisited region would provide; (2) deciding which regions to visit and in what order within the finite budget; and (3) translating each regional decision into a feasible kinematic trajectory. % State \textbf{State}. Because the number of acquired slices grows with each step, directly conditioning the policy on the full observation history is impractical. We instead maintain a fixed-dimensional state $s_t$ that summarizes all spatial information collected up to step $t$. At each step $t$, the agent receives a 2D ultrasound slice $I_t$ at probe pose $(\mathbf{p}_t, \mathbf{q}_t)$ and fuses it into a 3D probability map $\hat{V}_t$ via the volumetric fusion function $U(\cdot)$. We formulate the state $s_t$ as: \begin{equation} s_t = (\hat{V}_t, \mathbf{p}_t, \mathbf{q}_t), \end{equation} Here $\hat{V}_t$ aggregates all slices observed up to step $t$ into a spatial probability map, where each voxel stores the estimated probability of tissue occupancy. $\hat{V}_0$ is initialized to a uniform probability of $0.5$ to represent maximum uncertainty. This representation maintains the same dimensionality across different time steps, allowing the policy to operate on a fixed-size input regardless of the episode length. Although $s_t$ captures the spatial structure observed so far, it does not explicitly indicate how much of the target anatomy has been covered. To provide the critic with a more informative training signal, we define a privileged coverage ratio: \begin{equation} c_t = \frac{\sum_{v} \hat{m}_{t}(v) \cdot g(v)}{\sum_{v} g(v) + \epsilon}, \quad c_t \in [0,1], \end{equation} where $g(v)$ is the ground-truth binary mask of the target structure and $\hat{m}_t(v)$ is the estimated target occupancy from the current reconstruction. Since $c_t$ requires $g(v)$, it is available only during training in simulation. Following the asymmetric actor-critic formulation~\cite{pinto2017asymmetric}, the actor $\pi_\phi(a_t | s_t)$ sees only $s_t$, while the critic $V_\psi(s_t, c_t)$ additionally receives $c_t$ for more accurate value estimation. This separation ensures that the deployed policy does not rely on any privileged information. % Action \textbf{Action}. For a given state $s_t$ at time step $t \in \{1,\dots,T\}$, the agent outputs a continuous 4D action $a_t = (\Delta x, \Delta z, \Delta\phi, \Delta\psi)$, where $\Delta x$ and $\Delta z$ are translational displacements along the x and z axes, and $\Delta\phi$ and $\Delta\psi$ are rotational increments for roll and yaw, respectively. The y-axis translation is omitted because the probe maintains surface contact throughout scanning. The action space is continuous to allow fine-grained kinematic adjustments. \textbf{Transition}. Upon executing action $a_t$, the probe pose is updated to $(\mathbf{p}_{t+1}, \mathbf{q}_{t+1})$ via the environment's kinematic function. The environment then returns a new ultrasound slice $I_{t+1}$, which is fused into the probability map to produce $\hat{V}_{t+1}$, and the state transitions to $s_{t+1} = (\hat{V}_{t+1}, \mathbf{p}_{t+1}, \mathbf{q}_{t+1})$. The scanning process terminates when the step budget $T$ is exhausted or the early stopping condition is met. % Reward \textbf{Reward}. We design a dense, multi-objective reward function: \begin{equation} r_t = w_{cov} \Delta C_t + w_{info} \Delta H_t^{echo} - \ell_t^{path} \end{equation} The first term $\Delta C_t$ measures the incremental coverage gain over the anatomical structures of interest, weighted by $w_{cov}$, and provides the main learning signal. However, a single partial slice may refine the reconstruction without producing measurable coverage gain. To reward such intermediate progress, the second term $\Delta H_t^{echo}$, weighted by $w_{info}$, captures the reduction in volumetric Shannon entropy over the target region, so that steps reducing acoustic uncertainty still receive positive feedback. Because the policy maximizes the cumulative sum of all these terms, entropy reduction alone cannot sustain high returns; the policy is driven toward trajectories that also achieve coverage gains over the structures of interest. This distinguishes our reward from objectives that use entropy reduction as the sole optimization target, where the policy has no incentive to prioritize diagnostically relevant regions over other high-uncertainty areas. Finally, $\ell_t^{path}$ is a conditional kinematic penalty that penalizes large translational and rotational displacements when a step produces no coverage gain, discouraging the agent from moving excessively without acquiring new information. \begin{figure*}[t] \includegraphics[width=\linewidth]{images/architecture.pdf} \caption{\textbf{Architecture of the SonoSelect.} The scanning region is discretized into 16 sectors from which learned features $f_i$ are extracted via shared 2D convolutional encoding and masked pooling. The Sector Selection module produces sector-conditioned Q-values $Q(s_t, z_i)$, and selects the optimal sector $z_t$ whose geometric center is converted into a target guidance vector $\mathbf{v}_t$. Concurrently, the PPO Actor encodes the volumetric state $s_t$ concatenated with the sector guidance $\mathbf{v}_t^{\text{pos}}$ through a convolutional backbone to output the local kinematic increment $\Delta_t$. The Residual Fusion module combines $\mathbf{v}_t^{\text{pos}}$ and $\hat{\Delta}_t$ to produce the final continuous action $a_t$, which drives the probe to a new pose and triggers volumetric fusion $U(\cdot)$ to update the spatial memory $\hat{V}_{t+1}$.} \label{fig:sono} \end{figure*} \subsection{SonoSelect Architecture} A flat continuous policy would need to simultaneously decide which region of the anatomy to visit next and compute the kinematic actions to get there. In practice, this joint optimization is difficult because selecting which anatomical region to scan next requires reasoning over the entire observed volume and operates over long horizons with sparse diagnostic feedback, while executing the probe motion toward that region requires dense, short-horizon kinematic adjustments. These two sub-tasks differ in both temporal scale and input granularity. SonoSelect decomposes this problem into two coupled components. A sector selection module handles the long-horizon decision of where to explore. The selected region then provides a directional target for a continuous control policy, which only needs to solve a simpler, short-range navigation task toward the chosen sector. This decomposition constrains the search space for each sub-problem while maintaining the flexibility required for fine-grained kinematic control. % \begin{figure} % \centering % \includegraphics[width=\linewidth]{images/feature.pdf} % \caption{\textbf{Sector feature extraction pipeline.} By treating elevation slices as input channels, a shared 2D convolutional encoder processes the reconstruction volume $\hat{V}_t$ into a 32-channel feature map. For each sector $i$, a sector-specific mask filters this map, followed by parallel average and max pooling. A shared MLP then projects the concatenated 64-dimensional vector, producing the sector feature $f_i$.} % \label{fig:sector_feature} % \end{figure} We discretize the local operational workspace into $S$ equiangular sectors (Fig.~\ref{fig:sono}). To obtain a feature representation $f_i$ for each sector, the reconstruction volume $\hat{V}_t$ is first rearranged by treating elevation slices as input channels and then processed by a shared 2D convolutional encoder. For each sector $i$, a binary sector mask is applied to the encoded feature map, followed by parallel average and max pooling. The concatenated pooling result is then projected through a shared MLP to produce $f_i$. The sector features $\{f_i\}_{i=1}^{S}$ are each passed through a shared Q-network to produce action values $\{Q(s_t, z_i)\}_{i=1}^{S}$, where $Q(s_t, z_i)$ estimates the cumulative expected reward for navigating toward sector $z_i$. This parameter-sharing design ensures that the Q-network generalizes across all candidate sectors rather than learning separate value estimates for each. During training, the sector is chosen via an $\epsilon$-greedy strategy to balance exploration and exploitation; at deployment, the sector with the highest Q-value is deterministically selected. The geometric center of the selected sector $z_t$ is converted into a positional target vector $\mathbf{v}_t^{\text{pos}} \in \mathbb{R}^2$ in the probe's local coordinate frame, representing the translational direction toward the selected sector. This vector serves as the guidance signal for the downstream continuous control policy. The continuous control policy translates the selected sector into kinematic actions. We employ a PPO-based actor-critic architecture. The actor takes as input the current state $s_t$ concatenated with the sector guidance vector $\mathbf{v}_t^{\text{pos}}$, and outputs a local kinematic increment $\Delta_t = [\Delta_t^{\text{pos}}, \Delta_t^{\text{ang}}] \in \mathbb{R}^4$. A residual scaling factor $\alpha$ is applied to obtain the scaled increment $\hat{\Delta}_t = \alpha \Delta_t$. The final action $a_t$ fuses the sector-derived target with this scaled increment: \begin{equation} a_t^{\text{pos}} = \beta_t \mathbf{v}_t^{\text{pos}} + (1-\beta_t) \hat{\Delta}_t^{\text{pos}}, \quad a_t^{\text{ang}} = \hat{\Delta}_t^{\text{ang}} \end{equation} where $\beta_t$ linearly anneals from an initial value $\beta_0$ to a final value $\beta_f$ over training. In early training, $\beta_t$ is large so that the translational component is dominated by the sector guidance $\mathbf{v}_t^{\text{pos}}$, providing a stable learning signal before the policy has converged. As training progresses, $\beta_t$ decreases and the policy's own output $\hat{\Delta}_t^{\text{pos}}$ takes over. The angular component $a_t^{\text{ang}}$ is determined entirely by the policy, as the sector selection provides only translational guidance. The critic estimates the state value $V_{\psi}(s_t, c_t)$ using the augmented state. \subsection{Training Scheme} We employ a rollout-based sequential updating approach to jointly train the continuous control policy (via Proximal Policy Optimization, PPO~\cite{schulman2017proximal}) and the sector selection module via Q-learning. This joint training scheme allows both modules to co-adapt within the same trajectory data, ensuring consistent learning signals across the two decision levels. The continuous control policy is optimized using the standard PPO objective with Generalized Advantage Estimation (GAE) ~\cite{schulman2015high}. The actor outputs kinematic increments $\Delta_t$ and is updated via clipped surrogate objectives, while the critic estimates $V_\psi(s_t, c_t)$ and provides the baseline for advantage computation. For the sector selection module, we employ Q-learning. The action-value function $Q_{\theta}(s_t, z_t)$ estimates the expected cumulative reward after selecting sector $z_t$ at state $s_t$: \begin{equation} Q_{\theta}(s_{t}, z_{t}) = \mathbb{E} \left( \sum_{\tau=t}^{T} \gamma^{\tau-t} r_{\tau}^{(Q)} \right), \end{equation} where $\mathbb{E}(\cdot)$ denotes the expectation and $\gamma \in [0, 1]$ is the discount factor. The Q-network receives the same environment reward as the continuous control policy, i.e., $r^{(Q)}_t = r_t$. Although both modules share the same reward signal, they require different value representations. The PPO critic learns a state value $V_\psi(s_t, c_t)$ used to compute advantages for the continuous control policy, while the sector selection module learns action-conditional values $Q_\theta(s_t, z_i)$ that compare the expected return of each candidate sector. This difference motivates maintaining separate value functions despite the shared reward. Given this formulation, we compute the return from the collected rollouts as the supervision target: \begin{equation} y_{t} = \begin{cases} r_{t}^{(Q)} + \gamma (1 - d_{t}) y_{t+1}, & \text{if } t < T \\ r_{T}^{(Q)}, & \text{otherwise} \end{cases}, \end{equation} where $d_t$ is the termination mask. The Q-network is then optimized using the $L_2$ distance loss: \begin{equation} \mathcal{L}_{Q} = \lambda_Q \frac{1}{T} \sum_{t=1}^{T} \text{MSE}(Q_{\theta}(s_{t}, z_{t}), y_{t}), \end{equation} where $\lambda_Q$ controls the loss weight. In joint training, the two objectives are optimized in separate backward passes within each iteration. First, the PPO objective $\mathcal{L}_{\text{PPO}}$ updates the continuous control policy and the critic. Then, in a separate backward pass, the Q-learning loss $\mathcal{L}_{Q}$ updates the sector selection module, including the Q-network and its associated feature encoder. This sequential scheme prevents gradient interference between the two objectives. A step-by-step demonstration of this process can be found in ~\cref{alg:sonoselect}. \begin{algorithm}[t] \caption{SonoSelect} \label{alg:sonoselect} \begin{algorithmic}[1] \small \STATE \textbf{Input}: Env $\mathcal{E}$, budget $T$, exploration rate $\epsilon$, scaling factor $\alpha$, annealing weight $\beta_t$, Q-loss weight $\lambda_Q$. \STATE \textbf{Update}: Q-network $Q_{\theta}$, actor $\pi_{\phi}$, critic $V_{\psi}$. \FOR{each training iteration} \STATE Initialize rollout buffers $\mathcal{B}_{\text{PPO}}, \mathcal{B}_{Q} \leftarrow \emptyset$ \STATE Reset environment: $s_1 \leftarrow \mathcal{E}.\text{reset}()$ \FOR{$t = 1$ to $T$} \STATE Extract sector features $\{f_i\}_{i=1}^{S}$ from $\hat{V}_t$ \STATE Select sector using $\epsilon$-greedy: with probability $\epsilon$ adopt a random sector, or else choose $z_t = \arg\max_{z_i} Q_{\theta}(s_t, z_i)$ \STATE Compute guidance $\mathbf{v}_t^{\text{pos}} \leftarrow \text{GeometricCenter}(z_t)$ \STATE Sample $\Delta_t \sim \pi_{\phi}(\cdot \mid s_t, \mathbf{v}_t^{\text{pos}})$; scale $\hat{\Delta}_t \leftarrow \alpha \Delta_t$ \STATE Fuse action: $a_t^{\text{pos}} \leftarrow \beta_t \mathbf{v}_t^{\text{pos}} + (1{-}\beta_t) \hat{\Delta}_t^{\text{pos}}$, $a_t^{\text{ang}} \leftarrow \hat{\Delta}_t^{\text{ang}}$ \STATE Execute $a_t$ in $\mathcal{E}$; observe $s_{t+1}, r_t, d_t$ \STATE Store $(s_t, c_t, a_t, r_t, s_{t+1}, c_{t+1}, d_t)$ in $\mathcal{B}_{\text{PPO}}$ \STATE Store $(s_t, z_t, r_t, s_{t+1}, d_t)$ in $\mathcal{B}_{Q}$ \ENDFOR \STATE Compute GAE advantages from $\mathcal{B}_{\text{PPO}}$; update $\pi_{\phi}, V_{\psi}$ via $\mathcal{L}_{\text{PPO}}$ \STATE Compute n-step returns $\{y_t\}$ from $\mathcal{B}_{Q}$; update $Q_{\theta}$ via $\nabla\mathcal{L}_{Q}$ \ENDFOR \end{algorithmic} \end{algorithm} \section{Experiment} We systematically benchmark the proposed \textit{Active Multi-view Ultrasound Exploration} formulation. Our evaluation is structured into two phases: a discrete preliminary study followed by a fully continuous evaluation in a dynamic environment. The first phase (Sec. \ref{sec:discrete_classification}) validates the core assumption that diagnostic information concentrates in a small, instance-dependent subset of views, using a discrete setting where the probe can access any candidate viewpoint without kinematic constraints. The second phase (Sec. \ref{sec:continuous_detection}) evaluates our complete framework, \textbf{SonoSelect}, in the fully continuous POMDP environment where the agent jointly plans which regions to explore and how to navigate toward them. \subsection{Preliminary Study: Multi-view Classification} \label{sec:discrete_classification} To isolate the view selection problem from continuous navigation, we conduct this study in a discrete setting where the probe can access any candidate viewpoint without kinematic constraints. We employ MVSelect ~\cite{hou2024learning}, a sequential view selection method that chooses the next view conditioned on previously acquired observations. MVSelect provides a suitable testbed because it implements adaptive selection without requiring a continuous control policy, allowing us to focus on whether instance-dependent view selection outperforms fixed or random protocols. \textbf{Datasets.} We construct two custom multi-view ultrasound datasets, both adopting a strict 80\%/20\% train-test split and extracting $120 \times 120$ 2D slices under two distinct viewpoint configurations (12-view and 20-view configurations). \begin{itemize} \item \textit{SonoGeom:} This synthetic dataset comprises 10 distinct categories (sphere, ellipsoid, cube, cuboid, cylinder, capsule, cone, torus, octahedron, and cross), with 150 unique instances per category. \item \textit{SonoOrgan:} To move closer to clinical realism, we introduce a more challenging dataset comprising real human anatomical structures sourced from the publicly available TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}. It contains 6 distinct categories: left kidney, liver, pancreas, spleen, aorta, and stomach, with 100 unique patient instances per category. \end{itemize} \textbf{Task Network.} For both datasets, we employ a ResNet-18 ~\cite{he2016deep} backbone combined with a max-pooling aggregation module. The network is trained offline on complete multi-view sequences so that the learned representations are not biased toward any particular view subset. \textbf{Quantitative Results.} The classification performances on both datasets are summarized in \cref{tab:combined_results}. We compare four selection strategies: (1) \textit{dataset-level oracle}, which uses the same fixed pair of views that achieves the highest average accuracy across all instances in the training set; (2) \textit{instance-level oracle}, which selects the optimal pair for each test instance by exhaustive search; (3) \textit{random selection}, which samples two views uniformly; and (4) \textit{MVSelect}, which sequentially selects views conditioned on previous observations. We also report the performance of using all $N$ views as a reference \begin{table}[t] \centering \small \caption{Classification results on the simple geometry dataset (SonoGeom) and real organ dataset (SonoOrgan) with a selection budget of $T=2$ views.} \label{tab:combined_results} \begin{tabular}{l|c|c|c|c} \multirow{2}{*}{view selection ($T=2$)} & \multicolumn{2}{c|}{SonoGeom} & \multicolumn{2}{c}{SonoOrgan} \\ & 12 views & 20 views & 12 views & 20 views \\ \hline N/A: all $N$ views & 84.0 & 93.0 & 92.5 & 91.7 \\ \hline dataset-lvl oracle & 79.1 $\pm$ 0.7 & 83.4 $\pm$ 2.2 & 92.8 $\pm$ 0.6 & 90.1 $\pm$ 2.0 \\ instance-lvl oracle & 93.8 $\pm$ 0.5 & 99.2 $\pm$ 0.6 & 98.1 $\pm$ 0.6 & 99.3 $\pm$ 1.1 \\ \hline random selection & 74.6 $\pm$ 2.3 & 68.7 $\pm$ 8.3 & 87.4 $\pm$ 4.2 & 73.9 $\pm$ 11.6 \\ validation best policy& 71.5 $\pm$ 2.4 & 70.9 $\pm$ 5.6 & 88.5 $\pm$ 1.6 & 83.7 $\pm$ 4.1 \\ \hline SonoSelect & 79.1 $\pm$ 1.3 & 89.6 $\pm$ 1.6 & 96.2 $\pm$ 1.6 & 97.0 $\pm$ 1.0 \\ \end{tabular} \end{table} \begin{figure}[htbp] \centering \includegraphics[width=\linewidth]{images/Experiment1.pdf} \caption{\textbf{Qualitative results of the discrete view selection policy ($T=2$).} We visualize the selected viewpoints for both the SonoGeom (top, cube) and SonoOrgan (bottom, anatomical structure) datasets.} \label{fig:Experiment1} \end{figure} We first note that using all $N$ views does not yield the highest accuracy. The instance-level oracle, which selects the best two views per instance via exhaustive search, substantially surpasses the all-view baseline on both datasets. This indicates that redundant or low-quality views introduce noise that degrades the aggregated representation. However, the dataset-level oracle, which fixes the same two best views across all instances, performs considerably worse than the instance-level oracle and in some cases falls below the all-view baseline. This gap shows that the most informative views vary from one instance to another and cannot be predetermined as a fixed protocol. Random selection performs the worst overall, with high variance reflecting the inconsistency of uninformed view choices. MVSelect, which selects views conditioned on each instance's observations, approaches the instance-level oracle on both datasets. This confirms that an adaptive, observation-driven policy can recover near-optimal view combinations without exhaustive search. Together, these results support the two properties that motivate SonoSelect: (1) a small number of well-chosen views can match or exceed the performance of exhaustive acquisition, and (2) the optimal views are instance-dependent, requiring an observation-driven selection policy. These two properties motivate the core design of SonoSelect: the sector selection module implements observation-driven routing to determine which regions are diagnostically valuable, while the continuous control policy handles the kinematic execution that the discrete setting abstracts away. The following section evaluates whether this decomposition can realize the benefits of active, observation-driven view selection when the agent navigates through continuous probe motion rather than accessing arbitrary viewpoints directly. \textbf{Qualitative Results.} \cref{fig:Experiment1} visualizes the views selected by MVSelect for representative instances from both datasets. The policy avoids ambiguous cross-sections and orients the probe toward acoustic windows that capture discriminative geometric features of each object. \subsection{Continuous Kidney Cyst Detection.} \label{sec:continuous_detection} The discrete study establishes that active, observation-driven view selection is necessary for ultrasound perception: a small number of adaptively chosen views can match or exceed exhaustive acquisition, and the optimal views vary across instances. However, that study assumes the probe can access any viewpoint without kinematic cost. In practice, the probe moves continuously along the body surface, so each view choice carries a motion cost and constrains which views are reachable next. This introduces two challenges absent from the discrete setting: the agent needs to plan an efficient visitation order across regions, and it needs to execute feasible kinematic trajectories to reach each selected region. We evaluate SonoSelect, whose hierarchical design separates these two challenges into sector-level routing and continuous kinematic control, on a kidney cyst detection task to test both diagnostic performance and generalization to unseen patient anatomies. \textbf{Experimental Setup.} The primary task requires the agent to dynamically scan the left kidney and identify renal cysts. We utilize 3D clinical CT volumes from the TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}. To evaluate structural generalization, patient anatomies are strictly partitioned into seen and unseen domains. \textbf{Baselines.} In this fully continuous setting, we benchmark SonoSelect against baselines representing alternative exploration strategies: \begin{itemize} \item \textit{Random} applies uniformly sampled kinematic actions at each step, providing a lower bound that quantifies the diagnostic yield achievable without any learned or heuristic guidance. \item \textit{Pure PPO} represents end-to-end reinforcement learning without hierarchical decomposition, testing whether a flat policy can implicitly learn both regional planning and local control. \item \textit{VIG} (Volumetric Information Gain) ~\cite{isler2016information} represents classical Next-Best-View driven by entropy maximization, testing whether uncertainty reduction alone provides sufficient guidance for diagnostic exploration. \item \textit{RND ~\cite{burda2018exploration}} provides a state-visitation driven exploration bonus, testing whether encouraging novel state visits improves coverage without task-specific guidance. \end{itemize} \begin{table*}[t] \centering \label{tab:main_results} \begin{tabular}{l c c c c c c} \toprule \textbf{Method} & \textbf{Kidney Cov. (\%)} & \textbf{Cyst Cov. (\%)} & \textbf{Dice (\%)} & \textbf{IoU (\%)} & \textbf{Trans. (voxels)} & \textbf{Rot. ($^{\circ}$)} \\ \midrule \rowcolor{yellow!50} \multicolumn{7}{c}{\textit{Seen Patient}} \\ Random & 19.3 & 12.2 & 30.6 & 19.4 & 934.3 & 2621.1 \\ Pure PPO~\cite{schulman2017proximal} & 59.2 & 28.5 & 61.7 & 44.9 & 458.5 & 327.2 \\ RND~\cite{burda2018exploration} & 60.8 & 31.3 & 63.1 & 46.3 & \textbf{430.4} & \textbf{253.7} \\ VIG~\cite{isler2016information} & 63.1 & \textbf{44.5} & \textbf{71.7} & \textbf{56.0} & 473.6 & 318.9 \\ \textbf{SonoSelect (Ours)} & \textbf{64.3} & 31.0 & 65.9 & 49.3 & 452.6 & 296.1 \\ \midrule \rowcolor{orange!20} \multicolumn{7}{c}{\textit{Unseen Patient}} \\ Random & 27.4 & 2.7 & 42.3 & 27.8 & 927.8 & 2630.0 \\ Pure PPO & 25.7 & 8.2 & 39.1 & 26.5 & \textbf{210.1} & 375.5 \\ RND & 41.1 & 20.7 & 55.2 & 41.2 & 403.8 & \textbf{237.5} \\ VIG & 48.6 & 23.8 & 64.0 & 45.2 & 489.1 & 469.8 \\ \textbf{SonoSelect (Ours)} & \textbf{49.5} & \textbf{30.7} & \textbf{64.2} & \textbf{48.4} & 667.9 & 291.5 \\ \bottomrule \end{tabular} \caption{Quantitative comparison of active scanning performance. SonoSelect exhibits superior robustness, effectively bridging the generalization gap that plagues standard RL baselines in unseen environments.} \end{table*} \textbf{Quantitative Results.} \cref{tab:main_results} presents scanning performance on seen and unseen patient anatomies. On seen anatomies, VIG achieves the highest cyst coverage and reconstruction accuracy, while SonoSelect achieves the highest kidney coverage, as reported in ~\cref{tab:main_results}. This is consistent with the nature of entropy-based exploration: on training anatomies, the spatial distribution of acoustic uncertainty tends to align with target anatomical structures, so greedy entropy maximization effectively guides the probe toward informative regions. SonoSelect's lower cyst coverage on seen data reflects a trade-off in its hierarchical design: the sector selection module distributes exploration across the scanning workspace based on estimated diagnostic value, producing more uniform spatial coverage rather than concentrating on the regions that happen to contain cysts in the training set. This broader exploration strategy, in turn, favors generalization, as the results on unseen anatomies confirm. For reference, the Random baseline achieves the lowest diagnostic scores across all metrics, while consuming substantially more motion budget, confirming that directed exploration is necessary for this task. All methods degrade on unseen anatomies, but the extent of degradation differs. Among the learned methods, Pure PPO exhibits the largest performance degradation, with kidney coverage dropping from 59.2\% to 25.7\% and cyst coverage from 28.5\% to 8.2\%, indicating that the flat policy does not learn transferable exploration behaviors across different anatomies. VIG's cyst coverage drops from 44.5\% to 23.8\%, accompanied by a sharp increase in rotational motion. On seen anatomies, high-entropy regions tend to coincide with target structures, so entropy maximization effectively guides the probe. On unseen anatomies, this alignment weakens, and the probe spends motion budget pursuing uncertainty reduction in diagnostically uninformative regions. In contrast, SonoSelect's cyst coverage remains stable across seen and unseen anatomies, and it achieves the highest scores across all four diagnostic metrics on unseen data. This stability can be attributed to the hierarchical decomposition of the scanning policy. Because the high-level routing operates on sector-level spatial features rather than raw voxel coordinates, its decisions are less tied to the specific geometry of training anatomies. Similarly, the low-level controller only needs to execute short-range navigation toward a given sector, a skill that depends on local kinematics rather than global anatomical layout. As a result, neither level relies on memorizing the full spatial structure of training patients, which explains why SonoSelect's performance degrades less when the anatomy changes. % 将 wrapfigure 放在目标段落正上方 \begin{figure} \centering \includegraphics[width=\linewidth]{images/ppo&sono.pdf} \caption{Episode-level distribution of Cyst Coverage against Trajectory Length on unseen anatomies. SonoSelect (blue) pushes the Pareto front toward higher diagnostic yields compared to Pure PPO (red).} \label{fig:tradeoff} \end{figure} \textbf{Episode-level analysis.} To further examine the generalization behavior at the episode level, ~\cref{fig:tradeoff} plots the episode-level distribution of cyst coverage against trajectory length on unseen anatomies. Pure PPO exhibits a dense cluster in the bottom-left quadrant, indicating frequent near-zero coverage episodes with short, spatially confined trajectories. This pattern is consistent with the limited transferability of the flat policy: when familiar spatial cues from training anatomies are absent, the agent tends to remain confined to local regions rather than exploring broadly. SonoSelect's distribution occupies the upper-right quadrant, where longer trajectories correspond to higher diagnostic coverage. As \cref{tab:main_results} shows, SonoSelect's average trajectory length is considerably longer than that of Pure PPO, yet this additional motion translates into higher scores across all diagnostic metrics, indicating thorough exploration of the target anatomy rather than aimless wandering. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/trajectories.pdf} \caption{Qualitative comparison of scanning trajectories on unseen patient data. Red segments indicate trajectory portions where the probe is actively scanning the kidney or cyst, while gray segments represent motion through non-target regions. The percentage below each example records the proportion of the total trajectory spent on effective target scanning. (a) Pure PPO produces uncoordinated trajectories with low effective scanning ratios. (b) SonoSelect achieves structured, anatomy-centered navigation with substantially higher effective scanning ratios.} \label{fig:qualitative_trajectories} \end{figure} \textbf{Qualitative Results.} ~\cref{fig:qualitative_trajectories} visualizes representative trajectories generated by Pure PPO and SonoSelect on unseen anatomies. As illustrated in \cref{fig:qualitative_trajectories}a, Pure PPO produces uncoordinated circular movements far from the kidney, with the majority of the trajectory passing through non-target regions. The effective scanning ratio in these examples ranges from 13.5\% to 19.6\%, indicating that the agent spends most of its motion budget on non-informative traversal. This is consistent with the low coverage reported in \cref{tab:main_results}, where the agent fails to direct the probe toward the target anatomy. In contrast, SonoSelect (\cref{fig:qualitative_trajectories}b) produces more structured trajectories that closely follow the contours of the kidney. The effective scanning ratios increase substantially, reflecting that a larger fraction of the trajectory contributes to diagnostic observation. This improvement is attributable to the sector-level routing learned by the high-level module, which directs the probe toward the target region and reduces time spent in non-informative areas. \subsection{Ablation Studies} \label{sec:ablation} To validate the core architectural designs of SonoSelect, we conduct ablation experiments on the kidney cyst detection task using unseen patient data. We isolate three components: the learned routing policy, the per-sector feature representation, and the residual control module. Each ablation removes one component while keeping the rest unchanged. The quantitative comparisons are summarized in \cref{tab:ablation}. \textbf{Effect of Learned Routing.} We first evaluate the high-level decision maker by replacing the learned routing policy with random sector selection. Without a task-driven geometric prior, the continuous policy receives arbitrary directional targets, leading to uncoordinated probe motion. As shown in \cref{tab:ablation}, this variant shows a notable drop in cyst coverage, confirming that the learned routing policy is necessary to constrain the search space and direct the continuous policy toward diagnostically relevant regions. \begin{table} \begin{tabular}{lcccc} \toprule Method & Kidney & Cyst & Dice & IoU \\ % 缩短了表头以适应窄列 \midrule Random Routing & 44.4 & 12.5 & 62.2 & 45.5 \\ w/o Sector Features & 44.3 & 18.3 & 61.3 & 45.2 \\ w/o Residual Control & 44.9 & 10.1 & 59.2 & 44.9 \\ \textbf{SonoSelect (Ours)} & \textbf{49.5} & \textbf{30.7} & \textbf{64.2} & \textbf{48.4} \\ \bottomrule \end{tabular} \caption{Ablation study of SonoSelect components.} \label{tab:ablation} \end{table} \textbf{Necessity of Explicit Sector Features.} The w/o Sector Features variant replaces the learned feature vectors with uniform values, making all sectors appear identical to the Q-network. Although the Q-network still receives the global observation $o_t$, it cannot distinguish sectors based on their spatial content in the reconstruction volume. As a result, the Q-network selects sectors without considering what each region contains, leading to reduced coverage for both kidney and cyst targets. The coverage drop confirms that the Q-network relies on per-sector spatial features to make informed routing decisions. \textbf{Role of Residual Control.} The w/o Residual Control variant removes the low-level kinematic adjustments. This variant achieves the lowest cyst coverage among all configurations, while its kidney coverage remains comparable to the other ablated variants. This asymmetry reveals a clear functional division within the framework: the high-level routing policy is sufficient to guide the probe toward the correct anatomical region, but capturing small targets such as cysts requires the fine-grained probe adjustments that only the residual control module provides. \section{Conclusion} We propose SonoSelect, an active multi-view exploration framework for robotic ultrasound that selects informative viewpoints without exhaustive scanning or predefined trajectories. By bridging discrete high-level regional routing with continuous low-level kinematic control, SonoSelect learns to resolve anatomical ambiguities and achieves robust generalization to unseen anatomies where standard reinforcement learning approaches show substantial performance degradation. This approach represents a step toward autonomous robotic ultrasound deployment in clinical workflows. While the current evaluation is conducted in simulation, the hierarchical formulation of coupling discrete region selection with continuous probe control provides a principled way to handle the view-dependent nature of ultrasound imaging. This work suggests that structured, observation-driven exploration can serve as an effective mechanism for multi-view ultrasound perception, reducing the number of views needed for accurate diagnosis while maintaining robust coverage across diverse patient anatomies. %% %% The next two lines define the bibliography style to be used, and %% the bibliography file. \bibliographystyle{ACM-Reference-Format} \bibliography{sample-base} \end{document} \endinput %% %% End of file `sample-sigconf-authordraft.tex'.