2022/3/31 ~ 2022/4/6
Order-Embeddings of images and Language
Core idea
- Explicitly modeling the partial order structure of the hierarchy over language and image -> Visual sematic hierarchy
How to do it
- Penalize order violations $$ E(x,y) = ||max(0,y-x)||^2 $$ where $E(x,y)=0 \Leftrightarrow x \preceq y$
Modeling heterogeneous hierarchies with relation-specific hyperbolic cones
Core idea
- Embeds entities into hyperbolic cones & models relations as transformations between the cones
How to do it
- Poincare entailment cone at apex $x$ $$\zeta_x = {y\in \Beta^d | \angle_xy\leq\sin^{-1}(K\frac{1-||x||^2}{||x||})}$$
- embed entity: $h=(h_1,h_2,\cdots,h_{d})$, where $h_i\in\Beta^2$ is the apex of the $i-$th 2D hyperbolic cone.
- Transformation:
- Rotation transformation: $f_1(h,r) = \exp_o(\mathbf{G}(\theta)\log_o(h))$, where $\mathbf{G(\theta)=\left[\begin{matrix}cos(\theta)&-sin(\theta)\sin(\theta)&cos(\theta)\end{matrix}\right]}$
- Restricted rotation transformation:$f_2(h,r)=\exp_h(s\cdot \mathbf{G}(\theta\frac{\phi}{\pi})\bar h), r=(s,\theta)$
- Loss function:
- $\ell_d=-\log\sigma(\psi(h,r,t))-\sum_{t^{’}}\frac{1}{|T|}\log\sigma(-\psi(h,r,t^{’}))$
- $\ell_a(h,r,t)=m\cdot(max(0,\angle_{h_i}t_i-\phi(h_i)))$
Learning to detect unseen object classes by between-class attribute transfer
Core idea
- Decompose an image/object into high-level semantic attributes such as shape, color, geographic information. Then detect new classes based on their attribute representations.
How to do it
- Direct attribute prediction:
- decouple training images into attribute variables $\alpha_m$
- per-attribute parameters $\beta_m$
- at test: image $x\rightarrow\beta_m\rightarrow \alpha_m\rightarrow class$ $$p(z|x)=\sum_{a\in{0,1}^M}p(z|a)p(a|x) = \frac{p(z)}{p(a^z)}\prod_{m=1}^{M}p(a_m^z|x)$$
- Indirect attribute prediction
- add an attribute layer between two label layers
- for the indirect one, $p(a_m|x)=\sum_{k=1}^K p(a_m|y_k)p(y_k|x)$
Zero-shot learning via joint similarity embedding
Core idea
- To test if source-target pair matches each other
Multi-grained vision language pre-training: aligning texts with visual concepts
Core idea
- Locate visual concepts in the image given the associated texts & align the texts with the visual concepts.
How to do it
- Bounding box loss: $$\ell_{bbox} = \mathbb{E}{(V^j,T^j)\sim I;I\sim D}\left[\ell{iou}(b_j,\hat{b}_j)+||b_j-\hat{b}_j||_1\right]$$
- Contrastive learning: $$p^{v2t}(V) = \frac{\exp(s(V,t)/\tau)}{\sum_{i=1}^{N}\exp(s(V,T^i)/\tau)}$$ $$p^{t2v}(T) = \frac{\exp(s(V,t)/\tau)}{\sum_{i=1}^{N}\exp(s(V^i,T)/\tau)}$$ $$\ell_{cl}=\frac{1}{2}\mathbb{E}_{V,T\sim D}\left[H(y^{v2t}(V),p^{v2t}(V))+H(y^{t2v}(T),p^{t2v}(T))\right]$$
- Match prediction: whether a pair of visual concept and text is matched. $$\ell_{match} = \mathbb{E}_{V,T\sim D}H(y^{match},p^{match}(V,T))$$
- Masked Language Modeling: predict the masked words in the text based on the visual concept $$\ell_{mlm} = \mathbb{E}_{t_j\sim \hat{T};(V,\hat{T}\sim D)}H(y^j, p^j(V,\hat{T}))$$
Visual Turing test for computer vision system
Abstract
- An operator-assisted device that produces a stochastic sequence of binary questions from a given test image. The query engine proposes a question, the operator either provides correct answer or rejects the question as ambiguous.
- The system is designed to produce streams of questions that follw natural storylines, from the instantiation of a unique object, through an exploration of its properties, and on to its relationships with other uniquely instantiated objects.
Introduction
- Alan Turing proposed that the ultimate test of a machine could think ot think at least as well as a person was for a human judge to be unable to tell which was which based on natural language conversations in an appropriately cloaked scenario. Or how well a computer can imitate a human.
- Semantic image interpretation-image to words
Research shows that human brain can only think about one idea at a time
Visual grounding overview
TransVG: End-to-End Visual Grounding with Transformers
Introduction
- Two-stage methods measure the similarity between region and expression embedding with an MLP.
- One-stage methods encode the language vector to visual feature by direct concatenation.
- Challenges:
- lead to sub-optimal results on long and complex language expressions.
- Built on pre-defined structures of language queries or image scenes.
- Ground object in an indirect way: candidates are sparse region proposals, or dense anchors. So the performance would be influenced by the proposals.
Zero-Shot Robot Manipulation from Passive Human Videos
- Learn robot manipulation only by watching videos of humans doing arbitrary tasks in unstructured settings