Avs speaker proposal
Interactive Ranking Aggregation of multiple results
The Ad-hoc search task necessitates participants to appropriately model a user’s text query to search for video shots that align with the textual description. After referencing last year’s team’s approach, we chose to utilize embedding models and incorporated some of the successful practices from the previous teams. For automatic runs,we applied classic Language-Image pre-training models CLIP and its various variants:SLIP,BLIP,BLIP-2,LaCLIP. Besides, we also applied Diffusion model to turn text query into abundant generated pictures in order to attain so-called “mean image query”. For those using feedback runs, we used Top-K Feedback and a new algorithm Quantum-Theoretic Interactive Ranking Aggregation (QT-IRA) that adjusts models’ weight with relevance feedback.
For various language-image models, We find some interesting phenomenon. On one hand, the more diverse the types of models, the better the results after fusion. On the other hand, for models of the same kind, the fewer models that perform poorly, the better the results. Besides, inspired by team Waseda_Meisei_SoftBank at 2022, we use diffusion model turn text query into “mean image query”. In our experiments, this method performed even better in certain queries than CLIP. It can be concluded that comparing within the same modality instead of the same mapping space performs well. We have made attempts in various directions. For example, we tried to design an appropriate prompt to formulate the initial query with ChatGPT. Although intergrating the approach of Chain of Thought(COT) , the improvement in results is still unstable.
With so many models, our primary task is to determine the optimal fusion weights for them. One natural idea is to adjust weights with feedback information. That’s why we propose a new algorithm Quantum-Theoretic Interactive Ranking Aggregation(QT-IRA). However, the improvement in results is limited. Further improving on the quality of interaction is needed, which providing feedback on specific details rather than general information. For example, for the query “a man wears black shorts” and a negative feedback image “a woman wears black shorts”, we hope the model should be conscious that “woman” is wrong rather than “black shorts”.
This line appears after every note.