๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต-๐—ฅ๐Ÿญ โ€“ the first ๐—ฟ๐—ฒ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐——๐—ฒ๐—ฒ๐—ฝ๐˜€๐—ฒ๐—ฒ๐—ธ-๐—ฅ๐Ÿญ (๐˜‡๐—ฒ๐—ฟ๐—ผ) with reinforcement learning

What do you think of something like this?

For training reasoning and search-augmented LLM agents with reinforcement learning.

This is a step towards training an ๐—ผ๐—ฝ๐—ฒ๐—ป-๐˜€๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐—ข๐—ฝ๐—ฒ๐—ป๐—”๐—œ โ€œ๐——๐—ฒ๐—ฒ๐—ฝ ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ตโ€ via RL.

๐Ÿฏ๐—• ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—Ÿ๐—Ÿ๐— ๐˜€โ€”including not just ๐—ค๐˜„๐—ฒ๐—ป ๐Ÿฎ.๐Ÿฑ but also ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿฎโ€”learn to ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป and ๐—ฐ๐—ฎ๐—น๐—น ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐˜€ all on their own.

We follow Deepseek R1-zero, starting with a base LLM, prompts, and ground-truth rewards. Then, we apply ๐—ฟ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด (RL). Our experiments are conducted on ๐—ก๐—ฎ๐˜๐˜‚๐—ฟ๐—ฎ๐—น ๐—ค๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป๐˜€ (๐—ก๐—ค), a factual QA dataset in which LLMs struggle with direct answers, making search engine calls crucial. The only supervision? A ๐—ฟ๐˜‚๐—น๐—ฒ-๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—ผ๐˜‚๐˜๐—ฐ๐—ผ๐—บ๐—ฒ ๐—ฟ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ (string exact match) to determine correctness.

We first experiment with ๐—ฅ๐—Ÿ ๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด ๐‘ค๐‘–๐‘กโ„Ž๐‘œ๐‘ข๐‘ก search engine access, letting the ๐—Ÿ๐—Ÿ๐—  (๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿฎ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ) answer questions on its own. Initially, the model produces ๐—ฑ๐˜‚๐—บ๐—บ๐˜† ๐—ผ๐˜‚๐˜๐—ฝ๐˜‚๐˜๐˜€, but through RL, it ๐—ด๐—ฟ๐—ฎ๐—ฑ๐˜‚๐—ฎ๐—น๐—น๐˜† ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐˜€ to generate meaningful answers.

Image

Next, we ๐—ถ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜ the ๐—Ÿ๐—Ÿ๐—  (๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿฎ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ) that it can ๐—ฐ๐—ฎ๐—น๐—น ๐—ฎ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ to retrieve relevant information. ๐—ฆ๐˜‚๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ถ๐—ป๐—ด๐—น๐˜†, even ๐˜„๐—ถ๐˜๐—ต๐—ผ๐˜‚๐˜ any supervised fine-tuning (SFT), the ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—Ÿ๐—Ÿ๐—  ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐˜€ ๐˜๐—ผ ๐—ฐ๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฒ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ, ๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ฒ๐˜ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€, ๐—ฎ๐—ป๐—ฑ ๐—ฎ๐—ป๐˜€๐˜„๐—ฒ๐—ฟ ๐—พ๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป๐˜€โ€”๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต ๐—ฅ๐—Ÿ!

Image

We compare the performance of the ๐—Ÿ๐—Ÿ๐—  ๐˜„๐—ถ๐˜๐—ต๐—ผ๐˜‚๐˜ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฎ๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ vs. ๐—Ÿ๐—Ÿ๐—  ๐˜„๐—ถ๐˜๐—ต ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต-๐—ฎ๐˜‚๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฒ๐—ฑ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด via RL. The ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต-๐—ฒ๐—ป๐—ฎ๐—ฏ๐—น๐—ฒ๐—ฑ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜„๐—ถ๐—ป๐˜€!

Image

When training ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿฎ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ with ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฐ๐—ฎ๐—น๐—น๐—ถ๐—ป๐—ด, the response length follows an interesting trend:

๐—™๐—ถ๐—ฟ๐˜€๐˜, ๐—ถ๐˜ ๐—ฑ๐—ฒ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ๐˜€โ€”the model learns to ๐—ฎ๐˜ƒ๐—ผ๐—ถ๐—ฑ ๐—ฒ๐˜…๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ ๐—ฑ๐˜‚๐—บ๐—บ๐˜† ๐˜„๐—ผ๐—ฟ๐—ฑ๐˜€. ๐—ง๐—ต๐—ฒ๐—ป, ๐—ถ๐˜ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ๐˜€โ€”as it learns to ๐—ฐ๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฒ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป effectively. Since ๐—ก๐—ค ๐—ถ๐˜€ ๐—ฎ ๐—ฟ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ๐—น๐˜† ๐˜€๐—ถ๐—บ๐—ฝ๐—น๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ, the response length ๐˜€๐˜๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜‡๐—ฒ๐˜€ ๐—ฎ๐˜ ~๐Ÿฑ๐Ÿฌ๐Ÿฌ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€.

Image

We experiment with ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ and ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿณ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ under both with/without search engine RL settings. ๐—œ๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—ฏ๐—ผ๐˜๐—ต! Interestingly, in the ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต-๐—ฎ๐˜‚๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฒ๐—ฑ ๐˜€๐—ฒ๐˜๐˜๐—ถ๐—ป๐—ด, the ๐Ÿฏ๐—• ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ฒ๐˜ƒ๐—ฒ๐˜€ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—ฏ๐—น๐—ฒ ๐˜๐—ผ ๐˜๐—ต๐—ฒ ๐Ÿณ๐—• ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น. ๐—›๐˜†๐—ฝ๐—ผ๐˜๐—ต๐—ฒ๐˜€๐—ถ๐˜€: When an ๐—Ÿ๐—Ÿ๐—  ๐—ถ๐˜€ ๐—ฐ๐—ผ๐—ป๐—ป๐—ฒ๐—ฐ๐˜๐—ฒ๐—ฑ ๐˜๐—ผ ๐—ฒ๐˜…๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป, its ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† ๐—บ๐—ฎ๐˜† ๐—ป๐—ผ๐˜ ๐—ป๐—ฒ๐—ฐ๐—ฒ๐˜€๐˜€๐—ฎ๐—ฟ๐—ถ๐—น๐˜† ๐—ฟ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ ๐—ฎ ๐—น๐—ฎ๐—ฟ๐—ด๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜€๐—ถ๐˜‡๐—ฒ.

Image

๐—•๐—ผ๐˜๐—ต ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐—ถ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐˜„๐—ผ๐—ฟ๐—ธ! The ๐—ถ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น converges ๐—ณ๐—ฎ๐˜€๐˜๐—ฒ๐—ฟ and starts from ๐—ฎ ๐—ฏ๐—ฒ๐˜๐˜๐—ฒ๐—ฟ ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ฎ๐—น ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ. However, the ๐—ณ๐—ถ๐—ป๐—ฎ๐—น ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ of both models is ๐˜ƒ๐—ฒ๐—ฟ๐˜† ๐˜€๐—ถ๐—บ๐—ถ๐—น๐—ฎ๐—ฟ. This suggests that while ๐—ถ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด ๐—ฎ๐—ฐ๐—ฐ๐—ฒ๐—น๐—ฒ๐—ฟ๐—ฎ๐˜๐—ฒ๐˜€ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด, ๐—ฟ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฟ๐—ถ๐—ฑ๐—ด๐—ฒ ๐˜๐—ต๐—ฒ ๐—ด๐—ฎ๐—ฝ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ ๐˜๐—ถ๐—บ๐—ฒ.

Image

We experiment with ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ, ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ๐Ÿฏ.๐Ÿฎ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ, ๐—ฎ๐—ป๐—ฑ ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿณ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒโ€”and ๐˜๐—ต๐—ฒ๐˜† ๐—ฎ๐—น๐—น ๐˜„๐—ผ๐—ฟ๐—ธ! This is ๐—ป๐—ผ๐˜๐—ฎ๐—ฏ๐—น๐˜† ๐—ฑ๐—ถ๐—ณ๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐˜ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—บ๐—ฎ๐˜๐—ต ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด, where only the ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ ๐˜€๐—ฒ๐—ฟ๐—ถ๐—ฒ๐˜€ models succeed.

Image

The ๐—Ÿ๐—Ÿ๐—  ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐˜€ ๐˜๐—ผ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ ๐—บ๐˜‚๐—น๐˜๐—ถ-๐˜๐˜‚๐—ฟ๐—ป ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฐ๐—ฎ๐—น๐—น๐˜€, refining its queries step by step to gather more relevant information. This showcases its ability to ๐—ถ๐˜๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ๐—น๐˜† ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ ๐—ฟ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฒ๐˜ƒ๐—ฎ๐—น ๐—ฎ๐—ป๐—ฑ ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ดโ€”a key capability for real-world research agents!

Image

Our framework supports ๐—ณ๐—น๐—ฒ๐˜…๐—ถ๐—ฏ๐—น๐—ฒ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฐ๐—ต๐—ผ๐—ถ๐—ฐ๐—ฒ๐˜€, including: ๐—Ÿ๐—ผ๐—ฐ๐—ฎ๐—น ๐—ฟ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐˜€ (sparse/dense) ๐—ข๐—ป๐—น๐—ถ๐—ป๐—ฒ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐˜€ (Google, Bing, etc.) ๐—–๐˜‚๐˜€๐˜๐—ผ๐—บ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐˜€โ€”Launch your own on any corpus and integrate it with RL effortlessly!


The pipeline is based on verl (https://github.com/volcengine/verl), a highly efficient RL framework.

Fully open source

Experimental logs
Github

Please authenticate to join the conversation.

Upvoters
Status

Rejected

Board
๐Ÿ’ก

Feature Requests

Tags

Web Search

Date

About 1 year ago

Author

JaeSwift

Subscribe to post

Get notified by email when there are changes.