Ben Burtenshaw’s Post

View profile for Ben Burtenshaw, graphic

NLP Engineer

ORPO seems to be a streamlined way of aligning language models right now. It's real handy because benchmarks are showing that you can get improved results over SFT plus DPO in just one step. The main advantages of ORPO are: 📊 Mistral ORPO gets 66.19 on IFEval. 🦄 ORPO one method that replaces SFT+DPO/PPO's two. 🏆 ORPO outperforms SFT+DPO on PHI-2, Llama 2, Qwen, and Mistral 🚈 It doesn't require the reference model so uses less memory. This post by Argilla and MantisNLP is a concise overview of the method to bootstrap your experiments. Some more references are in the comments, like the original repo, paper, and a minimal implementation for reduced compute.

View organization page for Argilla, graphic

7,516 followers

🔎 What if you only need a base model for preference alignment? Welcome to ORPO, summarized in our latest blog with MantisNLP. ORPO eliminates the complexity of multi-stage RLHF and leverages the power of an odds ratio-based loss approach. The result? Better performance and significant savings in resources and memory! 🌟 And the cherry on top? To showcase this, one of our favorite datasets, Argilla's UltraFeedback Binarized, was used.

RLHF and alternatives: ORPO

RLHF and alternatives: ORPO

argilla.io

I would also take a look at the paper (https://lnkd.in/eMZjKseK) and repo (https://lnkd.in/e4966VfT) by the ORPO authors.

And this post from last week where I shared a quickstart notebook for using ORPO a limited setup (https://lnkd.in/enXY47rB).

Christopher Williams

Quantitative Analyst | Data Driven Investor

3w

Sounds interesting... but I am still too scared to click on the image.

See more comments

To view or add a comment, sign in

Explore topics