Tinker: Call for Community Projects

We launched Tinker to enable builders and researchers to train models their own way, whether they’re conducting studies or customizing models for new applications. We plan to publish regular roundups of the coolest projects from the Tinker community, and we invite you to submit what you’ve been Tinkering on to be featured on our blog.

Below are some broad suggestions for what we hope to see from the Tinker featured projects, and some specific research directions we would particularly love to see pursued.

We’re interested in featuring ML research projects, AI-enabled research in other domains, custom models, and other contributions. Some examples:

  • A reimplementation of a research project or tech report using Tinker, such as papers that compare algorithmic recipes or datasets.
  • Original research in machine learning, such as exploring new approaches to training or optimization or applying novel benchmarks and evaluations.
  • Research in a non-AI field that uses fine-tuned models, such as the work on mathematical theorem provers and chemistry models we highlighted previously.
  • Product prototypes built with Tinker, demoing a model that does something fresh or delightful.
  • Novel datasets and task environments for training models.
  • High-level libraries built on top of Tinker that enable less experienced practitioners to perform fine-tuning effectively.
  • Infrastructure contributions, such as a clean self-hosted implementation of the Tinker training API.

Your submission should include a write-up and, preferably, an open-source release of your code. We encourage you to focus on rigor and clear evaluation in your write-ups: crisp charts, raw output examples, clear comparisons to alternative approaches or models on relevant benchmarks and metrics. Tinkering is experimenting — we want to feature diligent work and transparent results over novelty or hype.

Please send your projects and any related questions to [email protected] with “Featured Project” in the subject line.

Suggested research projects

Here are some research directions that we would personally love to see explored and that Tinker can enable real progress on. We have created a repo with detailed motivation and guidelines for each; we’ll be adding more resources to it over time to help researchers get started. We expect most project ideas to surprise us, but this short list could serve as inspiration.

Replicating Constitutional AI, starting from the base model. Though RLAIF is widely used, it’s most often bootstrapped from existing instruction-tuned models. This makes it difficult to separate the impact of the constitution from the impact of the data-generating model that interprets it. A study of Constitutional AI with and without instruction-tuned models in the pipeline would shed light on the use of constitutions and RLAIF.

RLVR with Noisy student. Noisy student self-distillation was a popular technique in an earlier era of machine learning for making use of large unlabeled datasets, but it hasn’t been adapted widely to LLMs. One possible adaptation is to start RLVR with a small labeled training set and a large unlabeled one, then have the student apply labels to the latter set after each RL run and iterate.

On-Policy Context Distillation. Context distillation trains a student model with empty context on a teacher model with long and detailed context. Prior work used off-policy distillation — fine-tuning on teacher samples. We have found that on-policy distillation is often much more effective; it would be useful to compare the two approaches for context distillation in particular.

RL memory test. Our post on LoRA presented theoretical arguments on the rate of information acquisition by both SFT and RL. You can set up a toy environment where RL must learn a completely random number sequence, to compare the empirical learning rate under various reward functions to the theoretical estimate.

Direct RL on pairwise judge. RLHF and RLAIF use datasets of pairwise preferences, which are used to train a reward model, which is then used in RL. As an alternative “direct” approach, we can do RL using a prompted model that does pairwise comparisons, without training the reward model. It would be interesting to do experiments comparing the direct and indirect approaches.

Replicate Open Character Training. Replicate the recent paper on Open Character Training using Tinker.

GAN for jokes. In domains such as humor, it is easier to curate a human-vetted set of demonstrations than to train a reliable judge or reward model. Try implementing GAN-style training for a joke evaluator and joke generator that can craft a joke with a requested subject and keywords.

Tips for high-quality ML experiments

In closing, we want to offer a few guidelines for running quality ML studies, the same guidelines we strive to adhere to internally when running experiments and documenting the results.

We encourage researchers to apply multiple analyses for examining each result. When creating datasets or environments, we recommend training a range of models and applying different evals. When developing novel methods, we suggest comparing to simpler baseline methods and sweeping hyperparameters that performance is sensitive to, particularly learning rate.

We’d love to see your reasoning in the write-up: assumptions you made, how your approach diverges from previous reports, and what motivated each change. We hope to see examples of the raw data and model rollouts, along with the summarized results. Finally, we appreciate crisp and detailed write-ups with clean and well-labeled charts and illustrations of the inner workings of the methods used.

We are excited to see what our community creates with Tinker, and hope that our featured projects will inspire your own work.