Good morning everyone!
Today, we’re still reflecting on the last year and how our AI usage habits have evolved over those 365 days of practice, implementation, and usage of this technology. It's fascinating to see how our relationship with these tools has matured from the initial "use GPT-4 for everything and see what happens" approach to a more nuanced and strategic utilization of various AI models.
Remember when we all thought having access to GPT-4 was all we needed? Those days are long gone. Let's explore how our understanding and usage of AI have evolved and what we've learned about these tools' strengths and weaknesses.
Note that this is simply what the three of us, writing this piece, have witnessed in our own experience in using and working with LLMs over the year. It is not a piece that applies to everyone! Still, we’d love it if you could take a few minutes to reflect back on your usage of AI systems and let us know if you agree or have a very different experience than us.
The Shift from "One Model Fits All" to Strategic Selection
Last year, many of us defaulted to using GPT-4-Turbo (now 4o) for everything. It was our Swiss Army knife for all AI needs. Now, we've learned to be more selective in our choices. GPT-4o has found its niche in quick, straightforward tasks like simple coding questions and term explanations, though interestingly, our usage has decreased as we've become more experienced.
We now tend to prioritize o1 for more complex (or larger) challenges. It excels at complex tasks with multiple requirements, from mathematical problems to sophisticated coding challenges. Its ability to handle architecture-level programming decisions and generate detailed reports while paying attention to all specified requirements makes it much better.
When it comes to more conversational needs and general knowledge tasks, Claude 3.5 Sonnet has carved out its own space. It shines in RAG-based projects and tasks requiring broad understanding, offering technical capabilities on par with GPT-4o while maintaining more concise responses. What truly sets it apart is its broader general knowledge and more empathetic, nuanced personality compared to OpenAI's more technical and ‘robotic’ approach. Claude is particularly effective for discussing real-world situations requiring emotional intelligence, cultural understanding, or complex reasoning. While OpenAI's models might excel in their systematic, structured responses - perfect for technical tasks - Claude often provides more thoughtful, contextually rich (and human) insights for general topics.
The Rise of Specialized Tools
The evolution of our AI usage isn't just about different language models – it's also about the emergence of specialized tools. There has been lots of progress in the quality of integrations of LLMs into existing apps this year. We have been using Cursor as our primary IDE. It offers direct integration with OpenAI and Anthropic APIs. Its pay-per-use model provides a refreshing alternative to traditional subscriptions (though a subscription is also offered), and the ability to get immediate coding assistance without context switching has made it increasingly valuable.
In search and research, we've seen a significant shift toward using Perplexity for web searches, while tools like Claude Projects and OpenAI Canvas have enhanced our writing and editing capabilities through extended context and specialized features. Again, this is not LLM improvement but integration (and UX) improvements.
Changed Programming Habits
Perhaps the most dramatic shift has been in our programming workflow. AI has become our second half in development. We use it as a starting point for template generation, feature integration, and code review. AI helps us think through problems, explain code in new ways, and be much more efficient.
Key Learnings and Best Practices
After a year, our experience has shown that context is absolutely important for getting the best results from AI models. These LLMs are now advanced interns (or even better). But you still need to be precise with what you really want, the style, the format, what you like and dislike, and then let it do the work. Features like Claude Projects and Cursor’s “answer with codebase” have proven to be game-changers in this regard. We've also seen a clear shift away from standalone chat interfaces toward integrated solutions that combine multiple tools for optimal results.
The key to effective programming with AI lies in knowing when to use which model. Sonnet 3.5 or GPT-4o excel at specific, single-task prompts where quick iteration is needed, while o1 shines when dealing with complex, multi-step problems. We've learned to switch between them strategically based on the task's complexity and requirements.
Looking Forward
As we continue to evolve in our AI usage, we're seeing a clear trend: it's not about finding the "best" model, but rather about understanding which tool fits each specific need. The future likely holds even more specialized applications and refined usage patterns. This is again following the generalization-specialization cycle, where we tend to always alternate between the two, optimizing one at a time. Though, now, we have amazing tools like Cursor optimizing for specialization, leveraging generalization-optimizers like OpenAI and Google leveraging their best general foundation models.
What's particularly fascinating is how challenging it is to evaluate these models' capabilities truly. Traditional benchmarks often fail to capture the nuances of real-world tasks and the subtle differences in how models handle complex scenarios. Model providers typically report performance on broad tasks (like general coding problems), hoping that optimization at this level will translate to better performance on more specific use cases. Our understanding of each model's strengths comes primarily through hands-on experience rather than performance metrics alone.
Remember, the key isn't to use the most advanced model for everything, but to use the right tool for each specific task.
Here's a quick reference table from our experience with model selection: