/scripts

Primary LanguageRuby

GalaxyBrain & CodeGPT: A Comprehensive Vision for AI-Assisted Coding


Introduction: GalaxyBrain aims to harness the capabilities of AI, particularly GPT-4, to revolutionize the developer experience. The goal is to provide precise, context-aware coding assistance, bridging the gap between AI models and developers.


Pivot to CodeGPT and Pinecone: Upon discovering CodeGPT's support for embeddings through Pinecone and its integration with VS Code, there was a shift in direction. Pinecone's existing system offers a platform that can be leveraged without building from scratch. While the initial ideas of integrating with GPT Engineer or GPT Pilot remain, there's a knowledge gap regarding how the GPT-4 API session in such tools would read embeddings hosted at Pinecone.

User's Goal with CodeGPT: Develop a robust Ruby script for the VS Code extension, CodeGPT, to enhance code completions. The script should:

  • Scan Gemfile or package.json in the current directory.
  • Identify on-disk code directories of packages.
  • Cache and manage package repos for tracking.
  • Chunk code from repos and the current workspace (excluding sensitive data).
  • Create embeddings via OpenAI API and store vectors in Pinecone.
  • Handle errors gracefully and allow periodic embeddings refresh.

Embeddings & Pinecone: Embeddings are vectorized code representations stored in Pinecone for CodeGPT's richer context. They're crucial for optimized code suggestions, especially in decentralized ecosystems.

Key Points:

  • Use local versions of installed packages.
  • Index the full workspace, excluding duplicates and sensitive files.
  • Utilize the Oj gem for JSON operations.
  • If a package's enabled status is nil, prompt the user for relevance.

Token Limitations & Strategy:

  • OpenAI embeddings API has a token limit of 8191 for the V2 model.
  • Use tiktoken_ruby gem for tokenization.

Refined Chunking Strategy:

  • Chunk Size & Overlap: Opt for a chunk size of 300-500 tokens with a deliberate overlap. This overlap ensures context preservation, especially vital when meaningful constructs like functions or classes risk being split between chunks. The size strikes a balance between capturing sufficient context and maintaining granularity in the embeddings.
  • Metadata Capture: For each chunk, store metadata comprising filename, relative filepath, and an array of line_numbers. This approach not only simplifies line number tracking but is also crucial for accurate git patch generation in future stages. An array representation avoids manual calculations, ensuring precision.
  • Unique Semantic Embeddings: Focus on generating embeddings that represent unique semantic chunks. By doing so, we ensure that the embeddings capture distinct and meaningful information, maximizing the utility and efficiency of the retrieval-augmented generation process.

Caching: Implement caching to avoid re-chunking unchanged files. Calculate file content hashes and compare with stored hashes. Only reprocess changed files.

Integration with CodeGPT: Ensure CodeGPT in VS Code has the Pinecone API key. Regularly update embeddings and monitor code completion quality.

Metadata and Line Numbers: Preserve line numbers in code chunks for future git patches. Store line numbers as metadata in Pinecone.

Embedding Metadata within Code Chunks:

  1. Format: Use a JSON format for the metadata. It's structured, easily recognizable, and can be parsed by most programming languages.

  2. Metadata Structure:

{
    "filename": "main.rb",
    "filepath": "/projects/my_project/",
    "line_numbers": [10, 11, 12, ...],
    "package_name": "sample_package" // or "project_name": "my_project" if it's from the current workspace
}
  1. Integration:
  • Serialize the JSON metadata to a string.
  • Prepend the serialized metadata to the code chunk, separating them with a newline.

Example in Ruby:

metadata = {
    filename: "main.rb",
    filepath: "/projects/my_project/",
    line_numbers: [10, 11, 12, ...],
    package_name: "sample_package" // or project_name: "my_project" if it's from the current workspace
}.to_json

chunk_with_metadata = "#{metadata}\n#{code_chunk}"
  1. Usage:
  • When the LLM or any other system processes the chunk, it can easily extract and parse the metadata from the content.
  • The metadata provides context about the code chunk, including its origin, location in the source file, and associated package or project.

Advanced Features:

  • Auto-Debugging and CLI Interactions: Inspired by GPT Pilot's auto-debugging feature, GalaxyBrain can self-correct, run CLI commands, and adjust based on feedback.
  • Git Patches: Generates git patches for granular code modifications, allowing for easier reviews and direct integration with version control systems.
  • Integration with Existing Tools: Designed to integrate seamlessly with tools like GPT Engineer and GPT Pilot, enhancing their capabilities and user experience.

Challenges & Considerations:

  • The VS Code extension for CodeGPT may not handle git patches as output from GPT-4. It cannot automatically apply those patches, necessitating git integration similar to Aider, which can receive, apply, and potentially auto-commit git patches if they are correctly formatted and safe.

Future Vision and Implementation:

  1. Optimized User Experience:
    • Potential development of a Mac app using Electron.
    • The app would wrap around a backend built on iterated versions of GPT Engineer.
    • Features like drag-and-drop for Gemfiles or project directories, stylish prompts, and in-app terminals are envisioned.
  2. Automated Bug Fixing:
    • Envisions a future where bug fixing is largely automated.
    • Developers can press "fix" on entire projects, streamlining the debugging process.
  3. Strategic and Creative Focus:
    • Developers will prioritize among coding projects, understanding their interdependencies.
    • Align projects with higher aims and ensure sustainable coding practices.
  4. Human vs. AI Decision Making:
    • A central challenge is determining when automated action is appropriate and when human authorization or intervention is necessary.
  5. Future Explorations:
    • Add support for Python’s requirements.txt.
    • Write exhaustive tests, possibly in Rspec.

GitHub Repo: - Check the current code status on https://github.com/semikolon/scripts. - Read todo.md for a list of the next steps. - The Ruby script update_repos is the starting point. It doesn't have an .rb extension because it's meant to be executable (+x) and the directory should be added to the PATH in the terminal so that you can run update_repos from any project directory. That way, it will index the dependency packages of that project and upload/update embeddings of their code in Pinecone.


AI-Assisted Programming Tools Overview:

  • GPT Pilot: A tool developed by Pythagora-io. Link
  • GPT Engineer: Created by Anton Osika. This tool has a unique improve flag feature. Link
  • Aider: A programming assistant by Paul Gauthier. Link
  • MetaGPT: Developed by Geekan, it includes a code_review mode. Link
  • Open Interpreter: A project by Killian Lucas. Link

Comparison Insights:

  • The code_review mode in MetaGPT and the improve flag in GPT Engineer have been compared and found to be quite similar in functionality.
  • A challenge in AI-assisted programming is the finite token context. While users can manually select files, the ideal scenario would be for the AI to understand the complete context of the codebase and its dependencies. One potential solution is using embeddings and retrieval-augmented generation to approximate this ideal. This is a primary goal for future development.