Senior Software Engineer — AI Evaluation & Benchmarks (Python)

G2i Inc.

Job Overview

Location

Miami

Job Type

Contract

Full Job Description

📋 Description

• Design and build coding benchmarks that evaluate frontier AI models on real-world software engineering tasks, including reasoning, debugging, and production-quality code generation
• Develop and maintain scalable data pipelines to support automated evaluation workflows for AI-generated code
• Analyze model-generated code for correctness, reliability, edge-case failures, and adherence to software engineering best practices
• Construct structured evaluation scenarios that span large, multi-repository codebases and multi-language programming environments
• Provide detailed technical feedback on AI model performance, identifying patterns of success and failure to inform iterative benchmark improvements
• Contribute to the development of evaluation frameworks that define industry standards for measuring coding ability in AI systems
• Ensure benchmarks effectively distinguish between high-performing and weak AI models by creating tasks grounded in real software engineering work
• Implement evaluation harnesses that automate task execution, result collection, and failure analysis across diverse coding challenges
• Collaborate with engineering and research teams to refine evaluation methodologies based on empirical model behavior and performance trends
• Maintain version-controlled, well-documented, and tested code for all benchmark components and evaluation infrastructure
• Optimize evaluation pipelines for speed, reproducibility, and resource efficiency while handling large volumes of model outputs
• Work within modern development workflows using Git, code reviews, and automated testing to ensure high-quality, production-grade evaluation systems
• Translate abstract model capabilities into concrete, measurable evaluation criteria that align with real-world software development needs
• Iterate on benchmark design based on feedback from model performance data to continuously improve discriminative power and relevance
• Document evaluation protocols, scoring rubrics, and failure analysis methodologies for internal and external consumption
• Support the creation of datasets used to train and evaluate next-generation AI coding models through rigorous, repeatable testing procedures
• Operate independently to manage end-to-end evaluation pipeline development, from initial design to deployment and ongoing maintenance

🎯 Requirements

• 4+ years of professional software engineering experience (non-negotiable)
• Expert Python — clean, performant, well-tested code
• Hands-on experience working in large, complex codebases
• Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
• Strong command of Git and modern development workflows
• Track record at a high-growth tech company or top-tier software organization

🏖️ Benefits

• $80–$100/hr compensation based on location and seniority
• Fully remote work — eligible from accepted countries only
• Weekly payment via PayPal or Stripe
• 3-month contract with potential for extension
• Full-time availability preferred, though hours vary week to week
• Independent contractor (1099) engagement with no visa sponsorship or W-2 employment

Skills & Technologies

Python

JavaScript

Git

Pytest

Senior

Remote

$80-100/hr

Degree Required

Ready to Apply?

Apply Externally

You will be redirected to an external site to apply.

AI Job Fit Analysis

Pro

See exactly how your profile matches this role — strengths, skill gaps, and what to do about them.

G2i Inc.

Visit Website

About G2i Inc.

G2i is a technical talent marketplace that pre-vets React, React Native, and Node.js engineers for U.S. companies. Founded by developers to solve hiring pain, it runs extensive code reviews, pair-programming interviews, and background checks before matching engineers for contract or full-time remote roles. G2i emphasizes mental health, offering a monthly wellness stipend and a zero-burnout policy. The company also provides direct-hire services and manages payroll, compliance, and ongoing support, enabling startups and enterprises to scale engineering teams quickly while maintaining code quality and developer well-being.

View Company Profile

Get more remote jobs like this

Subscribe to the weekly newsletter for similar remote roles and curated hiring updates.

Weekly remote jobs and featured talent.

No spam. Only curated remote roles and product updates. You can unsubscribe anytime.