Alexey Fateev

Technical Lead / AI Platform & LLMOps

Summary

Technical lead with 6+ years in IT and 3+ years in MLOps/LLMOps, focused on taking AI/LLM solutions from requirements and technical design to production rollout and operations. Combines hands-on AI infrastructure expertise with delivery leadership across Data Science, MLOps, Data Engineering, analytics, and business stakeholders. Specialize in building and optimizing RAG platforms, AI agents, and LLM inference infrastructure.

  • Production experience with RAG platforms, AI agents, LLM inference, model serving, release processes, observability, and incident handling.
  • Able to translate business/product needs into technical plans, clarify ambiguous requirements, surface delivery risks, and unblock cross-functional teams.
  • Active independent researcher in LLM inference and public speaker on RAG, AI platforms, and LLM infrastructure.

Experience

Technical Lead / AI Platform & LLMOps

KTSAugust 2024 - PresentRussia

In this role, I lead the MLOps team in a project to create a unified RAG platform for the entire bank. My work combines technical leadership, model optimization, and interaction with business stakeholders to integrate new solutions.

Key Achievements:

  • Leading a cross-functional team of 15+ professionals (Data Scientists, ML Engineers, Data Engineers, System Analysts) as Tech Lead, driving technical strategy and execution across multiple AI initiatives
  • Successfully delivered 5 production-ready RAG-based products and AI Agent solutions, serving the entire bank's AI infrastructure needs
  • Architected and implemented from scratch an A/B testing platform for RAG products leveraging Istio Service Mesh and Argo Rollouts, enabling data-driven product optimization
  • Designed and deployed canary deployment strategy from the ground up, significantly reducing production deployment risks and enabling safer rollouts
  • Established unified technical and infrastructure layer across all AI products, ensuring consistency, scalability, and maintainability
  • Optimized LLM model inference, resulting in a 40% performance improvement. This reduced the response time of the entire RAG service by half
  • Ensured high performance and reliability of the service, maintaining SLA at 5 seconds under load of up to 250,000 requests per day
  • Developed and implemented production-ready MLOps pipelines for LLM model deployment using KServe and vLLM
  • Resolved infrastructure constraints by building vLLM from source with flash-attention support for legacy CUDA (11.8)
  • Implemented a unified gateway (HiGress) for all LLM models and MCP (Model Context Protocol), centralizing management and access

Core Responsibilities:

  • Designing architecture and participating in RAG system implementation
  • Deploying and maintaining LLM inference infrastructure in new clusters based on KServe, including troubleshooting kNative and Istio components
  • Client interaction: conducting meetings, developing connection schemes for new clients to RAG service, and effort estimation
  • Creating unified pipelines for deploying various non-model services across multiple environments (clusters), improving release speed and consistency
  • Research and implementation of best practices for optimizing and accelerating LLM model inference
KubernetesKServevLLMRAGArgoCDArgo RolloutsIstioPythonJenkinsAI Agents

MLOps Engineer

VKMay 2023 - August 2024 · 1 year 4 monthsRussia
  • Developed and maintained a machine learning model deployment platform, managing 100+ ML models as part of a specialized ML team
  • Orchestrated database operations, including table creation and structure optimization for enhanced performance
  • Led critical aspects of a large-scale infrastructure migration, including server relocation and system upgrades
  • Authored and implemented Lua scripts for Tarantool Cartridge cluster during application migration
  • Enhanced a Golang-based database emulator for Clickhouse, improving integration testing capabilities
  • Streamlined Python environment migration through RPM packaging and GitLab CI pipeline development
  • Developed and deployed a chat-bot application utilizing OpenAI API, Langchain, and RAG for custom report generation
  • Deployed applications in Kubernetes (k8s) environments, ensuring scalability and efficient container orchestration
  • Utilized Puppet for automated server deployment and configuration management
PythonRAGLuaGolangClickhousePythonRPMGitLab CIOpenAI APILangchainKubernetesPuppet

Data Engineer

Метр квадратныйMarch 2022 - May 2023 · 1 year 3 monthsRussia
  • DWH maintenance
  • Modeling new database objects from non-relational to relational form
  • Implementing Grafana and Prometheus to track metrics about DAGs execution
  • Creating and maintaining ETL pipelines to automate CRM interactions with customers through various communication channels (email, SMS, push notifications, etc)
  • Using asynchrony to speed up query execution
  • API integration with external systems
PythonDWHApache AirflowApache KafkaPostgreSQL

Data Engineer

DataArtAugust 2021 - March 2022 · 8 monthsRussia
  • Developed data pipelines in GCP for financial data processing, including encryption and anonymization in PCI environment
  • Built backend services using FastAPI and deployed them to Cloud Run and Cloud Functions
  • Created and maintained data analytics protocols, standards and documentation
  • Developed web application using Django and Plotly Dash for IT job market trend analysis
  • Implemented ETL pipelines using Apache Airflow for data processing
  • Worked with technologies: GKE, Cloud PubSub, BigQuery, Cloud Build, PostgreSQL, Docker, Redis
GCPFastAPIDjangoPlotly DashApache AirflowGKECloud PubSubBigQueryPostgreSQLDockerRedis

Independent Research

LLM Inference Optimization

2025 - Present

Building and maintaining a personal 4×RTX 3090 inference server (96GB VRAM). Experimenting with emerging inference optimization techniques, tracking industry trends, and publishing benchmarks and findings on Telegram.

  • Benchmarked Prefill/Decode disaggregation (SGLang + Mooncake), achieving 5× lower P99 inter-token latency vs unified serving
  • Tested DFlash speculative decoding with Qwen3.5-27B, reaching 90 tok/s single-user (1.6× over baseline)
  • Evaluated ultra-low-bit dynamic quantization (Unsloth IQ2_XXS) for running 256B+ parameter models on consumer hardware
  • Exploring Tensor/Pipeline Parallelism tradeoffs on PCIe-connected multi-GPU setups without NVLink
vLLMSGLangMooncakeDFlashGGUF/AWQ quantizationRTX 3090 multi-GPUflash-attention

Public Speaking

MLечный путь 2026

ИИ-агенты в крупном банке — опыт, эффекты, затраты, ошибки

DevOops 2025

From RAG for operators to a RAG platform for a major bank

Selectel Tech Day 2025

Alfa-Bank case — piloting RAG for 10,000 employees

Certifications

Technical Leadership

Hard&Soft Skills · Certificate of completion for the Technical Leadership program

Education

Master of Mathematical Modeling and Computer Science

Voronezh State University2009 - 2015Russia