Alexey Fateev

Technical Lead / AI Platform & LLMOps

[email protected]in/alexfateev x.com/superalesha t.me/tarnished_ones t.me/fuckup_files

Summary

Technical lead with 6+ years in IT and 3+ years in MLOps/LLMOps, focused on taking AI/LLM solutions from requirements and technical design to production rollout and operations. Combines hands-on AI infrastructure expertise with delivery leadership across Data Science, MLOps, Data Engineering, analytics, and business stakeholders. Specialize in building and optimizing RAG platforms, AI agents, and LLM inference infrastructure.

Production experience with RAG platforms, AI agents, LLM inference, model serving, release processes, observability, and incident handling.
Able to translate business/product needs into technical plans, clarify ambiguous requirements, surface delivery risks, and unblock cross-functional teams.
Active independent researcher in LLM inference and public speaker on RAG, AI platforms, and LLM infrastructure.

Experience

Technical Lead / AI Platform & LLMOps

KTS•August 2024 - Present•Russia

In this role, I lead the MLOps team in a project to create a unified RAG platform for the entire bank. My work combines technical leadership, model optimization, and interaction with business stakeholders to integrate new solutions.

Key Achievements:

Leading a cross-functional team of 15+ professionals (Data Scientists, ML Engineers, Data Engineers, System Analysts) as Tech Lead, driving technical strategy and execution across multiple AI initiatives
Successfully delivered 5 production-ready RAG-based products and AI Agent solutions, serving the entire bank's AI infrastructure needs
Architected and implemented from scratch an A/B testing platform for RAG products leveraging Istio Service Mesh and Argo Rollouts, enabling data-driven product optimization
Designed and deployed canary deployment strategy from the ground up, significantly reducing production deployment risks and enabling safer rollouts
Established unified technical and infrastructure layer across all AI products, ensuring consistency, scalability, and maintainability
Optimized LLM model inference, resulting in a 40% performance improvement. This reduced the response time of the entire RAG service by half
Ensured high performance and reliability of the service, maintaining SLA at 5 seconds under load of up to 250,000 requests per day
Developed and implemented production-ready MLOps pipelines for LLM model deployment using KServe and vLLM
Resolved infrastructure constraints by building vLLM from source with flash-attention support for legacy CUDA (11.8)
Implemented a unified gateway (HiGress) for all LLM models and MCP (Model Context Protocol), centralizing management and access

Core Responsibilities:

Designing architecture and participating in RAG system implementation
Deploying and maintaining LLM inference infrastructure in new clusters based on KServe, including troubleshooting kNative and Istio components
Client interaction: conducting meetings, developing connection schemes for new clients to RAG service, and effort estimation
Creating unified pipelines for deploying various non-model services across multiple environments (clusters), improving release speed and consistency
Research and implementation of best practices for optimizing and accelerating LLM model inference

KubernetesKServevLLMRAGArgoCDArgo RolloutsIstioPythonJenkinsAI Agents

MLOps Engineer

VK•May 2023 - August 2024 · 1 year 4 months•Russia

Developed and maintained a machine learning model deployment platform, managing 100+ ML models as part of a specialized ML team
Orchestrated database operations, including table creation and structure optimization for enhanced performance
Led critical aspects of a large-scale infrastructure migration, including server relocation and system upgrades
Authored and implemented Lua scripts for Tarantool Cartridge cluster during application migration
Enhanced a Golang-based database emulator for Clickhouse, improving integration testing capabilities
Streamlined Python environment migration through RPM packaging and GitLab CI pipeline development
Developed and deployed a chat-bot application utilizing OpenAI API, Langchain, and RAG for custom report generation
Deployed applications in Kubernetes (k8s) environments, ensuring scalability and efficient container orchestration
Utilized Puppet for automated server deployment and configuration management

PythonRAGLuaGolangClickhousePythonRPMGitLab CIOpenAI APILangchainKubernetesPuppet

Data Engineer

Метр квадратный•March 2022 - May 2023 · 1 year 3 months•Russia

DWH maintenance
Modeling new database objects from non-relational to relational form
Implementing Grafana and Prometheus to track metrics about DAGs execution
Creating and maintaining ETL pipelines to automate CRM interactions with customers through various communication channels (email, SMS, push notifications, etc)
Using asynchrony to speed up query execution
API integration with external systems

PythonDWHApache AirflowApache KafkaPostgreSQL

Data Engineer

DataArt•August 2021 - March 2022 · 8 months•Russia

Developed data pipelines in GCP for financial data processing, including encryption and anonymization in PCI environment
Built backend services using FastAPI and deployed them to Cloud Run and Cloud Functions
Created and maintained data analytics protocols, standards and documentation
Developed web application using Django and Plotly Dash for IT job market trend analysis
Implemented ETL pipelines using Apache Airflow for data processing
Worked with technologies: GKE, Cloud PubSub, BigQuery, Cloud Build, PostgreSQL, Docker, Redis

GCPFastAPIDjangoPlotly DashApache AirflowGKECloud PubSubBigQueryPostgreSQLDockerRedis

Independent Research

LLM Inference Optimization

2025 - Present

Building and maintaining a personal 4×RTX 3090 inference server (96GB VRAM), and sharing the work build-in-public on X and Telegram. Focused on running large models on consumer GPUs: quantization, speculative decoding, prefill/decode disaggregation, and multi-GPU parallelism without NVLink.

Sped up DeepSeek-V4 prefill by 29× on the rig — no custom kernels, just fixing the inference engine (attention was silently running on CPU while all 4 GPUs sat idle)
Run 100B+ parameter models locally through quantization, e.g. Qwen3.5-122B (AWQ / AutoRound int4) with ~200K context on 96GB
Publish reproducible benchmarks across vLLM, SGLang, and llama.cpp; one deep-dive thread reached 117K+ views

vLLMSGLangllama.cppAWQ/AutoRound quantizationspeculative decodingRTX 3090 multi-GPUflash-attention

Public Speaking

DevOops 2025

From RAG for operators to a RAG platform for a major bank

Selectel Tech Day 2025

Alfa-Bank case — piloting RAG for 10,000 employees

GR2Hub local meetup

RAG in banking

MLечный путь 2026

AI agents in a major bank — experience, impact, costs, mistakes

Certifications

Technical Leadership

Hard&Soft Skills · Certificate of completion for the Technical Leadership program

Education

Master of Mathematical Modeling and Computer Science

Voronezh State University•2009 - 2015•Russia