KIV: 1M token context window on a RTX 4070 (12GB VRAM), no retraining, drop-in HuggingFace cache replacement - Works with any model that uses DynamicCache [P]
KIV introduces a novel tiered retrieval system that replaces standard key-value caches in HuggingFace transformers, enabling an unprecedented 1 million token context window on a RTX 4070 GPU with only 12GB VRAM and without retraining. This breakthrough allows models using DynamicCache to scale context length drastically while remaining a drop-in cache replacement, significantly improving inference capabilities for long-context LLM applications.
