Kubernetes Quick-Start Guide (Sponsored)Cut through the noise with this engineer-friendly guide to Kubernetes observability. Save this reference for fast-track access to essential kubectl commands and critical metrics, from disk I/O and network latency to real-time cluster events. Perfect for scaling, debugging, and tuning your workloads without sifting through endless docs. Digital services require accurate extraction of information from user-submitted documents such as identification cards, driver’s licenses, and vehicle registration certificates. This process is essential for electronic know-your-customer (eKYC) verification. However, the diversity of languages and document formats across the region makes this task particularly challenging. Grab Engineering Team faced significant obstacles with traditional Optical Character Recognition (OCR) systems, which struggled to handle the variety of document templates. While powerful proprietary Large Language Models (LLMs) were available, they often failed to adequately understand Southeast Asian languages, produced errors and hallucinations, and suffered from high latency. Open-source Vision LLMs offered better efficiency but lacked the accuracy required for production deployment. This situation prompted Grab to fine-tune existing models and eventually build a lightweight, specialized Vision LLM from the ground up. In this article, we will look at the complete architecture, the technical decisions made, and the results achieved.
Disclaimer: This post is based on publicly shared details from the Grab Engineering Team. Please comment if you notice any inaccuracies. Understanding Vision LLMsBefore diving into the solution, it helps to understand what a Vision LLM is and how it differs from traditional text-based language models. A standard LLM processes text inputs and generates text outputs. A Vision LLM extends this capability by enabling the model to understand and process images. The architecture consists of three essential components working together:
See the diagram below: Build product instead of babysitting prod (Sponsored)Engineering teams at Coinbase, MSCI, and Zscaler have at least one thing in common: they use Resolve AI’s AI SRE to make MTTR 5x faster and increase dev productivity by up to 75%. When it comes to production issues, the numbers hurt: 54% of significant outages exceed $100,000 lost. Downtime cost the Global 2000 ~$400 billion annually. It’s why eng teams leverage our AI SRE to correlate code, infrastructure, and telemetry and provide real-time root cause analysis, prescriptive remediation, and continuous learning. Time to try an AI SRE? This guide covers:
Selecting the Base ModelGrab evaluated several open-source models capable of performing OCR and Key Information Extraction (KIE). The options included Qwen2VL, miniCPM, Llama3.2 Vision, Pixtral 12B, GOT-OCR2.0, and NVLM 1.0. After thorough evaluation, Grab selected Qwen2-VL 2B as the base multimodal LLM. This decision was driven by several critical factors:
Initial benchmarking of Qwen2VL and miniCPM on Grab’s dataset revealed low accuracy, primarily due to the limited coverage of Southeast Asian languages. This finding motivated the team to pursue fine-tuning to improve OCR and KIE accuracy. However, training LLMs is both data-intensive and GPU resource-intensive, which brings up two important questions: how to use open-source and internal data effectively, and how to customize the model to reduce latency while maintaining high accuracy. Training Dataset GenerationGrab developed two approaches to generate training data for the model: 1 - Synthetic OCR DatasetThe first approach involved creating synthetic training data. Grab extracted Southeast Asian language text content from Common Crawl, a large online text corpus that contains data from across the internet. Using an in-house synthetic data pipeline, the team generated text images by rendering this content in various fonts, backgrounds, and augmentations. The resulting dataset included text in Bahasa Indonesia, Thai, Vietnamese, and English. Each image contained a paragraph of random sentences extracted from the corpus. This synthetic approach offered several advantages. It allowed controlled generation of training examples, enabled the creation of unlimited variations, and ensured coverage of different visual styles and document conditions. 2 - Documint: The Auto-Labelling FrameworkThe second approach leveraged real documents collected by Grab. Experiments showed that applying document detection and orientation correction significantly improved OCR and information extraction. To generate a preprocessing dataset, Grab built Documint, an internal platform that creates an auto-labelling and preprocessing framework for document understanding. Documint prepares high-quality, labelled datasets through various submodules that execute the full OCR and KIE task. The team used this pipeline with a large volume of Grab-collected cards and documents to extract training labels. Human reviewers then refined the data to achieve high label accuracy. Documint consists of four main modules:
The Experimentation JourneyGrab conducted the model development in three distinct phases, each building on the lessons learned from the previous phase: Phase 1: LoRA Fine-TuningThe first attempt at fine-tuning involved a technique called Low-Rank Adaptation, or LoRA. This method is efficient because it updates only a small portion of the model’s parameters rather than retraining the entire model. Specifically, LoRA adds small trainable matrices to the model while keeping most of the original weights frozen. This approach minimizes computational resource requirements and reduces training time. Grab trained the model on curated document data that included various document templates in multiple languages. The performance showed promise for documents with Latin scripts. The LoRA fine-tuned Qwen2VL-2B achieved high field-level accuracy for Indonesian documents. However, the fine-tuned model struggled with two categories of documents:
The experiments revealed a key limitation. While open-source Vision LLMs often have extensive multilingual text corpus coverage for the language model decoder’s pre-training, they lack visual examples of text in Southeast Asian languages during vision encoder training. The language model might understand Thai text, but the vision encoder had never learned to recognize what Thai characters look like in images. This insight drove the decision to pursue full parameter fine-tuning. Phase 2: Full Fine-TuningDrawing from the Large Language and Vision Assistant (LLAVA) methodology, Grab implemented a two-stage training approach:
The results were significant. For example, the Thai document accuracy increased by 70 percentage points from the baseline. Vietnamese document accuracy rose by 40 percentage points from the baseline. Indonesian documents saw a 15 percentage point improvement, and Philippine documents improved by 6 percentage points. The fully fine-tuned Qwen2-VL 2B model delivered substantial improvements, especially on documents that the LoRA model had struggled with. Phase 3: Building a 1B Model from ScratchWhile the 2B model succeeded, full fine-tuning pushed the limits of available GPUs. To optimize resource usage and create a model perfectly tailored to their needs, Grab decided to build a lightweight Vision LLM with approximately 1 billion parameters from scratch. The strategy involved combining the best components from different models. Grab took the powerful vision encoder from the larger Qwen2-VL 2B model, which had proven effective at understanding document images. The team paired it with the compact and efficient language decoder from the Qwen2.5 0.5B model. They connected these components with an adjusted projector layer to ensure seamless communication between the vision encoder and language decoder. This combination created a custom Vision LLM with approximately 1 billion parameters, optimized for both training and deployment. Four-Stage Training ProcessGrab trained this new model using a comprehensive four-stage process:
Results and PerformanceThe final 1B model achieved remarkable results across two key metrics: accuracy and latency. For accuracy, the model performed comparably to the larger 2B model, staying within a 3 percentage point accuracy gap across most document types. The model also maintained strong generalization when trained on quality-augmented datasets, meaning it could handle variations it had not seen during training. For latency, the results were even more impressive. The 1B model achieved 48 percent faster processing at the P50 latency (median response time), 56 percent faster at P90 latency (90th percentile), and 56 percent faster at P99 latency (99th percentile, representing worst-case scenarios). These latency improvements are particularly important. Grab identified that one of the biggest weaknesses of external APIs like ChatGPT or Gemini was the P99 latency, which can easily be 3 to 4 times higher than the P50 latency. This variability would not be acceptable for large-scale production rollouts where consistent performance is essential. Key Technical InsightsThe project yielded several important insights that can guide similar efforts. Full parameter fine-tuning proved superior to LoRA for specialized, non-Latin script domains. While LoRA is efficient, it cannot match the performance gains of updating all model parameters when dealing with significantly different data distributions. Lightweight models can be highly effective. A smaller model of approximately 1 billion parameters, built from scratch and trained comprehensively, can achieve near state-of-the-art results. This validates the approach of custom architecture over simply using the largest available model. The choice of base model matters significantly. Starting with a model that has native support for target languages is crucial for success. Trying to force a model to learn languages it was not designed for leads to suboptimal results. Data quality plays a critical role. Meticulous dataset preprocessing and augmentation are as important as model architecture in achieving consistent and accurate results. The effort invested in building Documint and creating synthetic datasets directly contributed to the final model’s success. Finally, native resolution support is transformative for OCR tasks. A model that can handle dynamic image resolutions preserves text integrity and dramatically improves OCR capabilities. This feature prevents the distortion that occurs when images are resized to fit fixed input dimensions. ConclusionGrab’s journey of building a Vision LLM demonstrates that specialized Vision LLMs can effectively replace traditional OCR pipelines with a single, unified, highly accurate model. This opens new possibilities for document processing at scale. The project shows that with strategic training approaches, high-quality data preparation, and thoughtful model architecture decisions, smaller specialized models can outperform larger general-purpose alternatives. The resulting system processes documents faster and more accurately than previous solutions while using fewer computational resources. Grab continues to enhance these capabilities. The team is developing Chain of Thought-based OCR and KIE models to strengthen generalization and tackle even more diverse document scenarios. They are also extending support to all Grab markets, bringing advanced document processing to Myanmar, Cambodia, and beyond. References: |
How Grab Built a Vision LLM to Scan Images
Tuesday, 3 February 2026
Subscribe to:
Post Comments (Atom)








No comments:
Post a Comment