Hybrid Intelligence: Solving Excel Table Detection with Heuristics and LLMs


Abstract

Excel spreadsheets remain one of the most challenging data formats for automated processing systems. Unlike structured formats such as CSV, JSON, or XML, Excel files often contain dispersed, multi-format tabular data with unpredictable layouts. This article presents a novel hybrid approach combining heuristic boundary detection with Large Language Model (LLM) semantic analysis to accurately detect, extract, and structure data from complex Excel files. Our method addresses critical limitations of both purely rule-based and purely AI-driven approaches, achieving robust table detection across diverse layouts.

Table of Contents

  1. Introduction
  2. The Excel Processing Challenge
  3. Evolution of Our Approach
  4. Architecture Deep Dive
  5. Conclusion

The Excel Processing Challenge

Why Excel Is Different

Excel files present unique challenges that distinguish them from other structured data formats:

  1. Positional Freedom: Data can appear anywhere on a sheet, with no enforced structure
  2. Multiple Tables Per Sheet: A single sheet may contain multiple unrelated tables with different formats
  3. Diverse Table Types: Tables range from simple single-header structures to complex multi-header, pivot, hierarchical, and matrix formats
  4. Visual-First Design: Excel prioritizes human readability over machine parseability, with merged cells, formatting cues, and visual hierarchies that lack explicit structural markers
  5. Sparse Data: Empty rows and columns serve as visual separators but complicate programmatic detection

These characteristics make Excel fundamentally different from formats designed for data interchange (CSV, JSON, XML) where structure is explicit and predictable.

Evolution of Our Approach

Phase 1: Pure Heuristic Methods

Our initial implementation relied entirely on rule-based heuristic methods to detect tables within Excel sheets. This approach, implemented in our first-generation processor, attempted to identify table boundaries through:

Detection Strategies

  • Density Analysis: Calculating the ratio of non-empty cells in sliding windows to identify data-rich regions
  • Empty Row Detection: Using blank rows as table separators
  • Pattern Recognition: Identifying header rows by detecting text-heavy rows followed by numeric data
  • Column Type Analysis: Detecting columns based on consistent data types (text, numeric, date)
  • Multi-table Detection: Splitting sheets into regions based on visual gaps
// Simplified example of heuristic boundary detection
private detectByDensity(rows: unknown[][]): TableBoundary[] {
  const windowSize = 10;
  const densityThreshold = 0.3; // 30% non-empty cells

  let inTable = false;
  let tableStart = 0;
  const boundaries: TableBoundary[] = [];

  for (let i = 0; i < rows.length; i += windowSize) {
    const windowEnd = Math.min(i + windowSize, rows.length);
    const windowRows = rows.slice(i, windowEnd);
    const density = this.calculateDensity(windowRows);

    if (density >= densityThreshold && !inTable) {
      tableStart = i;
      inTable = true;
    } else if (density < densityThreshold && inTable) {
      boundaries.push(this.createBoundary(tableStart, i - 1));
      inTable = false;
    }
  }

  return boundaries;
}

Limitations Encountered

The purely heuristic approach failed catastrophically across multiple dimensions:

  1. Poor Generalization: Rules optimized for invoice registers failed completely on P&L statements or balance sheets
  2. Sensitivity to Formatting: Extra spacing, merged cells, or unconventional layouts broke detection logic
  3. Multi-table Confusion: Difficulty distinguishing between separate tables vs. sections of one table
  4. Header Detection Failures: Complex multi-row headers or non-standard naming conventions caused misclassification
  5. False Positives: Text blocks, titles, and notes were incorrectly identified as tables
  6. Brittle Logic: Minor variations in Excel design required constant rule adjustments

Result: Accuracy rates below 60% on real-world files, with frequent complete failures on unconventional layouts.

Phase 2: Full LLM-Based Detection

Given the failure of heuristics, we pivoted to a pure LLM approach, leveraging GPT-4’s semantic understanding to interpret Excel data as structured information.

Implementation

We sent raw 2D arrays of cell values directly to GPT-4 with prompts requesting:

  • Table boundary identification
  • Header row detection
  • Column classification (dimensions vs. measures)
  • Semantic interpretation of table purpose
// Conceptual example of LLM-only approach
const prompt = `Analyze this Excel sheet data and identify all tables:
${JSON.stringify(rawData)}

For each table, specify:
- Start and end rows
- Column headers
- Data types
- Business context`;

const analysis = await generateObject({
  model: openai('gpt-4'),
  schema: TableAnalysisSchema,
  prompt
});

The Attention Bias Problem

This approach worked remarkably well for small datasets but failed systematically on larger tables due to a phenomenon known as attention bias or the “lost-in-the-middle” problem.

Understanding Attention Bias in LLMs

Attention bias refers to the tendency of transformer-based language models to disproportionately focus on information at the beginning and end of input sequences while underweighting or entirely overlooking content in the middle positions. This phenomenon arises from several architectural factors:

  1. Positional Encoding Decay: The model’s ability to distinguish positional information degrades with sequence length
  2. Attention Pattern Collapse: Self-attention mechanisms often develop patterns that emphasize boundary tokens
  3. Training Data Distribution: Models are frequently trained on documents where critical information appears at beginnings and conclusions
  4. Causal Masking Effects: In autoregressive models, later tokens can attend to all previous tokens, creating an asymmetric information flow

Research from MIT CSAIL (2024) demonstrates that this bias intensifies with input length—when processing tables with 100+ rows, LLMs might accurately identify elements in rows 1-20 and 80-100 while completely omitting rows 30-70.

Observed Failures in Our Implementation

When processing a 150-row invoice ledger:

  • ✅ Correctly identified columns and first 25 transaction records
  • ❌ Skipped rows 26-120 entirely in boundary detection
  • ✅ Captured the last 5 total rows accurately
  • Net Result: 83% of actual data missing from extraction

The LLM could perfectly identify what table elements were (headers, dimensions, measures, totals) but could not reliably determine where the table ended, especially for large datasets.

Phase 3: The Hybrid Approach

Recognizing that both heuristics and LLMs had complementary strengths and weaknesses, we developed a hybrid architecture:

CapabilityHeuristicsLLMs
Boundary detection (spatial)✅ Excellent❌ Poor (attention bias)
Semantic understanding❌ None✅ Excellent
Complex header parsing❌ Poor✅ Excellent
Performance on large data✅ Fast, constant❌ Slow, degrades
Generalization across formats❌ Poor✅ Good (within boundaries)

Solution: Use heuristics for spatial analysis (finding boundaries) and LLMs for semantic analysis (understanding content).

Architecture Deep Dive

Step 1: Excel File Parsing

We use the SheetJS (xlsx) library to extract raw data from Excel files in a 2D array format:

const workbook = read(fileBuffer, {
  type: 'buffer',
  cellDates: true, // Parse dates automatically
  cellNF: true, // Include number formats
  cellStyles: false // Skip styling for performance
});

const rawData = utils.sheet_to_json<unknown[]>(worksheet, {
  header: 1, // Return as 2D array
  raw: false, // Format values as strings
  defval: null, // Null for empty cells
  blankrows: true // Include blank rows (crucial for boundaries)
});

Output: Clean 2D array where rawData[row][col] gives us the cell value at any position.

Step 2: Heuristic Boundary Detection

We apply multiple detection algorithms in parallel:

2.1 Empty Row Detection

Identifies tables separated by blank rows:

private detectByEmptyRows(rawData: unknown[][]): TableBoundary[] {
  const boundaries: TableBoundary[] = [];
  const emptyRows: number[] = [];

  // Find empty rows
  for (let i = 0; i < rawData.length; i++) {
    if (this.isEmptyRow(rawData[i])) {
      emptyRows.push(i);
    }
  }

  // Split into regions between empty rows
  let startRow = 0;
  for (const emptyRow of emptyRows) {
    if (emptyRow > startRow) {
      boundaries.push({
        startRow,
        endRow: emptyRow - 1,
        startCol: 0,
        endCol: rawData[0]?.length - 1 || 0,
        detectionMethod: "empty_rows",
        confidenceScore: 0.8
      });
    }
    startRow = emptyRow + 1;
  }

  return boundaries;
}

2.2 Density-Based Detection

Uses sliding windows to find data-rich regions:

private detectByDensity(rawData: unknown[][]): TableBoundary[] {
  const windowSize = 10;
  const densityThreshold = 0.3;

  for (let i = 0; i < rawData.length; i += windowSize) {
    const windowRows = rawData.slice(i, i + windowSize);
    const density = this.calculateNonEmptyCellRatio(windowRows);

    if (density >= densityThreshold) {
      // Found a dense region, mark as potential table
    }
  }
}

2.3 Pattern-Based Detection

Identifies header patterns followed by data:

private detectByPatterns(rawData: unknown[][]): TableBoundary[] {
  for (let i = 0; i < rawData.length - 1; i++) {
    if (this.isHeaderRow(rawData[i]) && !this.isEmptyRow(rawData[i + 1])) {
      // Find end of this table
      let endRow = i + 1;
      for (let j = i + 2; j < rawData.length; j++) {
        if (this.isEmptyRow(rawData[j]) || this.isHeaderRow(rawData[j])) {
          break;
        }
        endRow = j;
      }

      boundaries.push(createBoundary(i, endRow));
    }
  }
}

2.4 Boundary Merging

Combines overlapping detections:

private mergeBoundaries(boundaries: TableBoundary[]): TableBoundary[] {
  const sorted = boundaries.sort((a, b) => a.startRow - b.startRow);
  const merged: TableBoundary[] = [];
  let current = sorted[0];

  for (const next of sorted.slice(1)) {
    if (next.startRow <= current.endRow + 2) {
      // Overlapping or adjacent - merge
      current = {
        startRow: current.startRow,
        endRow: Math.max(current.endRow, next.endRow),
        startCol: Math.min(current.startCol, next.startCol),
        endCol: Math.max(current.endCol, next.endCol),
        detectionMethod: `${current.detectionMethod}+${next.detectionMethod}`,
        confidenceScore: Math.max(current.confidenceScore, next.confidenceScore)
      };
    } else {
      merged.push(current);
      current = next;
    }
  }

  merged.push(current);
  return merged;
}

Output: Array of table boundaries with high spatial accuracy:

[
  {
    id: 'boundary_2_45_0_8',
    startRow: 2,
    endRow: 45,
    startCol: 0,
    endCol: 8,
    rowCount: 44,
    colCount: 9,
    cellRange: 'A3:I46',
    detectionMethod: 'empty_rows+density+pattern',
    confidenceScore: 0.85
  }
  // ... more boundaries
];

Step 3: LLM-Based Semantic Analysis

For each detected boundary, we extract the relevant data subset and send it to GPT-4 for semantic analysis:

3.1 Boundary-Specific Extraction

private async analyzeSingleBoundary(
  boundary: TableBoundary,
  fullRawData: unknown[][]
): Promise<BoundaryAnalysis> {
  // Extract only the data for this boundary
  const boundaryData = fullRawData
    .slice(boundary.startRow, boundary.endRow + 1)
    .map(row => row.slice(boundary.startCol, boundary.endCol + 1));

  // Create focused prompt
  const prompt = createBoundaryAnalysisPrompt(
    sheetName,
    fileName,
    boundary,
    boundaryData
  );

  // Call LLM with structured output
  const result = await generateObject({
    model: openai("gpt-4o"),
    schema: BoundaryAnalysisSchema,
    prompt,
    temperature: 0.1  // Low for consistency
  });

  return result.object;
}

3.2 Structured Output Schema

We use Zod schemas to enforce structured LLM responses:

const BoundaryAnalysisSchema = z.object({
  decision: z.enum(['accept', 'reject', 'split', 'expand']),

  singleTable: z
    .object({
      tableName: z.string(),
      tableType: z.enum([
        'simple',
        'pivot',
        'hierarchical',
        'multi-header',
        'matrix',
        'key-value'
      ]),
      confidenceScore: z.number().min(0).max(1),

      finalBoundary: z.object({
        startRow: z.number(),
        endRow: z.number(),
        startCol: z.number(),
        endCol: z.number()
      }),

      dimensions: z.array(
        z.object({
          name: z.string(),
          originalColumn: z.string(),
          columnIndex: z.number(),
          dataType: z.enum(['text', 'number', 'date', 'boolean', 'mixed']),
          role: z.enum([
            'primary_key',
            'grouping',
            'category',
            'date_dimension',
            'hierarchy_level'
          ]),
          sampleValues: z.array(z.string()).max(5)
        })
      ),

      measures: z.array(
        z.object({
          name: z.string(),
          originalColumn: z.string(),
          columnIndex: z.number(),
          dataType: z.enum([
            'integer',
            'decimal',
            'currency',
            'percentage',
            'mixed'
          ]),
          aggregationType: z.enum([
            'sum',
            'average',
            'count',
            'min',
            'max',
            'none'
          ]),
          unit: z.string().nullable().optional(),
          sampleValues: z.array(z.string()).max(5)
        })
      ),

      totalsInfo: z.object({
        totalRows: z.array(z.number()),
        subtotalRows: z.array(z.number()),
        hasTotals: z.boolean()
      }),

      pivotInfo: z
        .object({
          isPivot: z.boolean(),
          rowDimensions: z.array(z.string()),
          columnDimensions: z.array(z.string()),
          valueColumns: z.array(z.string())
        })
        .optional(),

      semanticDescription: z.string(),
      businessContext: z.string().nullable().optional()
    })
    .optional(),

  rejectionReason: z.string().optional(),
  multipleTables: z.array(/* ... */).optional()
});

3.3 Attention Bias Mitigation

By processing each boundary independently, we eliminate the attention bias problem:

  • Small Input Size: Each boundary typically contains 20-100 rows instead of 500+
  • Focused Context: LLM only sees relevant data for one table at a time
  • No Position Confusion: Rows are renumbered relative to boundary start
  • Reliable Coverage: All rows receive equal attention within the smaller context

3.4 Validation and Correction

We validate LLM outputs against the original boundary to detect and correct attention bias effects:

private validateBoundaryAnalysis(
  originalBoundary: TableBoundary,
  aiAnalysis: BoundaryAnalysis
): ValidationResult {
  const finalBoundary = aiAnalysis.singleTable?.finalBoundary;

  // Check if AI truncated the boundary (attention bias artifact)
  if (finalBoundary.endRow < originalBoundary.endRow) {
    return {
      valid: false,
      error: `AI truncated boundary! Original endRow: ${originalBoundary.endRow}, AI endRow: ${finalBoundary.endRow}`,
      action: "USE_ORIGINAL_BOUNDARY"
    };
  }

  return { valid: true };
}

private correctTruncatedBoundary(
  originalBoundary: TableBoundary,
  aiAnalysis: BoundaryAnalysis,
  validationResult: ValidationResult
): BoundaryAnalysis {
  if (validationResult.action === "USE_ORIGINAL_BOUNDARY") {
    return {
      ...aiAnalysis,
      singleTable: {
        ...aiAnalysis.singleTable,
        finalBoundary: {
          ...aiAnalysis.singleTable.finalBoundary,
          endRow: originalBoundary.endRow  // Force original boundary
        }
      },
      correctionApplied: true,
      correctionReason: "AI truncated boundary due to attention bias"
    };
  }

  return aiAnalysis;
}

Output: Semantic analysis for each table:

{
  tableName: "Sales Transaction Register",
  tableType: "simple",
  confidenceScore: 0.95,
  finalBoundary: { startRow: 2, endRow: 45, startCol: 0, endCol: 8 },

  dimensions: [
    { name: "Transaction ID", columnIndex: 0, role: "primary_key", dataType: "text" },
    { name: "Customer Name", columnIndex: 1, role: "category", dataType: "text" },
    { name: "Transaction Date", columnIndex: 2, role: "date_dimension", dataType: "date" }
  ],

  measures: [
    { name: "Revenue Amount", columnIndex: 3, dataType: "currency", aggregationType: "sum", unit: "$" },
    { name: "Unit Quantity", columnIndex: 4, dataType: "integer", aggregationType: "sum", unit: "units" }
  ],

  totalsInfo: {
    hasTotals: true,
    totalRows: [45],
    subtotalRows: []
  },

  semanticDescription: "Sales register showing transaction-level revenue information with customer details and financial metrics",
  businessContext: "Financial Services - Accounting"
}

Step 4: Semantic Chunking

We transform the AI analysis and raw data into searchable semantic chunks optimized for vector databases and retrieval systems:

4.1 Chunk Types

We create multiple chunk types to support different query patterns:

Table Summary Chunk - High-level overview:

function createTableSummaryChunk(table: DetectedTable): SemanticChunk {
  return {
    content: `
TABLE SUMMARY: ${table.tableName}
Type: ${table.tableType}
Business Context: ${table.businessContext}

Description: ${table.semanticDescription}

Structure:
- Dimensions (${table.dimensions.length}): ${table.dimensions
      .map(d => d.name)
      .join(', ')}
- Measures (${table.measures.length}): ${table.measures
      .map(m => m.name)
      .join(', ')}
- Data Rows: ${dataRowCount}
- Total Rows: ${table.totalsInfo.totalRows.length}
    `.trim(),

    metadata: {
      chunkType: 'table_summary',
      tableName: table.tableName,
      tableType: table.tableType,
      confidenceScore: table.confidenceScore
    }
  };
}

Table Header Chunk - Schema and structure:

function createTableHeaderChunk(table: DetectedTable): SemanticChunk {
  return {
    content: `
TABLE STRUCTURE: ${table.tableName}

DIMENSIONS (Categorical Columns):
${table.dimensions
  .map(
    d => `
- ${d.name} (${d.role}, ${d.dataType}, column ${d.columnIndex})
  Sample values: ${d.sampleValues.join(', ')}
`
  )
  .join('')}

MEASURES (Numeric Columns):
${table.measures
  .map(
    m => `
- ${m.name} (${m.dataType}, ${m.aggregationType}, column ${m.columnIndex})
  ${m.unit ? `Unit: ${m.unit}` : ''}
  Sample values: ${m.sampleValues.join(', ')}
`
  )
  .join('')}
    `.trim(),

    metadata: {
      chunkType: 'table_header',
      dimensions: Object.fromEntries(
        table.dimensions.map(d => [d.name, d.role])
      ),
      measures: Object.fromEntries(
        table.measures.map(m => [m.name, m.aggregationType])
      )
    }
  };
}

Row-Level Chunks - Individual data records:

function createTableRowChunks(
  table: DetectedTable,
  rawData: unknown[][]
): SemanticChunk[] {
  const chunks: SemanticChunk[] = [];
  const tableRows = rawData.slice(table.range.startRow, table.range.endRow + 1);
  const dataRows = tableRows.slice(
    table.range.dataStartRow - table.range.startRow
  );

  for (const [idx, row] of dataRows.entries()) {
    const absoluteRowIndex = table.range.dataStartRow + idx;

    // Skip total rows
    if (table.totalsInfo.totalRows.includes(absoluteRowIndex)) continue;

    // Extract values
    const dimensionValues: Record<string, unknown> = {};
    for (const dim of table.dimensions) {
      const colIndex = dim.columnIndex - table.range.startCol;
      dimensionValues[dim.name] = row[colIndex] ?? null;
    }

    const measureValues: Record<string, unknown> = {};
    for (const measure of table.measures) {
      const colIndex = measure.columnIndex - table.range.startCol;
      measureValues[measure.name] = row[colIndex] ?? null;
    }

    // Create natural language content
    const content = `
Row from table: ${table.tableName}
${Object.entries(dimensionValues)
  .map(([k, v]) => `${k}: ${v ?? 'N/A'}`)
  .join(', ')}
${Object.entries(measureValues)
  .map(([k, v]) => `${k}: ${v ?? 'N/A'}`)
  .join(', ')}

Context: ${table.semanticDescription}
    `.trim();

    chunks.push({
      content,
      metadata: {
        chunkType: 'table_row',
        tableName: table.tableName,
        rowIndex: absoluteRowIndex,
        dimensions: dimensionValues,
        measures: measureValues,
        semanticDescription: table.semanticDescription,
        businessContext: table.businessContext
      }
    });
  }

  return chunks;
}

4.2 Chunk Benefits

This multi-level chunking strategy enables:

  1. Flexible Querying: Summary chunks for “what tables exist?”, row chunks for “find specific records”
  2. Semantic Search: Natural language content enables vector similarity search
  3. Structured Metadata: Enables filtering by table type, business context, or data ranges
  4. Scalability: Row-level chunks prevent massive monolithic embeddings

Step 5: Embedding Generation and Storage

We generate vector embeddings for semantic search and store them with metadata:

// Generate embeddings for all chunks
const embeddingRequests = chunks.map(chunk => ({
  text: chunk.content,
  chunkId: generateUUID(),
  metadata: {
    ...chunk.metadata,
    documentType: 'excel',
    workspaceId
  }
}));

const embeddingResponses = await embeddingService.generateEmbeddingsBatch(
  embeddingRequests
);

// Store in vector database
const storedChunks = await vectorStorage.storeChunks(
  embeddingResponses.map((response, idx) => ({
    content: chunks[idx].content,
    embedding: response.embedding,
    metadata: embeddingRequests[idx].metadata,
    chunkIndex: idx
  })),
  documentId,
  documentCreatedAt
);

This enables semantic queries like:

  • “What is the total revenue for all transactions?” → Retrieves summary and total row chunks
  • “Show me invoices with amounts over $5000” → Retrieves relevant row-level chunks
  • “What columns are in the sales ledger?” → Retrieves header/structure chunks

Accuracy Measurement and Results

Evaluation Methodology

We evaluate our hybrid approach across three dimensions:

1. Boundary Detection Accuracy

  • Precision: Ratio of correct boundaries to all detected boundaries
  • Recall: Ratio of detected boundaries to all actual boundaries in ground truth
  • IoU (Intersection over Union): Overlap ratio between detected and actual table regions
function calculateBoundaryIoU(
  detected: TableBoundary,
  groundTruth: TableBoundary
): number {
  const intersection =
    Math.max(
      0,
      Math.min(detected.endRow, groundTruth.endRow) -
        Math.max(detected.startRow, groundTruth.startRow) +
        1
    ) *
    Math.max(
      0,
      Math.min(detected.endCol, groundTruth.endCol) -
        Math.max(detected.startCol, groundTruth.startCol) +
        1
    );

  const union =
    detected.rowCount * detected.colCount +
    groundTruth.rowCount * groundTruth.colCount -
    intersection;

  return intersection / union;
}

2. Element Classification Accuracy

  • Column Classification: Correctly identified dimensions vs. measures
  • Data Type Detection: Accuracy of dataType inference (text, number, date, etc.)
  • Aggregation Type: Correctness of aggregation type assignment (sum, average, count)
  • Total Row Detection: Precision and recall for total/subtotal row identification

3. Semantic Quality

  • Business Context Relevance: Human evaluation of assigned business context
  • Description Coherence: Quality of semantic descriptions for downstream use
  • Chunk Usefulness: Evaluation of chunk quality for QA systems

Performance Results

Testing on a diverse corpus of 50 real-world Excel files (invoice registers, P&L statements, balance sheets, financial reports):

MetricHeuristic-OnlyLLM-OnlyHybrid Approach
Boundary Detection Precision0.610.730.94
Boundary Detection Recall0.580.520.91
Boundary IoU (avg)0.640.680.89
Column Classification Accuracy0.510.890.92
Data Type Accuracy0.720.910.93
Total Row Detection F10.430.810.88
Processing Time (avg)1.2s8.7s3.4s

Key Findings:

  • Hybrid approach achieves 94% boundary precision vs. 73% for LLM-only
  • Recall improved from 52% (LLM-only) to 91% (hybrid) by eliminating attention bias
  • Processing time reduced by 60% vs. LLM-only by limiting AI analysis to bounded regions
  • Column classification near-perfect (92%) by combining spatial context with semantic understanding

Real-World Impact

On a complex 200-row financial statement with 3 separate tables:

  • Heuristic-only: Detected 1 of 3 tables, missed headers, 33% recall
  • LLM-only: Detected all 3 tables but truncated 2 due to attention bias, extracted only first 40 rows of 150-row table
  • Hybrid: Detected all 3 tables with exact boundaries, 100% data extraction

Practical Applications

1. Embedding Creation for QA Systems

The structured chunks with semantic content enable powerful question-answering:

// Query: "What is the average revenue per transaction?"
const relevantChunks = await vectorDB.semanticSearch(queryEmbedding, {
  filters: {
    chunkType: ['table_summary', 'table_header'],
    businessContext: 'Financial Services'
  },
  topK: 5
});

// System retrieves:
// 1. Table summary identifying "Revenue Amount" and "Transaction Count" measures
// 2. Header chunk showing aggregation types
// 3. Row chunks with actual data

// Can then compute: average(Revenue Amount / Transaction Count)

2. Structured Data Storage

The extracted structure enables storage in relational databases:

// Transform detected table to SQL schema
function tableToSchema(table: DetectedTable): SQLSchema {
  const columns = [
    ...table.dimensions.map(d => ({
      name: d.name,
      type: mapToSQLType(d.dataType),
      primaryKey: d.role === 'primary_key'
    })),
    ...table.measures.map(m => ({
      name: m.name,
      type: mapToSQLType(m.dataType),
      primaryKey: false
    }))
  ];

  return {
    tableName: sanitizeTableName(table.tableName),
    columns,
    indexes: table.dimensions
      .filter(d => d.role === 'grouping' || d.role === 'category')
      .map(d => ({ column: d.name, type: 'btree' }))
  };
}

// INSERT data from chunks
const insertQuery = buildInsertQuery(table, rowChunks);
await db.execute(insertQuery);

3. Metric Computation and Analytics

The semantic understanding enables automatic metric calculation:

// Automatically identify calculation opportunities
function inferCalculations(table: DetectedTable): Calculation[] {
  const calculations: Calculation[] = [];

  // Find ratio opportunities (measure / measure)
  const measures = table.measures;
  for (const numerator of measures) {
    for (const denominator of measures) {
      if (numerator !== denominator) {
        calculations.push({
          type: 'ratio',
          formula: `${numerator.name} / ${denominator.name}`,
          name: `${numerator.name} per ${denominator.name}`,
          unit:
            numerator.unit && denominator.unit
              ? `${numerator.unit}/${denominator.unit}`
              : undefined
        });
      }
    }
  }

  // Find aggregation opportunities by dimensions
  for (const dimension of table.dimensions.filter(d => d.role === 'grouping')) {
    for (const measure of measures.filter(m => m.aggregationType !== 'none')) {
      calculations.push({
        type: 'group_aggregate',
        formula: `${measure.aggregationType}(${measure.name}) GROUP BY ${dimension.name}`,
        name: `${measure.aggregationType} ${measure.name} by ${dimension.name}`
      });
    }
  }

  return calculations;
}

4. Table Reconstruction

The complete boundary and semantic information allows faithful recreation:

function reconstructTable(
  table: DetectedTable,
  rawData: unknown[][],
  format: 'markdown' | 'html' | 'csv'
): string {
  const rows = rawData.slice(table.range.startRow, table.range.endRow + 1);

  if (format === 'markdown') {
    const headers = table.dimensions
      .concat(table.measures)
      .sort((a, b) => a.columnIndex - b.columnIndex)
      .map(col => col.name);

    let md = `| ${headers.join(' | ')} |\n`;
    md += `| ${headers.map(() => '---').join(' | ')} |\n`;

    for (const row of rows.slice(1)) {
      // Skip header row
      const cells = headers.map(
        (_, idx) => row[table.range.startCol + idx] ?? ''
      );
      md += `| ${cells.join(' | ')} |\n`;
    }

    return md;
  }

  // ... similar logic for HTML, CSV
}

Comparative Analysis

Other Approaches in the Industry

1. OCR-Based Table Detection

Method: Render Excel as image, use computer vision models (like Detectron2, TableNet) to detect tables

Pros:

  • Works on scanned documents and PDFs
  • Can handle complex visual layouts
  • No dependency on file format structure

Cons:

  • Computationally expensive (rendering + CV inference)
  • Loses Excel’s inherent structured data (formulas, data types)
  • Requires large training datasets
  • Poor performance on sparse tables or unusual layouts

Comparison: Our hybrid approach is 10x faster and leverages native Excel structure that OCR discards.

2. Template-Based Extraction

Method: Define templates for known table formats (e.g., “Standard Invoice Template v3”), match incoming files to templates

Pros:

  • Very accurate for exact template matches
  • Fast execution
  • Deterministic results

Cons:

  • Requires pre-defined templates for every format variation
  • Brittle to layout changes
  • Cannot handle novel formats
  • High maintenance burden as templates evolve

Comparison: Our approach generalizes to unseen formats without template creation.

3. Pure Deep Learning (End-to-End)

Method: Train neural networks to directly predict table structures from raw cell data

Pros:

  • Can learn complex patterns
  • Single model for all cases
  • No explicit rule engineering

Cons:

  • Requires massive labeled training data (thousands of annotated Excel files)
  • Opaque decision-making (hard to debug failures)
  • Poor generalization to out-of-distribution layouts
  • Expensive to train and maintain

Comparison: Our hybrid approach requires no training data and provides interpretable intermediate results (heuristic boundaries).

4. Rule Engines (e.g., Apache Tika)

Method: Extensive rule-based systems with hundreds of hardcoded patterns

Pros:

  • Fast and deterministic
  • No external API dependencies
  • Well-understood behavior

Cons:

  • Constant maintenance as new patterns emerge
  • Poor generalization
  • Complex rule conflict resolution
  • Cannot understand semantic context

Comparison: Our LLM component provides semantic understanding that pure rules cannot achieve.

Advantages of the Hybrid Approach

DimensionHybrid Advantage
AccuracyCombines spatial precision of heuristics with semantic understanding of LLMs
RobustnessHandles attention bias through bounded analysis
GeneralizationLLM adapts to new formats without retraining or new rules
InterpretabilityClear separation of spatial (heuristic) and semantic (LLM) decisions
PerformanceFaster than pure LLM by constraining analysis scope
MaintenanceNo training data or template updates required
Semantic DepthExtracts business context and relationships, not just structure

Limitations and Future Work

Current Limitations

  1. Cost: LLM API calls for each boundary add expense (mitigated by batching and concurrency control)
  2. Latency: 3-5 second processing time may be too slow for real-time applications
  3. Complex Pivots: Multi-level pivot tables with irregular structures occasionally confuse the LLM
  4. Formula Interpretation: We extract computed values but don’t preserve formulas
  5. Cross-Sheet References: Tables spanning multiple sheets not yet supported

Future Enhancements

1. Adaptive Boundary Refinement

Use LLM feedback to improve heuristic algorithms:

if (aiAnalysis.decision === 'expand') {
  // LLM suggests boundary should be larger
  // Update heuristic parameters for similar patterns
  heuristicModel.updateThresholds(boundaryType, aiAnalysis.suggestedExpansion);
}

2. Formula Preservation

Extract and store Excel formulas alongside computed values:

const cellWithFormula = worksheet[cellAddress];
if (cellWithFormula.f) {
  metadata.formula = cellWithFormula.f; // "=SUM(B2:B45)"
  metadata.computedValue = cellWithFormula.v;
}

3. Cross-Sheet Table Detection

Detect tables that reference multiple sheets:

function detectCrossSheetTables(workbook: Workbook): CrossSheetTable[] {
  // Analyze formula references across sheets
  // Identify logical table boundaries that span sheets
  // Combine related regions into unified table structures
}

4. Streaming Processing for Large Files

Process Excel files in streaming chunks to handle files with 100+ MB:

async function* processExcelStream(fileStream: ReadableStream) {
  const sheetIterator = parseWorkbookStreaming(fileStream);

  for await (const sheet of sheetIterator) {
    const boundaries = await detectBoundaries(sheet);
    for (const boundary of boundaries) {
      yield await analyzeAndChunk(boundary, sheet);
    }
  }
}

Conclusion

Excel’s ubiquity in business environments makes robust automated processing essential, yet its flexible, visual-first design poses significant challenges. Our hybrid approach demonstrates that the optimal solution combines the complementary strengths of deterministic heuristics and semantic language models.

Key Contributions:

  1. Identified and solved the attention bias problem in LLM-based table detection through boundary-scoped analysis
  2. Achieved 94% precision and 91% recall on diverse real-world Excel files
  3. Developed a generalizable pattern for combining spatial heuristics with semantic AI analysis
  4. Created production-ready semantic chunks optimized for vector search and downstream applications

The hybrid approach is not merely a pragmatic compromise—it represents a principled architecture that aligns each subtask (spatial analysis, semantic understanding) with the most appropriate computational method. As LLMs evolve, we expect even greater synergies, with models providing feedback to refine heuristic parameters and heuristics constraining LLM search spaces for improved efficiency.

For organizations processing large volumes of Excel data, this hybrid strategy offers a path to reliable, scalable, and semantically rich data extraction without the brittleness of pure rule systems or the biases of pure AI approaches.

References

  1. MIT CSAIL (2024). “Unpacking Bias in Large Language Models: Position Bias and the Lost-in-the-Middle Problem.” https://www.csail.mit.edu/news/unpacking-bias-large-language-models

  2. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv preprint arXiv:2307.03172.

  3. SheetJS Community Edition. “xlsx: Parser and writer for various spreadsheet formats.” https://github.com/SheetJS/sheetjs

  4. OpenAI. (2024). “GPT-4 Technical Report.” https://openai.com/research/gpt-4

  5. Vercel AI SDK. “Building AI-powered applications with TypeScript.” https://sdk.vercel.ai/docs


Author Note: This article describes a production implementation processing thousands of financial documents monthly. Code examples are simplified for clarity but representative of the actual system architecture. Full implementation details available upon request for academic or research purposes.