Logo

Pandas Excel Reading Failure Case Analysis

Published on
...
Authors

Date: 2025-10-16 Tags: #Python #Pandas #Excel #Debug #Experience Summary

🔍 Problem Description

When processing an Excel file (AP_LF 300 in DI Water 50 1Hz_1_20251016_100357上午.xlsx), an anomaly occurred when reading column B data from the Oscillation worksheet using pandas' read_excel() method:

  • Visible to the eye: Column B rows 2-15 in Excel have data (17 values like 2.64, -0.07, 2.64, etc.)
  • Pandas reading: Only 3 values read (B16-B18), first 15 rows are all NaN
  • Actual situation: B1 is empty, B2-B18 all have data

🔬 Troubleshooting Process

1. Initial Attempt (Failed ❌)

df = pd.read_excel(file, sheet_name='Oscillation', header=None)
b_column = df[1]

2. Try Different Pandas Parameters (All Failed ❌)

  • header=None - Failed
  • header=0 - Column B named __EMPTY, but data still NaN
  • skiprows=1 - Failed
  • engine='openpyxl' - Failed

3. Direct Reading with openpyxl (Success ✅)

import openpyxl

wb = openpyxl.load_workbook(file, data_only=True)
ws = wb['Oscillation']

for row in range(1, 19):
    value = ws[f'B{row}'].value
    print(f"B{row}: {value}")

4. Verify Data Exists

Using SheetJS (JavaScript) to check raw Excel file structure:

const cell = oscillationSheet['B2'];
console.log(cell.v);  // 2.64
console.log(cell.t);  // 'n' (number type)

✅ Confirmed data actually exists in the Excel file.

🎯 Root Cause Analysis

Pandas' "Smart" Recognition Mechanism Failure

1. Incorrect Data Area Judgment

  • Excel internal dimension tag defines worksheet range as A1:C18
  • Cell B1 is empty
  • Pandas may incorrectly judge B2-B15 are not in valid data area during parsing due to B1 being empty + some internal marker anomalies

2. Difference Between pandas vs openpyxl Reading

pandas.read_excel():
└─ Calls openpyxl
└─ Smart data area recognition (may fail)
└─ Data type conversion and optimization
└─ Skip "invalid" areas

openpyxl.iter_rows():
└─ Iterate directly by row
└─ Read all cells
└─ No smart judgment (more reliable)

3. Possible File Format Issues

  • Excel file may have special internal markers
  • First 15 rows of column B may be marked as "blank" rather than NULL
  • Pandas' parser cannot properly handle this edge case

✅ Solution

Best Solution: Bypass pandas for Direct Reading

import pandas as pd
import openpyxl

def read_excel_robust(filepath, sheet_name):
    """
    Use openpyxl for direct reading, bypassing pandas' smart recognition
    Suitable for "strange" Excel files where pandas reading fails
    """
    wb = openpyxl.load_workbook(filepath, data_only=True)
    ws = wb[sheet_name]

    # Iterate all rows directly without assumptions
    raw_rows = [row for row in ws.iter_rows(values_only=True)]
    wb.close()

    # Manually build DataFrame
    df = pd.DataFrame(raw_rows)
    return df

# Usage example
df = read_excel_robust('file.xlsx', 'Oscillation')
b_column = df[1]  # Column B data (index starts from 0)

Solution Advantages

Completely Reliable - Does not depend on pandas' data area judgment ✅ Raw Data - Reads actual Excel cell content ✅ Strong Adaptability - Suitable for various "abnormal" Excel files ✅ Transparent and Controllable - Clear understanding of every step

📚 Key Parameter Description

openpyxl.load_workbook(
    filename,
    data_only=True,  # ⭐ Read calculated formula values, not formulas themselves
    read_only=False  # Set to True for read-only to improve performance
)

🚀 Production Environment Best Practices

def read_oscillation_columns(filepath: str, retries: int = 2):
    """
    Production-grade Excel reading function
    Includes retry mechanism, error handling, resource management
    """
    last_err = None

    # Retry mechanism (handle file being occupied)
    for attempt in range(retries + 1):
        try:
            wb = openpyxl.load_workbook(filepath, data_only=True)
            break
        except PermissionError as e:
            last_err = e
            if attempt < retries:
                import time
                time.sleep(0.4)
        except Exception as e:
            return None
    else:
        return None

    try:
        if 'Oscillation' not in wb.sheetnames:
            return None

        ws = wb['Oscillation']
        raw_rows = [r for r in ws.iter_rows(values_only=True)]

        # Ensure resource release
        wb.close()

        if not raw_rows:
            return None

        df = pd.DataFrame(raw_rows)
        return df

    except Exception:
        wb.close()
        return None

💡 Experience Summary

Core Lessons

  1. Don't Blindly Trust Pandas' "Smart" Features

    • Pandas' smart recognition works well in most cases
    • But may fail with specially formatted Excel files
    • Direct reading is more reliable for critical data processing
  2. Layered Diagnosis

    Pandas reading failed
    Try different pandas parameters
    Use underlying library (openpyxl) for direct reading
    Check raw file structure (SheetJS/openpyxl)
    
  3. Robust Design for Production Environment

    • Retry mechanism (handle file occupation)
    • Comprehensive error handling
    • Resource management (ensure wb.close())
    • Logging (facilitate debugging)

Applicable Scenarios

This solution is especially suitable for:

  • ❌ Pandas reading returns abnormal data
  • ❌ Incorrect data area judgment
  • ❌ Specially formatted Excel files
  • ❌ Scenarios requiring precise control over reading process

Performance Considerations

  • Direct openpyxl reading is slightly slower than optimized pandas reading
  • But for "strange" files, Reliability > Performance
  • For large files, consider:
    • read_only=True parameter
    • Read only specific needed ranges
    • Use iter_rows(min_row=x, max_row=y) to limit range
  • [Python Data Processing Best Practices](/Python Data Processing Best Practices)
  • [Production Environment Error Handling Strategy](/Production Environment Error Handling Strategy)
  • [Handling Corrupted Excel Files](/Handling Corrupted Excel Files)

Last Updated: 2025-10-16 Author: Your Name

Pandas Excel Reading Failure Case Analysis | 原子比特之间