Mail

Date: 2025-10-16 Tags: #Python #Pandas #Excel #Debug #Experience Summary

🔍 Problem Description

When processing an Excel file (AP_LF 300 in DI Water 50 1Hz_1_20251016_100357上午.xlsx), an anomaly occurred when reading column B data from the Oscillation worksheet using pandas' read_excel() method:

Visible to the eye: Column B rows 2-15 in Excel have data (17 values like 2.64, -0.07, 2.64, etc.)
Pandas reading: Only 3 values read (B16-B18), first 15 rows are all NaN
Actual situation: B1 is empty, B2-B18 all have data

🔬 Troubleshooting Process

1. Initial Attempt (Failed ❌)

df = pd.read_excel(file, sheet_name='Oscillation', header=None)
b_column = df[1]

2. Try Different Pandas Parameters (All Failed ❌)

header=None - Failed
header=0 - Column B named __EMPTY, but data still NaN
skiprows=1 - Failed
engine='openpyxl' - Failed

3. Direct Reading with openpyxl (Success ✅)

import openpyxl

wb = openpyxl.load_workbook(file, data_only=True)
ws = wb['Oscillation']

for row in range(1, 19):
    value = ws[f'B{row}'].value
    print(f"B{row}: {value}")

4. Verify Data Exists

Using SheetJS (JavaScript) to check raw Excel file structure:

const cell = oscillationSheet['B2'];
console.log(cell.v);  // 2.64
console.log(cell.t);  // 'n' (number type)

✅ Confirmed data actually exists in the Excel file.

🎯 Root Cause Analysis

Pandas' "Smart" Recognition Mechanism Failure

1. Incorrect Data Area Judgment

Excel internal dimension tag defines worksheet range as A1:C18
Cell B1 is empty
Pandas may incorrectly judge B2-B15 are not in valid data area during parsing due to B1 being empty + some internal marker anomalies

2. Difference Between pandas vs openpyxl Reading

pandas.read_excel():
└─ Calls openpyxl
└─ Smart data area recognition (may fail)
└─ Data type conversion and optimization
└─ Skip "invalid" areas

openpyxl.iter_rows():
└─ Iterate directly by row
└─ Read all cells
└─ No smart judgment (more reliable)

3. Possible File Format Issues

Excel file may have special internal markers
First 15 rows of column B may be marked as "blank" rather than NULL
Pandas' parser cannot properly handle this edge case

✅ Solution

Best Solution: Bypass pandas for Direct Reading

import pandas as pd
import openpyxl

def read_excel_robust(filepath, sheet_name):
    """
    Use openpyxl for direct reading, bypassing pandas' smart recognition
    Suitable for "strange" Excel files where pandas reading fails
    """
    wb = openpyxl.load_workbook(filepath, data_only=True)
    ws = wb[sheet_name]

    # Iterate all rows directly without assumptions
    raw_rows = [row for row in ws.iter_rows(values_only=True)]
    wb.close()

    # Manually build DataFrame
    df = pd.DataFrame(raw_rows)
    return df

# Usage example
df = read_excel_robust('file.xlsx', 'Oscillation')
b_column = df[1]  # Column B data (index starts from 0)

Solution Advantages

✅ Completely Reliable - Does not depend on pandas' data area judgment ✅ Raw Data - Reads actual Excel cell content ✅ Strong Adaptability - Suitable for various "abnormal" Excel files ✅ Transparent and Controllable - Clear understanding of every step

📚 Key Parameter Description

openpyxl.load_workbook(
    filename,
    data_only=True,  # ⭐ Read calculated formula values, not formulas themselves
    read_only=False  # Set to True for read-only to improve performance
)

🚀 Production Environment Best Practices

def read_oscillation_columns(filepath: str, retries: int = 2):
    """
    Production-grade Excel reading function
    Includes retry mechanism, error handling, resource management
    """
    last_err = None

    # Retry mechanism (handle file being occupied)
    for attempt in range(retries + 1):
        try:
            wb = openpyxl.load_workbook(filepath, data_only=True)
            break
        except PermissionError as e:
            last_err = e
            if attempt < retries:
                import time
                time.sleep(0.4)
        except Exception as e:
            return None
    else:
        return None

    try:
        if 'Oscillation' not in wb.sheetnames:
            return None

        ws = wb['Oscillation']
        raw_rows = [r for r in ws.iter_rows(values_only=True)]

        # Ensure resource release
        wb.close()

        if not raw_rows:
            return None

        df = pd.DataFrame(raw_rows)
        return df

    except Exception:
        wb.close()
        return None

💡 Experience Summary

Core Lessons

Don't Blindly Trust Pandas' "Smart" Features
- Pandas' smart recognition works well in most cases
- But may fail with specially formatted Excel files
- Direct reading is more reliable for critical data processing

Layered Diagnosis

Pandas reading failed
↓
Try different pandas parameters
↓
Use underlying library (openpyxl) for direct reading
↓
Check raw file structure (SheetJS/openpyxl)

Robust Design for Production Environment
- Retry mechanism (handle file occupation)
- Comprehensive error handling
- Resource management (ensure wb.close())
- Logging (facilitate debugging)

Applicable Scenarios

This solution is especially suitable for:

❌ Pandas reading returns abnormal data
❌ Incorrect data area judgment
❌ Specially formatted Excel files
❌ Scenarios requiring precise control over reading process

Performance Considerations

Direct openpyxl reading is slightly slower than optimized pandas reading
But for "strange" files, Reliability > Performance
For large files, consider:
- read_only=True parameter
- Read only specific needed ranges
- Use iter_rows(min_row=x, max_row=y) to limit range

openpyxl Official Documentation
pandas read_excel Documentation
Project code: oscillation_aggregator.py

[Python Data Processing Best Practices](/Python Data Processing Best Practices)
[Production Environment Error Handling Strategy](/Production Environment Error Handling Strategy)
[Handling Corrupted Excel Files](/Handling Corrupted Excel Files)

Last Updated: 2025-10-16 Author: Your Name

Pandas Excel Reading Failure Case Analysis

🔍 Problem Description

🔬 Troubleshooting Process

1. Initial Attempt (Failed ❌)

2. Try Different Pandas Parameters (All Failed ❌)

3. Direct Reading with openpyxl (Success ✅)

4. Verify Data Exists

🎯 Root Cause Analysis

Pandas' "Smart" Recognition Mechanism Failure

1. Incorrect Data Area Judgment

2. Difference Between pandas vs openpyxl Reading

3. Possible File Format Issues

✅ Solution

Best Solution: Bypass pandas for Direct Reading

Solution Advantages

📚 Key Parameter Description

🚀 Production Environment Best Practices

💡 Experience Summary

Core Lessons

Applicable Scenarios

Performance Considerations

Related Posts

Complete Guide to Migrating from GitHub Pages to Cloudflare Pages

Fixing NoMachine Menu Unresponsive Issue on macOS 26

Git Fundamentals and Workflow: A Complete Guide from Principles to Team Collaboration

🔍 Problem Description

🔬 Troubleshooting Process

1. Initial Attempt (Failed ❌)

2. Try Different Pandas Parameters (All Failed ❌)

3. Direct Reading with openpyxl (Success ✅)

4. Verify Data Exists

🎯 Root Cause Analysis

Pandas' "Smart" Recognition Mechanism Failure

1. Incorrect Data Area Judgment

2. Difference Between pandas vs openpyxl Reading

3. Possible File Format Issues

✅ Solution

Best Solution: Bypass pandas for Direct Reading

Solution Advantages

📚 Key Parameter Description

🚀 Production Environment Best Practices

💡 Experience Summary

Core Lessons

Applicable Scenarios

Performance Considerations

🔗 Related Resources

🔗 Related Notes

Related Posts

Complete Guide to Migrating from GitHub Pages to Cloudflare Pages

Fixing NoMachine Menu Unresponsive Issue on macOS 26

Git Fundamentals and Workflow: A Complete Guide from Principles to Team Collaboration