Announcement Announcement Module
Collapse
No announcement yet.
StaxEventItemReader ISO-8859-1 Character Normalization Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • StaxEventItemReader ISO-8859-1 Character Normalization

    Hello,

    Newbie here. I have a batch program that uses a StaxEventItemReader to input some XML. The XML is UTF-8 and contains some ISO-8859-1 Latin characters.

    The Stax parser works fine issuing nextEvent calls until it gets to retrieving the XMLEvent that contains one of these characters, throwing this exception:

    com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0x6e (at char #35463, byte #31999)

    Looking for ideas on how to either normalize the data at the point of retrieving the event...or somehow configuring the StaxEventItemReader so it normalize the data. Any ideas?

    Thanks!

  • #2
    You probably need to consult the documentation for your XML library (Woodstox by the looks of it)? But it is telling you there is an invalid byte, so are you sure it is really ISO-8859-1?

    Comment


    • #3
      Thanks for the reply, Dave. Yes, WoodStox is our XML library...I took your advice and checked out the documentation but unfortunately it did not provide any information or clues on how to handle this scenario. The links to their issues/bugs database and logs show as unavailble.

      I am also pretty sure it is ISO-8859-1. Googling, I have found similar issues with the same character - when I delete that character, it runs fine. Posts seem to hint at differences between encoding used by StAX reader implementations (e.g. StAXUtils.createXMLStreamReader(InputStream)) vs. encoding the String uses (JVM default).

      Was really hoping to see some properties/interface of the StaxEventItemReader available to normalize characters, set encoding options, alter the type of underlying reader it uses, etc. Any ideas are welcomed. Thank you!

      Comment

      Working...
      X