Dealing with an arbitrary text file encoded in an unknown way and trying to normalize to UTF-8 (with minimal data loss) is a tricky thing.
This is a little test bed for trying out various detection and transcoding strategies. The problem
I was solving had to do with user uploaded CSV or TXT files for importing data into an app. Due to
the way excel handles CSVs, I fully expected to deal with UTF-16LE files with a BOM
(byte order marker) and/or various flavors of ISO-8859. In a perfect world you would just specify
that all files must be valid UTF-8, but most developers don’t really understand what UTF-8 is let
alone your average user! You can’t expect a user to do anything more than hit export on their spreadsheet
and dump the resulting mess into your file uploader.
I started my exercise by laying a baseline with 1.9’s built in string encoding methods.
Explicitly transcoding the files works flawlessly (in MRI and jRuby) as long as the source encoding was
set to exactly match the file’s actual encoding. One trick here is setting the File.open directive
as rb:bom|utf-8. This scraps a BOM (Byte Order Marker) if one is present and sets the
encoding to UTF-8 even if that’s not the actual string encoding. Once you have this BOM stripped
string you can do the explicit transcode and everything comes out nice. This is great and all, but
if you don’t know the precise source encoding of the file you are dealing with then the results
of the encode might not be so pretty.
In the quest to detect the source encoding I tried several gems. rchardet19 comes very close to
getting things right: it nails the unicode files (UTF-8 and UTF-16LE), and at least returns a flavor of ISO-8859 for
the windows-1252 and iso-8859-1 files (ISO_8859_8). Unfortunately trying to transcode to UTF-8 and setting
the source encoding as 8859-8 yields not so great results. Close but no cigar.
charlock_holmes is the next gem I tried. I expected this one to be the winner as it is built on
top of the icu4c which is supposedly the most badass character encoding detector to ever walk these
lands. It did not even come close to guessing the encodings correctly. As you can see below it guessed
correctly for valid UTF-8 (congrats), binary of all things for UTF-16LE (helpful!), and EUC-JP for the ISO flavored
Ensure-encoding won the day. One of the strategies this gem employs
is very similar to what I was planning on writing by hand: using an educated guess, pick a small subset of
encodings and test the unknown string against them one by one until you get a valid encoding. Once the source encoding
is establish you can then transcode to UTF-8 successfully. If all your anticipated encodings fail to get a valid match,
then you can fall back to an encode without an explicit source set and just pass in the options so that unknown or
invalid characters get tossed out or replaced rather than raising an encoding error. Example: