Monday, February 20, 2012

Get file encoding even if no Byte Order Marker

Every now and then you need to be able to programmatically determine a file's encoding.  Maybe you are writing a utility that rewrites files and you want ensure you maintain the original type.  Perhaps you want to make sure that certain files have a Byte Order Marker (BOM).

If the file has a BOM, this is easy.  If it doesn't... aw, crap.  At that point you have to analyze the file's contents and make a judgement call based on what you see.  I wrote a function to do this: Get-DTWFileEncoding

Get-DTWFileEncoding returns a System.Text.Encoding type based on the file specified.





It will also return the appropriate encoding type based on the whether or not the original file had a BOM or not.




The included sample test script will run it against the sample files:


You can download a module containing the Get-DTWFileEncoding function, along with some sample files with different encodings, here.  Check out the function help and the source if you want more information on the detection algorithm.


5 comments:

  1. I love it when you find someone has done all the hard work for you. Thanks this is brilliant.

    ReplyDelete
  2. Thanks! I think there might be a bug in the algorithm (there's might be a corner case I missed); let me know if you find any issues.

    ReplyDelete
  3. "file --mime filename" on Linux works so much faster

    ReplyDelete
  4. I think you have your bit order mark wrong.. You're script has the BOM reversed for Big and Little Endian. http://en.wikipedia.org/wiki/Byte_order_mark

    ReplyDelete
    Replies
    1. * Byte Order Mark. sorry, buzzed.

      Delete