Monday, February 20, 2012

Get file encoding even if no Byte Order Marker

Note: you can find the latest version of the encoding functions in my PowerShell beautifier project:
https://github.com/DTW-DanWard/PowerShell-Beautifier
Check out file src/DTW.PS.FileSystem.Encoding.psm1

Every now and then you need to be able to programmatically determine a file's encoding. Maybe you are writing a utility that edits files and you want ensure you maintain the original encoding type. Perhaps you want to make sure that certain files have a Byte Order Marker (BOM).

If the file has a BOM, this is easy. If it doesn't... aw, crap. At that point you have to analyze the file's contents and make a judgement call based on what you see. I wrote a function to do this: Get-DTWFileEncoding

Get-DTWFileEncoding returns a System.Text.Encoding type based on the file specified. Here's an example of a big-endian file with a BOM:






As you can see, the System.Text.Encoding type is returned and the BOM type has the correct value: FE FF

Here's an example for another big-endian file, this time with no BOM:
The returned Encoding type info looks the same as the first but if you inspect the Preamble, there's no value.




There are some other handy functions in there as well:
  • Add-DTWFileEncodingByteOrderMarker - adds a byte order marker file encoding to a file.
  • Compare-DTWFiles - compares two files and returns $true if same, $false otherwise.  Uses the two functions below to do comparisons.
  • Compare-DTWFilesIgnoringBOM - compares two files, ignoring BOMs, returning $true if same, $false otherwise.
  • Compare-DTWFilesIncludingBOM - compares two files, including BOMs, returning $true if same, $false otherwise.

Again, you can get the encoding functions at the beautifier:


5 comments:

  1. I love it when you find someone has done all the hard work for you. Thanks this is brilliant.

    ReplyDelete
  2. Thanks! I think there might be a bug in the algorithm (there's might be a corner case I missed); let me know if you find any issues.

    ReplyDelete
  3. "file --mime filename" on Linux works so much faster

    ReplyDelete
  4. I think you have your bit order mark wrong.. You're script has the BOM reversed for Big and Little Endian. http://en.wikipedia.org/wiki/Byte_order_mark

    ReplyDelete
    Replies
    1. * Byte Order Mark. sorry, buzzed.

      Delete