Convert File Encoding

On several occasions I have been asked to convert files from their original encoding to something else so another process or system can use it.

Below is an example of a simple approach:

get-content -path inputFile.txt | out-file -filePath ouputFile.txt -encoding UTF8

If the file to convert is small and your not in a hurry then this will probably work fine.However, this approach could struggle with larger files for a few reasons.

1. get-content will load the entire file into memory, which for very large files could cause excessive memory usage.

2. get-content by default will send each line down the pipeline separately, which will dramatically slow down your writes.

The write speed can be worked around by using “-readCount n” – n being the number of lines to send down the pipeline at once.

get-content -path inputFile.txt -readCount 100 | out-file -filePath ouputFile.txt -encoding UTF8

Alternatively, you could use “-raw” to read the entire file as one string and send it down the pipeline all at once. Neither of these change the fact that get-content will attempt to load the entire file into memory.

get-content -path inputFile.txt -raw | out-file -filePath ouputFile.txt -encoding UTF8

Occasionally for me the files were vary large or I needed a somewhat obscure type of encoding that was not native to out-file, so I turned to .NET. This is a simplified version of the end product:

—————————————————————————–

[string]$inputFile=$args[0]
[string]$inputEncode=$args[1]
[string]$outputEncode=$args[2]
[string]$outputFile=”output.txt”

#create stream reader and writer
$streamReader=new-object -typeName System.IO.StreamReader(
$inputFile,
[System.Text.Encoding]::GetEncoding(“$inputEncode”))

$streamWriter=new-object -typeName System.IO.StreamWriter(
$outputFile,
[System.Text.Encoding]::GetEncoding(“$outputEncode”))

#read and write each line
while($streamReader.peek() -ge 0){
$line=$streamReader.readLine()
$Streamwriter.writeLine($line)
}

#dispose of objects
$streamReader.close()
$streamReader.dispose()
$streamWriter.flush()
$streamWriter.close()
$streamWriter.dispose()

—————————————————————————–

3 arguments are required. Source file path, input file encoding, and output file encoding. Once completed a file named output.txt is created. Example usage:

./encode.ps1 test.txt “UTF-8” “US-ASCII”

The primary thing to note is the use of the streamReader and streamWriter from the System.IO .NET namespace.

This combination reads and writes each line very quickly with minimal memory usage.

In my experience a file that would take several minutes to process using get-content and out-file may only take a few seconds using this approach.

The documentation for the .NET System.Text.Encoding class lists supported encodings.

https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s