Adjust files encoding in Finder context menu "GB1312 to UTF-8 with 1-click"

One of the most annoying thing of Mac is that encoding, espeically you’re living in a non-Mac world.

Mac uses UTF-8 as the default encoding for text files. But Windows uses local encoding, so it changes according to your OS language. For Chinese users, Windows uses GB2312 as the default encoding.

So usually the movie subtitle files, the song lyrics files, the plain text novel files, the code contains Chinese, which you downloaded form web sites or recieved from others usually cannot be read because of the wrong encoding.

So I really wish to have an item in my finder’s context menu that I can adjust the encoding of selected files with 1-click.

Luckily, with the help of Ruby, Automator workflow and Mac OSX service, it isn’t that hard.

So basically, OSX loads all the workflow files saved in ~/Library/Services/, which is displayed as Context Menu in finder.

To build the service, work through the following steps:

1. To create a new service, just pick Service in Automator’s ‘create new document’ dialog.

2. Set service input as “files and folders from any application”.

3. Run Ruby Script to transcode the files

Add a “Run Shell Script” action to execute the following ruby code, which is used to transcode the files passed to service. (For more detail about how to embed Ruby in workflow, check out Using RVMed Ruby in Mac Automator Workflow )

Make sure the input is passed as arugments to the ruby script.

Transcode the files
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
old_files = []
ARGV.each do |name|
next until File.file? name
backup_name = name + '.old'
File.rename name, backup_name
source = File.open backup_name, 'r:GB2312:UTF-8'
dest = File.open name, 'w'
while line = source.gets
dest.puts line
end
puts name
old_files << backup_name
end
ENV['Transcode_Backup_Files'] = old_files.join('|')

4. Display a growl message when processing is done

5. Prompt user whether to keep the backup files

I use a Ask for confirmation action to ask whether user want to keep the backup files.
The workflow will abort if user clicks “No”, make sure you updated the text on the buttons, and texts are put on right button.

6. Add script to remove backup files

Add another “Run Shell Script” aciton to execute another piece of ruby code.

Remove backup files
1
2
3
4
5
6
7
8
9
if ENV['Transcode_Backup_Files']
ENV['Transcode_Backup_Files'].split('|').each do |file|
File.delete file
end
ENV.delete 'Transcode_Backup_Files'
end

7. Display notification to tell user that backup files has been deleted

TIP: The transcode ruby script requires Ruby 1.9+, but Mac OS X default provides Ruby 1.8.3, which doesn’t support encoding. To interprets workflow embedded code with ruby 1.9+, please refers to Using RVMed Ruby in Mac Automator Workflow

Pitfall in node crypto and base64 encoding

Today, we found there is a huge pitfall in node.js crypto module! Decipher has potential problem when processing Base64 encoding.

We’re building RESTful web service based on Node.js, which talks to some other services implemented with Ruby.

Ruby

In ruby, we use the default Base64 class to handle Base64 encoding.

Base64#encode64 has a very interesting feature:
It add line break (\n) to output every 60 characters. This format make the output look pretty and be friendly for human reading:

Ruby Base64 Block
1
2
3
4
5
6
7
MSwyLDMsNCw1LDYsNyw4LDksMTAsMTEsMTIsMTMsMTQsMTUsMTYsMTcsMTgs
MTksMjAsMjEsMjIsMjMsMjQsMjUsMjYsMjcsMjgsMjksMzAsMzEsMzIsMzMs
MzQsMzUsMzYsMzcsMzgsMzksNDAsNDEsNDIsNDMsNDQsNDUsNDYsNDcsNDgs
NDksNTAsNTEsNTIsNTMsNTQsNTUsNTYsNTcsNTgsNTksNjAsNjEsNjIsNjMs
NjQsNjUsNjYsNjcsNjgsNjksNzAsNzEsNzIsNzMsNzQsNzUsNzYsNzcsNzgs
NzksODAsODEsODIsODMsODQsODUsODYsODcsODgsODksOTAsOTEsOTIsOTMs
OTQsOTUsOTYsOTcsOTgsOTksMTAw

The Base64#decode64 class ignores the line break (\n) when parsing the base64 encoded data, so the line break won’t pollute the data.

Node.js

Node.js take Base64 as one of the 5 standard encodings (ascii, utf8, base64, binary, hex). Ideally the data or string can be transcoded between these 4 encodings without data loss.

The Buffer class is the simplest way to transcode the data:

Base64 Encoder in Node.js
1
2
3
4
5
6
7
8
Base64 =
encode64: (text) ->
new Buffer(text, 'utf8').toString('base64')
decode64: (base64) ->
new Buffer(base64. 'base64').toString('utf8')

Although encode64 function in node.js won’t add line break to the output, but the decode64 function does ignore the line break when parsing the data. It keeps the consistent behavior with ruby Base64 class, so we can use this decode64 function to decode the data from ruby.

Since base64 is one of the standard encodings, and some of the node.js API does allow set encoding for input and output. So ideally, we can complete the base64 encoding and decoding during processing the data.
It seems Node.js is more convenient comparing to Ruby when dealing with Base64.

e.g. We can combine reading file and base64 encoding the content into one operation by setting the encoding to readFileSync API.

Write and Read string as Base64
1
2
3
4
5
6
fs = require('fs')
fileName = './binary.dat' # this file contains binary data
base64 = fs.readFileSync(fileName, 'base64') # file content has been base64 encoded

It looks like we can always use this trick to avoid manually base64 encoding and decoding when the API has encoding parameter! But actually it is not true! There is a BIG pitfall here!

In our real case, we uses crypto module to decrypt the the JSON document that encrypted and base64 encoded by Ruby:

Base64 Deocde and Decrypt
1
2
3
4
5
6
7
8
9
10
11
crypto = require('crypto')
parse = (data, algorithm, key, iv) ->
decipher = crypto.createDecipheriv(algorithm, key, iv)
decrypted = decipher.update(data, 'base64', 'utf8') # Set input encoding to 'base64' to ask API to base64 decode the input before decryption
decrypted += dechiper.final('utf8')
JSON.parse(decrypted)
Manually Base64 Decoding
1
2
3
4
5
6
7
8
9
10
11
12
13
crypto = require('crypto')
parse = (data, algorithm, key, iv) ->
decipher = crypto.createDecipheriv(algorithm, key, iv)
binary = new Buffer(data,'base64') # Manually Base64 Decode
decrypted = decipher.update(binary, 'binary', 'utf8') # Set input encoding to 'binary'
decrypted += dechiper.final('utf8')
JSON.parse(decrypted)

The previous 2 implementations are very similar except the second one base64 decoded the data manually by using Buffer. Ideally they should be equivalent in behavior. But in fact, they are NOT equivalent!

The previous implementation throws “TypeError: DecipherFinal fail”.
And the reason is that the shortcut way doesn’t ignore the line break, but Buffer does!!! So in the previous implementation, the data is polluted by the line break!

Conclusion

Be careful, when you try to ask the API to base64 decode the data by setting the encoding argument to ‘base64’. It has inconsistent behavior comparing to Buffer class.

I’m not sure whether it is a node.js bug, or it is as is by design. But it is indeed a pitfall that hides so deep. And usually is extremely hard to figure out. Since encrypted binary is hard to human to read, and debugging between 2 languages are also kind of hard!

HTML codes to put special characters on your Web page

尝试了一下用 LinqPad 把各种诡异的字母转成 Html 编码~结果发现不是左右字符都能转过去~
.net 内置的工具并不能完美的处理所有的 Html 编码~

字符来源

Query

Get html escaped unicodes
1
2
3
"A,a,À,à,Á,á,Â,â,Ã,ã,Ä,ä,Å,å,Ā,ā,Ă,ă,Ą,ą,Ǟ,ǟ,Ǻ,ǻ,Æ,æ,Ǽ,ǽ,B,b,Ḃ,ḃ,C,c,Ć,ć,Ç,ç,Č,č,Ĉ,ĉ,Ċ,ċ,D,d,Ḑ,ḑ,Ď,ď,Ḋ,ḋ,Đ,đ,Ð,ð,DZ,dz,DŽ,dž,E,e,È,è,É,é,Ě,ě,Ê,ê,Ë,ë,Ē,ē,Ĕ,ĕ,Ę,ę,Ė,ė,Ʒ,ʒ,Ǯ,ǯ,F,f,Ḟ,ḟ,ƒ,ff,fi,fl,ffi,ffl,ſt,G,g,Ǵ,ǵ,Ģ,ģ,Ǧ,ǧ,Ĝ,ĝ,Ğ,ğ,Ġ,ġ,Ǥ,ǥ,H,h,Ĥ,ĥ,Ħ,ħ,I,i,Ì,ì,Í,í,Î,î,Ĩ,ĩ,Ï,ï,Ī,ī,Ĭ,ĭ,Į,į,İ,ı,IJ,ij,J,j,Ĵ,ĵ,K,k,Ḱ,ḱ,Ķ,ķ,Ǩ,ǩ,ĸ,L,l,Ĺ,ĺ,Ļ,ļ,Ľ,ľ,Ŀ,ŀ,Ł,ł,LJ,lj,M,m,Ṁ,ṁ,N,n,Ń,ń,Ņ,ņ,Ň,ň,Ñ,ñ,ʼn,Ŋ,ŋ,NJ,nj,O,o,Ò,ò,Ó,ó,Ô,ô,Õ,õ,Ö,ö,Ō,ō,Ŏ,ŏ,Ø,ø,Ő,ő,Ǿ,ǿ,Œ,œ,P,p,Ṗ,ṗ,Q,q,R,r,Ŕ,ŕ,Ŗ,ŗ,Ř,ř,ɼ,S,s,Ś,ś,Ş,ş,Š,š,Ŝ,ŝ,Ṡ,ṡ,ſ,ß,T,t,Ţ,ţ,Ť,ť,Ṫ,ṫ,Ŧ,ŧ,Þ,þ,U,u,Ù,ù,Ú,ú,Û,û,Ũ,ũ,Ü,ü,Ů,ů,Ū,ū,Ŭ,ŭ,Ų,ų,Ű,ű,V,v,W,w,Ẁ,ẁ,Ẃ,ẃ,Ŵ,ŵ,Ẅ,ẅ,X,x,Y,y,Ỳ,ỳ,Ý,ý,Ŷ,ŷ,Ÿ,ÿ,Z,z,Ź,ź,Ž,ž,Ż,ż"
.Split(',')
.ToDictionary(k=>k,HttpUtility.HtmlEncode)

Result

Dictionary<String,String>
(304 items)



















































































































































































































































































































KeyValue
AA
aa
À&#192;
à&#224;
Á&#193;
á&#225;
Â&#194;
â&#226;
Ã&#195;
ã&#227;
Ä&#196;
ä&#228;
Å&#197;
å&#229;
ĀĀ
āā
ĂĂ
ăă
ĄĄ
ąą
ǞǞ
ǟǟ
ǺǺ
ǻǻ
Æ&#198;
æ&#230;
ǼǼ
ǽǽ
BB
bb
CC
cc
ĆĆ
ćć
Ç&#199;
ç&#231;
ČČ
čč
ĈĈ
ĉĉ
ĊĊ
ċċ
DD
dd
ĎĎ
ďď
ĐĐ
đđ
Ð&#208;
ð&#240;
DZDZ
dzdz
DŽDŽ
dždž
EE
ee
È&#200;
è&#232;
É&#201;
é&#233;
ĚĚ
ěě
Ê&#202;
ê&#234;
Ë&#203;
ë&#235;
ĒĒ
ēē
ĔĔ
ĕĕ
ĘĘ
ęę
ĖĖ
ėė
ƷƷ
ʒʒ
ǮǮ
ǯǯ
FF
ff
ƒƒ
GG
gg
ǴǴ
ǵǵ
ĢĢ
ģģ
ǦǦ
ǧǧ
ĜĜ
ĝĝ
ĞĞ
ğğ
ĠĠ
ġġ
ǤǤ
ǥǥ
HH
hh
ĤĤ
ĥĥ
ĦĦ
ħħ
II
ii
Ì&#204;
ì&#236;
Í&#205;
í&#237;
Î&#206;
î&#238;
ĨĨ
ĩĩ
Ï&#207;
ï&#239;
ĪĪ
īī
ĬĬ
ĭĭ
ĮĮ
įį
İİ
ıı
IJIJ
ijij
JJ
jj
ĴĴ
ĵĵ
KK
kk
ĶĶ
ķķ
ǨǨ
ǩǩ
ĸĸ
LL
ll
ĹĹ
ĺĺ
ĻĻ
ļļ
ĽĽ
ľľ
ĿĿ
ŀŀ
ŁŁ
łł
LJLJ
ljlj
MM
mm
NN
nn
ŃŃ
ńń
ŅŅ
ņņ
ŇŇ
ňň
Ñ&#209;
ñ&#241;
ʼnʼn
ŊŊ
ŋŋ
NJNJ
njnj
OO
oo
Ò&#210;
ò&#242;
Ó&#211;
ó&#243;
Ô&#212;
ô&#244;
Õ&#213;
õ&#245;
Ö&#214;
ö&#246;
ŌŌ
ōō
ŎŎ
ŏŏ
Ø&#216;
ø&#248;
ŐŐ
őő
ǾǾ
ǿǿ
ŒŒ
œœ
PP
pp
QQ
qq
RR
rr
ŔŔ
ŕŕ
ŖŖ
ŗŗ
ŘŘ
řř
ɼɼ
SS
ss
ŚŚ
śś
ŞŞ
şş
ŠŠ
šš
ŜŜ
ŝŝ
ſſ
ß&#223;
TT
tt
ŢŢ
ţţ
ŤŤ
ťť
ŦŦ
ŧŧ
Þ&#222;
þ&#254;
UU
uu
Ù&#217;
ù&#249;
Ú&#218;
ú&#250;
Û&#219;
û&#251;
ŨŨ
ũũ
Ü&#220;
ü&#252;
ŮŮ
ůů
ŪŪ
ūū
ŬŬ
ŭŭ
ŲŲ
ųų
ŰŰ
űű
VV
vv
WW
ww
ŴŴ
ŵŵ
XX
xx
YY
yy
Ý&#221;
ý&#253;
ŶŶ
ŷŷ
ŸŸ
ÿ&#255;
ZZ
zz
ŹŹ
źź
ŽŽ
žž
ŻŻ
żż