I love working on a Mac. My Powerbook is not the fastest computer in the world, but it works reliably and is virus- and malware-free. (Thus far.) And when you’re working on text-heavy document sets (such as websites), OS X’s Unix tools can be incredible time savers.
An example: running a word count on a published site. This is a request I get fairly frequently; translators usually want to know how much work they will need to do to translate a site from one language to another (eg. Spanish to English). Fortunately there are two Unix tools that can make this work very easy: lynx and wc.
-more->
The following is the sequence of commands I usually employ:
Open up the terminal and type the following:
cd ~/Desktop
mkdir sitename_com
cd sitename_com
This creates a new folder called sitename_com on your Desktop, and then places you in it. Now type:
lynx -traversal -crawl http://www.sitename.com
lynx is an amazing command-line based web browser that does many things. Here we’re using it with the -traversal switch, which follows every link it finds in the site you pointed it to (http://www.sitename.com). The -crawl switch saves each page it finds as a text file with a .dat extension, without the html markup. Just what we want!
Note: if lynx isn’t on your system, you can install it using Fink. Explaining how to do this is beyond the scope of this post, check out the documentation on the Fink site for more info.
Next step:
wc -w *.dat > ~/Desktop/wordcount.txt
wc is a word count utility. Here we are telling it to count only words (hence the -w switch) in all files with the *.dat extension (in other words, the files that lynx saved in the current directory in the previous step). The results are saved to a file called wordcount.txt on your desktop. Open this file up in a text editor, and you’re done!
Well, not quite. Web pages in most sites usually have many words in common with other pages in the same site. For example, navigation menus are usually the same throughout the site. It wouldn’t be fair to count the navigation labels as “new words”, because they will only need to be translated once. I usually take a look at a few of the .dat files that lynx created, to guesstimate a percentage of repeated words. (It can be between 10% – 40% or more of the site content.) I then subtract this number from the total. (I always make it clear that the number I’m giving is at best a rough estimate. But this is better than nothing!)
Of course, none of these tools are Mac-specific; these things can be done in Linux and even Windows (using Cygwin).
If you have any Unix web-dev tips to share, or if you know of ways of improving this technique, please let me know.
Continue Reading