Counting Words and Characters
Tim Watson
Word and character counting is a subject close to the heart of all freelance translators, as it’s the basis for job costing and getting paid. This article considers some of the issues involved in word counting.
Different word processors and translation tools very often produce different word count values for the same document, though typically not wildly different. The differences can be due to the use of different rules for counting as well as deficiencies in the applications used.
Many people rely on the document property statistics produced by Microsoft Word to determine the word and character counts. In many instances this is perfectly good. There are, however, a few things to be aware of that Word gets wrong, as I will explain.
When one is handling a large number of documents at a time, getting an overall word count for all documents can be time consuming, especially if this means opening several documents in Microsoft Word, noting the count values for each file and then totalling them all together. There are third-party tools to automate the process of counting words. These allow a number of files, which may be of different formats, to be selected, and the word/character counts are then summarized and totalled. When one is faced with many files, these tools are real time savers. For example, when one is working with Web pages, it’s quite common for a customer to supply dozens of separate files. The utilities typically support multiple file formats such as Word, HTML, PDF, PowerPoint, Excel, text and so on. These dedicated word counting tools can also be more accurate as they don’t have the deficiencies that standard applications such as Microsoft Word have. The table “Word Counts From Three Applications” shows the word count from three different applications, including Microsoft Word, for a set of test documents.
Readers who wish to try the test documents on their own systems may download them from www.surefiresoftware.com /testdocs.htm. Scanned and electronically faxed images are another matter. These will typically be in bitmap (.bmp), .jpg, .gif, .tif or some other graphical format. Acrobat PDF documents or Word documents may also contain scanned images. Text in a scanned image is not stored in the form of a character encoding, but is described like a picture and is made up of colored dots or pixels. In order for a computer program to count words, one must first convert the graphical image back into a character encoded format, such as Word, rich text file (RTF), text and so on. This can be done with the aid of an optical character recognition (OCR) application. Several OCR applications are commercially available.
Counting in Counting in Word
Let’s now consider Microsoft Word in more depth and look at the areas where caution is needed.Word basically counts words by assuming everything between spaces is a word. This includes symbols such as %, &, @, * and #. Translation tools are generally a bit smarter and will not include these symbols as words.
Text from text boxes, grouped shapes, auto-shapes, headers, footers and comments are not included in the Word-generated document statistics. Headers and footers usually contain little text, so the error introduced by Word from ignoring this text is minor. The use of text boxes can be more significant. Some document authors use many text boxes, particularly to annotate drawings or to help produce complex text layouts. In these cases, ignoring this text can produce large errors, causing the word count to be far too low.
Microsoft Word counts numbers as words. For example, 4.7 would count as one word. Some other packages may exclude numbers from the word count. General opinion seems to be split on how to consider numbers. Some say that because numbers don’t need to be translated, they should not be included. Others say that because numbers need to be transcribed and checked for errors, they should be included. The difference is typically not significant for documents that contain only a few numbers.
Word does not count the text contained in any embedded objects. These objects, sometimes also known as OLE objects, are inserted into a Word document through use of Word’s Insert menu and the Object… item. For example, an Excel worksheet can be embedded within a Word document. Inserted OLE objects in Word documents are often diagrams with little or no text; but this is not always the case, and caution is needed. For example, an embedded Excel worksheet may contain significant amounts of text.
Using Microsoft Word to open HTML files and provide statistics needs some additional care. If the HTML file contains a form with predefined options for a drop-down type combo box, then Word will not count the predefined drop-down text options. When the HTML contains forms, this can lead to the word count being significantly lower than the truth. The Word statistics also do not include the HTML page title, button text, and text in meta tags such as meta tags for description and keywords. Scanned images — text that is part of a graphic, very often buttons — will also not be counted.
Counting in PowerPoint
In common with Word, PowerPoint does not count the text contained in OLE objects, which are commonly used in PowerPoint presentations. Microsoft Word tables can be easily inserted as embedded objects, using the PowerPoint Insert menu, Picture sub-menu. Excel worksheets are also commonly embedded into a slide. When embedded objects exist, they typically contain significant amounts of text, and this should be taken into account manually.
PowerPoint 97 and 2000 are not consistent with Word in the rules used for counting.
For example, hyphenated words are counted as two words. PowerPoint XP corrects this difference. This means that two different users with different PowerPoint versions may disagree about the word count on the same document. PowerPoint, of course, doesn’t provide character count statistics. A third-party tool must be used for this purpose.
Summary
Understanding the tools available and the shortcomings of different approaches to word and character counting is important. Minor word-count differences are probably not worth getting hung up on, and a pragmatic approach is sensible. A few words make little difference to the overall time for translation; it is far more important to consider carefully the type and difficulty of the material. This, of course, is an altogether more skilled task.

<< Home