Microsoft Excel
Back to jkp-ads.com

Ron de Bruin
Excel Automation

Microsoft MVP Program

Cleaning "Dirty" Data

Important message to visitors of this page

Ron de Bruin decided to remove all Windows Excel content from his website for personal reasons. If you want to know why, head over to rondebruin.nl.

Luckily, Ron was kind enough to allow me to publish all of his Excel content here.

Most of these pages are slightly outdated and may contain links that don 't work. Please inform me if you find such an error and I'll try to fix it.

Kind regards

Jan Karel Pieterse


To
1. Remove excess spaces,
2. Substitute " " for all CHAR(160), and non printing characters (especially CHAR(10)) and,
3. Remove ".", "?" and "!" from end of text:

These are formulas or derivations from them are especially, but not solely, used for preliminary handling of "bad" text entries and "bad" imported data.

Use:
=TRIM(CLEAN(SUBSTITUTE(LEFT(TRIM(A7),LEN(TRIM(A7))-OR(RIGHT(TRIM(A7))={"?","!","."})),CHAR(160)," ")))

Formula Comments:
Wow! 7 different Text functions and a coerced logical formula using an array structure in a calculation (of length)! Look Mum! No VBA! The above formula is a composite of the two formulas below. We've further explained the construction of the second formula. Note that if used on a large data range, the result will be increased size of workbook and slower re-calculation. In most cases this formula will be used once only and after use you should use Copy > Paste Special > Values > OK. The source data can then be deleted or you can copy result over source data.

To just Remove Excess Spaces and Substitute " " for CHAR(160) and non-printing characters (especially CHAR(10))
Use:
=TRIM(CLEAN(SUBSTITUTE(A7,CHAR(160)," ")))

Formula Comments:
The formula replaces the non breaking character space CHAR(160) by a space. This often causes trouble with data imported from HTML sources. CLEAN removes all non printed characters with the most common "culprit" in Excel being CHAR(10). CHAR(10) is inserted when you use Alt-Enter to force a line wrap. It is not "seen" in the cell it is entered in but will appear as a box in cells that reference it. The logic of the formula is that we TRIM after CLEANing after SUBSTITUTE of CHAR(160).

Just Remove Terminating Punctuation and TRIM
Use:
=LEFT(TRIM(A7),LEN(TRIM(A7))-OR(RIGHT(TRIM(A7))={"?","!","."}))

Formula Comments:
First look at how we determine if there are terminating punctuation characters:
=-OR(RIGHT(TRIM(A7))={"?","!","."})
Note the "-" before the OR. This forces a return of -1 for TRUE and 0 for FALSE. We've used an internal array within the OR function to check for existence of "?", "!" or "." in the TRIMmed target cell. We have to TRIM the target cell in this OR function and elsewhere in the main formula just in case some darned fool has put a space after the punctuation. The use of the internal array structure allows us to cycle through the options efficiently and without a long OR function that tests different values of the same parameter. The OR function will normally return TRUE or FALSE but we negate the function and force to return -1 or 0. Forcing to -1 or 0 allows us to use this element to calculate the LENgth of the TRIMmed formula for use by the LEFT function. This forcing return of logical expressions to is commonly used where addition, subtraction or multiplications by 1 or zero can serve a given objective.

 

Example workbook

Download a workbook created by Norman Harker showing these formulas in operation that allows you to insert test data for "cleaning". Download Example workbook

 

Acknowledgements

Dave McRitchie's website at http://dmcritchie.mvps.org/excel/strings.htm
as a general resource and information repository on string manipulation.