CloverETL's Blog

August 3, 2011

Address Cleansing and Transliteration with CloverETL and AddressDoctor

Filed under: Using CloverETL — Tags: , , , , — Agata Vackova @ 2:55 pm

Data quality usually goes hand in hand with data integration. The new version CloverETL 3.1 has enriched its data cleansing capabilities through integration with AddressDoctor. AddressDoctor contains address and geo data for more than 240 countries all over the globe. Along with correcting and fixing mail addresses, AddressDoctor can also be used for transliteration of non-Latin writing systems into Latin characters or enriching addresses with latitude and longitude information.

CloverETL integrates AddressDoctor software through a dedicated AddressDoctor component. In all cases you need to have java library AddressDoctor5.jar, native libraries (they all need to be on the class-path when running a graph) and country databases. You also need unlock codes for your databases.

AddressDoctor component has 4 required parameters:

  • Configuration is driven by configXml or configFileparameter.  The simplest configuration file can look as follows:
           <?xml version="1.0" encoding="UTF-8"?>
           <SetConfig>
             <General MaxAddressObjectCount="10"/>
             <UnlockCode><Your unlock code here></UnlockCode>
             <DataBase CountryISO3="ALL" Path="/home/user/AD/db"/>
           </SetConfig>
  • Parameters can be set by parameterXml or parameterFileattribute. The simplest parameter file can look as follows:
           <?xml version="1.0" encoding="UTF-8"?>
           <Parameters>
             <Process Mode="PARSE"/>
             <Input/>
             <Result/>
           </Parameters>
  • Input mapping defines mapping between Clover input fields and AddressDoctor address properties.
  • Output mapping defines mapping between AddressDoctor output address properties and your output.

Transliteration example

Imagine that you have addresses all over the world saved in their original languages. For some languages, you would not even recognize them– to say nothing of their correct reading. Let’s try with following addresses:

Tomáš Novák;Vohradského 5;Česká Lípa;Czech republic
Hans-Peter Feiertag;Metro-Straße 1;Düsseldorf 4;Deutschland
Michał Dąbrowski;Marszałkowska 142;00-132 Warszawa;Polska
Борис Николаевич Ельцин; Казанская пл., 23;Санкт-Петербург;RUSSIAN FEDERATION
John Smith;100 Main Street;New York NY 10023;USB

To transliterate above addresses to Ascii, we need to set following parameters:

  • Mode to PARSE to transliterate without processing the address, Optimization level to STANDARD as it produces the best results. Both in Process tab of Parameters attribute:
  • Encoding to UTF-16 in Input tab of Parametersattribute:
  • Preferred language to ENGLISH and Preferred script to ASCII_SIMPLIFIED in Result tab of Parameters attribute:

Now we need to feed the Addressdoctor with input mapping. Input metadata:

Corresponds with Contact name, Street complete, Locality complete and Country name address properties:

–>

Similar mapping needs to be provided for output properties:

–>

We can also connect the Error port for invalid (unrecognised) addresses:

Now we can run our graph getting following results:

Tomas Novak;Vohradskeho;Ceska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strase;Dusseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevic Elcin;Kazanskaa Pl.;Sankt-Peterburg;RUSSIAN FEDERATION

But one address is missing. Look at the error port:

N1 means Validation Error: No validation performed because country was not recognized

The output can be even improved by setting ASCII_EXTENDED as Preferred script in Result tab of Parameters attribute:

Tomash Novak;Vohradskeho;Cheska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strasse;Duesseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevich Elcin;Kazanskaya Pl.;Sankt-Peterburg;RUSSIAN FEDERATION

Now the transcription corresponds with pronunciation.

Enrichment example

If you have proper database, you can enrich addresses with data you don’t know, e.g. ZIP code, geocoding, certification status.

Let’s consider following addresses:

To validate an address according to the Canada Post SERP rules we would need certified database for Canada. We need to set in Configuration as well as in Parameters:

and

After AddressDoctor transformation we get:

Count ResultNumber ElementItem Country Province Locality PostalCode Street Number SERPStatus SERPCategory ResultProcessStatus ResultModeUsed ResultPreferredScript ResultPreferredLanguage ResultDataMailabilityScore ResultDataElementResultStatus ResultDataElementInputStatus ResultDataElementRelevance
1 1 1 CANADA ON TORONTO M4P 3J5 YONGE STREET 2384 ESE1 C C3 CERTIFIED ASCII_SIMPLIFIED DATABASE 3 80F0F0F0F000004000E0 00606060600000200060 10101010100000100010
1 1 1 CANADA SK LA RONGE FINLAYSON ST 108 ESE1 N I1 CERTIFIED ASCII_SIMPLIFIED DATABASE 0 000000000000000000E0 00601010100000000060 00000000000000000000
1 1 1 CANADA SK PRINCE ALBERT S6V 0C7 15TH ST E 801 ESE1 C C4 CERTIFIED ASCII_SIMPLIFIED DATABASE 4 80F0F0F0F00000F000E0 00606060600000600060 10101010100000100010

Note the fields:

  • PostalCode – for recognized addresses we get valid postal code
  • SERPCategory :
    • C: Corrected
    • N: Incorrect
  • ResultProcessStatus and ResultDataMailabilityScore:
    • C3 (3) : Corrected – but some elements could not be checked (should be fine)
    • I1 (0) : Data could not be corrected and is pretty unlikely to be delivered (futile)
    • C4 (4) : Corrected – all (postally relevant) elements have been checked (almost certain)

Download the transformation graph with data

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Shocking Blue Green. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.