CloverETL's Blog

September 6, 2011

Handling Errors in Heterogeneous Input Data

Filed under: Using CloverETL — Tags: , , — stysm @ 11:17 am

ComplexDataReader is a powerful new component in CloverETL meant for reading elaborate heterogeneous data. However, all data cannot be read easily even if you spend a lot of time configuring the component. Sometimes you need to think in advance: What if you come across unknown metadata you have not handled? Normally, the graph crashes.

This post will examine a way of preventing that or, more specifically, how to handle errors in input data.

Example Input Data

Input Data

What We Will Do

We can instantly distinguish three kinds of metadata on the input: product, product_range and service. ComplexDataReader is the best component to parse these using three states of a state machine. As you can see, there is one line that does not fit into the data. The magic trick of this example lies in preparing one extra state – the error state. The state will be responsible for “catching” all incorrect data which would cause the component to fail. In order to be able to decide which data are “bad,” or, more precisely, when to switch to the error state, you have to write a custom Selector class in Java. The idea behind the code is very simple and will be explained below:

“Prep Work”

First, we need to prepare metadata for all three states of the state machine plus one extra. The extra metadata will represent error lines on the input we need to “throw away.”

Second, do not forget to connect the component to its succeeding components and assign metadata to output edges.

Third, set the “File URL” property to point the component to the input file.

Here are the three aforementioned metadata:

Metadata: Product

Metadata: Service

Metadata: Product Range

And one extra metadata for error lines:

Metadata for Error Lines

Designing  State Machine

We are going to create four states:

Note: There are no transition edges to be seen in the graph. It is because the Selector itself will decide when to change between states.

Start configuring the component via the “Transform” property. Create four states corresponding to the metadata and set “Initial state” to “Let selector decide”:

Switch to state “$0 product” and define its output mapping. In this state, we will send all fields to the output. Thus, drag state $0 to the “Value” column in the right-hand pane. You will produce the “$0.*” directive. In the “Transition table”, switch “Target state” to “Let selector decide”:

Repeat the same procedure for all remaining states (including the error state). Always send everything to the output port and “Let selector decide” about the target state:

Writing Custom Selector

We are now going to prepare a Java class that will do the magic of this example – switch between states “$0 product”, “$1 service”, “$2 product_range” and the “$3 error” state in case there are errors on reading. This particular prefix Selector will assume there is another record on the following line(s) and will try to read it. If there really is a new record, we can recover from the error line and carry on reading.

You can prepare the Java class in any editor of your choice. After writing it, just remember to place it into the “trans” folder of your project. On that condition, CloverETL will automatically compile the class for you.

The Selector class will look like this:

public class CustomPrefixInputMetadataSelector1 extends com.opensys.cloveretl.component.complexdatareader.PrefixInputMetadataSelector {

	private static final int DEFAULT = 3;

	@Override
	public int select(int prevState) {
		int result = super.select(prevState);
		if(result == org.jetel.component.RecordTransform.ALL) {
			return DEFAULT;
		}
		return result;
	}
}

A few comments concerning the code:

  • int result = super.select(prevState);
    First, we try to call the default selector and store the number of the next state into result.
  • if(result == org.jetel.component.RecordTransform.ALL)
    And if the default selector cannot decide…
  • return DEFAULT;
    We return the default state number – number 3. This is the error state.

Now that you are done with the code, switch to the “Selector” tab in “State transitions”. In “Selector URL”, browse for your custom Selector. Notice that after you specify its location, the “Selector properties” area changes:

Conclusions & Pitfalls

In this article, we have presented a way of handling flaws in the input data. We have been capable of addressing a situation when the selector looks on the following metadata and cannot decide which state goes next.

However, there are numerous cases when you just cannot prevent reading errors from occurring. For instance, even if the selector recognizes the following metadata but then fails on parsing them, we cannot react and the graph fails. You can imagine that as a file whose field types suddenly change, (e.g. from integer to date – the selector starts parsing an integer and crashes). Another known case we cannot handle is changeable number of fields in one record. If new fields occur or their number decreases, the graph execution fails. The only exception to this are fields added at the end of a record. These can be handled with the help of lenient data policy.

Download a complete CloverETL project – error handling in ComplexDataReader

August 18, 2011

Usability Improvements in CloverETL 3.1

Filed under: Using CloverETL — Tags: , , , — tichyj @ 1:16 pm

One of the most noticeable set of changes in CloverETL version 3.1 is the interface improvements, substantially improving Clover’s usability and understandability. These improvements save both new and old users valuable time when creating or manipulating their data transformation graphs and further cement CloverETL’s place as one of the most easy to use ETL tools on the market.

The biggest improvement was the addition of drag-and-drop functionality to a number of different aspects of Clover. You can drag files to the graph, files to components, files to metadata, and metadata to edges, saving innumerable clicks through menus.

We have also made it easier to link your metadata and edges while creating the edges. If you right-click on the Edge tool in your palette, it will give you a list of every metadata you have created on the current graph. If you select one of the metadata, whenever you create an edge with the edge tool, it will automatically assign that metadata to the edge.

Not only is it easier to link metadata and edges, we’ve also made it easier to create and manipulate the edges themselves. Edges can now be created simply by dragging from one component’s out port to another’s in port. If you find you want to change where the edge is connected, that too is now one-click. Simply click and drag an edge’s endpoints to any other port.

The last shortcut that version 3.1 added to CloverETL is an easier way to set the description on a component. Before, the description field was buried in the component’s properties, but now it has been moved to the header of the properties window. This improvement makes it substantially easier to clarify the purpose of your components, making your graph easier to read overall.

August 3, 2011

Address Cleansing and Transliteration with CloverETL and AddressDoctor

Filed under: Using CloverETL — Tags: , , , , — Agata Vackova @ 2:55 pm

Data quality usually goes hand in hand with data integration. The new version CloverETL 3.1 has enriched its data cleansing capabilities through integration with AddressDoctor. AddressDoctor contains address and geo data for more than 240 countries all over the globe. Along with correcting and fixing mail addresses, AddressDoctor can also be used for transliteration of non-Latin writing systems into Latin characters or enriching addresses with latitude and longitude information.

CloverETL integrates AddressDoctor software through a dedicated AddressDoctor component. In all cases you need to have java library AddressDoctor5.jar, native libraries (they all need to be on the class-path when running a graph) and country databases. You also need unlock codes for your databases.

AddressDoctor component has 4 required parameters:

  • Configuration is driven by configXml or configFileparameter.  The simplest configuration file can look as follows:
           <?xml version="1.0" encoding="UTF-8"?>
           <SetConfig>
             <General MaxAddressObjectCount="10"/>
             <UnlockCode><Your unlock code here></UnlockCode>
             <DataBase CountryISO3="ALL" Path="/home/user/AD/db"/>
           </SetConfig>
  • Parameters can be set by parameterXml or parameterFileattribute. The simplest parameter file can look as follows:
           <?xml version="1.0" encoding="UTF-8"?>
           <Parameters>
             <Process Mode="PARSE"/>
             <Input/>
             <Result/>
           </Parameters>
  • Input mapping defines mapping between Clover input fields and AddressDoctor address properties.
  • Output mapping defines mapping between AddressDoctor output address properties and your output.

Transliteration example

Imagine that you have addresses all over the world saved in their original languages. For some languages, you would not even recognize them– to say nothing of their correct reading. Let’s try with following addresses:

Tomáš Novák;Vohradského 5;Česká Lípa;Czech republic
Hans-Peter Feiertag;Metro-Straße 1;Düsseldorf 4;Deutschland
Michał Dąbrowski;Marszałkowska 142;00-132 Warszawa;Polska
Борис Николаевич Ельцин; Казанская пл., 23;Санкт-Петербург;RUSSIAN FEDERATION
John Smith;100 Main Street;New York NY 10023;USB

To transliterate above addresses to Ascii, we need to set following parameters:

  • Mode to PARSE to transliterate without processing the address, Optimization level to STANDARD as it produces the best results. Both in Process tab of Parameters attribute:
  • Encoding to UTF-16 in Input tab of Parametersattribute:
  • Preferred language to ENGLISH and Preferred script to ASCII_SIMPLIFIED in Result tab of Parameters attribute:

Now we need to feed the Addressdoctor with input mapping. Input metadata:

Corresponds with Contact name, Street complete, Locality complete and Country name address properties:

–>

Similar mapping needs to be provided for output properties:

–>

We can also connect the Error port for invalid (unrecognised) addresses:

Now we can run our graph getting following results:

Tomas Novak;Vohradskeho;Ceska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strase;Dusseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevic Elcin;Kazanskaa Pl.;Sankt-Peterburg;RUSSIAN FEDERATION

But one address is missing. Look at the error port:

N1 means Validation Error: No validation performed because country was not recognized

The output can be even improved by setting ASCII_EXTENDED as Preferred script in Result tab of Parameters attribute:

Tomash Novak;Vohradskeho;Cheska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strasse;Duesseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevich Elcin;Kazanskaya Pl.;Sankt-Peterburg;RUSSIAN FEDERATION

Now the transcription corresponds with pronunciation.

Enrichment example

If you have proper database, you can enrich addresses with data you don’t know, e.g. ZIP code, geocoding, certification status.

Let’s consider following addresses:

To validate an address according to the Canada Post SERP rules we would need certified database for Canada. We need to set in Configuration as well as in Parameters:

and

After AddressDoctor transformation we get:

Count ResultNumber ElementItem Country Province Locality PostalCode Street Number SERPStatus SERPCategory ResultProcessStatus ResultModeUsed ResultPreferredScript ResultPreferredLanguage ResultDataMailabilityScore ResultDataElementResultStatus ResultDataElementInputStatus ResultDataElementRelevance
1 1 1 CANADA ON TORONTO M4P 3J5 YONGE STREET 2384 ESE1 C C3 CERTIFIED ASCII_SIMPLIFIED DATABASE 3 80F0F0F0F000004000E0 00606060600000200060 10101010100000100010
1 1 1 CANADA SK LA RONGE FINLAYSON ST 108 ESE1 N I1 CERTIFIED ASCII_SIMPLIFIED DATABASE 0 000000000000000000E0 00601010100000000060 00000000000000000000
1 1 1 CANADA SK PRINCE ALBERT S6V 0C7 15TH ST E 801 ESE1 C C4 CERTIFIED ASCII_SIMPLIFIED DATABASE 4 80F0F0F0F00000F000E0 00606060600000600060 10101010100000100010

Note the fields:

  • PostalCode – for recognized addresses we get valid postal code
  • SERPCategory :
    • C: Corrected
    • N: Incorrect
  • ResultProcessStatus and ResultDataMailabilityScore:
    • C3 (3) : Corrected – but some elements could not be checked (should be fine)
    • I1 (0) : Data could not be corrected and is pretty unlikely to be delivered (futile)
    • C4 (4) : Corrected – all (postally relevant) elements have been checked (almost certain)

Download the transformation graph with data

July 21, 2011

Processing Heterogeneous Data with ComplexDataReader

Filed under: Using CloverETL — Tags: , , — javlinkrivanekm @ 8:25 am

ComplexDataReader – Example How-to

ComplexDataReader is a new component for reading heterogeneous data (data which contains multiple types of records that can also depend on each other) without the need of hard coding. Instead, the component is driven by a state machine which can be set-up using the GUI.

The following example will present some of the capabilities of ComplexDataReader, as well as guide you through the design of a simple automaton, which is used for processing a text file containing two types of shipments grouped into batches. Each batch starts with a batch header; the number of items in a batch is variable and it is part of the header.

Input Data

Input Data for ComplexDataReader

What We Want to Achieve

For every parcel and every letter, send to the output the address and the charge to the output, also add the batch ID, customer ID, and the date from the respective batch header.

The first element of a batch header determines the type of its elements, and the third element contains the number of items in the batch.

Preparation

Before starting the configuration of the component, all the required metadata should be defined. Also, the component should be connected to the succeeding component(s) and the output edge(s) should have metadata assigned.

You may also set the “File URL” property of the component to point to the input file.

Internal metadata (used for parsing the input):

Batch Metadata

Parcel Metadata

Letter Metadata

Output metadata (used for output mapping):

Shipment Metadata

ComplexDataReader Configuration

First, we have to design an automaton, which will guide the component through parsing the input. The automaton may look like this:

ComplexDataReader Automaton

The idea behind it is that we start by reading a batch header, therefore the initial state is set to “$0 – Batch”. Then we can decide, depending on the value of the “type” field, whether to proceed to “$1 – Letter” or “$2 – Parcel”. In either of these states, we read as many records as specified in the “count” field of the previous batch header, then return to “$0 – Batch” and expect a new batch header.

To start building the automaton, open the configuration dialog by double clicking the component and then its “Transform” property.

Create three states by dragging the “batch”, “letter” and “parcel” metadata, respectively, from the list of Available Metadata on the left to the list of States on the right. You can also edit the labels of the states. Set the Initial state to “State $0″ by selecting it from the drop-down list.

Optionally, you may switch to the Overview tab and press the Undock button to get an interactive overview of the automaton being built.

Switch to the State $0 tab. This state represents a new batch. Set the automaton to reset the counters for state $1 and $2 by pressing the Actions button and ticking Reset counter for “State $1″ and “State $2″. Add two rows to the Transition table. Set the condition of the first row to $batch.type == "LETTERS" and the condition of the second row to $batch.type == "PARCELS". Set their target states to “State $1″ and “State $2″, respectively. You may also set the target of the default transition to Fail to detect unexpected batch types.

Note that in state $0, no output mapping is defined; hence no data will be sent to the output.

The configuration of state $1 and $2 will be very similar. In these states we want to produce output, therefore we have to define output mapping. For example, in state $1 we need to send to the output “address” and “charge” fields from internal record $1 (last letter record) and “batchID”, “customerID” and “date” from internal record $0 (last batch header record).

For state $1, define Output mapping by dragging row “$1″ from the left table onto “Port 0″ in the right table. Then expand row $0 on the left and Port 0 on the right and drag “batchID”, “customerID” and “date” from the left onto “$0.batchID”, “$0.customerID” and “$0.date” on the right, respectively.

Add one row to the Transition table and set its condition to counter1 < $batch.count and its target to “State $1″. Also set the target of the default transition to “State $0″.

Letter State

Similarly, for state $2, drag row “$2″ onto “Port 0″ and “batchID”, “customerID” and “date” from row $0 onto “$0.batchID”, “$0.customerID” and “$0.date”. Add one row to the Transition table and set its condition to counter2 < $batch.count and its target to “State $2″. Again, set the target of the default transition to “State $0″.

Parcel State

Download the transformation graph with data

June 21, 2011

Speed-up Installation of Plugins in Eclipse

Filed under: Others, Using CloverETL — Tags: , — Jaroslav Urban @ 12:56 pm

Installing plug-ins into Eclipse can in some situations take a long time. This can also affect users of CloverETL Designer if they choose the Online or Offline Eclipse Plugin Installation download type. This blog post describes a workaround for the slow install process.

The reason for the long installation time is that by default Eclipse contacts all available update sites to try to resolve dependencies of the plugin being installed. There can be a large number of update sites, some can be not responding or slow and overall the connection can be bad. To disable contacting of all update sites, uncheck the “Contact all update sites during install to find required software” checkbox in the installation dialog:

Contact all update sites checkbox

This workaround can help not only when installing CloverETL Designer, but also when using Eclipse in general. However, it does have a drawback – some dependencies of the plugin being installed might not be resolved. This can also happen when installing CloverETL Designer, because it depends on the GEF and RSE plugins. The plugins are found in the main Eclipse update site which would not be contacted when using the described workaround. Eclipse will detect that some dependecies of CloverETL Designer are not met and will not proceed with the installation:

CloverETL dependencies not met

In case some dependencies are not resolved, there are 2 options:

  •  find and install the dependencies manually (in case of GEF and RSE they can be found in the main Eclipse update site)
  •  accept the long installation time and enable the checkbox back. Eclipse will resolve the dependencies automatically

Hopefully this hint will help some users of CloverETL Designer or Eclipse with the slow installation. However it’s important to understand that it’s NOT mandatory to use the workaround as the installation is quick in many cases – use it only in case of issues.

May 10, 2011

Handling of JSON Objects in CloverETL

Filed under: Using CloverETL — Tags: , — Agata Vackova @ 2:14 pm

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is language independent but uses conventions that are familiar to programmers. This format is often used for serializing and transmitting structured data over a network connection. It is primarily used to transmit data between a server and a web application, serving as an alternative to XML. CloverETL doesn’t have any JSONReader/Writer components, but has numerous components that can handle XML files. So all you need to do is to convert the JSON structure to an XML one or vice versa.  If you have a XSLT 1.0 stylesheet that transforms XML file into a JSON object, you can use the XSLTransformer component. I downloaded xml2json-xslt and created a transformation graph with only one component:

That converts an XML file into a JSON structure.

Another solution is to use JSON Java API and create a transformation in Java as an attribute of Reformat component. Then simple transformations need to be written:

  • To transform XML to JSON:

import org.jetel.component.DataRecordTransform;
import org.jetel.data.DataRecord;
import org.jetel.exception.ComponentNotReadyException;
import org.jetel.exception.TransformException;
import org.json.JSONException;
import org.json.XML;

public class XML2JSON extends DataRecordTransform {

    int counter;

    @Override
    public void preExecute() throws ComponentNotReadyException {
        super.preExecute();
        counter = 1;
    }

    @Override
    public int transform(DataRecord[] inputRecords, DataRecord[] outputRecords)
            throws TransformException {
        try {
            outputRecords[0].getField(0).setValue(XML.toJSONObject(inputRecords[0].getField(0).toString()).toString());
        } catch (JSONException e) {
            throw new TransformException("Can't convert XML to JSON.", e, counter, 0);
        }
        counter++;
        return 0;
    }

}
  • To transform JSON to XML:

import org.jetel.component.DataRecordTransform;
import org.jetel.data.DataRecord;
import org.jetel.exception.ComponentNotReadyException;
import org.jetel.exception.TransformException;
import org.json.JSONException;
import org.json.JSONObject;
import org.json.XML;

public class JSON2XML extends DataRecordTransform {
    int counter;

    @Override
    public void preExecute() throws ComponentNotReadyException {
        super.preExecute();
        counter = 1;
    }

    @Override
    public int transform(DataRecord[] inputRecords, DataRecord[] outputRecords)
            throws TransformException {
        try {
            outputRecords[0].getField(0).setValue(XML.toString(new JSONObject(inputRecords[0].getField(0).toString())));
        } catch (JSONException e) {
            throw new TransformException("Can't convert XML to JSON.", e, counter, 0);
        }
        counter++;
        return 0;
    }

}

Since CloverETL 3.1 the implementation above is going to be embedded into CloverETL Engine in the form of xml2json and json2xml ctl functions. Then your transformations are easy indeed:

  • To transform XML to JSON:

  • To transform JSON to XML:

Above transformation converts following XML structure:

<?xml version="1.0" encoding="ISO-8859-1"?>
<employees>
    <employee>
        <empID>1</empID>
        <jobID>3</jobID>
           <salary>5000</salary>
        <name>
            <firstname>Mark</firstname>
            <surname>Fish</surname>
        </name>
        <child>
            <chname>Ann</chname>
            <age>4</age>
        </child>
        <child>
            <chname>Mark</chname>
            <age>6</age>
        </child>
        <benefits>
            <car>none</car>
            <mobilephone>yes</mobilephone>
            <financial>
                <monthly_bonus>1000</monthly_bonus>
                <yearly_bonus>0</yearly_bonus>    
            </financial>
        </benefits>
        <project>
            <projName>JSP</projName>
            <projManager>John Smith</projManager>
            <inProjectID>34</inProjectID>
            <Start>06062006</Start>
            <End>in progress</End>
            <customer>
                            <name>Sunny</name>
            </customer>
            <customer>
                <name>Weblea</name>
            </customer>
        </project>
        <project>
            <projName>Data warehouse</projName>
            <projManager>John Major</projManager>
            <inProjectID>51</inProjectID>
            <Start>01062005</Start>
            <End>31052006</End>
            <customer>
                <name>Hanuman</name>
            </customer>
            <customer>
                <name>Weblea</name>
            </customer>
            <customer>
                <name>SomeBank</name>
            </customer>
        </project>
    </employee>
</employees>

into the following JSON object (needs to be formatted manually):

{"employees":
    {"employee":{
        "child":[
            {"chname":"Ann","age":4},
            {"chname":"Mark","age":6}
        ],
        "project":[{
            "End":"in progress",
            "projManager":"John Smith",
            "Start":6062006,
            "inProjectID":34,
            "projName":"JSP",
            "customer":[
                {"name":"Sunny"},
                {"name":"Weblea"}
            ]},{
            "End":31052006,
            "projManager":"John Major",
            "Start":1062005,
            "inProjectID":51,
            "projName":"Data warehouse",
            "customer":[
                {"name":"Hanuman"},
                {"name":"Weblea"},
                {"name":"SomeBank"}
            ]}
        ],
        "jobID":3,
        "empID":1,
        "name":
            {"surname":"Fish","firstname":"Mark"},
        "benefits":{
            "financial":
                {"monthly_bonus":1000,"yearly_bonus":"0"},
            "car":"none",
            "mobilephone":"yes"},
        "salary":5000}
    }
}

Download the example transformation graph

April 12, 2011

Usage of Internal and External Graph Elements

Filed under: Using CloverETL — Tags: , — tomaswaller @ 9:20 am

GRAPH ELEMENTS

In addition to components, edges, notes, etc., any CloverETL graphs may contain the following graph elements:

  • Metadata
  • Connections
  • Lookup tables
  • Sequences
  • Parameters

Each of these graph elements may be either created and written in the graph XML source file, or specified in an external file and linked to the graph. Such elements are called internal, or external, respectively. In case of external elements, graph XML file only contains a link to such file with graph element definition.

Internal Graph Elements

Graph XML file can contain internal graph elements that may look like those in the following examples:

 <Metadata id="Metadata1"> <Record fieldDelimiter="|" name="LUT" recordDelimiter="\r\n" type="delimited"> <Field name="field1" type="integer"/> <Field name="field2" type="string"/> </Record> </Metadata> <Connection database="POSTGRE" dbURL="jdbc:postgresql://hostname/database" id="JDBC0" jdbcSpecific="POSTGRE" name="NewConnection" password="mypassword" type="JDBC" user="username"/> <LookupTable charset="ISO-8859-1" id="LookupTable0" initialSize="512" key="field1" keyDuplicates="true" metadata="Metadata1" name="simpleLookup0" type="simpleLookup"/> <Sequence cached="1" fileURL="${SEQ_DIR}/seq_withdata.txt" id="Sequence0" name="seq" start="1" step="1" type="SIMPLE_SEQUENCE"/> <Property id="GraphParameter1" name="NUMBER" value="6"/>

(In addition to such internal graph elements, graph contains a link to the workspace.prm file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)

<Property fileURL="workspace.prm" id="GraphParameter0"/>

The resulting Outline pane looks like this:

Internal Graph Elements

External Graph Elements

When you want to link an external graph element, you need to right-click any of the five categories (Metadata, Connections, Parameters, Sequences, or Lookups) in the Outline pane and select in the context menu:

  • New metadata → Link shared definition
  • Connections → Link DB connection, Link JMS connection, or Link QuickBase connection
  • Parameters → Link parameter file
  • Sequences → Link shared sequence
  • Lookup tables → Link shared lookup table

After that, the URL Dialog opens.

URL Dialog

The dialog consists of three tabs: Workspace view, Local files, and Remote files tabs. They serve for browsing local files within your workspace, outside your workspace, or files located on a remote computer, respectively.

Files with graph elements are located in the same project as the graph itself.

To browse your project, you need to use the Workspace view tab of the URL Dialog.

The workspace.prm file in the project itself contains the definition of parameters for the conn, meta, lookup, seq directories (among others). They are designed as “CONN_DIR”, “META_DIR”, “SEQ_DIR”, and “LOOKUP_DIR”, respectively. The value of each of these parameters is resolved to its value using a dollar and curly brackets. For example, “${META_DIR}”.

Once you link a file containing graph element definition located in the project, the “_DIR” termination of these names assures that the path to such files located in the conn, meta, lookup, or seq directories is automatically replaced with one of these parameters.

The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:

<Metadata fileURL="${META_DIR}/LUT.fmt" id="Metadata1"/> <Connection dbConfig="${CONN_DIR}/NewConnection.cfg" id="JDBC0" type="JDBC"/> <LookupTable id="LookupTable0" lookupConfig="${LOOKUP_DIR}/simpleLookup0.cfg"/> <Sequence id="Sequence0" seqConfig="${SEQ_DIR}/seq.cfg"> <attr name="type"><![CDATA[SIMPLE_SEQUENCE]]></attr> </Sequence> <Property fileURL="parameters.prm" id="GraphParameter1"/>

(In addition to the mentioned graph elements, graph contains a link to the workspace.prm file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)

<Property fileURL="workspace.prm" id="GraphParameter0"/>

The resulting Outline pane looks like this:

External Graph Elements

Files with graph elements are located outside the project containing the graph.

To locate external graph elements OUTSIDE the project, you need to use the Local files tab of the URL Dialog. It allows to browse the file system of your local computer.

You can also define a parameter for the whole path with the mentioned “_DIR” termination. If you have defined such parameter (e.g, “PATH_TO_GRAPH_ELEMENTS_DIR”), the whole path will automatically be replaced with this parameter (as ${PATH_TO_GRAPH_ELEMENTS_DIR}).

Remember that you need to define such parameter BEFORE you link a graph element in the specified location, otherwise, the whole path will not be replaced with the parameter name!

The whole path to the graph elements is specified in the graph XML file. The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:

<Metadata fileURL="D:/ExternalGraphElements/meta/HashJoinInput.fmt" id="Metadata0"/> <Metadata fileURL="${PATH_TO_GRAPH_ELEMENTS_DIR}/meta/LUT.fmt" id="Metadata1"/> <Connection dbConfig="${PATH_TO_GRAPH_ELEMENTS_DIR}/NewConnection.cfg" id="JDBC0" type="JDBC"/> <LookupTable id="LookupTable0" lookupConfig="${PATH_TO_GRAPH_ELEMENTS_DIR}/simpleLookup0.cfg"/> <Sequence id="Sequence0" seqConfig="${_TO_GRAPH_ELEMENTS_DIR}/seq.cfg"> <attr name="type"><![CDATA[SIMPLE_SEQUENCE]]></attr> </Sequence> <Property fileURL="parameters.prm" id="GraphParameter1"/>

The last mentioned parameters.prm file defines two parameters:

NUMBER, whose value is 6

and

PATH_TO_GRAPH_ELEMENTS_DIR, whose value is D:/ExternalGraphElements.

(In addition to the mentioned graph elements, graph contains a link to the workspace.prm file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)

<Property fileURL="workspace.prm" id="GraphParameter0"/>

Note:

Note that the value of the PATH_TO_GRAPH_ELEMENTS_DIR parameter is D:/ExternalGraphElements.

  • The first external metadata element (with ”Metadata0” id) was linked BEFORE the PATH_TO_GRAPH_ELEMENTS_DIR parameter was defined. The whole path was NOT converted into ${PATH_TO_GRAPH_ELEMENTS}/…
  • But the next elements (with “Metadata1” id, and all of the other elements) were automatically converted to the ${PATH_TO_GRAPH_ELEMENTS_DIR}/… AS SOON AS they were linked to the graph.

The resulting Outline pane looks like this:

External Outside Project

Files with graph elements are located on a remote computer.

To locate the files on remote computer, you need to use the Remote files tab of the URL Dialog. It allows to specify details of the remote computer file system.

To connect a remote computer, click the Create/Edit URL button at the right side from the Server URL combobox. In the Edit URL Dialog that opens, you need to specify all the authentication details:

Edit URL Dialog

The general structure of a remote path is:

<protocol>://<username>:<password>@<hostname|IP>:<portnumber>/<pathtoexternalelements>

Supported protocols are: http, https, ftp, ftps, sftp. The first two do not allow browsing the remote file system, whereas the other four allow it.

The whole path to the graph elements is specified in the graph XML file. The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:

<Metadata fileURL="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/meta/LUT.fmt" id="Metadata1"/>

<Connection dbConfig="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/conn/NewConnection.cfg" id="JDBC0" type="JDBC"/>

<LookupTable id="LookupTable0"
lookupConfig="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/lookup/simpleLookup0.cfg"/>

<Sequence id="Sequence0" seqConfig="://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/seq/seq.cfg">
<attr name="type"><![CDATA[SIMPLE_SEQUENCE]]></attr>
</Sequence>

<Property fileURL="parameters.prm" id="GraphParameter1"/>

The last mentioned parameters.prm file defines two parameters:

NUMBER, whose value is 6

and

PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR, whose value is sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements.

(In addition to the mentioned graph elements, graph contains a link to the workspace.prm file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)

<Property fileURL="workspace.prm" id="GraphParameter0"/> 

The resulting Outline pane without parameter usage looks like this:

Remote Graph Elements

Note:

Remember that the paths to linked external elements located on a remote computer do NOT use automatically the ${PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR} value instead of sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements.

You may switch to the Source tab of your graph and replace the sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements with ${PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR} by hand.

Thus, after replacing the paths in the Source tab of the Graph editor with the value of graph parameter, the Outline pane looks like this:

Remote Graph Elements With Parameter

COMPARISON OF INTERNAL AND EXTERNAL GRAPH ELEMENTS

All the graph elements, both the internal and the external, may be converted into the other form. Any internal element may become an external one, and vice versa.

For more details consult our documentation.

VARIOUS FORMATS OF GRAPH ELEMENTS MAY BE USED AT THE SAME TIME

Remember that you can use all forms of graph elements in a single graph: internal, external, located on local computer, and external elements, located remotely, accessible via various protocols.

March 22, 2011

Launch Services – Part 2 – Configuration

Filed under: Using CloverETL — Tags: , — csochor @ 1:42 pm

In the last blog post, you learned what the Launch Services are. In this post you will see how to configure them.

Let us study an example scenario to become acquainted with configuration. We have a database containing the highest mountains on Earth along with their heights. The user enters an elevation above sea-level and hits the enter key. The Excel sheet is then displayed listing all mountains with the given minimal elevation.

Mountains example of Launch Services

How to Configure It?

First, we must create a transformation graph that uses a dictionary to receive parameters and to store results. Create a new graph in CloverETL Designer. In the outline pane, right click on Dictionary and choose Edit. Add a new entry named heightMin: with the “As Input” field set to true, and “type” set to Integer. Then add another entry named mountains.xls of type writable.channel, content type text/csv, and “As Output” set to true.

Dictionary

Now we may build a transformation graph. Components can use a dictionary in three different ways:

  1. Via file URL: data readers and writers may specify a File URL in the format dictionary:field-name. In our example, we set a data writer File URL to dictionary:mountains.xls.URL Dialog
  2. In CTL: anywhere in CTL code, we can use an expression of type dictionary.field-name to read or write the dictionary. In our example we use Filter expression $0.heightM >= dictionary.heightMin
  3. In Java code: using methods transformationGraph.getDictionary().getValue(String fieldName) and transformationGraph.getDictionary().setValue(String fieldName, Object value)

When a transformation graph is designed and ready, we must publish it as a Launch Service. In CloverETL Server administration, go to section Launch Services and click New launch configuration. Now enter a name, a sandbox and a graph name. Then open the Detail page for the new service, and click on Edit Parameters tab. Create a new parameter with heightMin name.

CloverETL Server Interface

Now we may test it. When we click a test link, the server generates a simple form which executes a launch service. We can copy, customize and use this form in a web site.

Test Page

March 3, 2011

Versatility of the File URL Attribute Used in Readers of CloverETL

Filed under: Using CloverETL — Tags: , , , — tomaswaller @ 9:42 am

CloverETL allows users to read several different kinds of files. These files may have various formats, they can be located on a local or remote computer, they can be accessed through a proxy, and they can also be compressed into zip, gzip, or tar archives. Users can also read data from the Console, from an input Port, or from a selected Dictionary entry.

A File URL must be specified using the URL dialog.

File URL Dialog

  • Workspace view tab serves to specify files within the workspace independently of whether the workspace belongs to CloverETL Project or CloverETL Server Project.
  • Local files tab displays the file structure of a local computer.
  • Remote files tab displays the file structure of a remote computer. It allows the user to specify the protocol, username, password, port, server, proxy, username for proxy, password for proxy, identification of proxy server, and the port for the proxy. After specifying these properties, the file structure of the remote computer is displayed (except for http and https protocols).
  • Port tab displays the input fields that are string, byte, or cbyte data types and allows the user to select one of them and also select a processing type from a combo box
  • Dictionary tab displays declared dictionary entries and allows to select one of them and choose the processing type from a combo

Now I will present the list of supported values of the File URL attribute.

Local Files (without compression)

Examples:
/path/file1.txt – reads one file
/path/file1.txt;/path/file2.txt – reads two files in one directory (semicolon separates files that will be read one after another)
/path1/fileA.txt;/path2/fileB.txt – reads two files in two directories (semicolon separates files that will be read one after another)
/path?/file*.txt – reads files in directories, when both the directories and the files must match the specified pattern.
/path/* – reads all files in the specified directory

Local Files (with compression)

Examples:
zip:(/path/file.zip) – reads the first file added to the zip archive
zip:(/path/file.zip)#innerfolder/innerfile.txt – reads the innerfile.txt contained in the innerfolder which has been compressed into the specified zip archive.
zip:(/path/file??.zip)#innerfolder?/innerfile*.txt – reads files contained in the innerfolders which have been compressed into the specified zip archives (each of these files, innerfolders, archive files must match their respective pattern)
gzip:(/path/file.gz) – reads the file compressed in the gzip archive
gzip:(/path/file??.gz) – reads the files compressed into specified gzip archives (each of these archives must match specified pattern)
tar:(/path/file.tar) – reads the first file added to the tar archive
tar:(/path/file.tar)#innerfolder/innerfile.txt – reads the innerfile.txt contained in the innerfolder which is compressed in the specified tar archive.
tar:(/path/file??.tar)#innerfolder?/innerfile*.txt – reads files contained in the innerfolders which have been compressed in the specified tar archives (each of these files, innerfolders, archive files must match their respective pattern)
zip:((zip:/path/file*.zip)#innerfolder/innerfile.zip)#innermostfolder??/innermostfile*.txt –reads innermost files contained in the innermostfolders which have been compressed into the specified innerfile zip archive contained in the innerfolder which has been compressed into the specified external zip archives (each of these innermostfiles, innermostfolders, external zip archives must match their respective pattern) Remember that innerfile.zip and innerfolder may not contain wildcards.

Remote Files  (without compression)

Unlike locally stored files, files on remote computers are accessible using a set of supported protocols. Sometimes it is also necessary to use a proxy server.

The following protocols are supported for accessing a remote server: sftp, ftp, ftps, http, https.

Access without proxy:

The structure of all remote files that are accessible directly, without a proxy, is as follows:
protocol://username:password@serverpassword@server :port/(whole|relative)path/file

Here, the whole path should be used for the sftp protocol, the other four protocols use relative paths.

Examples:
sftp://johnsmith:mypassword@myserver/home/johnsmith/relativepath/filename.txt
ftp://johnsmith:mypassword@myserver/relativepath/filename.txt
ftps://johnsmith:mypassword@myserver/relativepath/filename.txt
http://johnsmith:mypassword@myserver/relativepath/filename.txt
https://johnsmith:mypassword@myserver/relativepath/filename.txt

In the patterns shown above, username, password, and port may be ommitted if possible, whereas the other parts of such File URL are required.

Example (with username, password, and port ommitted):

http://myserver/relativepath/filename.txt

Access through proxy:

The structure of all remote files that are accessible through a proxy is as follows:
protocol:(proxy:proxyuser:proxypassword@proxyserver:proxyport)//username:password@server:port/(whole|relative)path/file

or with SOCKS V4 or V5 proxy:
protocol:(proxysocks:proxyuser:proxypassword@proxyserver:proxyport)//username:password@server:port/(whole|relative)path/file

Example:
ftp:(proxy:proxyuser:proxypassword@proxyserver:proxyport)//johnsmith:mypassword@myserver/relativepath/filename.txt

Also in this case, proxyuser, proxypassword, and proxyport can be ommitted if possible; the other parts of this pattern are required.

With SOCKS V4 or V5 proxy an example follows:
ftp:(proxysocks:proxyuser:proxypassword@proxyserver:proxyport)//johnsmith:mypassword@myserver/relativepath/filename.txt

Remote Files (with compression)

Remote File URLs may also be combined with archiving protocols in a similar manner to local File URLs.

Example:
zip:(ftp://johnsmith:mypassword@myserver/relativepath/myarchive.zip)#innerfolder/filename.txt

Wildcards may also be used in a similar way:

Example:
zip:(ftp://johnsmith:mypassword@myserver/relativepath/myarchive*.zip)#innerfolder??/filename?.txt

Note:

Remember that http and https protocols do not support wildcards in top level files or archives.

Console Input

File URL for Console input will be: – (hyphen character)

User types the input into Console after the graph starts, types data separated by field delimiters, presses Enter to specify end of records, and finishes the input after the last record by pressing Ctrl+Z.

Input Port Reading

CloverETL also supports reading incoming data through the input port of some Readers. Metadata connected to the input port must contain at least one field of string, byte, or cbyte data type. The user selects the field from which data should be read and parsed according to the output metadata. Three processing types can be selected in CloverETL:

  • discrete (the default value)
  • stream
  • source

File URL pattern is the following:

port:$0.fieldname:discrete|stream|source

Discrete processing type:

When the processing type is discrete, each record is parsed separately, according to the output metadata.

Example:
port:$0.customer:discrete

Note:
The colon and the word discrete can be ommitted.

Example:
port:$0.customer

Stream processing type:

When processing type is stream, all records are concatenated and parsed according to the output metadata. If input metadata contains a null value, this null means eof and separates groups of records. All records before such a null are concatenated, but separately from all records after such a null, which are also concatenated into another data source.

Example:
port:$0.customer:stream

Source processing type:

Example:
port:$0.file:source

When processing type is source, values of the selected field ($0.file) are valid URLs. The Reader to which such input is connected takes the file accessible with this URL and reads the contents. Metadata on the output must match the structure of the files specified with the help of these URLs.

Dictionary Entry Reading

Dictionary tab allows the selection of one of the graph dictionary entries. The processing type in the combo box should also be specified.

File URL pattern is:
dict:myentry:discrete|source

Discrete processing type:

Example:
dict:customer:discrete

Reads contents of dictionary entry whose name is customer.

Source processing type:

Example:
dict:file:source

When processing type is source, the value of the selected dictionary entry (file) is a valid URL. The reader with this File URL takes the contents of the file accessible with the help of this dictionary entry and reads the file contents. Metadata on the component’s output must match the structure of the file specified with the help of this URL.

January 31, 2011

Data sampling with CloverETL

Filed under: Using CloverETL — Tags: , , , — Agata Vackova @ 1:40 pm

Testing data transformations is generally not an easy task. When creating and testing a transformation you might want to get a data sample to check if your transformation works properly. In this point a question arises: How to create a representative data probe on the full data set? Obviously, the easiest way is to read just part of data from the beginning. But such data sample can be very unreliable.  I’ve prepared a few simple graphs that create a data probe which can be regarded as representative for the full data set.

All graphs were created based on the sampling methods described in the article Sampling (statistics).

Simple random sampling

In this method each record has the same probability of selection. Filtering is based on double value chosen (approximately) uniformly from the range 0.0d (inclusive) to 1.0d (exclusive): record is selected if the drawn number is lower than required sample set size:

Systematic sampling

Systematic sampling relies on arranging the data set according to some ordering scheme and then selecting elements in regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every k-th element from then onwards:

Sorting can be disabled in this graph. Then it is selected just every k-th element from the full data set, starting from a randomly selected record from the interval [1, k].

Stratified sampling

If the data set embraces a number of distinct categories, the frame can be organized by these categories into separate strata. Each stratum is then sampled as an independent sub-population out of which individual elements can be randomly selected. At least one record from each stratum must be selected:

Probability proportional to size sampling

Probability for each record is set to be proportional to its stratum size, up to a maximum of 1. Strata are defined by the value of the selected field. For each group of records it it is used systematic sampling method:

Methods comparison

Simple random sampling method is the simplest and fastest. It is sufficient in most cases. Systematic sampling with disabled sorting is as fast as simple random sampling and produces also strongly representative data probe. The stratified sampling method is the trickiest one. It is useful only if the data set can be split into the separated groups that have reasonable sizes. In other cases the data probe is a lot of bigger than requested.

Please see the attached CloverETL project with the above graphs. It also contains the graph for comparison of samples created with different sampling methods. I’ve done some tests for the file containing 5,000,000 rows with information about financial transactions. Each row contains unique transaction id, id of a customer, transaction amount and currency info. Total number of customers is 50,001; number of possible currencies is 35. I performed two sets of tests: one for the group defined by customer id and one defined by currency id.

Results for the sampling_field = CustomerId

Stratum is defined by id of customer. All data can be split to 50,001 groups with sizes from 61 to 143 transactions.

Following table shows testing results for some groups. Sorting was enabled for systematic sampling method.

defined sample size ratio: 0.01

sampling field (CustomerId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 22 s 772 ms 0 h 1 m 34 s 965 ms 0 h 1 m 33 s 831 ms 0 h 1 m 30 s 973 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 4 71 0.0563 1 71 0.0140 1 71 0.0140 1 71 0.0140
1 0 94 0.0000 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 2 110 0.0181 1 110 0.0090 2 110 0.0181 1 110 0.0090
100 1 93 0.0107 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 1 83 0.0120 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 0 109 0.0000 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 2 86 0.0232 1 86 0.0116
10004 1 86 0.0116 1 86 0.0116 1 86 0.0116 1 86 0.0116
total 49937 5000000 0.0099 50000 5000000 0.0100 68172 5000000 0.0136 50011 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 741 ms 0 h 1 m 34 s 474 ms 0 h 1 m 32 s 628 ms 0 h 1 m 33 s 949 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 0 71 0.0000 1 71 0.0140 0 71 0.0000
1 0 94 0.0000 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 1 110 0.0090 1 110 0.0090 3 110 0.0272 1 110 0.0090
100 1 93 0.0107 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 1 83 0.0120 2 83 0.0240 0 83 0.0000
10000 1 101 0.0099 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 2 99 0.0202 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 1 109 0.0091 1 109 0.0091 1 109 0.0091 1 109 0.0091
10003 0 86 0.0000 1 86 0.0116 1 86 0.0116 1 86 0.0116
10004 1 86 0.0116 1 86 0.0116 1 86 0.0116 0 86 0.0000
total 49931 5000000 0.0099 50000 5000000 0.0100 68369 5000000 0.0136 50010 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 24 s 975 ms 0 h 1 m 37 s 446 ms 0 h 1 m 29 s 98 ms 0 h 1 m 32 s 857 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 1 71 0.0140 1 71 0.0140 1 71 0.0140
1 1 94 0.0106 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 0 110 0.0000 1 110 0.0090 2 110 0.0181 2 110 0.0181
100 0 93 0.0000 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 1 83 0.0120 1 83 0.0120 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 1 99 0.0101 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 0 109 0.0000 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 1 86 0.0116 1 86 0.0116
10004 0 86 0.0000 1 86 0.0116 1 86 0.0116 1 86 0.0116
total 49983 5000000 0.0099 50000 5000000 0.0100 68258 5000000 0.0136 49900 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling

Results for the same test but with data sorting disabled in systematic sampling method:

defined sample size ratio: 0.01

sampling field (CustomerId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 168 ms 0 h 0 m 23 s 117 ms 0 h 1 m 35 s 414 ms 0 h 1 m 30 s 985 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1 71 0.0140 0 71 0.0000 1 71 0.0140 1 71 0.0140
1 1 94 0.0106 0 94 0.0000 2 94 0.0212 1 94 0.0106
10 1 110 0.0090 2 110 0.0181 1 110 0.0090 1 110 0.0090
100 0 93 0.0000 0 93 0.0000 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 0 83 0.0000 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 0 101 0.0000 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 3 109 0.0275 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 2 86 0.0232 1 86 0.0116 1 86 0.0116
10004 1 86 0.0116 0 86 0.0000 2 86 0.0232 0 86 0.0000
total 50081 5000000 0.0100 50000 5000000 0.0100 68227 5000000 0.0136 49966 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 23 s 78 ms 0 h 0 m 19 s 178 ms 0 h 1 m 33 s 148 ms 0 h 1 m 29 s 261 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 0 71 0.0000 1 71 0.0140 0 71 0.0000
1 0 94 0.0000 0 94 0.0000 1 94 0.0106 1 94 0.0106
10 0 110 0.0000 3 110 0.0272 1 110 0.0090 1 110 0.0090
100 3 93 0.0322 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 1 83 0.0120 0 83 0.0000 2 83 0.0240 1 83 0.0120
10000 0 101 0.0000 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 1 99 0.0101 1 99 0.0101 3 99 0.0303 1 99 0.0101
10002 1 109 0.0091 0 109 0.0000 1 109 0.0091 1 109 0.0091
10003 1 86 0.0116 0 86 0.0000 1 86 0.0116 1 86 0.0116
10004 3 86 0.0348 0 86 0.0000 1 86 0.0116 1 86 0.0116
total 50056 5000000 0.0100 50000 5000000 0.0100 68528 5000000 0.0137 50033 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 244 ms 0 h 0 m 27 s 52 ms 0 h 1 m 35 s 49 ms 0 h 1 m 27 s 725 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1 71 0.0140 0 71 0.0000 2 71 0.0281 0 71 0.0000
1 1 94 0.0106 0 94 0.0000 2 94 0.0212 1 94 0.0106
10 0 110 0.0000 2 110 0.0181 4 110 0.0363 1 110 0.0090
100 2 93 0.0215 2 93 0.0215 1 93 0.0107 1 93 0.0107
1000 2 83 0.0240 0 83 0.0000 1 83 0.0120 1 83 0.0120
10000 0 101 0.0000 0 101 0.0000 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 4 99 0.0404 1 99 0.0101 1 99 0.0101
10002 1 109 0.0091 2 109 0.0183 1 109 0.0091 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 2 86 0.0232 0 86 0.0000
10004 0 86 0.0000 0 86 0.0000 1 86 0.0116 1 86 0.0116
total 50116 5000000 0.0100 50000 5000000 0.0100 68470 5000000 0.0136 50010 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling

Since the groups are really small, there should be selected none or one record from each group and for the smaller groups we should have more often zero selected records. In relation to this criteria the PPS sampling method and systematic sampling method with sorting data enabled give the best results. Data sample created with stratified method is always oversized.

Results for the sampling_field = CurrencyId

Stratum is defined by id of currency. All data can be split to 35 groups with very similar sizes from 142,042 to 143,572 transactions.

The following table shows testing results for some groups. Sorting was enabled for systematic sampling method.

defined sample size ratio: 0.01

sampling field (CurrencyId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 21 s 681 ms 0 h 1 m 26 s 859 ms 0 h 1 m 25 s 970 ms 0 h 1 m 27 s 85 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1450 142623 0.0101 1427 142623 0.0100 1447 142623 0.0101 1426 142623 0.0099
1 1371 142925 0.0095 1429 142925 0.0099 1430 142925 0.0100 1429 142925 0.0099
10 1420 142897 0.0099 1429 142897 0.0100 1432 142897 0.0100 1429 142897 0.0100
11 1448 142896 0.0101 1429 142896 0.0100 1443 142896 0.0100 1429 142896 0.0100
12 1383 142522 0.0097 1425 142522 0.0099 1488 142522 0.0104 1425 142522 0.0099
13 1468 142461 0.0103 1425 142461 0.0100 1395 142461 0.0097 1424 142461 0.0099
14 1449 142997 0.0101 1430 142997 0.0100 1479 142997 0.0103 1430 142997 0.0100
15 1401 142697 0.0098 1426 142697 0.0099 1438 142697 0.0100 1427 142697 0.0100
16 1396 143137 0.0097 1432 143137 0.0100 1387 143137 0.0096 1431 143137 0.0099
17 1464 142517 0.0102 1425 142517 0.0099 1413 142517 0.0099 1425 142517 0.0099
total 49959 5000000 0.0099 50000 5000000 0.0100 50075 5000000 0.0100 49997 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 22 s 949 ms 0 h 1 m 25 s 726 ms 0 h 1 m 27 s 629 ms 0 h 1 m 24 s 537 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1449 142623 0.0101 1427 142623 0.0100 1496 142623 0.0104 1426 142623 0.0099
1 1468 142925 0.0102 1429 142925 0.0099 1442 142925 0.0100 1429 142925 0.0099
10 1436 142897 0.0100 1429 142897 0.0100 1406 142897 0.0098 1429 142897 0.0100
11 1436 142896 0.0100 1429 142896 0.0100 1402 142896 0.0098 1429 142896 0.0100
12 1410 142522 0.0098 1425 142522 0.0099 1454 142522 0.0102 1425 142522 0.0099
13 1438 142461 0.0100 1425 142461 0.0100 1414 142461 0.0099 1425 142461 0.0100
14 1420 142997 0.0099 1430 142997 0.0100 1450 142997 0.0101 1430 142997 0.0100
15 1412 142697 0.0098 1427 142697 0.0100 1400 142697 0.0098 1427 142697 0.0100
16 1453 143137 0.0101 1431 143137 0.0099 1442 143137 0.0100 1431 143137 0.0099
17 1431 142517 0.0100 1425 142517 0.0099 1372 142517 0.0096 1425 142517 0.0099
total 50163 5000000 0.0100 50000 5000000 0.0100 49709 5000000 0.0099 50000 5000000 0.0100
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 27 s 716 ms 0 h 1 m 26 s 865 ms 0 h 1 m 26 s 657 ms 0 h 1 m 26 s 254 ms
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1488 142623 0.0104 1426 142623 0.0099 1416 142623 0.0099 1426 142623 0.0099
1 1353 142925 0.0094 1429 142925 0.0099 1434 142925 0.0100 1429 142925 0.0099
10 1417 142897 0.0099 1429 142897 0.0100 1390 142897 0.0097 1429 142897 0.0100
11 1448 142896 0.0101 1429 142896 0.0100 1438 142896 0.0100 1429 142896 0.0100
12 1448 142522 0.0101 1425 142522 0.0099 1408 142522 0.0098 1425 142522 0.0099
13 1412 142461 0.0099 1425 142461 0.0100 1432 142461 0.0100 1424 142461 0.0099
14 1440 142997 0.0100 1430 142997 0.0100 1471 142997 0.0102 1430 142997 0.0100
15 1445 142697 0.0101 1427 142697 0.0100 1530 142697 0.0107 1427 142697 0.0100
16 1436 143137 0.0100 1431 143137 0.0099 1456 143137 0.0101 1432 143137 0.0100
17 1381 142517 0.0096 1425 142517 0.0099 1365 142517 0.0095 1426 142517 0.0100
total 50089 5000000 0.0100 50000 5000000 0.0100 49707 5000000 0.0099 49999 5000000 0.0099
sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling

With such large groups all the methods give very good results. Although no doubt we get the best results using the systematic sampling or PPS sampling methods where the sample size is always within the limits 0.0099 to 0.0100.

Download the transformation graph with data

Older Posts »

Theme: Shocking Blue Green. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.