<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>CloverETL&#039;s Blog</title>
	<atom:link href="http://cloveretl.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://cloveretl.wordpress.com</link>
	<description>Life, the Universe, CloverETL and everything ...</description>
	<lastBuildDate>Fri, 30 Sep 2011 14:16:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='cloveretl.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://1.gravatar.com/blavatar/dd4c2411bcdf90b36e88bda58e3fce7c?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>CloverETL&#039;s Blog</title>
		<link>http://cloveretl.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://cloveretl.wordpress.com/osd.xml" title="CloverETL&#039;s Blog" />
	<atom:link rel='hub' href='http://cloveretl.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Handling Errors in Heterogeneous Input Data</title>
		<link>http://cloveretl.wordpress.com/2011/09/06/handling-errors-in-heterogeneous-input-data/</link>
		<comments>http://cloveretl.wordpress.com/2011/09/06/handling-errors-in-heterogeneous-input-data/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 11:17:02 +0000</pubDate>
		<dc:creator>stysm</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[CSV]]></category>
		<category><![CDATA[data integration]]></category>
		<category><![CDATA[heterogeneous data]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=1257</guid>
		<description><![CDATA[ComplexDataReader is a powerful new component in CloverETL meant for reading elaborate heterogeneous data. However, all data cannot be read easily even if you spend a lot of time configuring the component. Sometimes you need to think in advance: What if you come across unknown metadata you have not handled? Normally, the graph crashes. This [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1257&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>ComplexDataReader is a powerful new component in CloverETL meant for reading elaborate heterogeneous data. However, all data cannot be read easily even if you spend a lot of time configuring the component. Sometimes you need to think in advance: What if you come across unknown metadata you have not handled? Normally, the graph crashes.</p>
<p>This post will examine a way of preventing that or, more specifically, how to handle errors in input data.</p>
<h3>Example Input Data</h3>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/input.png"><img class="alignnone size-full wp-image-1259" title="input" src="http://cloveretl.files.wordpress.com/2011/09/input.png" alt="Input Data" width="961" height="356" /></a></p>
<h3>What We Will Do</h3>
<p>We can instantly distinguish three kinds of metadata on the input: <span style="background-color:#00ffff;">product</span>, <span style="background-color:#808000;">product_range</span> and <span style="background-color:#ff0000;">service</span>. ComplexDataReader is the best component to parse these using three states of a state machine. As you can see, there is one line that does not fit into the data. The magic trick of this example lies in preparing one extra state – the <strong>error state</strong>. The state will be responsible for “catching” all incorrect data which would cause the component to fail. In order to be able to decide which data are “bad,&#8221; or, more precisely, when to switch to the error state, you have to write a <strong>custom Selector class</strong> in Java. The idea behind the code is very simple and will be explained below:</p>
<h3>&#8220;Prep Work&#8221;</h3>
<p>First, we need to prepare metadata for all three states of the state machine plus one extra. The extra metadata will represent error lines on the input we need to “throw away.&#8221;</p>
<p>Second, do not forget to connect the component to its succeeding components and assign metadata to output edges.</p>
<p>Third, set the “File URL” property to point the component to the input file.</p>
<p>Here are the three aforementioned metadata:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/meta_product.png"><img class="alignnone size-full wp-image-1261" title="meta_product" src="http://cloveretl.files.wordpress.com/2011/09/meta_product.png" alt="Metadata: Product" width="384" height="148" /></a></p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/meta_service.png"><img class="alignnone size-full wp-image-1263" title="meta_service" src="http://cloveretl.files.wordpress.com/2011/09/meta_service.png" alt="Metadata: Service" width="381" height="124" /></a></p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/meta_product_range.png"><img class="alignnone size-full wp-image-1262" title="meta_product_range" src="http://cloveretl.files.wordpress.com/2011/09/meta_product_range.png" alt="Metadata: Product Range" width="382" height="91" /></a></p>
<p>And one extra metadata for error lines:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/meta_bad.png"><img class="alignnone size-full wp-image-1260" title="meta_bad" src="http://cloveretl.files.wordpress.com/2011/09/meta_bad.png" alt="Metadata for Error Lines" width="382" height="90" /></a></p>
<h3>Designing  State Machine</h3>
<p>We are going to create four states:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/automaton.png"><img class="alignnone size-full wp-image-1258" title="automaton" src="http://cloveretl.files.wordpress.com/2011/09/automaton.png" alt="" width="726" height="368" /></a></p>
<p>Note: There are no transition edges to be seen in the graph. It is because the Selector itself will decide when to change between states.</p>
<p>Start configuring the component via the “Transform” property. Create four states corresponding to the metadata and set “Initial state” to “Let selector decide”:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/states.png"><img class="alignnone size-full wp-image-1270" title="states" src="http://cloveretl.files.wordpress.com/2011/09/states.png" alt="" width="740" height="769" /></a></p>
<p>Switch to state “$0 product” and define its output mapping. In this state, we will send all fields to the output. Thus, drag state $0 to the “Value” column in the right-hand pane. You will produce the “$0.*” directive. In the “Transition table”, switch “Target state” to “Let selector decide”:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/state0.png"><img class="alignnone size-full wp-image-1265" title="state0" src="http://cloveretl.files.wordpress.com/2011/09/state0.png" alt="" width="975" height="769" /></a></p>
<p>Repeat the same procedure for all remaining states (including the error state). Always send everything to the output port and “Let selector decide” about the target state:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/state1.png"><img class="alignnone size-full wp-image-1266" title="state1" src="http://cloveretl.files.wordpress.com/2011/09/state1.png" alt="" width="936" height="769" /></a></p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/state2.png"><img class="alignnone size-full wp-image-1267" title="state2" src="http://cloveretl.files.wordpress.com/2011/09/state2.png" alt="" width="936" height="769" /></a></p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/state3.png"><img class="alignnone size-full wp-image-1268" title="state3" src="http://cloveretl.files.wordpress.com/2011/09/state3.png" alt="" width="936" height="769" /></a></p>
<h3>Writing Custom Selector</h3>
<p>We are now going to prepare a Java class that will do the magic of this example – switch between states “$0 product”, “$1 service”, “$2 product_range” and the “$3 error” state in case there are errors on reading. This particular prefix Selector will assume there is another record on the following line(s) and will try to read it. If there really is a new record, we can recover from the error line and carry on reading.</p>
<p>You can prepare the Java class in any editor of your choice. After writing it, just remember to place it into the “trans” folder of your project. On that condition, CloverETL will automatically compile the class for you.</p>
<p>The Selector class will look like this:</p>
<pre>public class CustomPrefixInputMetadataSelector1 extends com.opensys.cloveretl.component.complexdatareader.PrefixInputMetadataSelector {

	private static final int DEFAULT = 3;

	@Override
	public int select(int prevState) {
		int result = super.select(prevState);
		if(result == org.jetel.component.RecordTransform.ALL) {
			return DEFAULT;
		}
		return result;
	}
}</pre>
<p>A few comments concerning the code:</p>
<ul>
<li><code>int result = super.select(prevState);</code><br />
First, we try to call the default selector and store the number of the next state into result.</li>
<li><code>if(result == org.jetel.component.RecordTransform.ALL)</code><br />
And if the default selector cannot decide&#8230;</li>
<li><code>return DEFAULT;</code><br />
We return the default state number – number 3. This is the error state.</li>
</ul>
<p>Now that you are done with the code, switch to the “Selector” tab in “State transitions”. In “Selector URL”, browse for your custom Selector. Notice that after you specify its location, the “Selector properties” area changes:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/09/selector.png"><img class="alignnone size-full wp-image-1264" title="Selector" src="http://cloveretl.files.wordpress.com/2011/09/selector.png" alt="" width="705" height="670" /></a></p>
<h3>Conclusions &amp; Pitfalls</h3>
<p>In this article, we have presented a way of handling flaws in the input data. We have been capable of addressing a situation when the selector looks on the following metadata and cannot decide which state goes next.</p>
<p>However, there are numerous cases when you just cannot prevent reading errors from occurring. For instance, even if the selector recognizes the following metadata but then fails on parsing them, we cannot react and the graph fails. You can imagine that as a file whose field types suddenly change, (e.g. from integer to date &#8211; the selector starts parsing an integer and crashes). Another known case we cannot handle is changeable number of fields in one record. If new fields occur or their number decreases, the graph execution fails. The only exception to this are fields added at the end of a record. These can be handled with the help of <em>lenient</em> data policy.</p>
<p><a href="http://www.cloveretl.com/sites/applicationcraft/files/files/blog/ComplexDataReader_error.zip"><strong>Download a complete CloverETL project – error handling in ComplexDataReader</strong></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/1257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/1257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/1257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/1257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/1257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/1257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/1257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/1257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/1257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/1257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/1257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/1257/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/1257/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/1257/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1257&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/09/06/handling-errors-in-heterogeneous-input-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/210df512f1d36b67ed2a70dde06f09f0?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">stysm</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/input.png" medium="image">
			<media:title type="html">input</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/meta_product.png" medium="image">
			<media:title type="html">meta_product</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/meta_service.png" medium="image">
			<media:title type="html">meta_service</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/meta_product_range.png" medium="image">
			<media:title type="html">meta_product_range</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/meta_bad.png" medium="image">
			<media:title type="html">meta_bad</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/automaton.png" medium="image">
			<media:title type="html">automaton</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/states.png" medium="image">
			<media:title type="html">states</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/state0.png" medium="image">
			<media:title type="html">state0</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/state1.png" medium="image">
			<media:title type="html">state1</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/state2.png" medium="image">
			<media:title type="html">state2</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/state3.png" medium="image">
			<media:title type="html">state3</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/09/selector.png" medium="image">
			<media:title type="html">Selector</media:title>
		</media:content>
	</item>
		<item>
		<title>Usability Improvements in CloverETL 3.1</title>
		<link>http://cloveretl.wordpress.com/2011/08/18/usability-improvements-in-cloveretl-3-1/</link>
		<comments>http://cloveretl.wordpress.com/2011/08/18/usability-improvements-in-cloveretl-3-1/#comments</comments>
		<pubDate>Thu, 18 Aug 2011 13:16:54 +0000</pubDate>
		<dc:creator>tichyj</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[CloverETL Designer]]></category>
		<category><![CDATA[data integration]]></category>
		<category><![CDATA[ETL tool]]></category>
		<category><![CDATA[usability]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=1163</guid>
		<description><![CDATA[One of the most noticeable set of changes in CloverETL version 3.1 is the interface improvements, substantially improving Clover&#8217;s usability and understandability. These improvements save both new and old users valuable time when creating or manipulating their data transformation graphs and further cement CloverETL&#8217;s place as one of the most easy to use ETL tools [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1163&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>One of the most noticeable set of changes in CloverETL version 3.1 is the interface improvements, substantially improving Clover&#8217;s usability and understandability. These improvements save both new and old users valuable time when creating or manipulating their data transformation graphs and further cement CloverETL&#8217;s place as one of the most easy to use ETL tools on the market.</p>
<p>The biggest improvement was the addition of drag-and-drop functionality to a number of different aspects of Clover. You can drag files to the graph, files to components, files to metadata, and metadata to edges, saving innumerable clicks through menus.</p>
<p>We have also made it easier to link your metadata and edges while creating the edges. If you right-click on the Edge tool in your palette, it will give you a list of every metadata you have created on the current graph. If you select one of the metadata, whenever you create an edge with the edge tool, it will automatically assign that metadata to the edge.</p>
<p>Not only is it easier to link metadata and edges, we&#8217;ve also made it easier to create and manipulate the edges themselves. Edges can now be created simply by dragging from one component’s out port to another&#8217;s in port. If you find you want to change where the edge is connected, that too is now one-click. Simply click and drag an edge&#8217;s endpoints to any other port.</p>
<p>The last shortcut that version 3.1 added to CloverETL is an easier way to set the description on a component. Before, the description field was buried in the component&#8217;s properties, but now it has been moved to the header of the properties window. This improvement makes it substantially easier to clarify the purpose of your components, making your graph easier to read overall.</p>
<span style="text-align:center; display: block;"><a href="http://cloveretl.wordpress.com/2011/08/18/usability-improvements-in-cloveretl-3-1/"><img src="http://img.youtube.com/vi/v_g59r6L8bs/2.jpg" alt="" /></a></span>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/1163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/1163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/1163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/1163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/1163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/1163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/1163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/1163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/1163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/1163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/1163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/1163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/1163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/1163/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1163&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/08/18/usability-improvements-in-cloveretl-3-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/5045cfc83bceb54c7ad747683abd6efc?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">tichyj</media:title>
		</media:content>
	</item>
		<item>
		<title>Address Cleansing and Transliteration with CloverETL and AddressDoctor</title>
		<link>http://cloveretl.wordpress.com/2011/08/03/address-cleansing-and-transliteration-with-cloveretl-and-addressdoctor/</link>
		<comments>http://cloveretl.wordpress.com/2011/08/03/address-cleansing-and-transliteration-with-cloveretl-and-addressdoctor/#comments</comments>
		<pubDate>Wed, 03 Aug 2011 14:55:31 +0000</pubDate>
		<dc:creator>Agata Vackova</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[address cleansing]]></category>
		<category><![CDATA[addressdoctor]]></category>
		<category><![CDATA[data cleansing]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[transliteration]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=1086</guid>
		<description><![CDATA[Data quality usually goes hand in hand with data integration. The new version CloverETL 3.1 has enriched its data cleansing capabilities through integration with AddressDoctor. AddressDoctor contains address and geo data for more than 240 countries all over the globe. Along with correcting and fixing mail addresses, AddressDoctor can also be used for transliteration of [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1086&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Data quality usually goes hand in hand with data integration. The new version CloverETL 3.1 has enriched its data cleansing capabilities through integration with AddressDoctor. AddressDoctor contains address and geo data for more than 240 countries all over the globe. Along with correcting and fixing mail addresses, AddressDoctor can also be used for transliteration of non-Latin writing systems into Latin characters or enriching addresses with latitude and longitude information.</p>
<p>CloverETL integrates AddressDoctor software through a dedicated AddressDoctor component. In all cases you need to have java library AddressDoctor5.jar, native libraries (they all need to be on the class-path when running a graph) and country databases. You also need unlock codes for your databases.</p>
<p>AddressDoctor component has 4 required parameters:</p>
<ul>
<li><strong>Configuration</strong> is driven by <em>configXml </em>or <em>configFile</em>parameter.  The simplest configuration file can look as follows:
<pre>       &lt;?xml version="1.0" encoding="UTF-8"?&gt;
       &lt;SetConfig&gt;
         &lt;General MaxAddressObjectCount="10"/&gt;
         &lt;UnlockCode&gt;&lt;Your unlock code here&gt;&lt;/UnlockCode&gt;
         &lt;DataBase CountryISO3="ALL" Path="/home/user/AD/db"/&gt;
       &lt;/SetConfig&gt;</pre>
</li>
<li><strong>Parameters</strong> can be set by <em>parameterXml</em> or <em>parameterFile</em>attribute. The simplest parameter file can look as follows:
<pre>       &lt;?xml version="1.0" encoding="UTF-8"?&gt;
       &lt;Parameters&gt;
         &lt;Process Mode="PARSE"/&gt;
         &lt;Input/&gt;
         &lt;Result/&gt;
       &lt;/Parameters&gt;</pre>
</li>
<li><strong>Input mapping</strong> defines mapping between Clover input fields and AddressDoctor address properties.</li>
<li><strong>Output mapping</strong> defines mapping between AddressDoctor output address properties and your output.</li>
</ul>
<h3>Transliteration example</h3>
<p>Imagine that you have addresses all over the world saved in their original languages. For some languages, you would not even recognize them&#8211; to say nothing of their correct reading. Let&#8217;s try with following addresses:</p>
<pre>Tomáš Novák;Vohradského 5;Česká Lípa;Czech republic
Hans-Peter Feiertag;Metro-Straße 1;Düsseldorf 4;Deutschland
Michał Dąbrowski;Marszałkowska 142;00-132 Warszawa;Polska
Борис Николаевич Ельцин; Казанская пл., 23;Санкт-Петербург;RUSSIAN FEDERATION
John Smith;100 Main Street;New York NY 10023;USB</pre>
<p>To transliterate above addresses to Ascii, we need to set following parameters:</p>
<ul>
<li><strong>Mode</strong> to PARSE to transliterate without processing the address,<strong> Optimization level</strong> to STANDARD as it produces the best results. Both in <strong>Process</strong> tab of <strong>Parameters </strong>attribute:<br />
<a href="http://cloveretl.files.wordpress.com/2011/08/process.png"><img class="alignnone size-full wp-image-1099" title="Process" src="http://cloveretl.files.wordpress.com/2011/08/process.png" alt="" width="613" height="850" /></a></li>
<li><strong>Encoding</strong> to UTF-16 in <strong>Input</strong> tab of <strong>Parameters</strong>attribute:<br />
<a href="http://cloveretl.files.wordpress.com/2011/08/input.png"><img class="alignnone size-full wp-image-1103" title="Input" src="http://cloveretl.files.wordpress.com/2011/08/input.png" alt="" width="613" height="850" /></a></li>
<li><strong>Preferred language </strong>to ENGLISH and <strong>Preferred script</strong> to ASCII_SIMPLIFIED in <strong>Result</strong> tab of <strong>Parameters </strong>attribute:<br />
<a href="http://cloveretl.files.wordpress.com/2011/08/result.png"><img class="alignnone size-full wp-image-1102" title="Result" src="http://cloveretl.files.wordpress.com/2011/08/result.png" alt="" width="613" height="850" /></a></li>
</ul>
<p>Now we need to feed the Addressdoctor with input mapping. Input metadata:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/08/inputmetadata.png"><img class="alignnone size-full wp-image-1113" title="InputMetadata" src="http://cloveretl.files.wordpress.com/2011/08/inputmetadata.png" alt="" width="900" height="600" /></a></p>
<p>Corresponds with Contact name, Street complete, Locality complete and Country name address properties:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/08/adinputproperties1.png"><img class="alignnone size-full wp-image-1115" title="ADInputProperties" src="http://cloveretl.files.wordpress.com/2011/08/adinputproperties1.png" alt="" width="565" height="838" /></a> &#8211;&gt; <a href="http://cloveretl.files.wordpress.com/2011/08/adinputmapping.png"><img class="alignnone size-full wp-image-1116" title="ADInputMapping" src="http://cloveretl.files.wordpress.com/2011/08/adinputmapping.png" alt="" width="730" height="838" /></a></p>
<p>Similar mapping needs to be provided for output properties:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/08/outputproperties.png"><img class="alignnone size-full wp-image-1125" title="OutputProperties" src="http://cloveretl.files.wordpress.com/2011/08/outputproperties.png" alt="" width="538" height="875" /></a> &#8211;&gt; <a href="http://cloveretl.files.wordpress.com/2011/08/output-mapping.png"><img class="alignnone size-full wp-image-1142" title="output mapping" src="http://cloveretl.files.wordpress.com/2011/08/output-mapping.png" alt="" width="732" height="668" /></a></p>
<p>We can also connect the <strong>Error port</strong> for invalid (unrecognised) addresses:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/08/error-mapping.png"><img class="alignnone size-full wp-image-1143" title="error mapping" src="http://cloveretl.files.wordpress.com/2011/08/error-mapping.png" alt="" width="732" height="668" /></a></p>
<p>Now we can run our graph getting following results:</p>
<pre>Tomas Novak;Vohradskeho;Ceska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strase;Dusseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevic Elcin;Kazanskaa Pl.;Sankt-Peterburg;RUSSIAN FEDERATION</pre>
<p>But one address is missing. Look at the error port:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/08/error.png"><img class="alignnone size-full wp-image-1144" title="error" src="http://cloveretl.files.wordpress.com/2011/08/error.png" alt="" width="586" height="68" /></a></p>
<p><strong>N1</strong> means <em>Validation Error: No validation performed because country was not recognized</em></p>
<p>The output can be even improved by setting ASCII_EXTENDED as <strong>Preferred</strong> script in <strong>Result</strong> tab of <strong>Parameters</strong> attribute:</p>
<pre>Tomash Novak;Vohradskeho;Cheska Lipa;CZECH REPUBLIC
Hans-Peter Feiertag;Metro-Strasse;Duesseldorf 4;GERMANY
Michal Dabrowski;Marszalkowska;Warsaw;POLAND
Boris Nikolaevich Elcin;Kazanskaya Pl.;Sankt-Peterburg;RUSSIAN FEDERATION</pre>
<p>Now the transcription corresponds with pronunciation.</p>
<h3>Enrichment example</h3>
<p>If you have proper database, you can enrich addresses with data you don&#8217;t know, e.g. ZIP code, geocoding, certification status.</p>
<p>Let&#8217;s consider following addresses:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/08/addresses.png"><img class="alignnone size-full wp-image-1132" title="Addresses" src="http://cloveretl.files.wordpress.com/2011/08/addresses.png" alt="" width="421" height="95" /></a></p>
<p>To validate an address according to the Canada Post SERP rules we would need certified database for Canada. We need to set in <strong>Configuration</strong> as well as in <strong>Parameters</strong>:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/08/configuration.png"><img class="alignnone size-full wp-image-1133" title="Configuration" src="http://cloveretl.files.wordpress.com/2011/08/configuration.png" alt="" width="613" height="459" /></a> and <a href="http://cloveretl.files.wordpress.com/2011/08/parameters.png"><img class="alignnone size-full wp-image-1134" title="Parameters" src="http://cloveretl.files.wordpress.com/2011/08/parameters.png" alt="" width="613" height="850" /></a></p>
<p>After AddressDoctor transformation we get:</p>
<table border="2">
<tbody>
<tr>
<th>Count</th>
<th>ResultNumber</th>
<th>ElementItem</th>
<th>Country</th>
<th>Province</th>
<th>Locality</th>
<th>PostalCode</th>
<th>Street</th>
<th>Number</th>
<th>SERPStatus</th>
<th>SERPCategory</th>
<th>ResultProcessStatus</th>
<th>ResultModeUsed</th>
<th>ResultPreferredScript</th>
<th>ResultPreferredLanguage</th>
<th>ResultDataMailabilityScore</th>
<th>ResultDataElementResultStatus</th>
<th>ResultDataElementInputStatus</th>
<th>ResultDataElementRelevance</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>CANADA</td>
<td>ON</td>
<td>TORONTO</td>
<td>M4P 3J5</td>
<td>YONGE STREET</td>
<td>2384</td>
<td>ESE1</td>
<td>C</td>
<td>C3</td>
<td>CERTIFIED</td>
<td>ASCII_SIMPLIFIED</td>
<td>DATABASE</td>
<td>3</td>
<td>80F0F0F0F000004000E0</td>
<td>00606060600000200060</td>
<td>10101010100000100010</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>CANADA</td>
<td>SK</td>
<td>LA RONGE</td>
<td></td>
<td>FINLAYSON ST</td>
<td>108</td>
<td>ESE1</td>
<td>N</td>
<td>I1</td>
<td>CERTIFIED</td>
<td>ASCII_SIMPLIFIED</td>
<td>DATABASE</td>
<td>0</td>
<td>000000000000000000E0</td>
<td>00601010100000000060</td>
<td>00000000000000000000</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>CANADA</td>
<td>SK</td>
<td>PRINCE ALBERT</td>
<td>S6V 0C7</td>
<td>15TH ST E</td>
<td>801</td>
<td>ESE1</td>
<td>C</td>
<td>C4</td>
<td>CERTIFIED</td>
<td>ASCII_SIMPLIFIED</td>
<td>DATABASE</td>
<td>4</td>
<td>80F0F0F0F00000F000E0</td>
<td>00606060600000600060</td>
<td>10101010100000100010</td>
</tr>
</tbody>
</table>
<p>Note the fields:</p>
<ul>
<li>PostalCode &#8211; for recognized addresses we get valid postal code</li>
<li>SERPCategory :</li>
<ul>
<li>C: Corrected</li>
<li>N: Incorrect</li>
</ul>
<li>ResultProcessStatus and ResultDataMailabilityScore:</li>
<ul>
<li>C3 (3) : Corrected – but some elements could not be checked (should be fine)</li>
<li>I1 (0) : Data could not be corrected and is pretty unlikely to be delivered (futile)</li>
<li>C4 (4) : Corrected – all (postally relevant) elements have been checked (almost certain)</li>
</ul>
</ul>
<p><a href="http://www.cloveretl.com/sites/applicationcraft/files/files/blog/AddressDoctor.zip">Download the transformation graph with data</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/1086/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/1086/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/1086/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/1086/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/1086/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/1086/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/1086/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/1086/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/1086/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/1086/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/1086/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/1086/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/1086/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/1086/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1086&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/08/03/address-cleansing-and-transliteration-with-cloveretl-and-addressdoctor/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/934c88184df6c0034450ae00a1695ee8?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">agad</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/process.png" medium="image">
			<media:title type="html">Process</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/input.png" medium="image">
			<media:title type="html">Input</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/result.png" medium="image">
			<media:title type="html">Result</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/inputmetadata.png" medium="image">
			<media:title type="html">InputMetadata</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/adinputproperties1.png" medium="image">
			<media:title type="html">ADInputProperties</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/adinputmapping.png" medium="image">
			<media:title type="html">ADInputMapping</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/outputproperties.png" medium="image">
			<media:title type="html">OutputProperties</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/output-mapping.png" medium="image">
			<media:title type="html">output mapping</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/error-mapping.png" medium="image">
			<media:title type="html">error mapping</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/error.png" medium="image">
			<media:title type="html">error</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/addresses.png" medium="image">
			<media:title type="html">Addresses</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/configuration.png" medium="image">
			<media:title type="html">Configuration</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/08/parameters.png" medium="image">
			<media:title type="html">Parameters</media:title>
		</media:content>
	</item>
		<item>
		<title>Processing Heterogeneous Data with ComplexDataReader</title>
		<link>http://cloveretl.wordpress.com/2011/07/21/processing-heterogeneous-data-with-complexdatareader/</link>
		<comments>http://cloveretl.wordpress.com/2011/07/21/processing-heterogeneous-data-with-complexdatareader/#comments</comments>
		<pubDate>Thu, 21 Jul 2011 08:25:02 +0000</pubDate>
		<dc:creator>javlinkrivanekm</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[complexdatareder]]></category>
		<category><![CDATA[example]]></category>
		<category><![CDATA[heterogeneous data]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=1067</guid>
		<description><![CDATA[ComplexDataReader &#8211; Example How-to ComplexDataReader is a new component for reading heterogeneous data (data which contains multiple types of records that can also depend on each other) without the need of hard coding. Instead, the component is driven by a state machine which can be set-up using the GUI. The following example will present some [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1067&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h3>ComplexDataReader &#8211; Example How-to</h3>
<p>ComplexDataReader is a new component for reading heterogeneous data (data which contains multiple types of records that can also depend on each other) without the need of hard coding. Instead, the component is driven by a state machine which can be set-up using the GUI.</p>
<p>The following example will present some of the capabilities of ComplexDataReader, as well as guide you through the design of a simple automaton, which is used for processing a text file containing two types of shipments grouped into batches. Each batch starts with a batch header; the number of items in a batch is variable and it is part of the header.</p>
<h3>Input Data</h3>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/data.png"><img class="alignnone size-full wp-image-1068" title="data" src="http://cloveretl.files.wordpress.com/2011/07/data.png" alt="Input Data for ComplexDataReader" width="650" height="193" /></a></p>
<h3>What We Want to Achieve</h3>
<p>For every parcel and every letter, send to the output the <span style="background-color:#ffd320;">address</span> and the <span style="background-color:#aecf00;">charge</span> to the output, also add the <span style="background-color:#0000ff;color:#fff;">batch ID</span>, <span style="background-color:#008000;color:#fff;">customer ID</span>, and the <span style="background-color:#dc2300;color:#fff;">date</span> from the respective batch header.</p>
<p>The first element of a batch header determines the <span style="background-color:#800000;color:#fff;">type</span> of its elements, and the third element contains the <span style="background-color:#800080;color:#fff;">number of items</span> in the batch.</p>
<h3>Preparation</h3>
<p>Before starting the configuration of the component, all the required metadata should be defined. Also, the component should be connected to the succeeding component(s) and the output edge(s) should have metadata assigned.</p>
<p>You may also set the “File URL” property of the component to point to the input file.</p>
<p>Internal metadata (used for parsing the input):</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/md_batch.png"><img class="alignnone size-full wp-image-1069" title="md_batch" src="http://cloveretl.files.wordpress.com/2011/07/md_batch.png" alt="Batch Metadata" width="391" height="146" /></a></p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/md_parcel.png"><img class="alignnone size-full wp-image-1071" title="md_parcel" src="http://cloveretl.files.wordpress.com/2011/07/md_parcel.png" alt="Parcel Metadata" width="375" height="125" /></a></p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/md_letter.png"><img class="alignnone size-full wp-image-1070" title="md_letter" src="http://cloveretl.files.wordpress.com/2011/07/md_letter.png" alt="Letter Metadata" width="377" height="90" /></a></p>
<p>Output metadata (used for output mapping):</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/md_shipment.png"><img class="alignnone size-full wp-image-1072" title="md_shipment" src="http://cloveretl.files.wordpress.com/2011/07/md_shipment.png" alt="Shipment Metadata" width="371" height="141" /></a></p>
<h3>ComplexDataReader Configuration</h3>
<p>First, we have to design an automaton, which will guide the component through parsing the input. The automaton may look like this:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/overview.png"><img class="alignnone size-full wp-image-1073" title="overview" src="http://cloveretl.files.wordpress.com/2011/07/overview.png" alt="ComplexDataReader Automaton" width="251" height="184" /></a></p>
<p>The idea behind it is that we start by reading a batch header, therefore the initial state is set to &#8220;$0 – Batch&#8221;. Then we can decide, depending on the value of the &#8220;type&#8221; field, whether to proceed to &#8220;$1 – Letter&#8221; or &#8220;$2 – Parcel&#8221;. In either of these states, we read as many records as specified in the &#8220;count&#8221; field of the previous batch header, then return to &#8220;$0 – Batch&#8221; and expect a new batch header.</p>
<p>To start building the automaton, open the configuration dialog by double clicking the component and then its &#8220;Transform&#8221; property.</p>
<p>Create three states by dragging the &#8220;batch&#8221;, &#8220;letter&#8221; and &#8220;parcel&#8221; metadata, respectively, from the list of <strong>Available Metadata</strong> on the left to the list of <strong>States</strong> on the right. You can also edit the labels of the states. Set the <strong>Initial state</strong> to &#8220;State $0&#8243; by selecting it from the drop-down list.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/states_hl.png"><img class="alignnone size-full wp-image-1077" title="states_hl" src="http://cloveretl.files.wordpress.com/2011/07/states_hl.png" alt="" width="727" height="718" /></a></p>
<p>Optionally, you may switch to the <strong>Overview</strong> tab and press the <strong>Undock</strong> button to get an interactive overview of the automaton being built.</p>
<p>Switch to the <strong>State $0</strong> tab. This state represents a new batch. Set the automaton to reset the counters for state $1 and $2 by pressing the <strong>Actions</strong> button and ticking <strong>Reset counter</strong> for &#8220;State $1&#8243; and &#8220;State $2&#8243;. Add two rows to the <strong>Transition table</strong>. Set the condition of the first row to <code>$batch.type == "LETTERS"</code> and the condition of the second row to <code>$batch.type == "PARCELS"</code>. Set their target states to &#8220;State $1&#8243; and &#8220;State $2&#8243;, respectively. You may also set the target of the <strong>default</strong> transition to <strong>Fail</strong> to detect unexpected batch types.</p>
<p>Note that in state $0, no output mapping is defined; hence no data will be sent to the output.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/state_batch_hl.png"><img class="alignnone size-full wp-image-1074" title="state_batch_hl" src="http://cloveretl.files.wordpress.com/2011/07/state_batch_hl.png" alt="" width="727" height="760" /></a></p>
<p>The configuration of state $1 and $2 will be very similar. In these states we want to produce output, therefore we have to define output mapping. For example, in state $1 we need to send to the output &#8220;address&#8221; and &#8220;charge&#8221; fields from internal record $1 (last letter record) and &#8220;batchID&#8221;, &#8220;customerID&#8221; and &#8220;date&#8221; from internal record $0 (last batch header record).</p>
<p>For state $1, define <strong>Output mapping</strong> by dragging row &#8220;$1&#8243; from the left table onto &#8220;Port 0&#8243; in the right table. Then expand row $0 on the left and Port 0 on the right and drag &#8220;batchID&#8221;, &#8220;customerID&#8221; and &#8220;date&#8221; from the left onto &#8220;$0.batchID&#8221;, &#8220;$0.customerID&#8221; and &#8220;$0.date&#8221; on the right, respectively.</p>
<p>Add one row to the <strong>Transition table</strong> and set its condition to <code>counter1 &lt; $batch.count</code> and its target to &#8220;State $1&#8243;. Also set the target of the <strong>default</strong> transition to &#8220;State $0&#8243;.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/state_letter_hl.png"><img class="alignnone size-full wp-image-1075" title="state_letter_hl" src="http://cloveretl.files.wordpress.com/2011/07/state_letter_hl.png" alt="Letter State" width="727" height="760" /></a></p>
<p>Similarly, for state $2, drag row &#8220;$2&#8243; onto &#8220;Port 0&#8243; and &#8220;batchID&#8221;, &#8220;customerID&#8221; and &#8220;date&#8221; from row $0 onto &#8220;$0.batchID&#8221;, &#8220;$0.customerID&#8221; and &#8220;$0.date&#8221;. Add one row to the <strong>Transition table</strong> and set its condition to <code>counter2 &lt; $batch.count</code> and its target to &#8220;State $2&#8243;. Again, set the target of the <strong>default</strong> transition to &#8220;State $0&#8243;.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/07/state_parcel_hl.png"><img class="alignnone size-full wp-image-1076" title="state_parcel_hl" src="http://cloveretl.files.wordpress.com/2011/07/state_parcel_hl.png" alt="Parcel State" width="727" height="760" /></a></p>
<p><a href="http://www.cloveretl.com/sites/applicationcraft/files/files/blog/ComplexDataReader_HowTo.zip">Download the transformation graph with data</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/1067/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/1067/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/1067/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/1067/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/1067/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/1067/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/1067/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/1067/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/1067/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/1067/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/1067/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/1067/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/1067/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/1067/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1067&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/07/21/processing-heterogeneous-data-with-complexdatareader/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/751ac603c220b4895e39e703711da114?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">javlinkrivanekm</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/data.png" medium="image">
			<media:title type="html">data</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/md_batch.png" medium="image">
			<media:title type="html">md_batch</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/md_parcel.png" medium="image">
			<media:title type="html">md_parcel</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/md_letter.png" medium="image">
			<media:title type="html">md_letter</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/md_shipment.png" medium="image">
			<media:title type="html">md_shipment</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/overview.png" medium="image">
			<media:title type="html">overview</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/states_hl.png" medium="image">
			<media:title type="html">states_hl</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/state_batch_hl.png" medium="image">
			<media:title type="html">state_batch_hl</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/state_letter_hl.png" medium="image">
			<media:title type="html">state_letter_hl</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/07/state_parcel_hl.png" medium="image">
			<media:title type="html">state_parcel_hl</media:title>
		</media:content>
	</item>
		<item>
		<title>Speed-up Installation of Plugins in Eclipse</title>
		<link>http://cloveretl.wordpress.com/2011/06/21/speed-up-installation-of-plugins-in-eclipse/</link>
		<comments>http://cloveretl.wordpress.com/2011/06/21/speed-up-installation-of-plugins-in-eclipse/#comments</comments>
		<pubDate>Tue, 21 Jun 2011 12:56:54 +0000</pubDate>
		<dc:creator>Jaroslav Urban</dc:creator>
				<category><![CDATA[Others]]></category>
		<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[Eclipse]]></category>
		<category><![CDATA[installation]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=1060</guid>
		<description><![CDATA[Installing plug-ins into Eclipse can in some situations take a long time. This can also affect users of CloverETL Designer if they choose the Online or Offline Eclipse Plugin Installation download type. This blog post describes a workaround for the slow install process. The reason for the long installation time is that by default Eclipse [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1060&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Installing plug-ins into Eclipse can in some situations take a long time. This can also affect users of CloverETL Designer if they choose the Online or Offline Eclipse Plugin Installation download type. This blog post describes a workaround for the slow install process.</p>
<p>The reason for the long installation time is that by default Eclipse contacts all available update sites to try to resolve dependencies of the plugin being installed. There can be a large number of update sites, some can be not responding or slow and overall the connection can be bad. To disable contacting of all update sites, uncheck the &#8220;Contact all update sites during install to find required software&#8221; checkbox in the installation dialog:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/06/contact_all_update_sites_checkbox.png"><img class="alignnone size-full wp-image-1062" title="contact_all_update_sites_checkbox" src="http://cloveretl.files.wordpress.com/2011/06/contact_all_update_sites_checkbox.png" alt="Contact all update sites checkbox" width="735" height="776" /></a></p>
<p>This workaround can help not only when installing CloverETL Designer, but also when using Eclipse in general. However, it does have a drawback &#8211; some dependencies of the plugin being installed might not be resolved. This can also happen when installing CloverETL Designer, because it depends on the GEF and RSE plugins. The plugins are found in the main Eclipse update site which would not be contacted when using the described workaround. Eclipse will detect that some dependecies of CloverETL Designer are not met and will not proceed with the installation:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/06/clover_dependencies_not_met.png"><img class="alignnone size-full wp-image-1061" title="clover_dependencies_not_met" src="http://cloveretl.files.wordpress.com/2011/06/clover_dependencies_not_met.png" alt="CloverETL dependencies not met" width="736" height="730" /></a></p>
<p>In case some dependencies are not resolved, there are 2 options:</p>
<ul>
<li> find and install the dependencies manually (in case of GEF and RSE they can be found in the main Eclipse update site)</li>
<li> accept the long installation time and enable the checkbox back. Eclipse will resolve the dependencies automatically</li>
</ul>
<p>Hopefully this hint will help some users of CloverETL Designer or Eclipse with the slow installation. However it&#8217;s important to understand that it&#8217;s NOT mandatory to use the workaround as the installation is quick in many cases &#8211; use it only in case of issues.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/1060/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1060&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/06/21/speed-up-installation-of-plugins-in-eclipse/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e2a5af2f4ce1baa2e6bf56ec03fe0565?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">jaroslavurban</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/06/contact_all_update_sites_checkbox.png" medium="image">
			<media:title type="html">contact_all_update_sites_checkbox</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/06/clover_dependencies_not_met.png" medium="image">
			<media:title type="html">clover_dependencies_not_met</media:title>
		</media:content>
	</item>
		<item>
		<title>Handling of JSON Objects in CloverETL</title>
		<link>http://cloveretl.wordpress.com/2011/05/10/handling-of-json-objects-in-cloveretl/</link>
		<comments>http://cloveretl.wordpress.com/2011/05/10/handling-of-json-objects-in-cloveretl/#comments</comments>
		<pubDate>Tue, 10 May 2011 14:14:00 +0000</pubDate>
		<dc:creator>Agata Vackova</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[JSON]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=1025</guid>
		<description><![CDATA[JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is language independent but uses conventions that are familiar to programmers. This format is often used for serializing and transmitting structured data over a network connection. It is primarily used to transmit data between a server and a web application, serving as an alternative to XML. CloverETL doesn&#8217;t [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1025&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a title="JSON" href="http://www.json.org/" target="_blank"><strong>JSON</strong></a> (JavaScript Object Notation) is a lightweight data-interchange format. It is language independent but uses conventions that are familiar to programmers. This format is often used for serializing and transmitting structured data over a network connection. It is primarily used to transmit data between a server and a web application, serving as an alternative to XML. CloverETL doesn&#8217;t have any JSONReader/Writer components, but has numerous components that can handle XML files. So all you need to do is to convert the JSON structure to an XML one or vice versa.  If you have a XSLT 1.0 stylesheet that transforms XML file into a JSON object, you can use the XSLTransformer component. I downloaded <a title="xml2json-xslt" href="http://code.google.com/p/xml2json-xslt/" target="_blank">xml2json-xslt</a> and created a transformation graph with only one component:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/05/xml2json5.png"><img class="alignnone size-full wp-image-1034" title="xml2json" src="http://cloveretl.files.wordpress.com/2011/05/xml2json5.png" alt="" width="960" height="689" /></a></p>
<p>That converts an XML file into a JSON structure.</p>
<p>Another solution is to use <a href="http://www.json.org/java/index.html" target="_blank">JSON Java API </a>and create a transformation in Java as an attribute of Reformat component. Then simple transformations need to be written:</p>
<ul>
<li>
<h4>To transform XML to JSON:</h4>
</li>
</ul>
<pre style="padding-left:60px;">import org.jetel.component.DataRecordTransform;
import org.jetel.data.DataRecord;
import org.jetel.exception.ComponentNotReadyException;
import org.jetel.exception.TransformException;
import org.json.JSONException;
import org.json.XML;

public class XML2JSON extends DataRecordTransform {

    int counter;

    @Override
    public void preExecute() throws ComponentNotReadyException {
        super.preExecute();
        counter = 1;
    }

    @Override
    public int transform(DataRecord[] inputRecords, DataRecord[] outputRecords)
            throws TransformException {
        try {
            outputRecords[0].getField(0).setValue(XML.toJSONObject(inputRecords[0].getField(0).toString()).toString());
        } catch (JSONException e) {
            throw new TransformException("Can't convert XML to JSON.", e, counter, 0);
        }
        counter++;
        return 0;
    }

}</pre>
<ul>
<li>
<h4>To transform JSON to XML:</h4>
</li>
</ul>
<pre style="padding-left:60px;">import org.jetel.component.DataRecordTransform;
import org.jetel.data.DataRecord;
import org.jetel.exception.ComponentNotReadyException;
import org.jetel.exception.TransformException;
import org.json.JSONException;
import org.json.JSONObject;
import org.json.XML;

public class JSON2XML extends DataRecordTransform {
    int counter;

    @Override
    public void preExecute() throws ComponentNotReadyException {
        super.preExecute();
        counter = 1;
    }

    @Override
    public int transform(DataRecord[] inputRecords, DataRecord[] outputRecords)
            throws TransformException {
        try {
            outputRecords[0].getField(0).setValue(XML.toString(new JSONObject(inputRecords[0].getField(0).toString())));
        } catch (JSONException e) {
            throw new TransformException("Can't convert XML to JSON.", e, counter, 0);
        }
        counter++;
        return 0;
    }

}</pre>
<p>Since CloverETL 3.1 the implementation above is going to be embedded into CloverETL Engine in the form of <em>xml2json</em> and <em>json2xml</em> ctl functions. Then your transformations are easy indeed:</p>
<ul>
<li>
<h4>To transform XML to JSON:</h4>
<p><a href="http://cloveretl.files.wordpress.com/2011/05/xml2jsontl2.png"><img class="alignnone size-full wp-image-1053" title="xml2jsonTL2" src="http://cloveretl.files.wordpress.com/2011/05/xml2jsontl2.png" alt="" width="718" height="736" /></a></li>
<li>
<h4>To transform JSON to XML:</h4>
<p><a href="http://cloveretl.files.wordpress.com/2011/05/json2xmltl2.png"><img class="alignnone size-full wp-image-1054" title="json2xmlTL2" src="http://cloveretl.files.wordpress.com/2011/05/json2xmltl2.png" alt="" width="707" height="721" /></a></li>
</ul>
<p>Above transformation converts following XML structure:</p>
<pre style="padding-left:60px;">&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
&lt;employees&gt;
    &lt;employee&gt;
        &lt;empID&gt;1&lt;/empID&gt;
        &lt;jobID&gt;3&lt;/jobID&gt;
           &lt;salary&gt;5000&lt;/salary&gt;
        &lt;name&gt;
            &lt;firstname&gt;Mark&lt;/firstname&gt;
            &lt;surname&gt;Fish&lt;/surname&gt;
        &lt;/name&gt;
        &lt;child&gt;
            &lt;chname&gt;Ann&lt;/chname&gt;
            &lt;age&gt;4&lt;/age&gt;
        &lt;/child&gt;
        &lt;child&gt;
            &lt;chname&gt;Mark&lt;/chname&gt;
            &lt;age&gt;6&lt;/age&gt;
        &lt;/child&gt;
        &lt;benefits&gt;
            &lt;car&gt;none&lt;/car&gt;
            &lt;mobilephone&gt;yes&lt;/mobilephone&gt;
            &lt;financial&gt;
                &lt;monthly_bonus&gt;1000&lt;/monthly_bonus&gt;
                &lt;yearly_bonus&gt;0&lt;/yearly_bonus&gt;    
            &lt;/financial&gt;
        &lt;/benefits&gt;
        &lt;project&gt;
            &lt;projName&gt;JSP&lt;/projName&gt;
            &lt;projManager&gt;John Smith&lt;/projManager&gt;
            &lt;inProjectID&gt;34&lt;/inProjectID&gt;
            &lt;Start&gt;06062006&lt;/Start&gt;
            &lt;End&gt;in progress&lt;/End&gt;
            &lt;customer&gt;
                            &lt;name&gt;Sunny&lt;/name&gt;
            &lt;/customer&gt;
            &lt;customer&gt;
                &lt;name&gt;Weblea&lt;/name&gt;
            &lt;/customer&gt;
        &lt;/project&gt;
        &lt;project&gt;
            &lt;projName&gt;Data warehouse&lt;/projName&gt;
            &lt;projManager&gt;John Major&lt;/projManager&gt;
            &lt;inProjectID&gt;51&lt;/inProjectID&gt;
            &lt;Start&gt;01062005&lt;/Start&gt;
            &lt;End&gt;31052006&lt;/End&gt;
            &lt;customer&gt;
                &lt;name&gt;Hanuman&lt;/name&gt;
            &lt;/customer&gt;
            &lt;customer&gt;
                &lt;name&gt;Weblea&lt;/name&gt;
            &lt;/customer&gt;
            &lt;customer&gt;
                &lt;name&gt;SomeBank&lt;/name&gt;
            &lt;/customer&gt;
        &lt;/project&gt;
    &lt;/employee&gt;
&lt;/employees&gt;</pre>
<p>into the following JSON object (needs to be formatted manually):</p>
<pre style="padding-left:60px;">{"employees":
    {"employee":{
        "child":[
            {"chname":"Ann","age":4},
            {"chname":"Mark","age":6}
        ],
        "project":[{
            "End":"in progress",
            "projManager":"John Smith",
            "Start":6062006,
            "inProjectID":34,
            "projName":"JSP",
            "customer":[
                {"name":"Sunny"},
                {"name":"Weblea"}
            ]},{
            "End":31052006,
            "projManager":"John Major",
            "Start":1062005,
            "inProjectID":51,
            "projName":"Data warehouse",
            "customer":[
                {"name":"Hanuman"},
                {"name":"Weblea"},
                {"name":"SomeBank"}
            ]}
        ],
        "jobID":3,
        "empID":1,
        "name":
            {"surname":"Fish","firstname":"Mark"},
        "benefits":{
            "financial":
                {"monthly_bonus":1000,"yearly_bonus":"0"},
            "car":"none",
            "mobilephone":"yes"},
        "salary":5000}
    }
}</pre>
<p><a href="http://www.cloveretl.com/sites/applicationcraft/files/files/blog/JSON.zip">Download the example transformation graph</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/1025/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/1025/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/1025/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=1025&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/05/10/handling-of-json-objects-in-cloveretl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/934c88184df6c0034450ae00a1695ee8?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">agad</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/05/xml2json5.png" medium="image">
			<media:title type="html">xml2json</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/05/xml2jsontl2.png" medium="image">
			<media:title type="html">xml2jsonTL2</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/05/json2xmltl2.png" medium="image">
			<media:title type="html">json2xmlTL2</media:title>
		</media:content>
	</item>
		<item>
		<title>Usage of Internal and External Graph Elements</title>
		<link>http://cloveretl.wordpress.com/2011/04/12/usage-of-internal-and-external-graph-elements/</link>
		<comments>http://cloveretl.wordpress.com/2011/04/12/usage-of-internal-and-external-graph-elements/#comments</comments>
		<pubDate>Tue, 12 Apr 2011 09:20:15 +0000</pubDate>
		<dc:creator>tomaswaller</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[data processing]]></category>
		<category><![CDATA[graph elements]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=993</guid>
		<description><![CDATA[GRAPH ELEMENTS In addition to components, edges, notes, etc., any CloverETL graphs may contain the following graph elements: Metadata Connections Lookup tables Sequences Parameters Each of these graph elements may be either created and written in the graph XML source file, or specified in an external file and linked to the graph. Such elements are [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=993&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h3>GRAPH ELEMENTS</h3>
<p>In addition to components, edges, notes, etc., any <strong>CloverETL</strong> graphs may contain the following graph elements:</p>
<ul>
<li>Metadata</li>
<li>Connections</li>
<li>Lookup tables</li>
<li>Sequences</li>
<li>Parameters</li>
</ul>
<p>Each of these graph elements may be either created and written in the graph XML source file, or specified in an external file and linked to the graph. Such elements are called <em>internal</em>, or <em>external</em>, respectively. In case of external elements, graph XML file only contains a link to such file with graph element definition.</p>
<h3>Internal Graph Elements</h3>
<p>Graph XML file can contain <em>internal</em> graph elements that may look like those in the following examples:</p>
<pre><code> &lt;Metadata id="Metadata1"&gt; &lt;Record fieldDelimiter="|" name="LUT" recordDelimiter="\r\n" type="delimited"&gt; &lt;Field name="field1" type="integer"/&gt; &lt;Field name="field2" type="string"/&gt; &lt;/Record&gt; &lt;/Metadata&gt; &lt;Connection database="POSTGRE" dbURL="jdbc:postgresql://hostname/database" id="JDBC0" jdbcSpecific="POSTGRE" name="NewConnection" password="mypassword" type="JDBC" user="username"/&gt; &lt;LookupTable charset="ISO-8859-1" id="LookupTable0" initialSize="512" key="field1" keyDuplicates="true" metadata="Metadata1" name="simpleLookup0" type="simpleLookup"/&gt; &lt;Sequence cached="1" fileURL="${SEQ_DIR}/seq_withdata.txt" id="Sequence0" name="seq" start="1" step="1" type="SIMPLE_SEQUENCE"/&gt; &lt;Property id="GraphParameter1" name="NUMBER" value="6"/&gt;</code></pre>
<p>(In addition to such <em>internal</em> graph elements, graph contains a link to the <code>workspace.prm</code> file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)</p>
<p><code>&lt;Property fileURL="workspace.prm" id="GraphParameter0"/&gt;</code></p>
<p>The resulting <strong>Outline</strong> pane looks like this:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/04/00_internalgraphelements.png"><img class="alignnone size-full wp-image-997" title="00_InternalGraphElements" src="http://cloveretl.files.wordpress.com/2011/04/00_internalgraphelements.png" alt="Internal Graph Elements" width="349" height="344" /></a></p>
<h3>External Graph Elements</h3>
<p>When you want to link an <em>external</em> graph element, you need to right-click any of the five categories (<strong>Metadata</strong>, <strong>Connections</strong>, <strong>Parameters</strong>, <strong>Sequences</strong>, or <strong>Lookups</strong>) in the <strong>Outline</strong> pane and select in the context menu:</p>
<ul>
<li>New metadata → Link shared definition</li>
<li>Connections → Link DB connection, Link JMS connection, or Link QuickBase connection</li>
<li>Parameters → Link parameter file</li>
<li>Sequences → Link shared sequence</li>
<li>Lookup tables → Link shared lookup table</li>
</ul>
<p>After that, the <strong>URL Dialog</strong> opens.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/04/urldialog.png"><img class="alignnone size-full wp-image-1003" title="URLDialog" src="http://cloveretl.files.wordpress.com/2011/04/urldialog.png" alt="URL Dialog" width="800" height="600" /></a></p>
<p>The dialog consists of three tabs: <strong>Workspace view</strong>, <strong>Local files</strong>, and <strong>Remote files</strong> tabs. They serve for browsing local files within your workspace, outside your workspace, or files located on a remote computer, respectively.</p>
<h4>Files with graph elements are located in the same project as the graph itself.</h4>
<p>To browse your project, you need to use the <strong>Workspace view</strong> tab of the <strong>URL Dialog</strong>.</p>
<p>The <code>workspace.prm</code> file in the project itself contains the definition of parameters for the conn, meta, lookup, seq directories (among others). They are designed as “CONN_DIR”, “META_DIR”, “SEQ_DIR”, and “LOOKUP_DIR”, respectively. The value of each of these parameters is resolved to its value using a dollar and curly brackets. For example, “${META_DIR}”.</p>
<p>Once you link a file containing graph element definition located in the project, the “_DIR” termination of these names assures that the path to such files located in the <code>conn</code>, <code>meta</code>, <code>lookup</code>, or <code>seq</code> directories is automatically replaced with one of these parameters.</p>
<p>The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:</p>
<pre><code>&lt;Metadata fileURL="${META_DIR}/LUT.fmt" id="Metadata1"/&gt; &lt;Connection dbConfig="${CONN_DIR}/NewConnection.cfg" id="JDBC0" type="JDBC"/&gt; &lt;LookupTable id="LookupTable0" lookupConfig="${LOOKUP_DIR}/simpleLookup0.cfg"/&gt; &lt;Sequence id="Sequence0" seqConfig="${SEQ_DIR}/seq.cfg"&gt; &lt;attr name="type"&gt;&lt;![CDATA[SIMPLE_SEQUENCE]]&gt;&lt;/attr&gt; &lt;/Sequence&gt; &lt;Property fileURL="parameters.prm" id="GraphParameter1"/&gt;</code></pre>
<p>(In addition to the mentioned graph elements, graph contains a link to the <code>workspace.prm</code> file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)</p>
<p><code>&lt;Property fileURL="workspace.prm" id="GraphParameter0"/&gt;</code></p>
<p>The resulting <strong>Outline</strong> pane looks like this:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/04/01_externalgraphelements.png"><img class="alignnone size-full wp-image-998" title="01_ExternalGraphElements" src="http://cloveretl.files.wordpress.com/2011/04/01_externalgraphelements.png" alt="External Graph Elements" width="393" height="544" /></a></p>
<h4>Files with graph elements are located outside the project containing the graph.</h4>
<p>To locate external graph elements OUTSIDE the project, you need to use the <strong>Local files</strong> tab of the <strong>URL Dialog</strong>. It allows to browse the file system of your local computer.</p>
<p>You can also define a parameter for the whole path with the mentioned “_DIR” termination. If you have defined such parameter (e.g, “PATH_TO_GRAPH_ELEMENTS_DIR”), the whole path will automatically be replaced with this parameter (as ${PATH_TO_GRAPH_ELEMENTS_DIR}).</p>
<p>Remember that you need to define such parameter BEFORE you link a graph element in the specified location, otherwise, the whole path will not be replaced with the parameter name!</p>
<p>The whole path to the graph elements is specified in the graph XML file. The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:</p>
<pre><code>&lt;Metadata fileURL="D:/ExternalGraphElements/meta/HashJoinInput.fmt" id="Metadata0"/&gt; &lt;Metadata fileURL="${PATH_TO_GRAPH_ELEMENTS_DIR}/meta/LUT.fmt" id="Metadata1"/&gt; &lt;Connection dbConfig="${PATH_TO_GRAPH_ELEMENTS_DIR}/NewConnection.cfg" id="JDBC0" type="JDBC"/&gt; &lt;LookupTable id="LookupTable0" lookupConfig="${PATH_TO_GRAPH_ELEMENTS_DIR}/simpleLookup0.cfg"/&gt; &lt;Sequence id="Sequence0" seqConfig="${_TO_GRAPH_ELEMENTS_DIR}/seq.cfg"&gt; &lt;attr name="type"&gt;&lt;![CDATA[SIMPLE_SEQUENCE]]&gt;&lt;/attr&gt; &lt;/Sequence&gt; &lt;Property fileURL="parameters.prm" id="GraphParameter1"/&gt;</code></pre>
<p>The last mentioned <code>parameters.prm</code> file defines two parameters:</p>
<p>NUMBER, whose value is 6</p>
<p>and</p>
<p>PATH_TO_GRAPH_ELEMENTS_DIR, whose value is <code>D:/ExternalGraphElements</code>.</p>
<p>(In addition to the mentioned graph elements, graph contains a link to the <code>workspace.prm</code> file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)</p>
<p><code>&lt;Property fileURL="workspace.prm" id="GraphParameter0"/&gt;</code></p>
<p><strong>Note:</strong></p>
<p>Note that the value of the PATH_TO_GRAPH_ELEMENTS_DIR parameter is <code>D:/ExternalGraphElements</code>.</p>
<ul>
<li>The first external metadata element (with ”<code>Metadata0</code>” id) was linked BEFORE the PATH_TO_GRAPH_ELEMENTS_DIR parameter was defined. The whole path was NOT converted into ${PATH_TO_GRAPH_ELEMENTS}/&#8230;</li>
<li>But the next elements (with “<code>Metadata1</code>” id, and all of the other elements) were automatically converted to the ${PATH_TO_GRAPH_ELEMENTS_DIR}/&#8230; AS SOON AS they were linked to the graph.</li>
</ul>
<p>The resulting <strong>Outline</strong> pane looks like this:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/04/02_externaloutsideproject1.png"><img class="alignnone size-full wp-image-1019" title="02_ExternalOutsideProject" src="http://cloveretl.files.wordpress.com/2011/04/02_externaloutsideproject1.png" alt="External Outside Project" width="541" height="380" /></a></p>
<h4>Files with graph elements are located on a remote computer.</h4>
<p>To locate the files on remote computer, you need to use the <strong>Remote files</strong> tab of the <strong>URL Dialog</strong>. It allows to specify details of the remote computer file system.</p>
<p>To connect a remote computer, click the <strong>Create/Edit URL</strong> button at the right side from the <strong>Server URL</strong> combobox. In the <strong>Edit URL Dialog</strong> that opens, you need to specify all the authentication details:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/04/editurldialog.png"><img class="alignnone size-full wp-image-1002" title="EditURLDialog" src="http://cloveretl.files.wordpress.com/2011/04/editurldialog.png" alt="Edit URL Dialog" width="710" height="513" /></a></p>
<p>The general structure of a remote path is:</p>
<p>&lt;protocol&gt;://&lt;username&gt;:&lt;password&gt;@&lt;hostname|IP&gt;:&lt;portnumber&gt;/&lt;pathtoexternalelements&gt;</p>
<p>Supported protocols are: <code>http, https, ftp, ftps, sftp</code>. The first two do not allow browsing the remote file system, whereas the other four allow it.</p>
<p>The whole path to the graph elements is specified in the graph XML file. The graph XML file contains the following links to metadata, db connection, simple sequence, simple lookup table, and parameter files:</p>
<pre>&lt;Metadata fileURL="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/meta/LUT.fmt" id="Metadata1"/&gt;

&lt;Connection dbConfig="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/conn/NewConnection.cfg" id="JDBC0" type="JDBC"/&gt;

&lt;LookupTable id="LookupTable0"
lookupConfig="sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/lookup/simpleLookup0.cfg"/&gt;

&lt;Sequence id="Sequence0" seqConfig="://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/seq/seq.cfg"&gt;
&lt;attr name="type"&gt;&lt;![CDATA[SIMPLE_SEQUENCE]]&gt;&lt;/attr&gt;
&lt;/Sequence&gt;

&lt;Property fileURL="parameters.prm" id="GraphParameter1"/&gt;</pre>
<p>The last mentioned <code>parameters.prm</code> file defines two parameters:</p>
<p>NUMBER, whose value is 6</p>
<p>and</p>
<p>PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR, whose value is <code>sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements.</code></p>
<p>(In addition to the mentioned graph elements, graph contains a link to the <code>workspace.prm</code> file defining project parameters – “CONN_DIR”, “META_DIR”, etc.)</p>
<pre><code>&lt;Property fileURL="workspace.prm" id="GraphParameter0"/&gt; </code></pre>
<p>The resulting <strong>Outline</strong> pane without parameter usage looks like this:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/04/03_remotegraphelements.png"><img class="alignnone size-full wp-image-1000" title="03_RemoteGraphElements" src="http://cloveretl.files.wordpress.com/2011/04/03_remotegraphelements.png" alt="Remote Graph Elements" width="795" height="507" /></a></p>
<p><strong>Note:</strong></p>
<p>Remember that the paths to linked external elements located on a remote computer do NOT use automatically the ${PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR} value instead of <code>sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements. </code></p>
<p>You may switch to the <strong>Source</strong> tab of your graph and replace the <code>sftp://smithjohn:1a2b3c@192.168.1.12/home/smithjohn/ExternalGraphElements</code> with ${PATH_TO_REMOTE_GRAPH_ELEMENTS_DIR} by hand.</p>
<p>Thus, after replacing the paths in the <strong>Source</strong> tab of the <strong>Graph editor</strong> with the value of graph parameter, the <strong>Outline</strong> pane looks like this:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/04/03_remotegraphelementswithparameter.png"><img class="alignnone size-full wp-image-1001" title="03_RemoteGraphElementsWithParameter" src="http://cloveretl.files.wordpress.com/2011/04/03_remotegraphelementswithparameter.png" alt="Remote Graph Elements With Parameter" width="775" height="380" /></a></p>
<h3>COMPARISON OF INTERNAL AND EXTERNAL GRAPH ELEMENTS</h3>
<p>All the graph elements, both the <em>internal</em> and the <em>external</em>, may be converted into the other form. Any <em>internal</em> element may become an <em>external</em> one, and vice versa.</p>
<p>For more details consult our <a href="http://www.cloveretl.com/documentation/UserGuide/topic/com.cloveretl.gui.docs/docs/internal-external-graph-elements.html">documentation</a>.</p>
<h3>VARIOUS FORMATS OF GRAPH ELEMENTS MAY BE USED AT THE SAME TIME</h3>
<p>Remember that you can use all forms of graph elements in a single graph: <em>internal</em>, <em>external</em>, located on <em>local</em> computer, and <em>external</em> elements, located <em>remotely</em>, accessible via <em>various protocols</em>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/993/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/993/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/993/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/993/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/993/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/993/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/993/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/993/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/993/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/993/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/993/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/993/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/993/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/993/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=993&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/04/12/usage-of-internal-and-external-graph-elements/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2c39e475cb676a45a1a1e3a6ee1bf9b1?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">tomaswaller</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/04/00_internalgraphelements.png" medium="image">
			<media:title type="html">00_InternalGraphElements</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/04/urldialog.png" medium="image">
			<media:title type="html">URLDialog</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/04/01_externalgraphelements.png" medium="image">
			<media:title type="html">01_ExternalGraphElements</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/04/02_externaloutsideproject1.png" medium="image">
			<media:title type="html">02_ExternalOutsideProject</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/04/editurldialog.png" medium="image">
			<media:title type="html">EditURLDialog</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/04/03_remotegraphelements.png" medium="image">
			<media:title type="html">03_RemoteGraphElements</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/04/03_remotegraphelementswithparameter.png" medium="image">
			<media:title type="html">03_RemoteGraphElementsWithParameter</media:title>
		</media:content>
	</item>
		<item>
		<title>Launch Services &#8211; Part 2 &#8211; Configuration</title>
		<link>http://cloveretl.wordpress.com/2011/03/22/launch-services-part-2-configuration/</link>
		<comments>http://cloveretl.wordpress.com/2011/03/22/launch-services-part-2-configuration/#comments</comments>
		<pubDate>Tue, 22 Mar 2011 13:42:25 +0000</pubDate>
		<dc:creator>csochor</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[CloverETL Server]]></category>
		<category><![CDATA[Launch Services]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=982</guid>
		<description><![CDATA[In the last blog post, you learned what the Launch Services are. In this post you will see how to configure them. Let us study an example scenario to become acquainted with configuration. We have a database containing the highest mountains on Earth along with their heights. The user enters an elevation above sea-level and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=982&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://blog.cloveretl.com/2011/01/17/launch-services-first-steps/">last blog post</a>, you learned what the Launch Services are. In this post you will see how to configure them.</p>
<p>Let us study an example scenario to become acquainted with configuration. We have a database containing the highest mountains on Earth along with their heights. The user enters an elevation above sea-level and hits the enter key. The Excel sheet is then displayed listing all mountains with the given minimal elevation.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/03/mountains.png"><img class="alignnone size-full wp-image-984" title="mountains" src="http://cloveretl.files.wordpress.com/2011/03/mountains.png" alt="Mountains example of Launch Services" width="1006" height="244" /></a></p>
<h3>How to Configure It?</h3>
<p>First, we must create a transformation graph that uses a <strong>dictionary</strong> to receive parameters and to store results. Create a new graph in CloverETL Designer. In the outline pane, right click on Dictionary and choose Edit. Add a new entry named <strong>heightMin: with the “</strong>As Input” field set to true, and “type” set to Integer. Then add another entry named <strong>mountains.xls</strong> of type <strong>writable.channel, </strong>content type t<strong>ext/csv, </strong>and “As Output” set to true.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/03/dictionary.png"><img class="alignnone size-full wp-image-983" title="dictionary" src="http://cloveretl.files.wordpress.com/2011/03/dictionary.png" alt="Dictionary" width="1024" height="581" /></a></p>
<p>Now we may build a transformation graph. Components can use a dictionary in three different ways:</p>
<ol>
<li>Via file URL: data readers 	and writers  may specify a <strong>File URL </strong>in 	the format <strong>dictionary:field-name</strong>. 	In our example, we set a data writer File URL to 	<em>dictionary:mountains.xls</em>.<a href="http://cloveretl.files.wordpress.com/2011/03/url-dialog.png"><img class="alignnone size-full wp-image-987" title="url-dialog" src="http://cloveretl.files.wordpress.com/2011/03/url-dialog.png" alt="URL Dialog" width="958" height="624" /></a></li>
<li>In CTL: anywhere in CTL 	code, we can use an expression of type <strong>dictionary.field-name</strong> to read or write the dictionary. In our example we use Filter 	expression <span style="color:#197519;"><span style="font-family:Monospace;"><span style="font-size:x-small;"><strong>$0.</strong></span></span></span><span style="color:#000000;"><span style="font-family:Monospace;"><span style="font-size:x-small;"><strong>heightM 	&gt;= dictionary.heightMin</strong></span></span></span></li>
<li>In Java code: using methods 	transformationGraph.getDictionary().getValue(String fieldName) and 	transformationGraph.getDictionary().setValue(String fieldName, 	Object value)</li>
</ol>
<p>When a transformation graph is designed and ready, we must publish it as a Launch Service. In CloverETL Server administration, go to section <strong>Launch Services</strong> and click <strong>New launch configuration</strong>. Now enter a name, a sandbox and a graph name. Then open the Detail page for the new service, and click on <strong>Edit Parameters</strong> tab. Create a new parameter with <strong>heightMin </strong>name.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/03/server.png"><img class="alignnone size-full wp-image-985" title="server" src="http://cloveretl.files.wordpress.com/2011/03/server.png" alt="CloverETL Server Interface" width="900" height="458" /></a></p>
<p>Now we may test it. When we click a <strong>test</strong> link, the server generates a simple form which executes a launch service. We can copy, customize and use this form in a web site.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/03/test-page.png"><img class="alignnone size-full wp-image-986" title="test-page" src="http://cloveretl.files.wordpress.com/2011/03/test-page.png" alt="Test Page" width="635" height="274" /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/982/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/982/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/982/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=982&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/03/22/launch-services-part-2-configuration/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d3860e88dab62bbcf080d6bce52143e1?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">csochor</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/03/mountains.png" medium="image">
			<media:title type="html">mountains</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/03/dictionary.png" medium="image">
			<media:title type="html">dictionary</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/03/url-dialog.png" medium="image">
			<media:title type="html">url-dialog</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/03/server.png" medium="image">
			<media:title type="html">server</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/03/test-page.png" medium="image">
			<media:title type="html">test-page</media:title>
		</media:content>
	</item>
		<item>
		<title>Versatility of the File URL Attribute Used in Readers of CloverETL</title>
		<link>http://cloveretl.wordpress.com/2011/03/03/versatility-of-the-file-url/</link>
		<comments>http://cloveretl.wordpress.com/2011/03/03/versatility-of-the-file-url/#comments</comments>
		<pubDate>Thu, 03 Mar 2011 09:42:34 +0000</pubDate>
		<dc:creator>tomaswaller</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[connectivity]]></category>
		<category><![CDATA[ftp]]></category>
		<category><![CDATA[url]]></category>
		<category><![CDATA[User Interface]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=967</guid>
		<description><![CDATA[CloverETL allows users to read several different kinds of files. These files may have various formats, they can be located on a local or remote computer, they can be accessed through a proxy, and they can also be compressed into zip, gzip, or tar archives. Users can also read data from the Console, from an [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=967&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>CloverETL</strong> allows users to read several different kinds of files. These files may have various formats, they can be located on a local or remote computer, they can be accessed through a proxy, and they can also be compressed into zip, gzip, or tar archives. Users can also read data from the <strong>Console</strong>, from an input <strong>Port,</strong> or from a selected <strong>Dictionary</strong> entry.</p>
<p>A<strong> File URL</strong> must be specified using the <strong>URL dialog</strong>.</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/02/urldialog.png"><img class="alignnone size-full wp-image-970" title="File URL Dialog" src="http://cloveretl.files.wordpress.com/2011/02/urldialog.png" alt="File URL Dialog" width="800" height="600" /></a></p>
<ul>
<li><strong>Workspace view</strong> tab serves to specify files within the workspace independently of whether the workspace belongs to <strong>CloverETL Project</strong> or <strong>CloverETL Server Project</strong>.</li>
<li><strong>Local files </strong>tab displays the file structure of a local computer.</li>
<li><strong>Remote files</strong> tab displays the file structure of a remote computer. It allows the user to specify the protocol, username, password, port, server, proxy, username for proxy, password for proxy, identification of proxy server, and the port for the proxy. After specifying these properties, the file structure of the remote computer is displayed (except for <em>http</em> and <em>https</em> protocols).</li>
<li><strong>Port</strong> tab displays the input fields that are <em>string</em>, <em>byte</em>, or <em>cbyte</em> data types and allows the user to select one of them and also select a processing type from a combo box</li>
<li><strong>Dictionary</strong> tab displays declared dictionary entries and allows to select one of them and choose the processing type from a combo</li>
</ul>
<p>Now I will present the list of supported values of the <strong>File URL</strong> attribute.</p>
<h3>Local Files (without compression)</h3>
<p><span style="text-decoration:underline;">Examples:</span><br />
<em>/path/file1.txt</em> – reads one file<br />
<em>/path/file1.txt;/path/file2.txt</em> – reads two files in one directory (semicolon separates files that will be read one after another)<br />
<em>/path1/fileA.txt;/path2/fileB.txt</em> – reads two files in two directories (semicolon separates files that will be read one after another)<br />
<em>/path?/file*.txt</em> – reads files in directories, when both the directories and the files must match the specified pattern.<br />
<em>/path/*</em> – reads all files in the specified directory</p>
<h3>Local Files (with compression)</h3>
<p>Examples:<br />
<em>zip:(/path/file.zip)</em> – reads the first file added to the zip archive<br />
<em>zip:(/path/file.zip)#innerfolder/innerfile.txt</em> – reads the innerfile.txt contained in the innerfolder which has been compressed into the specified zip archive.<br />
<em>zip:(/path/file??.zip)#innerfolder?/innerfile*.txt</em> – reads files contained in the innerfolders which have been compressed into the specified zip archives (each of these files, innerfolders, archive files must match their respective pattern)<br />
<em>gzip:(/path/file.gz)</em> – reads the file compressed in the gzip archive<br />
<em>gzip:(/path/file??.gz)</em> – reads the files compressed into specified gzip archives (each of these archives must match specified pattern)<br />
<em>tar:(/path/file.tar)</em> – reads the first file added to the tar archive<br />
<em>tar:(/path/file.tar)#innerfolder/innerfile.txt</em> – reads the innerfile.txt contained in the innerfolder which is compressed in the specified tar archive.<br />
<em>tar:(/path/file??.tar)#innerfolder?/innerfile*.txt</em> – reads files contained in the innerfolders which have been compressed in the specified tar archives (each of these files, innerfolders, archive files must match their respective pattern)<br />
<em>zip:((zip:/path/file*.zip)#innerfolder/innerfile.zip)#innermostfolder??/innermostfile*.txt</em> –reads innermost files contained in the innermostfolders which have been compressed into the specified innerfile zip archive contained in the innerfolder which has been compressed into the specified external zip archives (each of these innermostfiles, innermostfolders, external zip archives must match their respective pattern) Remember that innerfile.zip and innerfolder may not contain wildcards.</p>
<h3>Remote Files  (without compression)</h3>
<p>Unlike locally stored files, files on remote computers are accessible using a set of supported protocols. Sometimes it is also necessary to use a proxy server.</p>
<p>The following protocols are supported for accessing a remote server: <em>sftp</em>, <em>ftp</em>, <em>ftps</em>, <em>http</em>, <em>https</em>.</p>
<h4>Access without proxy:</h4>
<p>The structure of all remote files that are accessible directly, without a proxy, is as follows:<br />
<em>protocol://username:password@serverpassword@server :port/(whole|relative)path/file</em></p>
<p>Here, the whole path should be used for the <em>sftp</em> protocol, the other four protocols use relative paths.</p>
<p>Examples:<br />
<em>sftp://johnsmith:mypassword@myserver/home/johnsmith/relativepath/filename.txt</em><br />
<em>ftp://johnsmith:mypassword@myserver/relativepath/filename.txt</em><br />
<em>ftps://johnsmith:mypassword@myserver/relativepath/filename.txt</em><br />
<em>http://johnsmith:mypassword@myserver/relativepath/filename.txt</em><br />
<em>https://johnsmith:mypassword@myserver/relativepath/filename.txt</em></p>
<p>In the patterns shown above, username, password, and port may be ommitted if possible, whereas the other parts of such <strong>File URL</strong> are required.</p>
<p>Example (with username, password, and port ommitted):</p>
<p>http://myserver/relativepath/filename.txt</p>
<h4>Access through proxy:</h4>
<p>The structure of all remote files that are accessible through a proxy is as follows:<br />
<em>protocol:(proxy:proxyuser:proxypassword@proxyserver:proxyport)//username:password@server:port/(whole|relative)path/file</em></p>
<p>or with SOCKS V4 or V5 proxy:<br />
<em>protocol:(proxysocks:proxyuser:proxypassword@proxyserver:proxyport)//username:password@server:port/(whole|relative)path/file</em></p>
<p>Example:<br />
<em>ftp:(proxy:proxyuser:proxypassword@proxyserver:proxyport)//johnsmith:mypassword@myserver/relativepath/filename.txt</em></p>
<p>Also in this case, proxyuser, proxypassword, and proxyport can be ommitted if possible; the other parts of this pattern are required.</p>
<p>With SOCKS V4 or V5 proxy an example follows:<br />
<em>ftp:(proxysocks:proxyuser:proxypassword@proxyserver:proxyport)//johnsmith:mypassword@myserver/relativepath/filename.txt</em></p>
<h3>Remote Files (with compression)</h3>
<p>Remote <strong>File URLs</strong> may also be combined with archiving protocols in a similar manner to local <strong>File URLs</strong>.</p>
<p>Example:<br />
<em>zip:(ftp://johnsmith:mypassword@myserver/relativepath/myarchive.zip)#innerfolder/filename.txt</em></p>
<p>Wildcards may also be used in a similar way:</p>
<p>Example:<br />
<em>zip:(ftp://johnsmith:mypassword@myserver/relativepath/myarchive*.zip)#innerfolder??/filename?.txt</em></p>
<p><strong>Note:</strong></p>
<p>Remember that <em>http</em> and <em>https</em> protocols do not support wildcards in top level files or archives.</p>
<h3>Console Input</h3>
<p><strong>File URL</strong> for Console input will be: &#8211; <em>(hyphen character)</em></p>
<p>User types the input into Console after the graph starts, types data separated by field delimiters, presses <strong>Enter</strong> to specify end of records, and finishes the input after the last record by pressing <strong>Ctrl+Z</strong>.</p>
<h3>Input Port Reading</h3>
<p><strong>CloverETL</strong> also supports reading incoming data through the input port of some Readers. Metadata connected to the input port must contain at least one field of <em>string</em>, <em>byte</em>, or <em>cbyte</em> data type. The user selects the field from which data should be read and parsed according to the output metadata. Three processing types can be selected in <strong>CloverETL</strong>:</p>
<ul>
<li><em>discrete</em> (the default value)</li>
<li><em>stream</em></li>
<li><em>source</em></li>
</ul>
<p><strong>File URL</strong> pattern is the following:</p>
<p><em>port:$0.fieldname:discrete|stream|source</em></p>
<h4>Discrete processing type:</h4>
<p>When the processing type is discrete, each record is parsed separately, according to the output metadata.</p>
<p>Example:<br />
<em>port:$0.customer:discrete</em></p>
<p><strong>Note:</strong><br />
The <em>colon</em> and the word <em>discrete</em> can be ommitted.</p>
<p>Example:<br />
<em>port:$0.customer</em></p>
<h4>Stream processing type:</h4>
<p>When processing type is stream, all records are concatenated and parsed according to the output metadata. If input metadata contains a null value, this null means eof and separates groups of records. All records before such a null are concatenated, but separately from all records after such a null, which are also concatenated into another data source.</p>
<p>Example:<br />
<em>port:$0.customer:stream</em></p>
<h4>Source processing type:</h4>
<p>Example:<br />
<em>port:$0.file:source</em></p>
<p>When processing type is source, values of the selected field (<em>$0.file</em>) are valid URLs. The Reader to which such input is connected takes the file accessible with this URL and reads the contents. Metadata on the output must match the structure of the files specified with the help of these URLs.</p>
<h3>Dictionary Entry Reading</h3>
<p><strong>Dictionary</strong> tab allows the selection of one of the graph dictionary entries. The processing type in the combo box should also be specified.</p>
<p><strong>File URL</strong> pattern is:<br />
<em>dict:myentry:discrete|source</em></p>
<h4>Discrete processing type:</h4>
<p>Example:<br />
<em>dict:customer:discrete</em></p>
<p>Reads contents of dictionary entry whose name is <em>customer</em>.</p>
<h4>Source processing type:</h4>
<p>Example:<br />
<em>dict:file:source</em></p>
<p>When processing type is source, the value of the selected dictionary entry (<em>file</em>) is a valid URL. The reader with this <strong>File URL</strong> takes the contents of the file accessible with the help of this dictionary entry and reads the file contents. Metadata on the component&#8217;s output must match the structure of the file specified with the help of this URL.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/967/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/967/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/967/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/967/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/967/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/967/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/967/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/967/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/967/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/967/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/967/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/967/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/967/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/967/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=967&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/03/03/versatility-of-the-file-url/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2c39e475cb676a45a1a1e3a6ee1bf9b1?s=96&#38;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">tomaswaller</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/02/urldialog.png" medium="image">
			<media:title type="html">File URL Dialog</media:title>
		</media:content>
	</item>
		<item>
		<title>Data sampling with CloverETL</title>
		<link>http://cloveretl.wordpress.com/2011/01/31/data-sampling/</link>
		<comments>http://cloveretl.wordpress.com/2011/01/31/data-sampling/#comments</comments>
		<pubDate>Mon, 31 Jan 2011 13:40:00 +0000</pubDate>
		<dc:creator>Agata Vackova</dc:creator>
				<category><![CDATA[Using CloverETL]]></category>
		<category><![CDATA[data profiling]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[data quality scorecard]]></category>
		<category><![CDATA[data sampling]]></category>

		<guid isPermaLink="false">http://blog.cloveretl.com/?p=917</guid>
		<description><![CDATA[Testing data transformations is generally not an easy task. When creating and testing a transformation you might want to get a data sample to check if your transformation works properly. In this point a question arises: How to create a representative data probe on the full data set? Obviously, the easiest way is to read [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=917&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Testing data transformations is generally not an easy task. When creating and testing a transformation you might want to get a data sample to check if your transformation works properly. In this point a question arises: How to create a <strong>representative</strong> data probe on the full data set? Obviously, the easiest way is to read just part of data from the beginning. But such data sample can be very unreliable.  I&#8217;ve prepared a few simple graphs that create a data probe which can be regarded as representative for the full data set.</p>
<p>All graphs were created based on the sampling methods described in the article <a title="Sampling methods" href="http://en.wikipedia.org/wiki/Sampling_%28statistics%29#Sampling_methods" target="_blank">Sampling (statistics)</a>.</p>
<h4>Simple random sampling</h4>
<p>In this method each record has the same probability of selection. Filtering is based on double value chosen (approximately) uniformly from the range 0.0d (inclusive) to 1.0d (exclusive): record is selected if the drawn number is lower than required sample set size:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/01/simplerandomsampling4.png"><img class="alignnone size-full wp-image-928" title="SimpleRandomSampling" src="http://cloveretl.files.wordpress.com/2011/01/simplerandomsampling4.png" alt="" width="516" height="173" /></a></p>
<h4>Systematic sampling</h4>
<p>Systematic sampling relies on arranging the data set according to some ordering scheme and then selecting elements in regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every k-th element from then onwards:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/01/systematicsampling1.png"><img class="alignnone size-full wp-image-930" title="SystematicSampling" src="http://cloveretl.files.wordpress.com/2011/01/systematicsampling1.png" alt="" width="689" height="202" /></a></p>
<p>Sorting can be disabled in this graph. Then it is selected just every k-th element from the full data set, starting from a randomly selected record from the interval [1, k].</p>
<h4>Stratified sampling</h4>
<p>If the data set embraces a number of distinct categories, the frame can be organized by these categories into separate <em>strata</em>. Each <em>stratum</em> is then sampled as an independent sub-population out of which individual elements can be randomly selected. At least one record from each <em>stratum</em> must be selected:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/01/stratifiedsampling21.png"><img class="alignnone size-full wp-image-933" title="StratifiedSampling2" src="http://cloveretl.files.wordpress.com/2011/01/stratifiedsampling21.png" alt="" width="677" height="210" /></a></p>
<h4>Probability proportional to size sampling</h4>
<p>Probability for each record is set to be proportional to its <em>stratum</em> size, up to a maximum of 1. <em>Strata</em> are defined by the value of the selected field. For each group of records it it is used systematic sampling method:</p>
<p><a href="http://cloveretl.files.wordpress.com/2011/01/ppssampling1.png"><img class="alignnone size-full wp-image-935" title="PPSSampling1" src="http://cloveretl.files.wordpress.com/2011/01/ppssampling1.png" alt="" width="676" height="220" /></a></p>
<h3>Methods comparison</h3>
<p>Simple random sampling method is the simplest and fastest. It is sufficient in most cases. Systematic sampling with disabled sorting is as fast as simple random sampling and produces also strongly representative data probe. The stratified sampling method is the trickiest one. It is useful only if the data set can be split into the separated groups that have reasonable sizes. In other cases the data probe is a lot of bigger than requested.</p>
<p>Please see the attached CloverETL project with the above graphs. It also contains the graph for comparison of samples created with different sampling methods. I&#8217;ve done some tests for the file containing 5,000,000 rows with information about financial transactions. Each row contains unique transaction id, id of a customer, transaction amount and currency info. Total number of customers is 50,001; number of possible currencies is 35. I performed two sets of tests: one for the group defined by customer id and one defined by currency id.</p>
<h4>Results for the sampling_field = CustomerId</h4>
<p><em>Stratum </em>is defined by id of customer. All data can be split to 50,001 groups with sizes from 61 to 143 transactions.</p>
<p>Following table shows testing results for some groups. Sorting was enabled for systematic sampling method.</p>
<p>defined sample size ratio:  0.01</p>
<table border="1">
<tbody>
<tr align="center">
<th>sampling field (CustomerId) value</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 22 s 772 ms</td>
<td colspan="3">0 h 1 m 34 s 965 ms</td>
<td colspan="3">0 h 1 m 33 s 831 ms</td>
<td colspan="3">0 h 1 m 30 s 973 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>4</td>
<td>71</td>
<td>0.0563</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>94</td>
<td>0.0000</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
</tr>
<tr>
<td>10</td>
<td>2</td>
<td>110</td>
<td>0.0181</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
<td>2</td>
<td>110</td>
<td>0.0181</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
</tr>
<tr>
<td>100</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
</tr>
<tr>
<td>1000</td>
<td>0</td>
<td>83</td>
<td>0.0000</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
</tr>
<tr>
<td>10000</td>
<td>2</td>
<td>101</td>
<td>0.0198</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
</tr>
<tr>
<td>10001</td>
<td>0</td>
<td>99</td>
<td>0.0000</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
</tr>
<tr>
<td>10002</td>
<td>0</td>
<td>109</td>
<td>0.0000</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>3</td>
<td>109</td>
<td>0.0275</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
</tr>
<tr>
<td>10003</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>2</td>
<td>86</td>
<td>0.0232</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr>
<td>10004</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>49937</td>
<td>5000000</td>
<td>0.0099</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>68172</td>
<td>5000000</td>
<td>0.0136</td>
<td>50011</td>
<td>5000000</td>
<td>0.0100</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CustomerId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 28 s 741 ms</td>
<td colspan="3">0 h 1 m 34 s 474 ms</td>
<td colspan="3">0 h 1 m 32 s 628 ms</td>
<td colspan="3">0 h 1 m 33 s 949 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>94</td>
<td>0.0000</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
</tr>
<tr>
<td>10</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
<td>3</td>
<td>110</td>
<td>0.0272</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
</tr>
<tr>
<td>100</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
</tr>
<tr>
<td>1000</td>
<td>0</td>
<td>83</td>
<td>0.0000</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>2</td>
<td>83</td>
<td>0.0240</td>
<td>0</td>
<td>83</td>
<td>0.0000</td>
</tr>
<tr>
<td>10000</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
</tr>
<tr>
<td>10001</td>
<td>2</td>
<td>99</td>
<td>0.0202</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
</tr>
<tr>
<td>10002</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
</tr>
<tr>
<td>10003</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr>
<td>10004</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>49931</td>
<td>5000000</td>
<td>0.0099</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>68369</td>
<td>5000000</td>
<td>0.0136</td>
<td>50010</td>
<td>5000000</td>
<td>0.0100</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CustomerId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 24 s 975 ms</td>
<td colspan="3">0 h 1 m 37 s 446 ms</td>
<td colspan="3">0 h 1 m 29 s 98 ms</td>
<td colspan="3">0 h 1 m 32 s 857 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
</tr>
<tr>
<td>10</td>
<td>0</td>
<td>110</td>
<td>0.0000</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
<td>2</td>
<td>110</td>
<td>0.0181</td>
<td>2</td>
<td>110</td>
<td>0.0181</td>
</tr>
<tr>
<td>100</td>
<td>0</td>
<td>93</td>
<td>0.0000</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
</tr>
<tr>
<td>1000</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
</tr>
<tr>
<td>10000</td>
<td>2</td>
<td>101</td>
<td>0.0198</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
</tr>
<tr>
<td>10001</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
</tr>
<tr>
<td>10002</td>
<td>0</td>
<td>109</td>
<td>0.0000</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>3</td>
<td>109</td>
<td>0.0275</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
</tr>
<tr>
<td>10003</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr>
<td>10004</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>49983</td>
<td>5000000</td>
<td>0.0099</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>68258</td>
<td>5000000</td>
<td>0.0136</td>
<td>49900</td>
<td>5000000</td>
<td>0.0099</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CustomerId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
</tbody>
</table>
<p>Results for the same test but with data sorting disabled in systematic sampling method:</p>
<p>defined sample size ratio:  0.01</p>
<table border="1">
<tbody>
<tr align="center">
<th>sampling field (CustomerId) value</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 28 s 168 ms</td>
<td colspan="3">0 h 0 m 23 s 117 ms</td>
<td colspan="3">0 h 1 m 35 s 414 ms</td>
<td colspan="3">0 h 1 m 30 s 985 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>0</td>
<td>94</td>
<td>0.0000</td>
<td>2</td>
<td>94</td>
<td>0.0212</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
</tr>
<tr>
<td>10</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
<td>2</td>
<td>110</td>
<td>0.0181</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
</tr>
<tr>
<td>100</td>
<td>0</td>
<td>93</td>
<td>0.0000</td>
<td>0</td>
<td>93</td>
<td>0.0000</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
</tr>
<tr>
<td>1000</td>
<td>0</td>
<td>83</td>
<td>0.0000</td>
<td>0</td>
<td>83</td>
<td>0.0000</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
</tr>
<tr>
<td>10000</td>
<td>2</td>
<td>101</td>
<td>0.0198</td>
<td>0</td>
<td>101</td>
<td>0.0000</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
</tr>
<tr>
<td>10001</td>
<td>0</td>
<td>99</td>
<td>0.0000</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
</tr>
<tr>
<td>10002</td>
<td>3</td>
<td>109</td>
<td>0.0275</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>3</td>
<td>109</td>
<td>0.0275</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
</tr>
<tr>
<td>10003</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>2</td>
<td>86</td>
<td>0.0232</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr>
<td>10004</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
<td>2</td>
<td>86</td>
<td>0.0232</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>50081</td>
<td>5000000</td>
<td>0.0100</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>68227</td>
<td>5000000</td>
<td>0.0136</td>
<td>49966</td>
<td>5000000</td>
<td>0.0099</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CustomerId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 23 s 78 ms</td>
<td colspan="3">0 h 0 m 19 s 178 ms</td>
<td colspan="3">0 h 1 m 33 s 148 ms</td>
<td colspan="3">0 h 1 m 29 s 261 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>94</td>
<td>0.0000</td>
<td>0</td>
<td>94</td>
<td>0.0000</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
</tr>
<tr>
<td>10</td>
<td>0</td>
<td>110</td>
<td>0.0000</td>
<td>3</td>
<td>110</td>
<td>0.0272</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
</tr>
<tr>
<td>100</td>
<td>3</td>
<td>93</td>
<td>0.0322</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
</tr>
<tr>
<td>1000</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>0</td>
<td>83</td>
<td>0.0000</td>
<td>2</td>
<td>83</td>
<td>0.0240</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
</tr>
<tr>
<td>10000</td>
<td>0</td>
<td>101</td>
<td>0.0000</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
</tr>
<tr>
<td>10001</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>3</td>
<td>99</td>
<td>0.0303</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
</tr>
<tr>
<td>10002</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>0</td>
<td>109</td>
<td>0.0000</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
</tr>
<tr>
<td>10003</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr>
<td>10004</td>
<td>3</td>
<td>86</td>
<td>0.0348</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>50056</td>
<td>5000000</td>
<td>0.0100</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>68528</td>
<td>5000000</td>
<td>0.0137</td>
<td>50033</td>
<td>5000000</td>
<td>0.0100</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CustomerId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 28 s 244 ms</td>
<td colspan="3">0 h 0 m 27 s 52 ms</td>
<td colspan="3">0 h 1 m 35 s 49 ms</td>
<td colspan="3">0 h 1 m 27 s 725 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>71</td>
<td>0.0140</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
<td>2</td>
<td>71</td>
<td>0.0281</td>
<td>0</td>
<td>71</td>
<td>0.0000</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
<td>0</td>
<td>94</td>
<td>0.0000</td>
<td>2</td>
<td>94</td>
<td>0.0212</td>
<td>1</td>
<td>94</td>
<td>0.0106</td>
</tr>
<tr>
<td>10</td>
<td>0</td>
<td>110</td>
<td>0.0000</td>
<td>2</td>
<td>110</td>
<td>0.0181</td>
<td>4</td>
<td>110</td>
<td>0.0363</td>
<td>1</td>
<td>110</td>
<td>0.0090</td>
</tr>
<tr>
<td>100</td>
<td>2</td>
<td>93</td>
<td>0.0215</td>
<td>2</td>
<td>93</td>
<td>0.0215</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
<td>1</td>
<td>93</td>
<td>0.0107</td>
</tr>
<tr>
<td>1000</td>
<td>2</td>
<td>83</td>
<td>0.0240</td>
<td>0</td>
<td>83</td>
<td>0.0000</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
<td>1</td>
<td>83</td>
<td>0.0120</td>
</tr>
<tr>
<td>10000</td>
<td>0</td>
<td>101</td>
<td>0.0000</td>
<td>0</td>
<td>101</td>
<td>0.0000</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
<td>1</td>
<td>101</td>
<td>0.0099</td>
</tr>
<tr>
<td>10001</td>
<td>0</td>
<td>99</td>
<td>0.0000</td>
<td>4</td>
<td>99</td>
<td>0.0404</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
<td>1</td>
<td>99</td>
<td>0.0101</td>
</tr>
<tr>
<td>10002</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>2</td>
<td>109</td>
<td>0.0183</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
<td>1</td>
<td>109</td>
<td>0.0091</td>
</tr>
<tr>
<td>10003</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>2</td>
<td>86</td>
<td>0.0232</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
</tr>
<tr>
<td>10004</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
<td>0</td>
<td>86</td>
<td>0.0000</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
<td>1</td>
<td>86</td>
<td>0.0116</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>50116</td>
<td>5000000</td>
<td>0.0100</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>68470</td>
<td>5000000</td>
<td>0.0136</td>
<td>50010</td>
<td>5000000</td>
<td>0.0100</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CustomerId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
</tbody>
</table>
<p>Since the groups are really small, there should be selected none or one record from each group and for the smaller groups we should have more often zero selected records. In relation to this criteria the PPS sampling method and systematic sampling method with sorting data enabled give the best results. Data sample created with stratified method is always oversized.</p>
<h4>Results for the sampling_field = CurrencyId</h4>
<p><em>Stratum </em>is defined by id of currency. All data can be split to 35 groups with very similar sizes from 142,042 to 143,572 transactions.</p>
<p>The following table shows testing results for some groups. Sorting was enabled for systematic sampling method.</p>
<p>defined sample size ratio:  0.01</p>
<table border="1">
<tbody>
<tr align="center">
<th>sampling field (CurrencyId) value</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 21 s 681 ms</td>
<td colspan="3">0 h 1 m 26 s 859 ms</td>
<td colspan="3">0 h 1 m 25 s 970 ms</td>
<td colspan="3">0 h 1 m 27 s 85 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>1450</td>
<td>142623</td>
<td>0.0101</td>
<td>1427</td>
<td>142623</td>
<td>0.0100</td>
<td>1447</td>
<td>142623</td>
<td>0.0101</td>
<td>1426</td>
<td>142623</td>
<td>0.0099</td>
</tr>
<tr>
<td>1</td>
<td>1371</td>
<td>142925</td>
<td>0.0095</td>
<td>1429</td>
<td>142925</td>
<td>0.0099</td>
<td>1430</td>
<td>142925</td>
<td>0.0100</td>
<td>1429</td>
<td>142925</td>
<td>0.0099</td>
</tr>
<tr>
<td>10</td>
<td>1420</td>
<td>142897</td>
<td>0.0099</td>
<td>1429</td>
<td>142897</td>
<td>0.0100</td>
<td>1432</td>
<td>142897</td>
<td>0.0100</td>
<td>1429</td>
<td>142897</td>
<td>0.0100</td>
</tr>
<tr>
<td>11</td>
<td>1448</td>
<td>142896</td>
<td>0.0101</td>
<td>1429</td>
<td>142896</td>
<td>0.0100</td>
<td>1443</td>
<td>142896</td>
<td>0.0100</td>
<td>1429</td>
<td>142896</td>
<td>0.0100</td>
</tr>
<tr>
<td>12</td>
<td>1383</td>
<td>142522</td>
<td>0.0097</td>
<td>1425</td>
<td>142522</td>
<td>0.0099</td>
<td>1488</td>
<td>142522</td>
<td>0.0104</td>
<td>1425</td>
<td>142522</td>
<td>0.0099</td>
</tr>
<tr>
<td>13</td>
<td>1468</td>
<td>142461</td>
<td>0.0103</td>
<td>1425</td>
<td>142461</td>
<td>0.0100</td>
<td>1395</td>
<td>142461</td>
<td>0.0097</td>
<td>1424</td>
<td>142461</td>
<td>0.0099</td>
</tr>
<tr>
<td>14</td>
<td>1449</td>
<td>142997</td>
<td>0.0101</td>
<td>1430</td>
<td>142997</td>
<td>0.0100</td>
<td>1479</td>
<td>142997</td>
<td>0.0103</td>
<td>1430</td>
<td>142997</td>
<td>0.0100</td>
</tr>
<tr>
<td>15</td>
<td>1401</td>
<td>142697</td>
<td>0.0098</td>
<td>1426</td>
<td>142697</td>
<td>0.0099</td>
<td>1438</td>
<td>142697</td>
<td>0.0100</td>
<td>1427</td>
<td>142697</td>
<td>0.0100</td>
</tr>
<tr>
<td>16</td>
<td>1396</td>
<td>143137</td>
<td>0.0097</td>
<td>1432</td>
<td>143137</td>
<td>0.0100</td>
<td>1387</td>
<td>143137</td>
<td>0.0096</td>
<td>1431</td>
<td>143137</td>
<td>0.0099</td>
</tr>
<tr>
<td>17</td>
<td>1464</td>
<td>142517</td>
<td>0.0102</td>
<td>1425</td>
<td>142517</td>
<td>0.0099</td>
<td>1413</td>
<td>142517</td>
<td>0.0099</td>
<td>1425</td>
<td>142517</td>
<td>0.0099</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>49959</td>
<td>5000000</td>
<td>0.0099</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>50075</td>
<td>5000000</td>
<td>0.0100</td>
<td>49997</td>
<td>5000000</td>
<td>0.0099</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CurrencyId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 22 s 949 ms</td>
<td colspan="3">0 h 1 m 25 s 726 ms</td>
<td colspan="3">0 h 1 m 27 s 629 ms</td>
<td colspan="3">0 h 1 m 24 s 537 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>1449</td>
<td>142623</td>
<td>0.0101</td>
<td>1427</td>
<td>142623</td>
<td>0.0100</td>
<td>1496</td>
<td>142623</td>
<td>0.0104</td>
<td>1426</td>
<td>142623</td>
<td>0.0099</td>
</tr>
<tr>
<td>1</td>
<td>1468</td>
<td>142925</td>
<td>0.0102</td>
<td>1429</td>
<td>142925</td>
<td>0.0099</td>
<td>1442</td>
<td>142925</td>
<td>0.0100</td>
<td>1429</td>
<td>142925</td>
<td>0.0099</td>
</tr>
<tr>
<td>10</td>
<td>1436</td>
<td>142897</td>
<td>0.0100</td>
<td>1429</td>
<td>142897</td>
<td>0.0100</td>
<td>1406</td>
<td>142897</td>
<td>0.0098</td>
<td>1429</td>
<td>142897</td>
<td>0.0100</td>
</tr>
<tr>
<td>11</td>
<td>1436</td>
<td>142896</td>
<td>0.0100</td>
<td>1429</td>
<td>142896</td>
<td>0.0100</td>
<td>1402</td>
<td>142896</td>
<td>0.0098</td>
<td>1429</td>
<td>142896</td>
<td>0.0100</td>
</tr>
<tr>
<td>12</td>
<td>1410</td>
<td>142522</td>
<td>0.0098</td>
<td>1425</td>
<td>142522</td>
<td>0.0099</td>
<td>1454</td>
<td>142522</td>
<td>0.0102</td>
<td>1425</td>
<td>142522</td>
<td>0.0099</td>
</tr>
<tr>
<td>13</td>
<td>1438</td>
<td>142461</td>
<td>0.0100</td>
<td>1425</td>
<td>142461</td>
<td>0.0100</td>
<td>1414</td>
<td>142461</td>
<td>0.0099</td>
<td>1425</td>
<td>142461</td>
<td>0.0100</td>
</tr>
<tr>
<td>14</td>
<td>1420</td>
<td>142997</td>
<td>0.0099</td>
<td>1430</td>
<td>142997</td>
<td>0.0100</td>
<td>1450</td>
<td>142997</td>
<td>0.0101</td>
<td>1430</td>
<td>142997</td>
<td>0.0100</td>
</tr>
<tr>
<td>15</td>
<td>1412</td>
<td>142697</td>
<td>0.0098</td>
<td>1427</td>
<td>142697</td>
<td>0.0100</td>
<td>1400</td>
<td>142697</td>
<td>0.0098</td>
<td>1427</td>
<td>142697</td>
<td>0.0100</td>
</tr>
<tr>
<td>16</td>
<td>1453</td>
<td>143137</td>
<td>0.0101</td>
<td>1431</td>
<td>143137</td>
<td>0.0099</td>
<td>1442</td>
<td>143137</td>
<td>0.0100</td>
<td>1431</td>
<td>143137</td>
<td>0.0099</td>
</tr>
<tr>
<td>17</td>
<td>1431</td>
<td>142517</td>
<td>0.0100</td>
<td>1425</td>
<td>142517</td>
<td>0.0099</td>
<td>1372</td>
<td>142517</td>
<td>0.0096</td>
<td>1425</td>
<td>142517</td>
<td>0.0099</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>50163</td>
<td>5000000</td>
<td>0.0100</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>49709</td>
<td>5000000</td>
<td>0.0099</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CurrencyId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
<tr align="center" bgcolor="#ffff99">
<td>Sampling time</td>
<td colspan="3">0 h 0 m 27 s 716 ms</td>
<td colspan="3">0 h 1 m 26 s 865 ms</td>
<td colspan="3">0 h 1 m 26 s 657 ms</td>
<td colspan="3">0 h 1 m 26 s 254 ms</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr>
<td>0</td>
<td>1488</td>
<td>142623</td>
<td>0.0104</td>
<td>1426</td>
<td>142623</td>
<td>0.0099</td>
<td>1416</td>
<td>142623</td>
<td>0.0099</td>
<td>1426</td>
<td>142623</td>
<td>0.0099</td>
</tr>
<tr>
<td>1</td>
<td>1353</td>
<td>142925</td>
<td>0.0094</td>
<td>1429</td>
<td>142925</td>
<td>0.0099</td>
<td>1434</td>
<td>142925</td>
<td>0.0100</td>
<td>1429</td>
<td>142925</td>
<td>0.0099</td>
</tr>
<tr>
<td>10</td>
<td>1417</td>
<td>142897</td>
<td>0.0099</td>
<td>1429</td>
<td>142897</td>
<td>0.0100</td>
<td>1390</td>
<td>142897</td>
<td>0.0097</td>
<td>1429</td>
<td>142897</td>
<td>0.0100</td>
</tr>
<tr>
<td>11</td>
<td>1448</td>
<td>142896</td>
<td>0.0101</td>
<td>1429</td>
<td>142896</td>
<td>0.0100</td>
<td>1438</td>
<td>142896</td>
<td>0.0100</td>
<td>1429</td>
<td>142896</td>
<td>0.0100</td>
</tr>
<tr>
<td>12</td>
<td>1448</td>
<td>142522</td>
<td>0.0101</td>
<td>1425</td>
<td>142522</td>
<td>0.0099</td>
<td>1408</td>
<td>142522</td>
<td>0.0098</td>
<td>1425</td>
<td>142522</td>
<td>0.0099</td>
</tr>
<tr>
<td>13</td>
<td>1412</td>
<td>142461</td>
<td>0.0099</td>
<td>1425</td>
<td>142461</td>
<td>0.0100</td>
<td>1432</td>
<td>142461</td>
<td>0.0100</td>
<td>1424</td>
<td>142461</td>
<td>0.0099</td>
</tr>
<tr>
<td>14</td>
<td>1440</td>
<td>142997</td>
<td>0.0100</td>
<td>1430</td>
<td>142997</td>
<td>0.0100</td>
<td>1471</td>
<td>142997</td>
<td>0.0102</td>
<td>1430</td>
<td>142997</td>
<td>0.0100</td>
</tr>
<tr>
<td>15</td>
<td>1445</td>
<td>142697</td>
<td>0.0101</td>
<td>1427</td>
<td>142697</td>
<td>0.0100</td>
<td>1530</td>
<td>142697</td>
<td>0.0107</td>
<td>1427</td>
<td>142697</td>
<td>0.0100</td>
</tr>
<tr>
<td>16</td>
<td>1436</td>
<td>143137</td>
<td>0.0100</td>
<td>1431</td>
<td>143137</td>
<td>0.0099</td>
<td>1456</td>
<td>143137</td>
<td>0.0101</td>
<td>1432</td>
<td>143137</td>
<td>0.0100</td>
</tr>
<tr>
<td>17</td>
<td>1381</td>
<td>142517</td>
<td>0.0096</td>
<td>1425</td>
<td>142517</td>
<td>0.0099</td>
<td>1365</td>
<td>142517</td>
<td>0.0095</td>
<td>1426</td>
<td>142517</td>
<td>0.0100</td>
</tr>
<tr align="center" bgcolor="#ff99cc">
<td>total</td>
<td>50089</td>
<td>5000000</td>
<td>0.0100</td>
<td>50000</td>
<td>5000000</td>
<td>0.0100</td>
<td>49707</td>
<td>5000000</td>
<td>0.0099</td>
<td>49999</td>
<td>5000000</td>
<td>0.0099</td>
</tr>
<tr>
<th></th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
<th>sample size</th>
<th>dataset size</th>
<th>sample size ratio</th>
</tr>
<tr align="center">
<th>CurrencyId</th>
<th colspan="3">simple sampling</th>
<th colspan="3">systematic sampling</th>
<th colspan="3">stratified sampling</th>
<th colspan="3">pps sampling</th>
</tr>
</tbody>
</table>
<p>With such large groups all the methods give very good results. Although no doubt we get the best results using the systematic sampling or PPS sampling methods where the sample size is always within the limits 0.0099 to 0.0100.</p>
<p><a href="http://www.cloveretl.com/sites/applicationcraft/files/files/blog/DataSampling.zip">Download the transformation graph with data</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/cloveretl.wordpress.com/917/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/cloveretl.wordpress.com/917/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/cloveretl.wordpress.com/917/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/cloveretl.wordpress.com/917/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/cloveretl.wordpress.com/917/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/cloveretl.wordpress.com/917/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/cloveretl.wordpress.com/917/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/cloveretl.wordpress.com/917/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/cloveretl.wordpress.com/917/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/cloveretl.wordpress.com/917/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/cloveretl.wordpress.com/917/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/cloveretl.wordpress.com/917/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/cloveretl.wordpress.com/917/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/cloveretl.wordpress.com/917/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=cloveretl.wordpress.com&amp;blog=7070972&amp;post=917&amp;subd=cloveretl&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://cloveretl.wordpress.com/2011/01/31/data-sampling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/934c88184df6c0034450ae00a1695ee8?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96&#38;r=G" medium="image">
			<media:title type="html">agad</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/01/simplerandomsampling4.png" medium="image">
			<media:title type="html">SimpleRandomSampling</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/01/systematicsampling1.png" medium="image">
			<media:title type="html">SystematicSampling</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/01/stratifiedsampling21.png" medium="image">
			<media:title type="html">StratifiedSampling2</media:title>
		</media:content>

		<media:content url="http://cloveretl.files.wordpress.com/2011/01/ppssampling1.png" medium="image">
			<media:title type="html">PPSSampling1</media:title>
		</media:content>
	</item>
	</channel>
</rss>
