ABInitio FAQ
ABInitio FAQ
Q. How does force_error function work ? If we set never abort in reformat , will force_error
stop the graph or will it continue to process the next set of records ?
A. Here you can set the two conditions for the reformat component
1. If you want to fail set the reject thresold to fail on first reject
2. If don't want to fail you set never to abort.
Force_error is used to abort any graph if the conditions are not met and you write the error
errors records in file and then abort the graphs this can done in different ways.
Or
force_error() fuction will not stop the graph it will write the error message to the error port for
that record and will process the next record.
Q. Phase verses Checkpoint?
A. Phase is breaking the graph into different block. It create some temp file while running and
brief what is the function of xfr (like what does it do ,where is it stored ,how does it affect )?
A. As you know when you create a new sandbox in ab initio environment the following
directories will be created
1.mp
2.dml
3.xfr
4.db
etc etc.
xfr is directory in abinitio where we can write our own function and use them during the
tranformation(rollup , reformat etc..).
example you can write a function to convert a string into decimal or to get string max length ,
I can write that function in a file called user_define_function.xfr in xfr directory inside this
file i can define a function called string_to_interger or get_string_max_length or both. In any
transform component you can include the file liek
include "<full path>/user_define_function.xfr "
you can called the function like anyother function in ab initio.
Q.
Or
key A using partition by key and pass the output to join component, if the join key is (A,B) will
it join or not and WHY?
A.
Q. In my sandbox i am having 10 graphs, i checked-in those graphs into EME. Again i checkedout the graph and i do the modifications, i found out the modifications was wrong. what i have
to do if i want to get the original graph..?
A.
The above-mentioned structure will exist under the os (eg unix), for instance for the project called
fin, and is usually name of the top-level directory.
In EME, a similar structure will exist for the project: fin.
When you checkout or check-in a whole project or an object belonging to a project, the
information is exchanged between these two structures.
For instance, if you checkout a dml called fin.dml for the project called fin, you need a sandbox
with the same structure as the EME project called fin. Once you've created that, as shown above,
fin.dml or a copy of it will come out from EME and be placed in the dml directory of your sandbox.
Q. I have a job that will do the following: ftps files from remote server; reformat data in those files
and updates the database; deletes the temporary files. How do we trap errors generated by Ab
Initio when an ftp fails? If I have to re-run / re-start a graph again, what are the points to be
considered? does *.rec file have anything to do with it?
A. AbInitio has very good restartability and recovery features built into it. In Your situation you can
do the tasks you mentioned in one graph with phase breaks.
FTP in phase 1 and your transaformation in next phase and then DB update in another pahse
(This is just an example this may not best of doing it as best design depends on various other
factors)
If the graph fails during FTP then your graph fails in Phase 0, you can restart the graph, if your
graph fails in Phase 1 then AB_JOB.rec file exists and when you restart your graph you would
see a message saying recovery file exists, do you want to start your graph from last successful
check point or restart from begining. Same thing if it fails in Phase 2.
Phases are expensive from Disk I/O perspective, so have to be careful in doing too much
phasing.
Coming back to error trapping each component has reject, error, log ports, reject captures
rejected records, error captures corresponding error and log captures the execution statistics of
the component. You can control reject status of each component by setting reject threshold to
either "Never Abort", "Abort on first reject" or setting "ramp/limit"
Recovery files keep tack of crucial information for recovering the graph from failed status, which
node the component is executing on etc. It is a bad idea to just remove the *.rec files, you always
want to rollback the recovery fils cleanly so that temporary files created during graph execution
won't hang around and occupy disk space and create issues.
always use m_rollback d
Q. What is Ad hoc multifile? How is it used?
A. Here is a description of Ad hoc multifile:
Ad hoc multifiles treat several serial files having the same record format as a single graph
component.
Frequently, the input of a graph consists of a set of serial files, all of which have to be processed
as a unit. An Ad hoc multifile is a multifile created 'on the fly' out of a set of serial files, without
needing to define a multifile system to contain it. This enables you to represent the needed set of
serial files with a single input file component in the graph. Moreover, the set of files used by the
component can be determined at runtime. This lets the user customize which set of files the
graph uses as input without having to change the graph itself, even after it goes into production.
Ad hoc multifiles can be used as output, intermediate, and lookup files as well as input files.
The simplest way to define an Ad hoc multifile is to list the files explicitly as follows:
1. Insert an input file component in your graph.
2. Open the properties dialog. Select Description tab.
3. Select Partitions in the Data Location of the Description tab
4. Click Edit to open the Define multifile Partitions dialog box.
5. Click New and enter the first file name. Click New again and enter the second file name and so
on.
6. Click OK.
If you have added 'n' files, then the input file now acts something like a file in a n-way multifile
system, whose data partitions are the n files you listed. It is possible for components to run in the
layout of the input file component. However, there is no way to run commands such as m_ls or
m_dump on the files, because they do not comprise a real multifile system.
There are other ways than listing the input files explicitly in an Ad hoc multifile.
1. Listing files using wildcards - If the input file names have a common pattern then you can use a
wild card for all the files. E.g. $AI_SERIAL/ad_hoc_input_*.dat. All the files that are found at the
runtime matching the wild card pattern will be taken for the Ad hoc multifile.
2. Listing files in a variable. You can create a runtime parameter for the graph and inside the
parameter you can list all the files separated by spaces.
3. Listing files using a command - E.g. $(ls $AI_SERIAL/ad_hoc_input_*.dat), which produces the
list of files to be used for the ad hoc multifile. This method gives maximum flexibility in choosing
the input files, since you can use complex commands also that involves owner of file or date time
stamp.
Q. What is the difference between Replicate and Broadcast?
A. Broadcast and Replicate are similar components but generally Replicate is used to increase
Component Parallelism, emitting multiple straight flows to seperate pipelines. Broadcast is used
to increase data parallelism by feeding records to fan-out or all-to-all flows.
Or
Replicate is old component when compared to broadcast. You can use Broadcast as join
component, where as Replicate you can't use as join. By Default, Replicate is Straight flow and
Broadcast is fan-out or All-To-All Flow.
Broadcast is used for Data Parallism whereas Replicate is used for Component Parallesim.
Or
Replicate
Supports component parallelism
Input File -------> Replicate --------> Format ---->Output File
|
|
|
--------->Rollup-------> output File
Broadcast
2. Another major benefit of component folding is the reduction of interpretation time for the
DML between processes. Because it will end up with multitool folded processes communicating
with other multitool or unitool.
3. Apart from that increase in number of processes results higher interprocess communication.
Data movement between two or more processes will not only consume time but memory too.
In CFG (Continuous Flow Graph) interprocess communication is always very high. So it is
worth enabling Component folding in a CFG.
Disadvantages of Component Folding:
1. Pipeline Parallelism: As component folding folds different component in a single process it
will hurt the pipeline parallelism of Ab Initio. If flow of our graph is like - Input File -> Filter By
Expression -> Reformat -> Output File. In traditional method by the help of Pipeline
Parallelism FBE and Reformat will execute concurrently. But now these two components are
folded together so there is no chance of parallel execution.
2. Address Space: In a 32 bit OS maximum limit of Address space for process is 4 GB. So if we
combine 4 different components to a single process by component folding OS will allow only 4
GB of address space for all 4 instead of 4X4 total 16 GB of spaces. So we should avert
component folding components where memory use is very high as in-memory Rollup, Join,
and Reformat with lookup. Some components like Sort, in-memory Join causes internal
buffering of data. Combing them in a single process will result writing to disk (Higher IO).
Set AB_MULTITOOL_MAXCORE variable to limit the maximum allowable memory for the
folded component group.
Excluding any component from Component Folding:
I know sometime you would wish to prevent components to be folded to allow pipeline
parallelism or to access more address space. Then you need to exclude some components
from being folded.
Set AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES configuration variable to space
separated mpname of the components in your $HOME/.abinitiorc or system wide
$AB_HOME/config/abinitiorc file. e.g. export
AB_FOLD_COMPONENTS_EXCLUDE_MPNAMES= hash-rollup reformat-transform
In other way to prevent two different components from getting folded together right click on
the flow between them and uncheck the Allow Component Folding option.
Everything has its cost. So it is always worth benchmarking before taking a decision. Prevent
and allow component folding for your components of the graph, tune it for the highest
performance.
CPU tracking report of folded components in a graph:
To report the execution detail of folded graph on console we need to override the AB_REPORT
variable with show-folding option as AB_REPORT=show-folding flows times interval=180
scroll=true spillage totals file-percentages.
The folded components are displayed as multitool process in CPU tracking information. The
CPU time for a folded component is shown twice one for the component itself once as a
multitool component.
Parameter Definition Language (PDL):
PDL is used to put logic for inline computation in parameter value. It provides high flexibility in
terms of interpretation. It supports both $ and ${} substitution. For this you need to set the
interpretation PDL and write the DML expression within $[ ]. This approach is much faster than
traditional shell scripting. It is the way to move forward to a much flexible and robust
technique of designing. With the use of it we can abolish the old shell scripting as script-end
and script-start are already beaten enough to death since last few years. You can use PDL
interpretation for condition of a component.
NOTE. The detail of PDL within the GDE is lacking any consistency. Basically, we can use the
majority of the Ab Initio DML functions. I would recommend looking at the metaprograming
section for starters. Then play with the parameters editor.
e.g.
Suppose in a graph we have a conditional component which runs based on existence of a file
called emp.dat.
Now FILE_NAME parameter is defined as /home/xyz/emp.dat and a conditional parameter
called EXIST is defined as
$[if (file_information($FILE_NAME).found) 1 else 0]
We can define a parameter with type and transform function with the help of parameter
AB_DML_DEFS.
e.g. Suppose AB_DML_DEFS is defined as
out :: sqrt(in) = begin out :: math_sqrt(in); end;
Now in a parameter called SQRT is defined as $[sqrt (16)]
Resolved value from this parameter will be 4.
Ensure your host run settings are checked for dynamic script generation, and read the 2.14
patchset notes for a description of any hint.