Py | Generic XML 2 Relational Data Convertor

Py | Generic XML 2 Relational Data Convertor - Basic

- December 13, 2015

Why I need this Generic XML Parser

Its often required to load xml data into tables, so that business users can access the XML data and also use those table to write reports.
XMLs I get normally are from 3rd parties and most of the time I dont get the xsd for those XML. Plus we get different XMLs from time to time and if we had hard-code the path to read a specific XML then I had to write code for every XML that I received.

So I thought it is better to make a GENERIC XML PARSER which can take any XML file and convert that into RELATIONAL DATA style and write that into CSVs ( which can be used to load the tables)

Logic applied for XML 2 Relational Data conversion

Even Level in the XML is a table. ie

items/ item is a table
items/item/baters is a table

Can not handle Namespaces
All key columns + columns created (by XML2RD) for reference created by with prefix = '_' (eg _xid , _xpath etc)
<Element> name becomes the column name.
In case of attributes the column name will be <element_name>#<attribute_name>. In the below example of <Item id="0001" ... the column name for "id" will be item#id
If there are same element name under a level then __1 __2 __3 and so on will be added. In the below example for topping we will have

topping#id = 5001 and topping = "None"
topping#id__1 = 5002 and topping__1 = "Glazed"
topping#id__2 = 5003 and topping__2 = "Sugar"
and so on......

Example

sample xml

CSV Generated for the above XML

There will be 2 CSV generated and it will be inside csv_dir (default is .\venpy\bin\csv_dir )

one for xmlpath - /items/item = ninjaCSV_table_prefix_item.csv
one for xmlpath - /items/item/batter = ninjaCSV_table_prefix_batters.csv

NOTE:

One CSV = one Table
each CSV has pk column = _xid + first ID column
the childs can be linked to the parent using _xid + Parent's ID column ( which will be available in the child CSV)

in our Example

ninjaCSV_table_prefix_item.csv

has primary_key columns - _xid + _table_prefix_item_id

ninjaCSV_table_prefix_batters.csv

has primary_key column = _xid + _table_prefix_batters_id
Foregin key to item's Table = _xid + _table_prefix_item_id

XML2RD Code and sample test code along with sample XMLs are available here

Venkat's GitHub Page for XML2RD

Sample code to test the Generic XML 2 Relational Data Convertor

#!/usr/bin/python -tt 

__version__ = '1.0'

import sys
import time
import re
import os
import subprocess
import datetime as dt
import json
import csv
from os import listdir
from  venpy.logger  import MyLogger
from  venpy.xml2rd  import XML2RD

def test_xml2rd():
 logger = MyLogger(pTag=0)
 xml2rd = XML2RD()
 
 
 #---- lets process a simple XML
 logger.log( "----- test for simple XML into CSV -----")
 l_pwd = os.path.dirname(os.path.abspath(sys.argv[0]))
 l_xml_file = os.path.join(l_pwd,'plant_catalog.xml')
 
 #---- It loops the same xml 2times. firsttime to get the column definition and secondtime to write the CSV
 #--   If you just need CSV then you can sk
 for ix in range(1,3): 
  if ix == 1:
   #--get the table definition and i use this to dynamically create the tables in my DB
   ld_table_defn = xml2rd.process_xml_getDefn( pXML_pathfilename = l_xml_file, 
              pAddInfo    = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info     
   logger.log("--------------- Table Definition ----------------")
   logger.log(ld_table_defn)
   logger.log("-------------------------------------------------")
  elif ix == 2:
   #----Now process the files and generate CSV
   xml2rd.process_xml_csv( pXML_pathfilename = l_xml_file, 
         pAddInfo   = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info     
         
   logger.log( "you can see the files in CSV directory - csv_dir")
  #end if  
 #end for
 
 
 #---- lets process a NESTED XML
 logger.log( " ----- test for nested XML into CSV -----")
 l_pwd = os.path.dirname(os.path.abspath(sys.argv[0]))
 l_xml_file = os.path.join(l_pwd,'nested.xml')
 
 #---- It loops the same xml 2times. firsttime to get the column definition and secondtime to write the CSV
 #--   If you just need CSV then you can sk
 for ix in range(1,3): 
  if ix == 1:
   #--get the table definition and i use this to dynamically create the tables in my DB
   ld_table_defn = xml2rd.process_xml_getDefn( pXML_pathfilename = l_xml_file, 
              pAddInfo    = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info     
   logger.log("--------------- Table Definition ----------------")
   logger.log(ld_table_defn)
   logger.log("-------------------------------------------------")
  elif ix == 2:
   #----Now process the files and generate CSV
   xml2rd.process_xml_csv( pXML_pathfilename = l_xml_file, 
         pAddInfo   = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info     
         
   logger.log( "you can see the files in CSV directory - csv_dir")
  #end if  
 #end for
 
#-----------------------------------------------------------------------------------------------------------------
if __name__ == "__main__":  #will be called from command line
 test_xml2rd()

Search This Blog

Venkat$ echo

Py | Generic XML 2 Relational Data Convertor - Basic

Logic applied for XML 2 Relational Data conversion

Comments

Popular posts from this blog

Tableau - Accessing Tableau's DB

react-bootstrap-table | header column alignment fix

Tableau : Convert ESRI shapes into Tableau Format