Py | Generic XML 2 Relational Data Convertor - Basic

Why I need this Generic XML Parser
  1. Its often required to load xml data into tables, so that business users can access the XML data and also use those table to write reports. 
  2. XMLs I get normally are from 3rd parties and most of the time I dont get the xsd for those XML. Plus we get different XMLs from time to time and if we had hard-code the path to read a specific XML then I had to write code for every XML that I received.
So I thought it is better to make a GENERIC XML PARSER which can take any XML file and convert that into RELATIONAL DATA style and write that into CSVs ( which can be used to load the tables)

Logic applied for  XML 2 Relational Data conversion

  1. Even Level in the XML is a table. ie 
    • items/ item is a table 
    • items/item/baters is a table
  2. Can not handle Namespaces
  3. All key columns + columns created (by XML2RD) for reference created by with prefix = '_' (eg _xid , _xpath etc)
  4. <Element> name becomes the column name.
  5. In case of attributes the column name will be <element_name>#<attribute_name>. In the below example of <Item id="0001" ... the column name for "id" will be item#id
  6. If there are same element name under a level then __1 __2 __3 and so on will be added. In the below example for topping we will have 
    • topping#id = 5001 and topping = "None"
    • topping#id__1 = 5002 and topping__1 = "Glazed"
    • topping#id__2 = 5003 and topping__2 = "Sugar"
    • and so on......
Example
sample xml


CSV Generated for the above XML
There will be 2 CSV generated and it will be inside  csv_dir (default is .\venpy\bin\csv_dir )
  1. one for xmlpath - /items/item  = ninjaCSV_table_prefix_item.csv
  2. one for xmlpath - /items/item/batter  = ninjaCSV_table_prefix_batters.csv


NOTE:
  • One CSV = one Table  
  • each CSV has pk column = _xid + first ID column
  • the childs can be linked to the parent using _xid + Parent's ID column ( which will be available in the child CSV)
in our Example 
  1. ninjaCSV_table_prefix_item.csv 
    • has primary_key columns - _xid + _table_prefix_item_id
  2. ninjaCSV_table_prefix_batters.csv 
    • has primary_key column  = _xid + _table_prefix_batters_id
    • Foregin key to item's Table =  _xid + _table_prefix_item_id
XML2RD Code and sample test code along with sample XMLs are available here
Venkat's GitHub Page for XML2RD

Sample code to test the Generic XML 2 Relational Data Convertor

#!/usr/bin/python -tt 

__version__ = '1.0'

import sys
import time
import re
import os
import subprocess
import datetime as dt
import json
import csv
from os import listdir
from  venpy.logger  import MyLogger
from  venpy.xml2rd  import XML2RD

def test_xml2rd():
 logger = MyLogger(pTag=0)
 xml2rd = XML2RD()
 
 
 #---- lets process a simple XML
 logger.log( "----- test for simple XML into CSV -----")
 l_pwd = os.path.dirname(os.path.abspath(sys.argv[0]))
 l_xml_file = os.path.join(l_pwd,'plant_catalog.xml')
 
 #---- It loops the same xml 2times. firsttime to get the column definition and secondtime to write the CSV
 #--   If you just need CSV then you can sk
 for ix in range(1,3): 
  if ix == 1:
   #--get the table definition and i use this to dynamically create the tables in my DB
   ld_table_defn = xml2rd.process_xml_getDefn( pXML_pathfilename = l_xml_file, 
              pAddInfo    = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info     
   logger.log("--------------- Table Definition ----------------")
   logger.log(ld_table_defn)
   logger.log("-------------------------------------------------")
  elif ix == 2:
   #----Now process the files and generate CSV
   xml2rd.process_xml_csv( pXML_pathfilename = l_xml_file, 
         pAddInfo   = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info     
         
   logger.log( "you can see the files in CSV directory - csv_dir")
  #end if  
 #end for
 
 
 #---- lets process a NESTED XML
 logger.log( " ----- test for nested XML into CSV -----")
 l_pwd = os.path.dirname(os.path.abspath(sys.argv[0]))
 l_xml_file = os.path.join(l_pwd,'nested.xml')
 
 #---- It loops the same xml 2times. firsttime to get the column definition and secondtime to write the CSV
 #--   If you just need CSV then you can sk
 for ix in range(1,3): 
  if ix == 1:
   #--get the table definition and i use this to dynamically create the tables in my DB
   ld_table_defn = xml2rd.process_xml_getDefn( pXML_pathfilename = l_xml_file, 
              pAddInfo    = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info     
   logger.log("--------------- Table Definition ----------------")
   logger.log(ld_table_defn)
   logger.log("-------------------------------------------------")
  elif ix == 2:
   #----Now process the files and generate CSV
   xml2rd.process_xml_csv( pXML_pathfilename = l_xml_file, 
         pAddInfo   = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info     
         
   logger.log( "you can see the files in CSV directory - csv_dir")
  #end if  
 #end for
 
#-----------------------------------------------------------------------------------------------------------------
if __name__ == "__main__":  #will be called from command line
 test_xml2rd()

Comments

Popular posts from this blog

Tableau - Accessing Tableau's DB

react-bootstrap-table | header column alignment fix

Tableau : Convert ESRI shapes into Tableau Format