Py | Generic XML 2 Relational Data Convertor - Basic
Why I need this Generic XML Parser
- Its often required to load xml data into tables, so that business users can access the XML data and also use those table to write reports.
- XMLs I get normally are from 3rd parties and most of the time I dont get the xsd for those XML. Plus we get different XMLs from time to time and if we had hard-code the path to read a specific XML then I had to write code for every XML that I received.
So I thought it is better to make a GENERIC XML PARSER which can take any XML file and convert that into RELATIONAL DATA style and write that into CSVs ( which can be used to load the tables)
Logic applied for XML 2 Relational Data conversion
- Even Level in the XML is a table. ie
- items/ item is a table
- items/item/baters is a table
- Can not handle Namespaces
- All key columns + columns created (by XML2RD) for reference created by with prefix = '_' (eg _xid , _xpath etc)
- <Element> name becomes the column name.
- In case of attributes the column name will be <element_name>#<attribute_name>. In the below example of <Item id="0001" ... the column name for "id" will be item#id
- If there are same element name under a level then __1 __2 __3 and so on will be added. In the below example for topping we will have
- topping#id = 5001 and topping = "None"
- topping#id__1 = 5002 and topping__1 = "Glazed"
- topping#id__2 = 5003 and topping__2 = "Sugar"
- and so on......
Example
sample xml
CSV Generated for the above XML
There will be 2 CSV generated and it will be inside csv_dir (default is .\venpy\bin\csv_dir )
- one for xmlpath - /items/item = ninjaCSV_table_prefix_item.csv
- one for xmlpath - /items/item/batter = ninjaCSV_table_prefix_batters.csv
NOTE:
- One CSV = one Table
- each CSV has pk column = _xid + first ID column
- the childs can be linked to the parent using _xid + Parent's ID column ( which will be available in the child CSV)
in our Example
- ninjaCSV_table_prefix_item.csv
- has primary_key columns - _xid + _table_prefix_item_id
- ninjaCSV_table_prefix_batters.csv
- has primary_key column = _xid + _table_prefix_batters_id
- Foregin key to item's Table = _xid + _table_prefix_item_id
XML2RD Code and sample test code along with sample XMLs are available here
Venkat's GitHub Page for XML2RD
Sample code to test the Generic XML 2 Relational Data Convertor
Sample code to test the Generic XML 2 Relational Data Convertor
#!/usr/bin/python -tt __version__ = '1.0' import sys import time import re import os import subprocess import datetime as dt import json import csv from os import listdir from venpy.logger import MyLogger from venpy.xml2rd import XML2RD def test_xml2rd(): logger = MyLogger(pTag=0) xml2rd = XML2RD() #---- lets process a simple XML logger.log( "----- test for simple XML into CSV -----") l_pwd = os.path.dirname(os.path.abspath(sys.argv[0])) l_xml_file = os.path.join(l_pwd,'plant_catalog.xml') #---- It loops the same xml 2times. firsttime to get the column definition and secondtime to write the CSV #-- If you just need CSV then you can sk for ix in range(1,3): if ix == 1: #--get the table definition and i use this to dynamically create the tables in my DB ld_table_defn = xml2rd.process_xml_getDefn( pXML_pathfilename = l_xml_file, pAddInfo = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info logger.log("--------------- Table Definition ----------------") logger.log(ld_table_defn) logger.log("-------------------------------------------------") elif ix == 2: #----Now process the files and generate CSV xml2rd.process_xml_csv( pXML_pathfilename = l_xml_file, pAddInfo = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info logger.log( "you can see the files in CSV directory - csv_dir") #end if #end for #---- lets process a NESTED XML logger.log( " ----- test for nested XML into CSV -----") l_pwd = os.path.dirname(os.path.abspath(sys.argv[0])) l_xml_file = os.path.join(l_pwd,'nested.xml') #---- It loops the same xml 2times. firsttime to get the column definition and secondtime to write the CSV #-- If you just need CSV then you can sk for ix in range(1,3): if ix == 1: #--get the table definition and i use this to dynamically create the tables in my DB ld_table_defn = xml2rd.process_xml_getDefn( pXML_pathfilename = l_xml_file, pAddInfo = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info logger.log("--------------- Table Definition ----------------") logger.log(ld_table_defn) logger.log("-------------------------------------------------") elif ix == 2: #----Now process the files and generate CSV xml2rd.process_xml_csv( pXML_pathfilename = l_xml_file, pAddInfo = "info here goes into _add_info in the csv. I used this for zip file name") #--additional file info logger.log( "you can see the files in CSV directory - csv_dir") #end if #end for #----------------------------------------------------------------------------------------------------------------- if __name__ == "__main__": #will be called from command line test_xml2rd()
Comments