Script to analyze the structure of an xml document

While working with XML data, you often don’t find the WSDL files and may end up manually working through the document to understand its structure. At my current project I ran into a few hundred XML files and had to analyze them to understand the data available. Here is a script I created which prints all the possible nodes in the input files

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/usr/bin/env ruby
# Author: Khaja Minhajuddin <minhajuddin.k@gmail.com>
require 'nokogiri'
class XmlAnalyze
def initialize(filepaths)
@filepaths = filepaths
@node_paths = {}
end
def analyze
@filepaths.each { |filepath| analyze_file(filepath) }
@node_paths.keys.sort
end
private
def analyze_file(filepath)
@doc = File.open(filepath) { |f| Nokogiri::XML(f) }
analyze_node(@doc.children.first)
end
def analyze_node(node)
return if node.is_a? Nokogiri::XML::Text
add_path node.path
node.attributes.keys.each do |attr|
add_path("#{node.path}:#{attr}")
end
node.children.each do |child|
analyze_node(child)
end
end
def add_path(path)
path = path.gsub(/\[\d+\]/, '')
@node_paths[path] = true
end
end
if ARGV.empty?
puts 'Usage: ./analyze_xml.rb file1.xml file2.xml ....'
exit(-1)
end
puts XmlAnalyze.new(ARGV).analyze

It outputs the following for the xml below

1
2
3
4
5
6
7
8
9
10
11
<?xml version="1.0" encoding="UTF-8"?>
<root>
<person>
<name type="full">Khaja</name>
<age>31</age>
</person>
<person>
<name type="full">Khaja</name>
<dob>Jan</dob>
</person>
</root>
1
2
3
4
5
6
/root
/root/person
/root/person/age
/root/person/dob
/root/person/name
/root/person/name:type

Hope you find it useful!


I am currently working on Zammu which makes Automatic Deployment of static websites to Github Pages very easy. I would love to get your feedback on it, Use the invitation code KHAJA