Script to analyze the structure of an xml document

While working with XML data, you often don’t find the WSDL files and may end up manually working through the document to understand its structure. At my current project I ran into a few hundred XML files and had to analyze them to understand the data available. Here is a script I created which prints all the possible nodes in the input files

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/usr/bin/env ruby
# Author: Khaja Minhajuddin <minhajuddin.k@gmail.com>

require 'nokogiri'

class XmlAnalyze
def initialize(filepaths)
@filepaths = filepaths
@node_paths = {}
end

def analyze
@filepaths.each { |filepath| analyze_file(filepath) }
@node_paths.keys.sort
end

private
def analyze_file(filepath)
@doc = File.open(filepath) { |f| Nokogiri::XML(f) }
analyze_node(@doc.children.first)
end

def analyze_node(node)
return if node.is_a? Nokogiri::XML::Text
add_path node.path

node.attributes.keys.each do |attr|
add_path("#{node.path}:#{attr}")
end

node.children.each do |child|
analyze_node(child)
end

end

def add_path(path)
path = path.gsub(/\[\d+\]/, '')
@node_paths[path] = true
end
end

if ARGV.empty?
puts 'Usage: ./analyze_xml.rb file1.xml file2.xml ....'
exit(-1)
end

puts XmlAnalyze.new(ARGV).analyze

It outputs the following for the xml below

1
2
3
4
5
6
7
8
9
10
11
<?xml version="1.0" encoding="UTF-8"?>
<root>
<person>
<name type="full">Khaja</name>
<age>31</age>
</person>
<person>
<name type="full">Khaja</name>
<dob>Jan</dob>
</person>
</root>
1
2
3
4
5
6
/root
/root/person
/root/person/age
/root/person/dob
/root/person/name
/root/person/name:type

Hope you find it useful!


I am currently working on Zammu which makes Automatic Deployment of static websites to Github Pages very easy. I would love to get your feedback on it, Use the invitation code KHAJA