Script to analyze the structure of an xml document

While working with XML data, you often don’t find the WSDL files and may end up manually working through the document to understand its structure. At my current project I ran into a few hundred XML files and had to analyze them to understand the data available. Here is a script I created which prints all the possible nodes in the input files

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/usr/bin/env ruby
# Author: Khaja Minhajuddin <minhajuddin.k@gmail.com>

require 'nokogiri'

class XmlAnalyze
def initialize(filepaths)
@filepaths = filepaths
@node_paths = {}
end

def analyze
@filepaths.each { |filepath| analyze_file(filepath) }
@node_paths.keys.sort
end

private
def analyze_file(filepath)
@doc = File.open(filepath) { |f| Nokogiri::XML(f) }
analyze_node(@doc.children.first)
end

def analyze_node(node)
return if node.is_a? Nokogiri::XML::Text
add_path node.path

node.attributes.keys.each do |attr|
add_path("#{node.path}:#{attr}")
end

node.children.each do |child|
analyze_node(child)
end

end

def add_path(path)
path = path.gsub(/\[\d+\]/, '')
@node_paths[path] = true
end
end

if ARGV.empty?
puts 'Usage: ./analyze_xml.rb file1.xml file2.xml ....'
exit(-1)
end

puts XmlAnalyze.new(ARGV).analyze

It outputs the following for the xml below

1
2
3
4
5
6
7
8
9
10
11
<?xml version="1.0" encoding="UTF-8"?>
<root>
<person>
<name type="full">Khaja</name>
<age>31</age>
</person>
<person>
<name type="full">Khaja</name>
<dob>Jan</dob>
</person>
</root>
1
2
3
4
5
6
/root
/root/person
/root/person/age
/root/person/dob
/root/person/name
/root/person/name:type

Hope you find it useful!


I am currently working on LiveForm which makes setting up contact forms on your website a breeze.