5. Practical session: Working with Beautiful Soup¶

In this lesson, we are going to explore how we can use the package Beautiful Soup to extract content from XML files. We use the same example file that was used in lesson 2 (download here).

This lesson is divided into the following steps:

Load the XML file;
Examine the structure of the XML file;
Extract the booktitles and descriptions;
Extract name and surname of the author;
Extract the book identifier;
Structure all information;
Explore namespaces;
Extra: Filter information

Open a new Jupyter Notebook and type all the code examples and code exercises in your Notebook.

Install Beautiful Soup¶

Beautifull Soup is not a standard Python package, so it needs to be installed first. This can be done directly in the Jupyter Notebook using:

!pip install beautifulsoup4

or through the command line (see lesson 1)

pip install beautifulsoup4

Import Beautiful Soup and load the xml file¶

Before we can use the package, we have to let Python know we want to use it. We do this by importing the package. Type the following in a code cell:

from bs4 import BeautifulSoup  

Now we can use the package to extract data from the XML.

Examine the structure of the file¶

Now we want to open the XML file from which we want to extract information. Add a new code cell and type:

with open("data/example.xml") as f:
    root = BeautifulSoup(f, 'xml')

Note

In the code above, alter the ‘data/example’ with the path to the folder and the filename of where you stored the file.

When you want to extract information from an XML file, it is important that you are familiar with the structure of the file. There are two ways to do this.

You can open the file in a program like Notepad++ or open it in your browser
You can show the file in your Jupyter Notebook with the following code:

print(root)

<?xml version="1.0" encoding="utf-8"?>
<catalog>
<book id="bk101">
<author>
<name>Matthew</name>
<surname>Gambardella</surname>
</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications 
      with XML.</description>
</book>
<book id="bk102">
<author>
<name>Kim</name>
<surname>Ralls</surname>
</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
</book>
<book id="bk103">
<author>
<name>Eva</name>
<surname>Corets</surname>
</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
</book>
<book id="bk104">
<author>
<name>Eva</name>
<surname>Corets</surname>
</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description>
</book>
<book id="bk105">
<author>
<name>Eva</name>
<surname>Corets</surname>
</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>
<name>Cynthia</name>
<surname>Randall</surname>
</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>
<name>Paula</name>
<surname>Thurman</surname>
</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>
<name>Stefan</name>
<surname>Knorr</surname>
</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
</book>
<book id="bk109">
<author>
<name>Peter</name>
<surname>Kress</surname>
</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description>
</book>
<book id="bk110">
<author>
<name>Tim</name>
<surname>O'Brien</surname>
</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>
<name>Tim</name>
<surname>O'Brien</surname>
</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.</description>
</book>
<book id="bk112">
<author>
<name>Mike</name>
<surname>Galos</surname>
</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.</description>
</book>
</catalog>

Extract the book titles and descriptions¶

Exercise

Look at the XML structure. Which elements do we need to extract the title and the description?

Solution

We need the element ‘book’, and its children ‘title’ and ‘description’.

First, type the following code in your Jupyter Notebook to get the title from every book:

for book in root.find_all('book'):
    title = book.find('title').text
    print(title)

XML Developer's Guide
Midnight Rain
Maeve Ascendant
Oberon's Legacy
The Sundered Grail
Lover Birds
Splish Splash
Creepy Crawlies
Paradox Lost
Microsoft .NET: The Programming Bible
MSXML3: A Comprehensive Guide
Visual Studio 7: A Comprehensive Guide

Note

Although the basic for loop for ElementTree en Beautiful Soup look identical, please note that there is a small difference: ElementTree uses ‘findall’ and Beautiful Soup ‘find_all’ (with an underscore).

We shall explain what every line of the code does.

First, we iterate through the complete XML file and search for every element with the tag name ‘book’.

for book in root.find_all('book'):

Then, for every book element that exist, we create a temporarly new variable with the name ‘title’. As value for this variable, we use the content of the tag ‘title’ (which is a direct child of the element ‘book’). we add ‘.text’. to let Python know that we are interested in the value between the tags. Without the ‘.text’ addition, Python would simply present us the tag in its location, like ‘<Element ‘title’ at 0x000001995B4718B0>’

title = book.find('title').text

Then, we print the output of the title

print(title)

After this, the loop proceeds to the following book elements, extraxts the title and print the title etc.

We can get the description of each book in the same way.

Exercise

Alter the code above to retreive all the descriptions and print out the descriptions.

Solution

for book in root.find_all('book'):
	description = book.find('description').text
	print(description)

This leads to the following output:

An in-depth look at creating applications 
      with XML.
A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.
After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.
In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.
The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.
When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.
A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.
An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.
After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.
Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.
The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.
Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.

We can use one for loop to extract both the book title and the description from the XML file. Combining multiple items is preferable because it saves unnecessary lines of codes and merges the part of the code which does the same thing. This makes the code more readable and better maintainable.

Combining the two codes above leads to the following code:

for book in root.find_all('book'):
	title = book.find('title').text
	description = book.find('description').text
	print(title, description)

XML Developer's Guide An in-depth look at creating applications 
      with XML.
Midnight Rain A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.
Maeve Ascendant After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.
Oberon's Legacy In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.
The Sundered Grail The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.
Lover Birds When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.
Splish Splash A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.
Creepy Crawlies An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.
Paradox Lost After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.
Microsoft .NET: The Programming Bible Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.
MSXML3: A Comprehensive Guide The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.
Visual Studio 7: A Comprehensive Guide Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.

Extract name and surname of the author¶

You can use the same method as described above to extract all the names and surnames from the authors from the example XML. However, if we look at the structure of the XML file, there is a difference between the placement of the elements ‘title’ and ‘description’, and the elements ‘name’ and ‘surname’ in the XML structure.

<catalog>
	<book id="bk101">
		<author>
			<name>Matthew</name>
			<surname>Gambardella</surname>
		</author>
		<title>XML Developer's Guide</title>
		<genre>Computer</genre>
		<price>44.95</price>
		<publish_date>2000-10-01</publish_date>
		<description>An in-depth look at creating applications with XML.</description>
   </book>

Exercise

Look at the XML snippet above. What is the difference between the element ‘title’ and the element ‘name’?

Solution

The element ‘title’ is a child of the element ‘book’. The element ‘name’ however, is a child of the element ‘author’ and a subchild of the element ‘book’.

Because of the difference in the place between elements, we need to alter our code a bit. We can use two approaches:

Add another for loop inside or first loop;
‘escape’ the element hierarchie.

For the first approach, instead of a single for loop that iterates through all the ‘book’ elements, we also need a second for loop that runs through the ‘author’ element of ‘book’. We can do this with the following code:

for book in root.find_all('book'):
    for author in book.find_all('author'):
        name = author.find('name').text
        print(name) 

Matthew
Kim
Eva
Eva
Eva
Cynthia
Paula
Stefan
Peter
Tim
Tim
Mike

Exercise

The above code extracts only the name of an author. Alter the code, so that it extracts both the name and the surname.

Solution

for book in root.find_all('book'):
	for author in book.find_all('author'):
		name = author.find('name').text
		surname = author.find('surname').text
		print(name, surname) 

Matthew Gambardella
Kim Ralls
Eva Corets
Eva Corets
Eva Corets
Cynthia Randall
Paula Thurman
Stefan Knorr
Peter Kress
Tim O'Brien
Tim O'Brien
Mike Galos

The second approach is to ‘escape’ the element hiearchy and directly select all subelements, on all levels beneath the current element. This is usefull if you have an XML with a lot of As explained in lesson 3, you can just insert the name of the subchild, as shown in the following code:

for book in root.find_all('author'):
    name = book.find('name').text
    surname = book.find('surname').text
    print(name, surname)  

Matthew Gambardella
Kim Ralls
Eva Corets
Eva Corets
Eva Corets
Cynthia Randall
Paula Thurman
Stefan Knorr
Peter Kress
Tim O'Brien
Tim O'Brien
Mike Galos

Extract the book identifier¶

As you can see in the XML, each book has its own identifier. As books can have the same name, and authors can have written multiple books, it is good practise to always use the identifier to point to a specific item.

In the previous exercises, we extracted the content that was presented between the tags of an element. For example:

<title>XML Developers Guide</title>

In this example, you see that the title ‘XML Developer’s guide’ is stored between the tags title and /title. We extracted this content by adding ‘.text’.

Exercise

Look at this example of the ‘book’ element with its identifier. What is the difference between the place of the content of the identifier and the place of the content of the title?

	<book id="bk101">
	</book>

Solution

The content of the identifier is stored in an attribute of the ‘book’ element, with the name ‘id’.

To extract content from attributes, we need to use the ‘.get’ method. We still use the for loop to iterate through all the books, but instead of the content of certain elements, we now extract the content of the attribute.

for book in root.find_all('book'):
    identifier = book.get('id')
    print(identifier)

bk101
bk102
bk103
bk104
bk105
bk106
bk107
bk108
bk109
bk110
bk111
bk112

Structure all information¶

We can combine the different codes we have used above into one cell. To do this we can use the following scheme:

Iterate through all books	
	Get content from id attribute
	Get title content	
	Get description content	
	Iterate through all authors
		Get name
		Get surname
	Print id, title, description, name and surname

Exercise

Create the code that extracts all information we have used so far, from every book. And print this information (see scheme above).

Solution

for book in root.find_all('book'):
	identifier = book.get('id')
	title = book.find('title').text
	description = book.find('description').text
	for author in book.find_all('author'):
		name = author.find('name').text
		surname = author.find('surname').text
	print(identifier, title, description, name, surname)

This leads to the following output:

bk101 XML Developer's Guide An in-depth look at creating applications 
      with XML. Matthew Gambardella
bk102 Midnight Rain A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world. Kim Ralls
bk103 Maeve Ascendant After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society. Eva Corets
bk104 Oberon's Legacy In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant. Eva Corets
bk105 The Sundered Grail The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy. Eva Corets
bk106 Lover Birds When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled. Cynthia Randall
bk107 Splish Splash A deep sea diver finds true love twenty 
      thousand leagues beneath the sea. Paula Thurman
bk108 Creepy Crawlies An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects. Stefan Knorr
bk109 Paradox Lost After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum. Peter Kress
bk110 Microsoft .NET: The Programming Bible Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference. Tim O'Brien
bk111 MSXML3: A Comprehensive Guide The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more. Tim O'Brien
bk112 Visual Studio 7: A Comprehensive Guide Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment. Mike Galos

As you can see, it displays all information we wanted, but the output is quite unreadable. For example, it is not clear which part of the content belongs to the title, and which to the description.

To make the output more readable, we can put text before our output variables. In Python, this can be done like this:

print(f"This is the string we type and {this_is_the_variable}")

So in our example, we could add the following:

for book in root.find_all('book'):
    identifier = book.get('id')
    title = book.find('title').text
    description = book.find('description').text
    for author in book.find_all('author'):
        name = author.find('name').text
        surname = author.find('surname').text
	## add text to identify the extracted parts
    print(f"Identifier= {identifier} title= {title} description= {description} name= {name} {surname}")

Identifier= bk101 title= XML Developer's Guide description= An in-depth look at creating applications 
      with XML. name= Matthew Gambardella
Identifier= bk102 title= Midnight Rain description= A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world. name= Kim Ralls
Identifier= bk103 title= Maeve Ascendant description= After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society. name= Eva Corets
Identifier= bk104 title= Oberon's Legacy description= In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant. name= Eva Corets
Identifier= bk105 title= The Sundered Grail description= The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy. name= Eva Corets
Identifier= bk106 title= Lover Birds description= When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled. name= Cynthia Randall
Identifier= bk107 title= Splish Splash description= A deep sea diver finds true love twenty 
      thousand leagues beneath the sea. name= Paula Thurman
Identifier= bk108 title= Creepy Crawlies description= An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects. name= Stefan Knorr
Identifier= bk109 title= Paradox Lost description= After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum. name= Peter Kress
Identifier= bk110 title= Microsoft .NET: The Programming Bible description= Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference. name= Tim O'Brien
Identifier= bk111 title= MSXML3: A Comprehensive Guide description= The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more. name= Tim O'Brien
Identifier= bk112 title= Visual Studio 7: A Comprehensive Guide description= Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment. name= Mike Galos

As you can see, we can now detect the various parts that we extracted. However, it is still not easy to read. To resolve this, we can add linebreaks between each variable and between the different books. W e add a line break by adding ‘\n’ after each variable, leading to the following code:

for book in root.find_all('book'):
    identifier = book.get('id')
    title = book.find('title').text
    description = book.find('description').text
    for author in book.find_all('author'):
        name = author.find('name').text
        surname = author.find('surname').text
	## add linebreaks
    print(f"Identifier= {identifier }\n title= {title}\n description= {description} \n name= {name} {surname}\n")

Identifier= bk101
 title= XML Developer's Guide
 description= An in-depth look at creating applications 
      with XML. 
 name= Matthew Gambardella

Identifier= bk102
 title= Midnight Rain
 description= A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world. 
 name= Kim Ralls

Identifier= bk103
 title= Maeve Ascendant
 description= After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society. 
 name= Eva Corets

Identifier= bk104
 title= Oberon's Legacy
 description= In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant. 
 name= Eva Corets

Identifier= bk105
 title= The Sundered Grail
 description= The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy. 
 name= Eva Corets

Identifier= bk106
 title= Lover Birds
 description= When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled. 
 name= Cynthia Randall

Identifier= bk107
 title= Splish Splash
 description= A deep sea diver finds true love twenty 
      thousand leagues beneath the sea. 
 name= Paula Thurman

Identifier= bk108
 title= Creepy Crawlies
 description= An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects. 
 name= Stefan Knorr

Identifier= bk109
 title= Paradox Lost
 description= After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum. 
 name= Peter Kress

Identifier= bk110
 title= Microsoft .NET: The Programming Bible
 description= Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference. 
 name= Tim O'Brien

Identifier= bk111
 title= MSXML3: A Comprehensive Guide
 description= The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more. 
 name= Tim O'Brien

Identifier= bk112
 title= Visual Studio 7: A Comprehensive Guide
 description= Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment. 
 name= Mike Galos

Well, that output looks way better, does it not?

Store the information in a .csv or .txt file.¶

In a lot of cases, you not only want the extracted content in your Jupyter Notebook, but you also want to store them for future use. We will show you how to store the output in two different ways:

as one file with the information of all books in .csv format (which, for example, can be opened in Excel)
as one textfile per book.

Store in one file¶

The easiest way to store and save Python output in one file is through storing it in a Dataframe from the Python package ‘Pandas’ and then saving this Dataframe. You can add data directly from the for loops we created above in a Pandas Dataframe, but we prefer the method in which you first create a list and then transform this list in the output, as Pandas Dataframe execution can become fairly slow with large amounts of data.

To create a list, we first have to declare an empty list. This is done with the following syntax:

booklist = []

Now, we alter our for loop a bit. Instead of printing the output to the screen, as we did above, we store our output in a list. We can use the following code:

booklist = []

for book in root.find_all('book'):
    identifier = book.get('id')
    title = book.find('title').text
    description = book.find('description').text
    for author in book.find_all('author'):
        name = author.find('name').text
        surname = author.find('surname').text
    booklist.append([identifier, title, description, name+" "+surname])

This leads to a list, called ‘booklist’, in which for every book all extracted information is stored. We can then easily transform this list to a pandas DataFrame. To do so, we need to import pandas first with the code

import pandas as pd

Then we type:

books = pd.DataFrame(booklist, columns=["identifier", "title", "description", "name"])

This code works as follows. You declare the variable ‘books’, which will be used to store all the information. Then you let Python know that you want to create a Dataframe. The content of this Dataframe is the list ‘booklist’, which we just created. We then tell Python how we want to name the columns (this should be in the same order as the order of the variables in the list).

You can show the dataframe you just created by typing:

books

This results in the following output:

	identifier	title	description	name
0	bk101	XML Developer's Guide	An in-depth look at creating applications \n ...	Matthew Gambardella
1	bk102	Midnight Rain	A former architect battles corporate zombies, ...	Kim Ralls
2	bk103	Maeve Ascendant	After the collapse of a nanotechnology \n ...	Eva Corets
3	bk104	Oberon's Legacy	In post-apocalypse England, the mysterious \n ...	Eva Corets
4	bk105	The Sundered Grail	The two daughters of Maeve, half-sisters, \n ...	Eva Corets
5	bk106	Lover Birds	When Carla meets Paul at an ornithology \n ...	Cynthia Randall
6	bk107	Splish Splash	A deep sea diver finds true love twenty \n ...	Paula Thurman
7	bk108	Creepy Crawlies	An anthology of horror stories about roaches,\...	Stefan Knorr
8	bk109	Paradox Lost	After an inadvertant trip through a Heisenberg...	Peter Kress
9	bk110	Microsoft .NET: The Programming Bible	Microsoft's .NET initiative is explored in \n ...	Tim O'Brien
10	bk111	MSXML3: A Comprehensive Guide	The Microsoft MSXML3 parser is covered in \n ...	Tim O'Brien
11	bk112	Visual Studio 7: A Comprehensive Guide	Microsoft Visual Studio 7 is explored in depth...	Mike Galos

Now we can save this dataframe into a csv file by typing:

books.to_csv('book.csv')

Note

This saves the csv in the root folder of your jupyter installation. If you want it saved in a specific location you need to specify the path before the filename followed by a ‘/, for example books.to_csv('C:/Users/Documents/book.csv') Please remember to use a backward slash (‘/’) between the folders

Create a textfile per book¶

If you want to create a textfile for every book, you can add the code directly in your for loop.

First, you have to declare a textfile in Python and give it a name. Then, you open the file and write content to it. After this, you close the file. Closing the file is important, else the loop will keep adding data to the file. You can try this with the following code:

with open("test.txt", "w") as f:
    f.write("This is just a test file")

Note

By default, Python stores the text file in the same folder as where you run your Jupyter Notebook. You can alter this by adding a path to your textfile, for example:

 myfile = open('C:/Users/Documents/test.txt', 'w') 

Please remember to use a backward slash (/) between the folders

With a few alterations, we can use this code to save our book information to a seperate file per book. First, we give the text file the name of the book identifier. We can do that by adding the variable into the name of the file like this:

with open(f"{identfier}.txt", "w") as f:

Then, we create the content of the file based on the content we extracted from the book.

f.write(name + " " + surname + "\n" + title + "\n" + description)

If we put these lines into our for loop, Python will save every book with its own name and information. The code looks like this:

for book in root.find_all('book'):
    identifier = book.get('id')
    title = book.find('title').text
    description = book.find('description').text
    for author in book.find_all('author'):
        name = author.find('name').text
        surname = author.find('surname').text
    with open(f"{identfier}.txt", "w") as f:
		f.write(name + " " + surname + "\n" + title + "\n" + description)

Filter information¶

You can also search for specific elements in your XML. For example, just the title information from the book ‘bk109’. To do so, you can start with the same for loop as we created in this lesson. However, before you print the output, you first check if you have the element you want (in this case: book 109). This can be done with an ‘if’ statement and it looks like this:

for book in root.find_all('book'):
    identifier = book.get('id')
    if identifier == "bk109":
        title = book.find('title').text
        print(title)

Paradox Lost

You can also search the content from XML elements, searching the content for a match. For example, if we want to print all titles that contain the word ‘XML’, we can use the following code:

for book in root.find_all('book'):
    title = book.find('title').text
    if "XML" in title:
        print(title)

XML Developer's Guide
MSXML3: A Comprehensive Guide

Note

Strings in Python are capital senstive! This means that ‘XML’ is not equal to ‘xml’ or ‘Xml’ in Python.

Exercise

Print out the title of all books that have England in their description.

Solution

for book in root.find_all('book'):
	description = book.find('description').text
	if "England" in description:
		print(book.find('description').text)

After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.
In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.
The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.

Namespaces¶

As we mentioned in lesson 2 during our introduction to XML, some XML files contain namespaces. In lesson 3,we mentioned that Beautiful Soup omits these namespaces in elements, so you don’t have to declare them.

Let’s look at the example with namespaces from lesson 2:

<p:student xmlns:p="http//www.imaginarypythoncourses.com/student">
  <p:id>3235329</p:id>
  <p:name>Jeff Smith</p:name>
  <p:language>Python</p:language>
  <p:rating>9.5</p:rating>
</p:student>

Imagine, we want to extract the name of the student from this XML file.

First, we load the file into our Notebook (the file is called ‘namespaces.xml’ and can be downloaded here

with open("data/namespaces.xml") as f:
    root_ns = BeautifulSoup(f, 'xml')

Then, we create a for loop that iterates through the file and returns the values of all ‘name’ elements.

for student in root_ns.find_all('name'):
    print(student.text)

Jeff Smith

As you can see, Beautiful Soup has no problem with printing the name of the student.

However, in some XML documents, attributes can have a namespace. In such cases, you have to put the namespace identifier in your code.

Let’s imagine the XML looks as follows:

<p:student xmlns:p="http//www.imaginarypythoncourses.com/student">
  <p:name p:id='3235329'>Jeff Smith</p:name>
  <p:language>Python</p:language>
  <p:rating>9.5</p:rating>
</p:student>

Imagine we want to extract the attribute ‘id’. We see that this attribute has a namespace, so we need to declare it in the code. This can only be done by putting the identifier into curly brackets before the attribute name.

The code should looks as follows:

for student in root.find_all('name'):
    identifier = student.get('{http//www.imaginarypythoncourses.com/student}id')

We now have a good basis to try exploring some reallife examples of XML files used in Digital Humanities research. We will introduce some of these formats in the following section.

Automatically extract XML content with Python

5. Practical session: Working with Beautiful Soup

Contents

5. Practical session: Working with Beautiful Soup¶

Install Beautiful Soup¶

Import Beautiful Soup and load the xml file¶

Examine the structure of the file¶

Extract the book titles and descriptions¶

Extract name and surname of the author¶

Extract the book identifier¶

Structure all information¶

Store the information in a .csv or .txt file.¶

Store in one file¶

Create a textfile per book¶

Filter information¶

Namespaces¶