12. Practical lesson: Automatically extract information from a batch of files¶

In the previous lessons, we have seen how we can extract information from the Alto, Didl, Tei and Page XML files. In this lesson, we will show you how you can use the codes from the previous lessons and, with a little bit of alteration, use them to automatically extract content from batches of XML files and save them as either textfiles or csv files.

We will provide the following examples:

Extract complete page content with newspaper metadata from various Alto and corresponding Didl files (basic).
Extract the poems from various Tei files and store them in seperate csv files per book (moderate).
Extract the content, including reading order, from various Page files and store the content in csv files (advanced).

Extract complete page content with newspaper metadata from various Alto and corresponding Didl files.¶

In lesson 7, we used the following code to extract the content and page number from a newspaper Alto XML and the title and publication year from the corresponding Didl file.

import xml.etree.ElementTree as ET

tree_alto = ET.parse('data/alto_id1.xml')
root_alto = tree_alto.getroot()

tree_didl = ET.parse('data/didl_id1.xml')
root_didl = tree_didl.getroot()

ns_alto = {'ns0': 'http://schema.ccs-gmbh.com/ALTO'} 

ns_didl = {'dc': 'http://purl.org/dc/elements/1.1/',
          'ns2': 'urn:mpeg:mpeg21:2002:02-DIDL-NS', 
          'ns4' : 'info:srw/schema/1/dc-v1.1' }

article_content = ""

for book in root_alto.findall('.//ns0:TextBlock', ns_alto):
    for article in book.findall('.//ns0:String', ns_alto):
        content = article.get('CONTENT')
        article_content = article_content + content
    article_content = article_content + "\n"

    
for book in root_alto.findall('.//ns0:Page', ns_alto):
    pagenr = book.get('ID')
    
item = root_didl.find('.//ns2:Resource', ns_didl)
    
for article in item.findall('.//ns4:dcx', ns_didl):
    title = article.find('.//dc:title', ns_didl).text
    date = article.find('.//dc:date', ns_didl).text

        
filename = f'{title}_{date}_{pagenr}.txt'

with open(filename, "w", encoding="utf-8") as f:
    f.write(article_content)

With few little alterations, we can use this code to automatically work with a batch of files.

We need a code that automatically search through folder in your computer;
We need to add a piece of code that finds the corresponding Didl file for every alto file.

For the following code, we assume you have a folder called ‘alto’, which contains the Alto XML files, and a folder ‘didl’ that contains the Didl files (downloaded here). Both alto and Didl have a filename that starts with an identifier, followed by either _alto or _didl. Make sure that there are no other files in the folder.

We start with a little loop that runs through your alto folder an returns all the file names

import os
# assign directory
directory = 'data/alto/'
 
for filename in os.listdir(directory):
    print(filename)

ddd_010097934_alto.xml
ddd_010097935_alto.xml
ddd_010097936_alto.xml
ddd_010097937_alto.xml
ddd_010097938_alto.xml

Now we need a code to strip the identifier from the alto file, and create the filename for the Didl file. This can be done with string alterations in Python, as we also did in lesson 7 and 8.

filename = 'ddd_010097934_alto.xml'
filename_didl = filename.split('_alto')[0] ## Split the string by the underscore and only keep the first part
filename_didl = filename + "_didl.xml"
print(filename_didl)

ddd_010097934_alto.xml_didl.xml

Now, we have a way to retreive all alto files and corresponding Didl files, so the only thing left is to put it in one big loop.

import xml.etree.ElementTree as ET
import os

directory_alto = 'data/alto/'
directory_didl = 'data/didl/'
 
for filename in os.listdir(directory_alto):

    tree_alto = ET.parse(directory_alto + filename)
    root_alto = tree_alto.getroot()
    
    filename_didl = filename.split('_alto')[0] ## Split the string by the underscore and only keep the first part
    filename_didl = filename_didl + "_didl.xml"

    tree_didl = ET.parse(directory_didl + filename_didl)
    root_didl = tree_didl.getroot()

    ns_alto = {'ns0': 'http://schema.ccs-gmbh.com/ALTO'} 

    ns_didl = {'dc': 'http://purl.org/dc/elements/1.1/',
              'ns2': 'urn:mpeg:mpeg21:2002:02-DIDL-NS', 
              'ns4' : 'info:srw/schema/1/dc-v1.1' }

    article_content = ""

    for book in root_alto.findall('.//ns0:TextBlock', ns_alto):
        for article in book.findall('.//ns0:String', ns_alto):
            content = article.get('CONTENT')
            article_content = article_content + content
        article_content = article_content + "\n"


    for book in root_alto.findall('.//ns0:Page', ns_alto):
        pagenr = book.get('ID')

    item = root_didl.find('.//ns2:Resource', ns_didl)

    for article in item.findall('.//ns4:dcx', ns_didl):
        title = article.find('.//dc:title', ns_didl).text
        date = article.find('.//dc:date', ns_didl).text


    filename = f'{title}_{date}_{pagenr}.txt'

    with open(filename, "w", encoding="utf-8") as f:
        f.write(article_content)

Extract the poems from various Tei files and store them in seperate csv files per book.¶

In lesson 10, we extracted poems from a Tei file and stored them in a csv file. The code for this looked like this:

from bs4 import BeautifulSoup
import pandas as pd

with open("data/tei.xml", encoding='utf8') as f:
    root = BeautifulSoup(f, 'xml')

poem_list = []
counter = 1

for div in root.find_all('lg'):
    if div.get('type') == 'poem':
        poem = "poem_" + str(counter)
        content = div.text
        poem_list.append([poem, content])
        counter += 1
        
poems = pd.DataFrame(poem_list , columns = (['poem', 'content']))

poems.to_csv('poems.csv')

Just as we did with the alto files in the previous section, we can do the same for the TEI files.

For the following exercises, we assume you have a folder on your computer with the name ‘tei’, in which you stored the various tei files (downloaded here).

Exercise

what steps do we need to take to be able to create a batch output?

Solution

create a loop that runs through the tei folder
create a file name for each file, based on the identifier

Exercise

Create a loop that runs through the files in your tei folder and print their names.

Solution

import os
# assign directory
directory = 'data/tei/'
 
for filename in os.listdir(directory):
    print(filename)

Exercise

create the variable ‘filename’ with the value ‘bild001dich01_01.xml’. Strip the filename from the suffix .xml and print the filename.

Solution

filename = 'bild001dich01_01.xml'
filename = filename.split('.')[0]
print(filename)

Now we have all ingredients to automatically extracts the poems from the batch of tei files, and save them as csv with their identifier as name.

Exercise

Write a code that loops through the tei files, extracts the poems and stores them as csv.

Solution

from bs4 import BeautifulSoup
import pandas as pd
import os

# assign directory
directory = 'data/tei/'
 
for filename in os.listdir(directory):
    with open(directory + filename, encoding='utf8') as f:
        root = BeautifulSoup(f, 'xml')

    identifier = filename.split('.')[0]
    
    poem_list = []
    counter = 1

    for div in root.find_all('lg'):
        if div.get('type') == 'poem':
            poem = "poem_" + str(counter)
            content = div.text
            poem_list.append([poem, content])
            counter += 1

    poems = pd.DataFrame(poem_list , columns = ['poem', 'content'])

    poems.to_csv(identifier + '.csv')

Extract the content, including reading order, from various Page files and store the content in csv files¶

And off course, we can do the same for the page XML. Lets start by repeating the code we made to extract the content, including the region information, and save it to a csv file.

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('data/page.xml')
root = tree.getroot()

ns = {'ns0': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19'}

dict_order = {}

for order in root.findall('.//ns0:ReadingOrder', ns):
    for group in root.findall('.//ns0:OrderedGroup', ns):
        groupnr = group.get('id')
        for suborder in group.findall('.//ns0:RegionRefIndexed', ns):  
            region = suborder.get('regionRef')
            index = suborder.get('index')
            dict_order.setdefault(region,[]).append([groupnr, index])
                
content_list = []

for newspaper in root.findall('.//ns0:TextRegion', ns):
    region = newspaper.get('id')
    if region in dict_order:
        groupvalues = dict_order[region]
        group = groupvalues[0][0]
        index = groupvalues[0][1]
    else:
        group = 0
        index = 0
    for content in newspaper.findall('.//ns0:Unicode', ns):
        content = content.text
    content_list.append([group, index, region, content])

newspaper_with_order = pd.DataFrame(content_list, columns = ["Group", "Index", "Region", "Content"]) 
newspaper_with_order = newspaper_with_order.sort_values(['Group', 'Index'], ascending = [True, True])

newspaper_with_order.to_csv('newspaper_with_order.csv')

For the following exercises, we assume you have a folder on your computer with the name ‘page’, in which you stored the various page files (downloaded here).

Exercise

Write a code that loops through the page files in your ‘page’ folder, extracts the content with region information, and store them in a .csv file with as name the identifier of the page file.

Solution

import xml.etree.ElementTree as ET
import pandas as pd
import os

directory = 'data/page/'
 
for filename in os.listdir(directory):

    tree = ET.parse(directory + filename)
    root = tree.getroot()
    
    identifier = filename.split('.')[0]
    print(filename)

    ns = {'ns0': 'http://schema.primaresearch.org/PAGE/gts/pagecontent/2010-03-19'}

    dict_order = {}

    for order in root.findall('.//ns0:ReadingOrder', ns):
        for group in root.findall('.//ns0:OrderedGroup', ns):
            groupnr = group.get('id')
            for suborder in group.findall('.//ns0:RegionRefIndexed', ns):  
                region = suborder.get('regionRef')
                index = suborder.get('index')
                dict_order.setdefault(region,[]).append([groupnr, index])
                
    content_list = []

    for newspaper in root.findall('.//ns0:TextRegion', ns):
        region = newspaper.get('id')
        if region in dict_order:
            groupvalues = dict_order[region]
            group = groupvalues[0][0]
            index = groupvalues[0][1]
        else:
            group = 0
            index = 0
        for content in newspaper.findall('.//ns0:Unicode', ns):
            content = content.text
        content_list.append([group, index, region, content])

    newspaper_with_order = pd.DataFrame(content_list, columns = ["Group", "Index", "Region", "Content"]) 
    newspaper_with_order = newspaper_with_order.sort_values(['Group', 'Index'], ascending = [True, True])


    newspaper_with_order.to_csv(identifier + '.csv')

And that’s all! You have now seen multiple ways of automatically extracting content from batches of files which can save a lot of time and errors.

Automatically extract XML content with Python

12. Practical lesson: Automatically extract information from a batch of files

Contents

12. Practical lesson: Automatically extract information from a batch of files¶

Extract complete page content with newspaper metadata from various Alto and corresponding Didl files.¶

Extract the poems from various Tei files and store them in seperate csv files per book.¶

Extract the content, including reading order, from various Page files and store the content in csv files¶