# 8. Practical lesson: Alto/Didl and Beautiful Soup

In this lesson we are going to work with the Alto and Didl format. As shown in lesson ***6***, the Alto and Didl are connected to each other. 
The Alto stores the plain text and the Didl the metadata of the newspaper. For this lesson, we assume that you have followed the practical lesson 5. 

This lesson contains the following content:
* Load the Alto file and examine the structure <span style="color:#ef6079">(*basic*)</span>;
* Extract the complete content of a newspaper page from the Alto file <span style="color:#ef6079">(*basic*)</span>;
* Load the Didl file and examine the structure <span style="color:#ef6079">(*basic*)</span>;
* Extract newspaper metadata from the Didl file. <span style="color:#ef6079">(*basic*)</span>;
* Extract all separate articles from the total newspaper from the Didl file <span style="color:#ef6079">(*moderate*)</span>;
* Extract all separate articles from a specific newspaper from the Didl file <span style="color:#ef6079">(*advanced*)</span>.

Open a new Jupyter Notebook and type all code examples and code exercises in your Notebook.

## Load the Alto file and examine the structure

We first need to prepare the Notebook by importing the package we need and loading the XML file into the enviroment.

```{admonition} Exercise
:class: attention
Import the ElemenTree package and load the XML file into your Notebook.
You can look back to lesson 4 if you need a reminder on how to do this. 
The XML file is named ‘alto.xml’ and can be [downloaded here](https://github.com/KBNLresearch/xml-workshop/tree/main/data).
```

````{admonition} Solution
:class: tip, dropdown
```Python
from bs4 import BeautifulSoup    

with open("data/alto_id1.xml", encoding='utf8') as f:
    root_alto = BeautifulSoup(f, 'xml')
```
````

In order to extract the required information from the file, we have to examine the structure.

```{admonition} Exercise
:class: attention
Print the file in your Notebook or look at the file in your browser, either way you prefer.
```
````{admonition} Solution
:class: tip, dropdown
```Python
print(root_alto)
```
````

In [1]:
from bs4 import BeautifulSoup    

with open("data/alto_id1.xml", encoding='utf8') as f:
    root_alto = BeautifulSoup(f, 'xml')
print(root_alto)

<?xml version="1.0" encoding="utf-8"?>
<alto xmlns="http://schema.ccs-gmbh.com/ALTO" xmlns:xlink="http://www.w3.org/1999/xlink">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>DDD_010097934_001.jp2</fileName>
</sourceImageInformation>
<OCRProcessing ID="OCRPROCESSING_1">
<preProcessingStep>
<processingDateTime>2010-01-07</processingDateTime>
<processingAgency>CCS Content Conversion Specialists GmbH, www.content-conversion.com</processingAgency>
<processingStepDescription>align</processingStepDescription>
<processingStepSettings>CCS OCR Processing Filter</processingStepSettings>
<processingSoftware>
<softwareCreator>CCS Content Conversion Specialists GmbH, Germany</softwareCreator>
<softwareName>CCS docWORKS</softwareName>
<softwareVersion>6.3-0.96</softwareVersion>
<applicationDescription/>
</processingSoftware>
</preProcessingStep>
<ocrProcessingStep>
<processingSoftware>
<softwareCreator>ABBYY (BIT Software), Russia</softwareCreator>
<softwa

```{note}
We will work with two XML files in this lesson. Therefore, we will name the root of the XML files according to the type of the XML: 'root_alto' for the alto XML and 'root_didl' for the Didl XML. 
```

The alto XML contains a lot of information that is not part of the textual content of the newspaper.
There is information about the layout (where the content is placed on the page), about word confidence etc. 
It also contains elements in which the plain text is stored. 
We start by searching for this element, and the check whether the content is stored as the value of tags or as the 
value of an attribute.

```{admonition} Exercise
:class: attention
Look at the XML structure, in which element is the content stored? 
*Hint: one of the news articles mentiones the
word 'spoorwegmaatschappij'.* 
```

```{admonition} Solution
:class: tip, dropdown
The content of the news paper articles is stored in the element 'ns0:String', for example:

	<String ID="P1_ST00323" HPOS="244" VPOS="2387" WIDTH="318" HEIGHT="35" CONTENT="spoorwegmaatschappij" WC="0.99" CC="88668080809486709965"/>

It is stored as an attribute of the element. 
```

```{admonition} Exercise
:class: attention
If we compare the element 'String' to our example XML, we see that there is a difference in how the content is stored. 
What is the difference? 
```

```{admonition} Solution
:class: tip, dropdown
The content of the elements of the example XML were stored als values from the elements. 
The content of the String element is stored in an attribute called 'CONTENT'. 
```

```{admonition} Exercise
:class: attention
There are a lot of nested element in this XML file.
Do we have to bother about these parents while extracting content from the file?
```

```{admonition} Solution
:class: tip, dropdown
With Beautiful Soup you can call any item directly without worrying about their parents. 
```

Remember namespaces? Before we start to extract the data we are interested in we need to stop for a moment and examine the file to 
see if we need to take namespaces into account.

```{admonition} Exercise
:class: attention
Are there any namespaces in the file that we have to take into account? If there are, how can we declare these?
```

````{admonition} Solution
:class: tip, dropdown
The XML file does contain namespace, however, since we are working with BeautifulSoup we don’t have to do antyhing special to deal with them.
````

Now we know some important information about this Alto file, so let's see if we can extract the content. 

## Extract the complete content of a newspaper page from the Alto file

We will start by extracting all the text, without worrying about the division between the articles. 

```{admonition} Exercise
:class: attention
As you have seen, the plain text of the news paper is stored in the 'CONTENT' attribute of the 'String' element. 
How can you extract the values from attributes?
```

```{admonition} Solution
:class: tip, dropdown
This can be done with the .get method, for example: book.get('id'). 
```

In lesson 5 we learned that is is possible to acces the elements with a for loop, like:
```Python
for book in root.find_all('book'):
```

````{admonition} Exercise
:class: attention
The text content that we wish to extract is stored in the Unicode element. 
Use Python and ElementTree to extract this content.
````

````{admonition} Solution
:class: tip, dropdown
```Python
for page in root_alto.find_all('String'):
    content = page.get('CONTENT')
    print(content)	
```
```` 

This leads to the following output:

In [2]:
from bs4 import BeautifulSoup    

with open("data/alto_id1.xml", encoding='utf8') as f:
    root_alto = BeautifulSoup(f, 'xml')

for page in root_alto.find_all('String'):
    content = page.get('CONTENT')
    print(content)	

p.
u«svd.
r-\j
iiWAv
(-tyWw+Avi
*,*i
:'"«-«f4»eV*^
üi/jyw
H^li
-A.
•
Xö
f
!!•
ALGEMEEN
HANDELSBLAD.
ABONNEMENTSPRIJS
VOOR
3
MAANDEN.
f
n!?,
terdam
ƒ6.-.
Voor
de
overige
plaatsen
des
ri>
..ƒ7.-.
ei'lijke
nummers
.„
0.08.
Bestellingen
en
aanvragen/y-öwA-o
in
te
zenden.
moet.
v<>U!
aan
ijn
uiterlijk
op
den
15
d"
van
ilc
3**
maand,
iad
verschijnt
dagelijks,
behalve
op
enkele
letfltdegfeh.
Zaterdag*
1
Januari.
UITGEVERS-DIREiITETJREN:
GEBROEDERS
DÏÊJÖERÏCHS,
PRIJS
DER
ADVERTENTIES:
Van
een
tot
vijf
gewone
regele
f
1.28.
Elke
regel
meer
f
0.25.
Aanvragen
en
vermelding
van
liefdegiften
worden
geplaatst
per
regel
a
15
cents.
Reklamés
Of
aard
«jvelingen
(beneden
het
Binnenlandsch
Nieuws)
por
regel
aöO
cents.
Groote
letters
worden
berekend
naar
het
aantal
gewone
regels,
dat
zij
beslaan.
Het
jaar
1869.
fcdi.'
hl
,S,;loo
Pen
Jaar
ondel-
Onheilspellende
tecken-ni.
Ta
Tnrir
**
•*""**
Iladdt'u
«e^e».
die
de
spanning
tt
h>
•
e"
UriL''ceilland
hielden
voor
het
eind,
niét
egio
vnn
en
djr^atieken
veldt

in
het'
staat»
kuudige
,
want
—
zegt
bij
—
moge
de
minister
een
man
zijn
van
eene
verdachte
kleur,
aan
het
hof,
"de
bron
van
alle
eer",
daar
wappert
nog
hoog
en
fier
de
vlag
des
behouds.
In
de
beneden
regioenen
kan
bet
insgelijks
veranderen,
en
is
dat
het
geval,
dan
ben
ik
daar,
om
door
middel
van
het
hof
goede
dieusten
aan
de
goede
zaak
te
bewijzen.
leder
herinnert
ï.ich
nog
hoe
in
de
dagen
van
het
konfiikt
tnzschen
de
legeering
en
den
landdag
het
hof
werkzaam
is
geweest
voor
de
ware
beginselen.
Op
dien
terugkeer
Van
den
goeden
tijd
hoopt
de
heer
Uhden.
ïe
Hallo
heeft
gedurende
eene
reeks
van
jaren
eene
inrich
ting
van
onderwijs,
liet
zoogenaamde
paedagogium,
ingrooten
bloei
verkeerd.
Het
was
gesticbt
door
August
Hermanh
Francke,
een
der
hoofden
van
de
piëtistische
partij;
maar
het
moet
wor
den
erkend,
en
bet
wordt
ook
openlijk
orkeiid,
dal
de
heel'
Fl'ancke
eene
uitstekende
methode
bezat,
eene
uitmunten
de
school
in
het
leven
heeft
geroepen
eu
eene
menigte
knappe
mannen
heeft
gevormd

uiteengesjiat
en
in
tal
van
fraktiën
verbrokkeld,
even'
als
een
werk
van
mo
zaïek,
dat
op
den
grond
is
gevallen.
*Tüü2
les
eentres
s'y
trouvent,
excepté
Ie
centre
de
gravité."
/'Daarin
is,
naar
wij
gelooven,
de
eersie
en
groote
moeielijk
lleid
gelegen,
welke
de
toepassing
der
nieuwe
staatkunde
zal
ondervinden,
Er
is
gebi'ek
aan
mannen.
Uit
dien
hoofde
is
het
misschien
te
betreuren,
dat
men
het
engelsche
stelsel
niet
ten
einde
toe
heeft
gevolgd;
dat
men
namelijk
aan
den
eersten
minister
geen
algemeene
funktie'n,
zonder
bijzondere
attributen,
heeft
gegeven.
De
overgang
zou,
indien
men
een
minister
van
staat,
belast
met
de
leiding
en
de
verdedi
ging
van
de
algemeene
staatkunde
der
regeeriug,
behouden
had,
gemakkelijker
zijn
geweest.
Eniilo
Ollivier
kan
er
op
rekenen,
dat
de
last
van
den
parlementairen
strijd
bijna
geheel
op
hem
zal
rusten,
en
do
bijzondere
werkzaamheden,
aan
dit
of
dat
departement
verbonden,
kunnen
alzoo
zijn
vrijheid
van
handelen
slechts
belemmeren,
on
eene
taak,
op
zich

hij
een
groot
voordeel
voor
de
schatkist,
frankrijk
,
dat
zeer
ruim
is
in
het
uitgeven
van
schatkistbiljetten,
is
een
voorbeeld,
dat
de
minister
niet
zou
willen
volgen,
en
toch
betaalt
het
slechts
een
half
pCt.
rente
's
jaara
-,
voorts
geeft
hij
de
verzekering,
dat
de
aanneming
der
wét
oji
de
middelen
niets
praejudiciëert
aangaande
het
aan
hangig
ontwerp,
dat
de
zaak
op
een
anderen
voet
regelt;
ter
wijl
bovendien
de
minister
van
finantiën
geen
gebruik
van
die
eventueele
wet
maken
kan
dan
met
medewerking
van
de
wet
gevende
macht,
omdat
slechts
een
zeer
miniem
bedrag
aan
rente
uitgetrokken
is,
en
du
sbij
elke
uitgifte,
krachtens
de
nieuwe
wet,
de
tusschenkomst
van
de
wetgevende
macht
moet
worden
ingeroepen.
De
heer
Sasse
vak
Ysselt
betuigt
zijne
verwondering
dat,
terwijl
de
minister
van
justitie
een
ontwerp
gereed
maakt
tot
afschaffing
der
tienden,
't
departement
van
finantiën
nieuwe
tienden
creëert
en
vooral
in
t
land
van
Kuyk,
waar
geheven
wordt
van
heidegronden,
die
ontgonnen
worden
d

As you can see, the text is printed in separate words, that all appear in one long list. 
So, this is quit unreadable. 
We can store the text in a *string* variable in which we concatenate all words.

In [3]:
all_content = ""

for page in root_alto.find_all('String'):
	content = page.get('CONTENT')
	all_content = all_content + " " + content
	
print(all_content)

 p. u«svd. r-\j iiWAv (-tyWw+Avi *,*i :'"«-«f4»eV*^ üi/jyw H^li -A. • Xö f !!• ALGEMEEN HANDELSBLAD. ABONNEMENTSPRIJS VOOR 3 MAANDEN. f n!?, terdam ƒ6.-. Voor de overige plaatsen des ri> ..ƒ7.-. ei'lijke nummers .„ 0.08. Bestellingen en aanvragen/y-öwA-o in te zenden. moet. v<>U! aan ijn uiterlijk op den 15 d" van ilc 3** maand, iad verschijnt dagelijks, behalve op enkele letfltdegfeh. Zaterdag* 1 Januari. UITGEVERS-DIREiITETJREN: GEBROEDERS DÏÊJÖERÏCHS, PRIJS DER ADVERTENTIES: Van een tot vijf gewone regele f 1.28. Elke regel meer f 0.25. Aanvragen en vermelding van liefdegiften worden geplaatst per regel a 15 cents. Reklamés Of aard «jvelingen (beneden het Binnenlandsch Nieuws) por regel aöO cents. Groote letters worden berekend naar het aantal gewone regels, dat zij beslaan. Het jaar 1869. fcdi.' hl ,S,;loo Pen Jaar ondel- Onheilspellende tecken-ni. Ta Tnrir ** •*""** Iladdt'u «e^e». die de spanning tt h> • e" UriL''ceilland hielden voor het eind, niét egio vnn en djr^atieken veld

The content is now more readable, however, it is still one long blob of the complete text of the newspaper.
As you can see in the XML file, the content is divided into sections. 

```{admonition} Exercise
:class: attention
Look at the XML file. There are different elements that divide the text. Which element would likely be used to separate articles from each other?
```

```{admonition} Solution
:class: tip, dropdown
The element 'TextBlock'
```

Now that we know how we can divide the various sections, let's put this into code.
Instead of storing all the output into one variabele, we create a variable, and store within it the information of one 
section. Then we print the variabele and empty it, so it can be re-used for a new section.

In code, this looks like this:

In [4]:
article_content = ""

for book in root_alto.find_all('TextBlock'):
    for article in book.find_all('String'):
        content = article.get('CONTENT')
        article_content = article_content + " " + content
    print(article_content)
    print("") ## add a linebreak between the separate sessions
    article_content = ""

 p. u«svd.

 r-\j iiWAv (-tyWw+Avi *,*i :'"«-«f4»eV*^ üi/jyw H^li

 -A. • Xö f !!•

 ALGEMEEN HANDELSBLAD.

 ABONNEMENTSPRIJS VOOR 3 MAANDEN. f n!?, terdam ƒ6.-. Voor de overige plaatsen des ri> ..ƒ7.-. ei'lijke nummers .„ 0.08. Bestellingen en aanvragen/y-öwA-o in te zenden. moet. v<>U! aan ijn uiterlijk op den 15 d" van ilc 3** maand, iad verschijnt dagelijks, behalve op enkele letfltdegfeh.

 Zaterdag* 1 Januari.

 UITGEVERS-DIREiITETJREN: GEBROEDERS DÏÊJÖERÏCHS,

 PRIJS DER ADVERTENTIES: Van een tot vijf gewone regele f 1.28. Elke regel meer f 0.25. Aanvragen en vermelding van liefdegiften worden geplaatst per regel a 15 cents. Reklamés Of aard «jvelingen (beneden het Binnenlandsch Nieuws) por regel aöO cents. Groote letters worden berekend naar het aantal gewone regels, dat zij beslaan.

 Het jaar 1869.

 fcdi.' hl ,S,;loo Pen Jaar ondel- Onheilspellende tecken-ni. Ta Tnrir ** •*""** Iladdt'u «e^e». die de spanning tt h> • e" UriL''ceilland hielden voor het eind, niét egio vnn e

Now we have a page of plain text that is better structured. 
The only thing left is to retreive the page number, and then we'll have all the information to save this data to a textfile.

```{admonition} Exercise
:class: attention
Look at the XML file. Where can we find the page number?
```

```{admonition} Solution
:class: tip, dropdown
The page number is stored in the 'Page' element. 
```

```{admonition} Exercise
:class: attention
Write the code to extract the page number from the XML. 
```

````{admonition} Solution
:class: tip, dropdown
```Python
for book in root_alto.find_all('Page'):
    pagenr = book.get('ID')
    print(pagenr)
```
````

The page number is:

In [5]:
for book in root_alto.find_all('Page'):
    pagenr = book.get('ID')
    print(pagenr)

P1


## Load the Didl file and examine the structure 

We now have a more readable page with the corresponding page number. However, if we store this as is, we will have no idea from which newspaper this page was extracted. This makes it of limited reuseability. 
In lesson 6 we described that we can find metadata corresponding to an Alto file in a Didl file. 
The alto and Didl file have the same identifier, so you can match them.

In our case, they both have the identifier 1. 

```{admonition} Exercise
:class: attention
Load the corresponding Didl file in your notebook. Name the root 'root_didl'. Look at the structure of the file. 
```

````{admonition} Solution
:class: tip, dropdown
```Python
with open("data/didl_id1.xml", encoding='utf8') as f:
    root_didl = BeautifulSoup(f, 'xml')
print(root_didl)
```
````

This leads to the following output:

In [6]:
with open("data/didl_id1.xml", encoding='utf8') as f:
    root_didl = BeautifulSoup(f, 'xml')
print(root_didl)

<?xml version="1.0" encoding="utf-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/          http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2015-04-14T14:15:45.386Z</responseDate>
<request metadataPrefix="didl" verb="GetRecord">http://services.kb.nl/mdo/oai	</request>
<GetRecord>
<record>
<header>
<identifier>DDD:ddd:010097934:mpeg21</identifier>
<datestamp>2012-07-19T01:01:12.372Z</datestamp>
<setSpec>DDD</setSpec>
</header>
<metadata>
<DIDL xmlns="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dcx="http://krait.kb.nl/coop/tel/handbook/telterms.html" xmlns:ddd="http://www.kb.nl/namespaces/ddd" xmlns:didl="urn:mpeg:mpeg21:2002:02-DIDL-NS" xmlns:didmodel="urn:mpeg:mpeg21:2002:02-DIDMODEL-NS" xmlns:srw_dc="info:srw/schema/1/dc-v1.1

```{admonition} Exercise
:class: attention
Look at the Didl file and see if you can find in which element the title of the newspaper is scored. Hint: the title is 'Algemeen Handelsblad'. 
What parent of this element contains all information we need to extract the title and the publication date?
```

```{admonition} Solution
:class: tip, dropdown
The title is stored in the element 'title', and the publication date in the element 'date'. 
They can both be found in an element called 'Resource'. 
```

```{admonition} Exercise
:class: attention
Are there any namespaces in the file that we have to take into account? 
```

```{admonition} Solution
:class: tip, dropdown
Yes, there are multiple namespaces in the Didl file, both with in element tags and in element attributes.
However, since we work with Beautiful Soup, we don't have to bother about them. 
```

## Extract newspaper metadata from the Didl file

We have seen that the element 'resource' contains all the information we want. If we look closely at the file, 
we see that there are multiple elements with the name 'resource', but the one we want is the first. 
If you want all the information from all resource blocks, we can use the findall method as we did before. 
However, we now only want information from the first block. In that case, you can just simply use find() as follows:

```
item = root_didl.find('Resource')
```
This will return the first element it finds. 


```{admonition} Exercise
:class: attention
Write a code that gets the only the first 'Resource' element, and then from this element create a for loop that loops through the dcx element. 
Extract the title of the newspaper and the publication date. Store them in two separate variables. 
```

````{admonition} Solution
:class: tip, dropdown
```
item = root_didl.find('Resource')

for article in item.find_all('dcx'):
    title = article.find('dc:title')
    date = article.find('dc:date')
    print(title.text, date.text) 
```
````

This leads to the following output:

In [7]:
item = root_didl.find('Resource')

for article in item.find_all('dcx'):
    title = article.find('dc:title')
    date = article.find('dc:date')
    print(title.text, date.text) 

Algemeen Handelsblad 1870-01-01


Now we can store the content of this newspaper page in a text file with as name the a combination of the title of the newspaper, the publication date, and the page number. 
We can create the filename like this:

```
filename = f'{title}_{date}_{pagenr}.txt'
```

```{admonition} Exercise
:class: attention
Save the content in a file.
```

````{admonition} Solution
:class: tip, dropdown
```
with open(filename, "w", encoding="utf-8") as f:
	f.write(article_content)
```
````

## Extract all separate articles from the total newspaper from the Didl file 

As you saw in the above sections, the Alto format has no clear separation between the articles and is therefore especially suitable when you are interested in the complete newspaper page.

However, there are a lot of cases in which you would be interested in the separate articles en metadata about these articles (for example, the type of article).

The collection of the KB makes use of Didl XML files to store additional information. You can use the Didl XML to extract this information and to gather the articles. 

```{admonition} Exercise
:class: attention
Look at the Didl file, do you see information about the articles?
```

````{admonition} Solution
:class: tip, dropdown
Yes, they are stored in the 'Resource' elements.  
```
<didl:Resource mimeType="text/xml">
<srw_dc:dcx>
<dc:subject>artikel</dc:subject>
<dc:title>Het jaar 1869.</dc:title>
<dcterms:accessRights>accessible</dcterms:accessRights>
<dcx:recordIdentifier>ddd:010097934:mpeg21:a0001</dcx:recordIdentifier>
<dc:identifier>http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0001</dc:identifier>
<dc:type xsi:type="dcterms:DCMIType">Text</dc:type>
</srw_dc:dcx>
</didl:Resource>
```
````

As you can see, there are blocks with information about the articles. The articles themself are not present in the Didl, but we can retreive them through their identifier. To do this we will perform the following two steps:

- Extract article information and identifier from the Didl;
- Download the articles and extract the plain text.

We start by extracting the subject, title and identifier from the resource element. 
However, as we saw before, there is also other information stored in the resource elements, such as the news paper title
and publication date. 

You can distinguish the articles using the newspaper metadata based on the element 'subject'.
All articles have a subject ('artikel', 'familiebericht' etc) whilst the other metadata does not.

This distinction can be done with an 'if' statement, in which we check if there is a element with the name 'subject' present in the element block. 

We will start with extracting the type of article, title, and identifier from the Didl XML. The identifier will later be used to download the articles.

In [8]:
for item in root_didl.find_all('Resource'):
	for article in item.find_all('dcx'):
		a_type = article.find('subject')
		## The first block will not have a subject as it contains newspaper metadata instead of article metadata.
		## This can be filtered out using an 'if [subject] is None' control structure.
		if a_type is not None:
			title = article.find('title')
			identifier = article.find('identifier')
			print(a_type.text, title.text, identifier.text)

artikel Het jaar 1869. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0001
artikel BUITENLAND. Groot-Brittannie. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0002
artikel Duitschland. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0003
artikel Oostenrijk. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0004
artikel Frankrijk. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0005
artikel Rumenie. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0006
artikel BINNENLAND. Eerste Kamer der Staten-Generaal. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0007
artikel BINNENLAND. Eerste Kamer der Staten-Generaal. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0007
artikel AMSTERDAM, Vrijdag 31 December. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0008
artikel WEERKUNDIGE WAARNEMINGEN. http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0009
artikel Gouden en Zilveren Specien. http://resolver.kb.nl/resolve?urn=

```{admonition} Exercise
:class: attention
Adapt the code above to store the variables into a list of articles.
```

````{admonition} Solution
:class: tip, dropdown
Your code should look like the code below:
```
article_list = []

for item in root_didl.find_all('Resource'):
	for article in item.find_all('dcx'):
		a_type = article.find('subject')
		## The first block will not have a subject as it contains newspaper metadata instead of article metadata.
		## This can be filtered out using an 'if [subject] is None' control structure.
		if a_type is not None:
			title = article.find('title')
			identifier = article.find('identifier')
			article_list.append([a_type.text, title.text, identifier.text])
```
````

Now we have the identifier for every article in the dataset. This identifier can be used to download the XML of its article 
and extract the text from it. We will demonstrate this for one article. 

As an example, we will use the identifier 'http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0001'. 
If we  click on this, we will be led to the image of the newspaper page on the digitale heritage website Delper.nl (property of the KB). 
However, if we were to add ':ocr' to the identifier, we will be led to the XML containing the OCR of that newspaper page: 
'http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0001:ocr'

This OCR can be saved to file, either manually or by using Python.

To save the OCR using Python we will need the *urllib* package.

```{note}
We recommend to always save the identifier in the name of the file, in this case the ***a0001*** indicates the article number, so we will save the whole identifier. Because Windows does not allow ***:*** in filenames we  will change this to an underscore. 
Everything before ***urn*** will be removed from the identifier, as it has no distinguish features.
We can perform these alteration through string manipulations in Python. 
```

```
## import urllib, it is a standard library so does not need to be installed
from urllib.request import urlopen

filename = 'http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0001:ocr'
## Remove the first part from the filename, so you keep only ddd:010097934:mpeg21:a0001:ocr'
filename = filename.split('=')[1]
## Replace the : with _
filename = filename.replace(':', '_')

url = 'http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0001:ocr'

## write XML to file, downloading happens in this step too.
with open(filename + ".xml", "w", encoding="utf-8") as f:
    f.write(urlopen(url).read().decode('utf-8'))
```
Now, we can open this xml file and look at the structure.

In [9]:
with open("ddd_010097934_mpeg21_a0001_ocr.xml", encoding='utf8') as f:
    root_article = BeautifulSoup(f, 'xml')
print(root_article)

FileNotFoundError: [Errno 2] No such file or directory: 'ddd_010097934_mpeg21_a0001_ocr.xml'

```{admonition} Exercise
:class: attention
Extract the title and content from the article, and store these in separate variables.
```

````{admonition} Solution
:class: tip, dropdown
Your code should look like the code below
```Python
for titles in root_article.find_all('title'):
    title = titles.text 

for contents in root_article.find_all('p'):
    content = contents.text + "\n"
```
````

This can than be saved to a textfile

```Python
with open(filename + ".txt", "w", encoding="utf-8") as f:
    f.write(title + "\n" + content)
```

The above workflow now consists of the folowing steps:
- Downloading the file;
- Opening the file;
- Extracting the contents;
- Saving the contents to file.

This can also be combined into one piece of code that handles all these steps. An advantage of this method is that 
there is no need to manually save and re-open every separate article file. 

```Python
from urllib.request import urlopen

identifier = 'http://resolver.kb.nl/resolve?urn=ddd:010097934:mpeg21:a0001:ocr'
filename = identifier.split('=')[1]
filename = filename.replace(':', '_')

file=urlopen(identifier)
root = BeautifulSoup(file, 'xml')

for titles in root.find_all('title'):
    title = titles.text + "\n"

for contents in root.find_all('p'):
    content = contents.text + "\n"

with open(filename + ".txt", "w", encoding="utf-8") as f:
    f.write(title + "\n" + content)
```

Until now we have manually selected a single article from a page and saved this. Of course one article is generally not enough and manually changing the identifier for every file is a lot of work.
Luckily, just as we have used a for loop to iterate through an XML file, we can use a for loop to iterate through a list of identifiers.

The folowing code does just that. It iterates through **article_list** and grabs the identifier of an article. 
Then it adds *:ocr* behind the identifier, downloads the file, and extracts the text. 
Finally, it saves the result as a textfile, with the identifier as filename.

```Python
from urllib.request import urlopen

for article in article_list:
    # We want the third object of the list, but Python counts from 0.    
    identifier = article[2] + ":ocr"
    # Prepare the filename
    filename = identifier.split('=')[1]
    filename = filename.replace(':', '_')
    
    # Download the xml and load into Python
    file=urlopen(identifier)
	root = BeautifulSoup(file, 'xml')

    
    #Extract the content
	for titles in root.find_all('title'):
		title = titles.text + "\n"

	for contents in root.find_all('p'):
		content = contents.text + "\n"
        
    # Some content, like advertisements, have no titles. 
	if title is None:
        article = content
    else:        
        article = title + "\n" + content
        
    #Save the content in a file 
    with open(filename + ".txt", "w", encoding="utf-8") as f:
        f.write(article)
 
```

## Extract all separate articles from a specific page of the newspaper from the Didl file

In the above we treated two options:
* Extracting the whole content of a page and saving into one file;
* Extracting all the articles of a newspaper and saving this to file per article.

It is also possible to download the articles per page.
If you look into the XML file you will see the element 'Component' with the attribute 'dc:identifier'.
For example:
```XML
<didl:Component dc:identifier="ddd:010097934:mpeg21:p001:a0003:zoning">
```

In this case the ***p001*** indicates that this concerns the first page. 
If the code to retrieve all the articles from a newspaper is adapted to loop via the element 'Component' instead of 
the element 'Resource , it becomes possible to filter out those elements whose attribute contains ***p001***. This can be done using: 

```Python
if 'p001' in [variable in which the content of dc:identifier is stored]
```

Then the rest of the code can be made similarly to the code we used to extract all identifiers of all articles of the whole newspaper. 

```{admonition} Note
If an ***attribute*** has a namespace, you HAVE to add the namespaces prefix before the attribute name in Beautiful Soup 
for it to recognize it. 
```

```{admonition} Exercise
:class: attention
Write code to collect the identifiers from page 1 and store them row by row in a Dataframe together with the pagenumber, type of text, and title. Then print this Dataframe.

```

````{admonition} Solution
:class: tip, dropdown
Your code should look like the code below:
```Python
article_list = []

# Declare the page variable here so it can easily be changed
article_list = []
page = 'p001'

for item in root_didl.find_all('Component'):
    identifier_page = item.get('dc:identifier')
    if page in identifier_page:
        for article in item.find_all('dcx'):
                a_type = article.find('subject')
                if a_type is not None:
                    title = article.find('title')
                    identifier = article.find('identifier')
                    article_list.append([page, a_type.text, title.text, identifier.text])
 
import pandas as pd
articles = pd.DataFrame(article_list, columns = ['Page', 'Type', 'Title', 'Identifier'])

articles
```
````

In [None]:
article_list = []
page = 'p001'

for item in root_didl.find_all('Component'):
    identifier_page = item.get('dc:identifier')
    if page in identifier_page:
        for article in item.find_all('dcx'):
                a_type = article.find('subject')
                if a_type is not None:
                    title = article.find('title')
                    identifier = article.find('identifier')
                    article_list.append([page, a_type.text, title.text, identifier.text])
 
import pandas as pd
articles = pd.DataFrame(article_list, columns = ['Page', 'Type', 'Title', 'Identifier'])

articles

You now have a dataframe with metadata from all articles of one page. You can use the same steps as described above to download the content from this articles and store them in textfiles.