How to create a corpus from the web archive

In December RESAW is hosting a 2 days Tech Meet-Up in Aarhus.

Tech Meet-Up
How to create a corpus from the web archive
Aarhus University, December 16-17, 2015

NetLab logo
NetLab in collaboration with the the Danish web archive, Netarchive.


The meet-up will be about corpus creation in web archives.


RESAW acknowledges the need for discussions of a more technical kind. These talks will start off with a presentation of the concept of a corpus as seen from the researchers. From there we will go into discussions about the design and implementation of a corpus.

Presentations format

We are doing lightning talks in order to get to the matter and having more time for discussions, whiteboard drawings and demos.

What is it?

This is not a workshop in the sense that we will not be going through installation of software. We will focus on the design & implementation issues concerning “a corpus from the web archive”. Which  technical infrastructure is required? How do we manage the different corpora researchers will create? How can they be referenced and shared? How do we preserve them 10 years or longer?

The Programme

On the first day the theme is “A web corpus – what is it?”. We will get a better understanding of the researcher’s needs when creating a corpus.
On day 2 the theme is “A web corpus – how do we make it?”. We will discuss the solution to the problem of creating a corpus.
You can download the programme here: Tech Meet-Up Programme.

The Outcome

We aim to find a common understanding of the concept of a corpus and to specify a technological solution that can be implemented in web archives.

May 2016: The discussion has been continued in the IIPC OpenWayback Machine group on CDX as a corpus format.


The meet-up will take place at Aarhus University.  There is Wi-Fi, food and water available, and maybe cakes too 🙂 Food preferences will be attended to. We have whiteboards and presentation equipment available.

Hotel and Travel

Hotel rooms will be reserved for attendees from outside Aarhus. NetLab will cover the cost of your hotel room. You will only need to pay for transportation to and from Aarhus.
The Radisson Blu Scandinavia Hotel is a 500 m walk from the Central Station in Aarhus. Upon registration your room will be booked for you. Just check in when you arrive.

Most probably you will be flying to one of these airports: Copenhagen (CPH), Aarhus (AAR) or Billund (BLL). From CPH, get on a plane to AAR or jump on the train which departs from the airport. Look at the timetable of DSB, The Danish national rail operator. You can purchase your ticket in the vending machines on the platform, international payment cards accepted.

Aarhus Airport (AAR) is located 40 km north of the city and connects to many European cities. There is an airport bus to Aarhus Central Station. It departs every 20 minutes after each flight arrival. You can look up your flight in the time table.

Billund Lufthavn (close to the famous LegoLand Park) is 100 km south of Aarhus and has direct international flight connections to many European destinations. There is an airport bus to Aarhus with several daily departures – please check the time table. The travel time is approximately 1 hour and 30 minutes. The bus stops at the bus station which is only 200 m from Aarhus Central Station.

The Venue

Day 1 – December 16

We will kick off the workshop at Statsbiblioteket. From the city centre you can take Bus 2A, get off at the bus stop Langelandsgade/Kaserneboulevarden and walk the last 600 m to the main entrance of the library. See map and buses at 9 AM.

Day 2 – December 17

The second day will be in the Nygaard building, IT Campus Katrinebjerg. Again, you can take the bus 2A and get off at the bus stop at Storcenter Nord and walk 400 m to the IT Campus. See map and buses at 9 AM.

Tickets for the bus

The ticket (2 zones) can be purchased on the bus and costs DKK 20 (cash only, coins not notes). Further information and time tables on Midttrafik.


Registration ended on Monday December 7.