Week 1 - Introduction to Data Structures

In this first week, we will introduce the general concept of data structures. We will also look at two simple data structure, the array and the list, and write some very basic Python code to create and use a list.

The importance of data structures

We will begin by taking a look at the concept of data structures. First of all, what is a data structure and why is it important? When it comes to programming, we frequently need to deal with collections of data, rather than a single item of data. For example, we might write a program to manage student records. This program would need to deal with not just one student, but many students. Similarly, you might want to write a program for a live music venue, to allow the venue to store concerts and allow users to book tickets. Again, such a program would have to deal with many concerts and many users.

To store multiple items in a computer program, we need to choose an appropriate data structure. A data structure is a programming construct which allows us to store multiple items of data. When dealing with data structures, an important thing to consider is which is the most appropriate data structure for the problem we are trying to solve? Certain data structures are more suited to specific types of problem, and we will be looking at this through the module. We will take a look at various data structures, from very basic ones to more advanced, specialised data structures, including:

As we do so, we will consider their advantages, and in some cases disadvantages, for particular tasks.

A basic data structure - the array

Let's start with the most basic data structure of all, the array. An array is a data structure which can hold more than one item of data, and can be thought of as a series of "boxes" which can hold data. With an array, the key thing is that each box is adjacent to its neighbours in memory which has some consequences in terms of efficiency which we will consider as we go on

The diagram below shows an example of an array which holds the names of five items of fruit.

An array

The things to note here are:

Arrays do have some disadvantages. Their simplicity makes them easy to work with but this simplicity can lead to limitations.

Advantages and disadvantages of arrays

Advantage: Fast lookup

As we saw above, the advantage of using an array is that it is very fast to index an array. This is because arrays are always sequentially stored in memory, so that in the example above, "apple", "orange", "pear", "grape" and "cherry" will be stored adjacent to each other in memory.

When indexing an array, the computer is able to work out where in memory the data is stored, because each member of an array uses up the same number of bytes. The equation is simply:

Address = Start Address + Index * Size Of One Member

So imagine, for example, that each member of an array uses up 32 bytes, and the start of the array is at memory location 24576. It's an easy calculation to work out where the member with the index of 3 is located. If we substitute the values into the equation we get:

Address = 24576 + 3 * 32

which is 24576 + 96, which is 24672.

We ourselves do not need to do this calculation. The computer does it. But we can see, because it's simple arithmetic, the computer is able to do it very quickly. So, arrays are optimised for fast look-up of data using a numerical index.

Coding arrays

We will be using the Repl online development environment. Repl is an online environment which allows you to program without setting up software on your own computer.

You can login to Repl using either a Repl account or your GitHub account. You may well already have a GitHub account, if not, you will find it useful throughout your time at university and also when coding in general. GitHub is a collaborative platform for developing projects which allows you to version-control your code, in other words you can commit your code every time you reach a certain stage, and then "roll back" to previous commits if you are unhappy with new code. You will learn more about GitHub, and the Git version-control system, on other modules, but for now, simply create an account on GitHub.

Once you have created a Repl account or GitHub account, you can log in with it on Repl. If you have created a simple "hello world" project already in COM411 you do not need to do anything else. If not, test that it all works by creating a new project.

Repl interface

If you select the menu icon on the top left, you can select "New Repl". A Repl is a project - essentially, a single program. The idea is that you create a new Repl for each program you are going to write. You can then go back and view your previous Repls to load the code.

New Repl

Once you have done this, just enter the following Python code:

print("Hello World!")

Run the code and you should see the output appear in the console to the right:

Hello World!

Repl Console Output

Variables

An important concept in programming is that of variables. A variable is a "box" which can be used to hold information for later use. For example here is a simple Python program which asks the user for their name and then greets them:

first_name = input("Please enter your name:")
print("Hello " + first_name)
The program prompts the user for their name, with the input() statement. The name is read into a variable called first_name. The variable is a "pointer" to an area of computer memory where the name is stored. Later on in the program, if we want to reference the name the user entered, we use this variable first_name outside quotes. So the statement:
print("Hello " + first_name)
will print the text "Hello" plus the value of the variable first_name. Note how the variable is outside quotes. Any text inside quotes is displayed literally (e.g the text "Hello") but any text outside quotes is interpreted as a variable.

The next example uses two variables, for the first name and the last name.

first_name = input("Please enter your first name:")
last_name = input("Please enter your last name:")
print("Hello " + first_name +" " + last_name)
The print() statement now displays the text "Hello", plus the first_name variable, plus a space, plus the last_name variable.

Exploring arrays

Creating a simple program making use of an array

Here is a program which makes use of an array (it's actually, technically, something called a list, which you will be introduced to soon, but for now, I'd like you to assume it's an array. We are creating an array called people to hold the names of members of a team.


print("Please enter the three people in your team")
people = [None] * 3
people[0] = input("Please enter the first person:")
people[1] = input("Please enter the second person:")
people[2] = input("Please enter the third person:")

print("The first person is " + people[0])
print("The second person is " + people[1])
print("The third person is " + people[2])

Note a couple of things:

Try running this code but before you do so, try to predict what will be displayed.

Disadvantage: not flexible, an array cannot be resized

Now, imagine you want to add two more entries to the array. Before trying it, do you think this would work?

people[3] = input("Please enter the fourth person:") 
people[4] = input("Please enter the fifth person:")

Now run it. Do you get the result that you expected? See how this illustrates an issue with arrays: they are not resizable.

Not enough space, or too much space!

Imagine we have a larger array storing company employees, as shown in the diagram below. (Just the initials of the employees are shown).

Employee array

Let's say, though, the company takes on an 11th employee, "Ulysses Vernon" (UV). Do we have space in the array to fit this 11th employee? No, we don't, because an array has fixed size!

So what happens if the company expands and takes on more employees? What could we do?

Adding to an array

Process needed to extend an array

So what can we do? One solution is to create a new array with more space when the old one runs out of space. If we wanted to to do this, we would need to:

This is shown in the diagram below, which returns to the original Python code example, in which team members were added to an array. Imagine we have an array with three team members, and we want to add two more members to the team. Hopefully you can see that this is a rather long-winded, and thus slow and inefficient process, as multiple steps are needed. Throughout this module, we will be focusing on how to maximise efficiency when coding, in terms of speed (time taken) and memory usage.

Array extension

Further disadvantage of arrays: insertion into the middle difficult

Above we saw that adding extra data to the end of an array was not possible, and we had to inefficiently create a new array and copy the data across if we wished to do that.Last week we saw that adding extra data to the end of an array was not possible, and we had to inefficiently create a new array and copy the data across if we wished to do that. What if we wanted to add a new member at a specific position in the array? For example, imagine we had an array of contacts that we wished to keep in alphabetical order, e.g:

Alex Acland
Bonnie Black
Charley Chase
Earl Edwards
Frances Freeman

Imagine we had a new name, "Danielle Dawson" which we wanted to insert between "Charley Chase" and "Earl Edwards" in the array above. What we would have to do is:

Array Insertion

Clearly this is inefficient as we have to create a new array, which is expensive in memory usage, as we have to have the old and new arrays in memory at the same time.

More flexible data structures

We have seen, through the exercise above, the limitations of the basic array. Consequently, there are a whole range of more flexible and specialised data structures which we can use for particular scenarios.

In fact Python uses one such data structure, the list. A list is essentially a wrapper round an underlying array. With the list you can add new items of data to the end of the list using the append operation, and keep doing so until the computer runs out of memory. However, because the list is implemented internally using an array, the append operation may require the inefficient process we saw last week, in which a new internal array is created with additional space to hold the new elements, and then the old data is copied across from the old array to the new array.

This is shown below. This diagram shows a simplified version of what happens when you append to a list. It is a repeat of the diagram showing array resizing from last week, but also shows the list (in green) as a wrapper round the array. It shows how the internal array has to be recreated and the old data copied across when you append to a list.

Appending to a list

(The Python list does however include some optimisations to improve the efficiency of append and insertion operations. For instance, more memory is allocated for the internal array than is needed, meaning that a new internal array need not be created if only a small number of items are added as there will be spare space at the end of the array to hold them. See here for details.)

Other languages have similar "extensible array" data structures, for instance C++ has the vector and Java has the ArrayList.

Creating a simple program making use of an list

Here is a reminder of the program we wrote above using an array.


print("Please enter the three people in your team")
people = [None] * 3
people[0] = input("Please enter the first person:")
people[1] = input("Please enter the second person:")
people[2] = input("Please enter the third person:")

print("The first person is " + people[0])
print("The second person is " + people[1])
print("The third person is " + people[2])

The key thing to note is, actually, people is NOT an array!. It's actually a list, and thus can be resized. As discussed above, to add new elements to a list, we use append. This will perform the process above, in which the list's internal array is resized, for us automatically:
people.append("Tim Jones")
people.append("Jane Smith")

// This will print out "Tim Jones" and "Jane Smith"
print("The fourth person is " + people[3])
print("The fifth person is " + people[4])

Advanced - Looping through lists

This section is not compulsory for the moment. It is provided if you want to get ahead.

In COM411 you will be introduced, in more detail than here, to the concept of "looping" to perform the same operation multiple times. However, because it's such a useful thing to do with data structures, I will give you a preview now of how to use loops to do something which each member of a list. (This work is not compulsory for this week, but is offered as an extra for those who have got through the basics).

The for (foreach) loop

If we have a data structure, we might well want to display all the items in that data structure. How this is done depends on the individual data structure that we're dealing with, but for a list in Python, it's actually quite easy. Here is a simple example:

all_languages = ["Python", "JavaScript", "Java", "Kotlin"]
for current_language in all_languages:
    print("At university you will be studying " + current_language)
Note how we create a list of languages (in this case, the languages that you will be studying while at university on the Software Engineering course) and store it in the variable "all_languages". sWe then use a so-called "for loop" to perform a particular block of code on each member of the list.
for current_language in all_languages:
This literally means "go through each language in the list 'all_languages' in turn, store the current language in the variable 'current_language', and perform the block of code below the 'for' statement for the variable 'current_language'. In other words, the print statement will be executed for EACH language in the list. Hence, this type of 'for' loop is also known as a 'for-each' loop.

The diagram below shows how the 'for' loop is working:
Operation of for (for each) loop

The while loop

As well as using the 'for' loop, we can use another type of loop, the 'while' loop. This allows us to run the code while some condition is true. In data structures, 'while' loops are generally used with indices. Here is the previous example rewritten to use a 'while' loop:

all_languages = ["Python", "JavaScript", "Java", "Kotlin"]
current_index = 0
while current_index < len(all_languages):
    print("At university you will be studying " + all_languages[current_index])
    current_index += 1
Note how we now declare a variable current_index to hold the current index we are dealing with, starting at 0. We then repeat the block of code below the while statement while a certain condition is true - in this case, while current_index is less than the length of the list.

So, as a result, this block of code:

print("At university you will be studying " + all_languages[current_index])
current_index += 1
will be repeated until current_index reaches the length of the list. In this code, we use the variable current_index to index the list and print out the value at that index in the list, and then increase current_index by one. So the first time this block of code runs, current_index will be 0, and it will display Python. The second time the block of code runs, current_index will be 1, and it will display JavaScript, and so on.

The diagram below shows how the 'while' loop is working:
Operation of a while loop with indices

Exercises

Exercise 1: Arrays: Paper-based

You are writing a program to store employees for a company. It's a small company, with only 10 employees. Try drawing, on paper, an array containing these 10 employees. Draw each name in each position in the array, similar to the diagram above containing an array of fruit.

John Stevenson
Jane Smith
Tim Wilson
Kate Stevenson
Kate Palmer
Tom Eastman
Laura Green
Mike Watson
Sally Black
Mark Ramsey

Answer these questions:

Exercise 2: Adding new data to an array (paper)

Do this exercise on paper. This is a similar example to that given in the lecture, and is designed to illustrate the problems with adding new data to an array.

Exercise 3: Lists (code): Creating a simple program making use of a list

Write a program which creates a list with the 10 employees mentioned in Exercise 1. The 10 names should be specified in your code - do not read them in from the user. Display "Jane Smith" and "Mike Watson" by indexing the list appropriately.

Advanced Exercise: Looping through a list

Enhance your code to loop through the list. The code should output the list in this format:

Jane Smith is an employee of Advanced Widgets Ltd.
Tim Wilson is an employee of Advanced Widgets Ltd.
(etc)

More advanced: Ask the user to enter a name. By looping through the employees and comparing each name to the name the user entered, the program should tell the user whether the name they entered is an employee of the company or not. You will need to use an if statement for this but it must also contain a loop. If you do this, you will have coded your first searching algorithm, albeit a basic and not-so-efficient one!